Make mboxgrep recognize dates

15 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/03/2015

Alas, on further study it turns out that the dates in the "From - [date]" line have the %c format according to Man date:
grep "From -" 1998-2022.Newest |sort -u |head -n 5
From - Fri Aug 17 15:47:13 2018
From - Fri Aug 17 15:50:59 2018
From - Fri Dec 14 07:28:23 2018
From - Fri Dec 14 07:28:24 2018
From - Fri Dec 14 07:28:25 2018

Whereas, the dates returned by
grep "Delivery-date:" 1998-2022.Newest |sort -u |head -n 5 are in --rfc-mail format.
Delivery-date: Fri, 01 Feb 2019 00:07:49 -0800
Delivery-date: Fri, 01 Feb 2019 05:16:21 -0800
Delivery-date: Fri, 01 Feb 2019 11:18:14 -0800
Delivery-date: Fri, 01 Feb 2019 17:14:04 -0800
Delivery-date: Fri, 01 Jan 2021 02:28:21 -0800

Supposedly, one writes
date -d [string date] +FORMAT -&c for the "From - [string date]" searches, but the syntax escapes me.

Although my "long form" workaround appears to be successful:
mboxgrep -e "Thu Jun 24 08:35:24 2021" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep01.txt
mboxgrep -e "Thu Jun 24 08:35:24 2021" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
mboxgrep -e "Thu Nov 29 16:50:27 2018" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
mboxgrep -e "Sat Dec 8 08:42:32 2018" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
...
mboxgrep -e "Mon Mar 14 08:53:38 2022" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
mboxgrep -e "Wed Mar 16 08:27:29 2022" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
mboxgrep -e "Fri Mar 18 09:05:17 2022" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;

Saved as Script-No-Date-Emails.Found.by.mboxgrep-GL.txt
cp Script-No-Date-Emails.Found.by.mboxgrep-GL.txt Script-No-Date-Emails.Found.by.mboxgrep-GL
sudo chmod +x Script-No-Date-Emails.Found.by.mboxgrep-GL
./Script-No-Date-Emails.Found.by.mboxgrep-GL
Took 1-1/3 minutes; 83.6 MB
Only a few of these dates match the pattern's dates:
grep "From -" No-Date-Emails.Found.by.mboxgrep.txt |sort -u |head -n 65 > Check-Pattern-Matches.txt
comm -12 <(sort -k 1,1 Dates.from.Emails.no.Dates.txt) <(sort -k 1,1 Check-Pattern-Matches-Edit.txt) > Dates-Common-to-Both-Files.txt

Fri Dec 21 12:21:19 2018
Fri Dec 7 17:28:57 2018
Fri Mar 18 09:05:17 2022
Fri Oct 1 10:52:57 2021
Fri Oct 1 10:53:20 2021
Fri Sep 17 14:53:57 2021
Mon Apr 1 17:41:34 2019
Mon Aug 6 17:50:58 2018
It would appear that the mboxgrep searches are selecting partial matches; mboxgrep must
somehow be held to verbatim matches.

AllegatoDimensione
Check-Pattern-Matches-Edit.txt1.59 KB
Dates.from_.Emails.no_.Dates_.txt1.49 KB
amenex
Offline
Iscritto: 01/03/2015

Did a Google search "mboxgrep dates" and https://trisquel.info/en/forum/make-mboxgrep-recognize-dates
was at the top of the list.
A little farther down in the Google search was an item about grepmail:
https://packages.debian.org/buster/grepmail
See also: https://metacpan.org/pod/grepmail
Available from the Trisquel repository but not included in Add/Remove Applications
Install with apt-get ... man pages have lots of detail.
First try:
grepmail -u -f Dates.from.Emails.no.Dates.txt 1998-2022.Newest > First-grepmail-search.txtTook 18 seconds.
Proof:
grep "From -" First-grepmail-search.txt | sed 's/From\ -\ //g' > Check-Pattern-Matches-grepmail.txt ;
comm -12 <(sort -k 1,1 Dates.from.Emails.no.Dates.txt) <(sort -k 1,1 Check-Pattern-Matches-grepmail.txt) > Dates-Common-to-Both-Files-grepmail.txt ;
comm -23 <(sort -k 1,1 Dates.from.Emails.no.Dates.txt) <(sort -k 1,1 Check-Pattern-Matches-grepmail.txt) > Dates-missed-by-grepmail.txt

Grepmail appears to have a lot of promise for use in my spam project.

AllegatoDimensione
Dates-Common-to-Both-Files-grepmail.txt 1.49 KB
Dates-missed-by-grepmail.txt 0 byte
Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

the syntax escapes me.

$ date -d 'Fri, 01 Feb 2019 00:07:49 -0800' +%c
Fri Feb 1 06:07:49 2019

amenex
Offline
Iscritto: 01/03/2015

That's not "it."
date -d 'Wed, 13 Jan 2021 00:14:24 -0700' +%c > Wed 13 Jan 2021 02:14:24 AM EST
date -R 'Wed, 13 Jan 2021 00:14:24 -0700' > date: invalid date ‘Wed, 13 Jan 2021 00:14:24 -0700’
but January 13 2021 really was a Wednesday ...
Man date says:
-R, --rfc-email
output date and time in RFC 5322 format. Example: Mon, 14 Aug 2006 02:34:56 -0600

My dates follow RFC 5322; what does bash date not like about those dates ?
grepmail won't find [verbatim] 'Wed, 13 Jan 2021 00:14:24 -0700' with a 'Wed 13 Jan 2021 02:14:24 AM EST' pattern.
One of the patterns is actually truncated ==> '18 Apr 2019 12:21:15 +0300' which is even worse.
Leafpad takes several minutes each time, but does find each target date in the 326 MB 1998-2022.Newest email file.
My attempts to force the %a, &d %b %Y %X %z sequence of fields have all been unsuccessful.

However, there is a way that doesn't require any hard labor: escape the spaces in the dates, making them simple strings.
sed 's/\ /\\ /g' Search-pattern.txt > Escaped-pattern.txt Then execute these six grepmail commands:
grepmail -u 'Wed\, 13\ Jan\ 2021\ 00:14:24\ \-0700' 1998-2022.Newest > Email-01.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u '18\ Apr\ 2019\ 12:21:15\ \+0300' 1998-2022.Newest > Email-02.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Thu,\ 24\ Jun\ 2021\ 18:09:36\ \+0200' 1998-2022.Newest > Email-03.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Thu,\ 24\ Jun\ 2021\ 18:05:29\ \+0200' 1998-2022.Newest > Email-04.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Fri,\ 03\ Sep\ 2021\ 05:10:55\ \+0200' 1998-2022.Newest > Email-05.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Fri,\ 03\ Sep\ 2021\ 06:25:11\ \+0200' 1998-2022.Newest > Email-06.Found.by.grepmail-Escaped-pattern.txt ;

cat Email-0?.Found.by.grepmail-Escaped-pattern.txt > EmailsFound.by.grepmail-Escaped-pattern.txt
LC_ALL=C ./MB-Scrape-Emails-GL-PO EmailsFound.by.grepmail-Escaped-pattern.txt | sed 's/\t\t/\tnull\t/g' > Hairball-Escaped-pattern-Emails.txtAll six date-strings are in their respective Date column ($2).

AllegatoDimensione
EmailsFound.by_.grepmail-Escaped-pattern.txt 147.4 KB
Hairball-Escaped-pattern-Emails.txt 6.3 KB
MB-Scrape-Emails-GL-PO.txt 2.11 KB
Search-pattern.txt 198 byte
Escaped-pattern.txt 228 byte
Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

what does bash date not like about those dates ?

The date command (not Bash) returns the error. It does so because you need the --date (-d) option to use another date and time than the current one:
$ date -Rd 'Wed, 13 Jan 2021 00:14:24 -0700'
Wed, 13 Jan 2021 04:14:24 -0300

grepmail won't find [verbatim] 'Wed, 13 Jan 2021 00:14:24 -0700' with a 'Wed 13 Jan 2021 02:14:24 AM EST' pattern.

grepmail searches a regular expression and "Wed, 13 Jan 2021 00:14:24 -0700" does not match "Wed 13 Jan 2021 02:14:24 AM EST", obviously (additional comma, "2" instead of "0", and "-0700" instead of "AM EST").

My attempts to force the %a, &d %b %Y %X %z sequence of fields have all been unsuccessful.

If you are referring to date's output format, your "&d" should be "%d":
$ date '+%a, %d %b %Y %X %z'
Wed, 30 Mar 2022 21:04:39 -0300

All six date-strings are in their respective Date column ($2).

That column contains the value of "Date" in the headers of the input emails, hence neither the "'From - [date]' line" nor the "Delivery-date" you were taking about in the original post.

Also, I do not see the point of separating six emails from the rest (in a terrible way; using Leafpad, six hard-coded dates, etc.). Why not simply applying MB-Scrape-Emails-GL-PO to all the emails? You can then select the relevant output lines, maybe with something like that (which only searches the values of the "Date" field, not the whole emails):
awk -F \\t '$2 == "Wed, 13 Jan 2021 00:14:24 -0700" || $2 == "18 Apr 2019 12:21:15 +0300" || $2 == "Thu, 24 Jun 2021 18:09:36 +0200" || $2 == "Thu, 24 Jun 2021 18:05:29 +0200" || $2 == "Fri, 03 Sep 2021 05:10:55 +0200" || $2 == "Fri, 03 Sep 2021 06:25:11 +0200"'

amenex
Offline
Iscritto: 01/03/2015

Only a short clarification; other matters will soon interfere.
Those six dates were scraped independently (about which, more later). They're in the six remaining unidentified emails and
serve to mark the "lost" emails. As grepmail is meant to search for strings, my escaping all the spaces in the six emails
with the sed command converted those dates to strings eliminating the dates' format from the syntax dilemma. The strings
that grepmail found in the 326 MB 1998-2022.Newest file are unique and are found in none of those other 10,000+ emails,
and they match perfectly the original unaltered dates.
I need to do a better job explaining how I found those six dates; I hastily skipped over that part of the task; I'm sorry.

amenex
Offline
Iscritto: 01/03/2015

Magic Banana's suggested script:
MB's suggested scraping script applied to 1998-2022.Newest appears to do what my longform script and the final script in my posting below do ==>
awk -F \\t '$2 == "Wed, 13 Jan 2021 00:14:24 -0700" || $2 == "18 Apr 2019 12:21:15 +0300" || $2 == "Thu, 24 Jun 2021 18:09:36 +0200" || $2 == "Thu, 24 Jun 2021 18:05:29 +0200" || $2 == "Fri, 03 Sep 2021 05:10:55 +0200" || $2 == "Fri, 03 Sep 2021 06:25:11 +0200"' Hairball-1998-2022.Newest.txt > Output-MBscript.txt
The one-line script is two hundred times faster and outputs about as much total information; 6 kB vs. 6.3 kB. The hare beats the tortoise, yet again showing the power of scripting for analyzing text files.

AllegatoDimensione
Output-MBscript.txt 5.98 KB
Hairball-Six-Dates-longform.txt 6.3 KB
Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

The one-line script is two hundred times faster and outputs about as much total information; 6 kB vs. 6.3 kB.

Apart from the order of the lines ("my" AWK selection preserves the order in the input), the only difference is the absence of the header. If you want it and if the input has it, you can write:
awk -F \\t 'NR == 1 || $2 == "Wed, 13 Jan 2021 00:14:24 -0700" || $2 == "18 Apr 2019 12:21:15 +0300" || $2 == "Thu, 24 Jun 2021 18:09:36 +0200" || $2 == "Thu, 24 Jun 2021 18:05:29 +0200" || $2 == "Fri, 03 Sep 2021 05:10:55 +0200" || $2 == "Fri, 03 Sep 2021 06:25:11 +0200"'

amenex
Offline
Iscritto: 01/03/2015

I concur; adding the headers stretched the processing time to one second for those six dates.
The outputs are close indeed; see screenshot attached.
The remaining 10,302 emails in 1998-2022.Newest are all accessible by their X-UIDL fields.

MagicBananaOverAmenex.png
amenex
Offline
Iscritto: 01/03/2015

If you were given a list of 10,000+ dates enclosed in quotes, how would you build a script like the one from
Thu, 03/31/2022 - 19:18 ?
Leafpad cannot place that many fields in a single row.
On the other hand it is easy to build a 10,000+ row longform script with Leafpad, but that script took several hours to finish.

AllegatoDimensione
Double-quoted-Plain-10277-UID-based-Dates.txt 336.29 KB
Double-quoted-10277-Escaped-space-UID-based-Dates.txt 385.58 KB
Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

AWK can first stores the dates in an array and then test whether $2 is in the array:
$ awk -F \\t 'FILENAME == ARGV[1] { gsub(/"/, ""); a[$0] } FILENAME == ARGV[2] && (NR == 1 || $2 in a)' Double-quoted-Plain-10277-UID-based-Dates.txt Hairball-1998-2022.Newest.txt

amenex
Offline
Iscritto: 01/03/2015

Continuing to account for all the emails in 1998-2022.Newest:

Steps to account for every email:
LC_ALL=C ./MB-Scrape-Emails-GL-PO 1998-2022.Newest | sed 's/\t\t/\tnull\t/g' > Hairball-1998-2022.Newest.txt 19.3 MB
Gathers all 10308 of the individual emails. Takes about six seconds; 19.3 MB.
head -n 1 Hairball-1998-2022.Newest.txt | sed 's/\t/\n/g' | sed 's/:/\t/g' | sed 's/\#\ //g' > Headers-1998-2022.Newest.txt
Lists the headers of the numbered columns in the email summary:
1 Content analysis details
2 Date
3 Delivery-date
4 IPv4-addr-in-contents
5 IPv4-addr-in-headers
6 Message-ID
7 Return-Path
8 Subject
9 URL-in-contents
10 URL-in-headers
11 X-Account-Key
12 X-Spam-Score
13 X-UIDL
14 email-addr-in-contents
15 email-addr-in-headers
16 email-addr-with-at-in-contents
17 email-addr-with-at-in-headers

cut -f 2,13 Hairball-1998-2022.Newest.txt | awk -F '[\t]' 'NR -1 {print $2"\t"$1}' | sort -nk 1,1 > X-UIDL.and.Date.txt
Took half a second; has a couple of malformed lines at the beginning, 10277 OK, 25 nulls in Col.$2, and four malformed Col.$1 at the end ==> 2 + 10277 + 25 + 4 = 10308, totaling 19.3 MB. These thirty-one malformed sets of email data need further analysis.

grep "null" X-UIDL.and.Date.txt | grep "UID" | awk '{print $1}' > UID-pattern.txt <== use this as a starting point to grepmail those 25 dates.

grepmail -u -f UID-pattern.txt 1998-2022.Newest > Emails-UID-null.txt ==> 1.1 MB <== Twenty-five emails containing the target UIDs.

LC_ALL=C ./MB-Scrape-Emails-GL-PO Emails-UID-null.txt | sed 's/\t\t/\tnull\t/g' > Hairball-UID-null.txt <== MB's script-summary: Twenty-five dates, 18 of them in Col.$2, the other seven in Col.$3.

sed 's/\ /\\ /g' X-UIDL.and.Date.txt | grep -v "UID" | awk -F '[\t]' '{print $2}' > Six-strung-out-dates.txt
'Wed,\ 13\ Jan\ 2021\ 00:14:24\ -0700'
'18\ Apr\ 2019\ 12:21:15\ +0300'
'Thu,\ 24\ Jun\ 2021\ 18:09:36\ +0200'
'Thu,\ 24\ Jun\ 2021\ 18:05:29\ +0200'
'Fri,\ 03\ Sep\ 2021\ 05:10:55\ +0200'
'Fri,\ 03\ Sep\ 2021\ 06:25:11\ +0200'

Unsuccessful short-form script:
while read pattern
do grepmail -e $pattern 1998-2022.Newest > Mailbox-Six-Dates.txt
done < Six-strung-out-dates.txt
Saved as Script-Mailbox-Six-Dates.txt but fails.

Fall back to GL's longform technique, which is constructed on the Leafpad platform: ==>
grepmail -u 'Wed\, 13\ Jan\ 2021\ 00:14:24\ \-0700' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u '18\ Apr\ 2019\ 12:21:15\ \+0300' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Thu,\ 24\ Jun\ 2021\ 18:09:36\ \+0200' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Thu,\ 24\ Jun\ 2021\ 18:05:29\ \+0200' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Fri,\ 03\ Sep\ 2021\ 05:10:55\ \+0200' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Fri,\ 03\ Sep\ 2021\ 06:25:11\ \+0200' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
Saved as Script-Mailbox-Six-Dates-longform.txt
./Script-Mailbox-Six-Dates-longform ==> Took 5.624 seconds; 150.9 kB
LC_ALL=C ./MB-Scrape-Emails-GL-PO Emails.Found.by.grepmail-Escaped-pattern.txt | sed 's/\t\t/\tnull\t/g' > Hairball-Six-Dates-longform.txt Lists the original six dates in Col.$2.
This series of step thereby enables MB's main script to summarize all of the Emails in 1998-2022.Newest.

AllegatoDimensione
X-UIDL.and_.Date_.txt 519.1 KB
UID-pattern.txt 518 byte
Emails-UID-null.txt 1.08 MB
Hairball-UID-null.txt 78.65 KB
Six-strung-out-dates.txt 228 byte
Script-Mailbox-Six-Dates.txt 115 byte
Script-Mailbox-Six-Dates-longform.txt 707 byte
Hairball-Six-Dates-longform.txt 6.3 KB
Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

Saved as Script-Mailbox-Six-Dates.txt but fails.

Every output of grepmail overwrites the previous one because you redirect with ">" (not ">>") inside the loop (using ">" after "done" would work). Like grep, grepmail (which I have never used) must have an option -f to match a set of regular expressions:
$ grepmail -f Six-strung-out-dates.txt 1998-2022.Newest > Mailbox-Six-Dates.txt

amenex
Offline
Iscritto: 01/03/2015

The plot thickens ...

Magic Banana found errors in my script:
while read pattern
do grepmail -e $pattern 1998-2022.Newest > Mailbox-Six-Dates.txt
done < Six-strung-out-dates.txt

Based on the premise that this script processes one date at a time and concatenates the outputs:
while read pattern
do grepmail -u $pattern 1998-2022.Newest >> Mailbox-Six-Dates.Corrected-U.txt
done < Six-strung-out-dates.txt

Saved as Script-Mailbox-Six-Dates.Corrected-U.txt
cp Script-Mailbox-Six-Dates.Corrected-U.txt Script-Mailbox-Six-Dates.Corrected-U
sudo chmod +x Script-Mailbox-Six-Dates.Corrected-U
./Script-Mailbox-Six-Dates.Corrected-U

Fails to treat each escaped-space date as a single string, yet creates 246 kB email file containing 16 emails,
none containing a sought date.

The proposed three-line script outwardly adheres to the syntax of the six-line longform script, but the difference
is that in the longform script the escaped-spaces dates are explicitly written out, whereas the three-line script
depends on verbatim transfers of said dates, one long string at a time. What have I missed ?

Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

Double quotes around $pattern, I believe. Also escaping the spaces looks useless and I still do not understand why you would not use option -f.

amenex
Offline
Iscritto: 01/03/2015

Here's a script that works OK, but the twenty-five patterns are all single strings:
grepmail -u -f UID-pattern.txt 1998-2022.Newest > Emails-UID-null-GL.txt
The longform script has escaped spaces and single-quote-enclosed fields and works OK:
grepmail -u 'Fri,\ 03\ Sep\ 2021\ 06:25:11\ \+0200' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt
Zero bytes output:
grepmail -u "Fri, 03 Sep 2021 06:25:11 +0200" 1998-2022.Newest >> Emails.Found.by.grepmail-Double-quoted-pattern.txt

EDIT ==> Found a strawman to get around mboxgrep's and mailgrep's aversions to dates. The squeamish should look away.

First, in plain English:
(1) Use grep to find a suitable quantity of text around each date;
(2) Search grep's output for another email identifier, such as X-UIDL;
(3) Capture the found X-UIDL data;
(4) Save that data as an individual pattern (the strawman) file;
(5) Apply grepmail and the strawman to collect the associated emails;
(6) Confirm by generating a dedicated hairball with MB-Scrape-Emails-GL-PO.

Here is the first line of seventeen in the said script:
grep -B 25 -e "10 Mar 2021 12:39:12 -0200" 1998-2022.Newest | grep -e "X-UIDL:" '-' | sed 's/X-UIDL:\ //g' | tr -d '<>' > Pattern-X-UIDL ; grepmail -u -f Pattern-X-UIDL 1998-2022.Newest >> Combined-App-Invalid-Date-Emails.txt

When I attempted a one-line version of the script:
grep -B 25 -ef Invalid-Double-Quoted-Dates.txt 1998-2022.Newest | grep -e "X-UIDL:" '-' | sed 's/1998-2022.Newest-X-UIDL: //g' | tr -d '<>' > Pattern-X-UIDL ; grepmail -u -f Pattern-X-UIDL 1998-2022.Newest I was stymied because the pattern data accumulated in Pattern-X-UIDL rather than overwriting the file, resulting in multiple copies of the emails in the output.

AllegatoDimensione
UID-pattern.txt 518 byte
Emails-UID-null-GL.txt 1.08 MB
Script-Mailbox-Six-Dates-longform.txt 707 byte
Emails.Found_.by_.grepmail-Escaped-pattern.txt 159.95 KB
Six-double-quoted-Dates.txt 199 byte
Invalid-Double-Quoted-Dates.txt 1.11 KB
Script-Invalid-Double-Quoted-Dates.txt 3.79 KB
Combined-App-Invalid-Date-Emails.txt 249.75 KB
Hairball-Invalid-Date-Emails.txt 17.22 KB