Make mboxgrep recognize dates
- Vous devez vous identifier ou créer un compte pour écrire des commentaires
Alas, on further study it turns out that the dates in the "From - [date]" line have the %c format according to Man date:
grep "From -" 1998-2022.Newest |sort -u |head -n 5
From - Fri Aug 17 15:47:13 2018
From - Fri Aug 17 15:50:59 2018
From - Fri Dec 14 07:28:23 2018
From - Fri Dec 14 07:28:24 2018
From - Fri Dec 14 07:28:25 2018
Whereas, the dates returned by
grep "Delivery-date:" 1998-2022.Newest |sort -u |head -n 5
are in --rfc-mail format.
Delivery-date: Fri, 01 Feb 2019 00:07:49 -0800
Delivery-date: Fri, 01 Feb 2019 05:16:21 -0800
Delivery-date: Fri, 01 Feb 2019 11:18:14 -0800
Delivery-date: Fri, 01 Feb 2019 17:14:04 -0800
Delivery-date: Fri, 01 Jan 2021 02:28:21 -0800
Supposedly, one writes
date -d [string date] +FORMAT -&c
for the "From - [string date]" searches, but the syntax escapes me.
Although my "long form" workaround appears to be successful:
mboxgrep -e "Thu Jun 24 08:35:24 2021" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep01.txt
mboxgrep -e "Thu Jun 24 08:35:24 2021" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
mboxgrep -e "Thu Nov 29 16:50:27 2018" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
mboxgrep -e "Sat Dec 8 08:42:32 2018" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
...
mboxgrep -e "Mon Mar 14 08:53:38 2022" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
mboxgrep -e "Wed Mar 16 08:27:29 2022" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
mboxgrep -e "Fri Mar 18 09:05:17 2022" 1998-2022.Newest >> No-Date-Emails.Found.by.mboxgrep.txt ;
Saved as Script-No-Date-Emails.Found.by.mboxgrep-GL.txt
cp Script-No-Date-Emails.Found.by.mboxgrep-GL.txt Script-No-Date-Emails.Found.by.mboxgrep-GL
Took 1-1/3 minutes; 83.6 MB
sudo chmod +x Script-No-Date-Emails.Found.by.mboxgrep-GL
./Script-No-Date-Emails.Found.by.mboxgrep-GL
Only a few of these dates match the pattern's dates:
grep "From -" No-Date-Emails.Found.by.mboxgrep.txt |sort -u |head -n 65 > Check-Pattern-Matches.txt
comm -12 <(sort -k 1,1 Dates.from.Emails.no.Dates.txt) <(sort -k 1,1 Check-Pattern-Matches-Edit.txt) > Dates-Common-to-Both-Files.txt
Fri Dec 21 12:21:19 2018
Fri Dec 7 17:28:57 2018
Fri Mar 18 09:05:17 2022
Fri Oct 1 10:52:57 2021
Fri Oct 1 10:53:20 2021
Fri Sep 17 14:53:57 2021
Mon Apr 1 17:41:34 2019
Mon Aug 6 17:50:58 2018
It would appear that the mboxgrep searches are selecting partial matches; mboxgrep must
somehow be held to verbatim matches.
Pièce jointe | Taille |
---|---|
Check-Pattern-Matches-Edit.txt | 1.59 Ko |
Dates.from_.Emails.no_.Dates_.txt | 1.49 Ko |
Did a Google search "mboxgrep dates" and https://trisquel.info/en/forum/make-mboxgrep-recognize-dates
was at the top of the list.
A little farther down in the Google search was an item about grepmail:
https://packages.debian.org/buster/grepmail
See also: https://metacpan.org/pod/grepmail
Available from the Trisquel repository but not included in Add/Remove Applications
Install with apt-get ... man pages have lots of detail.
First try:
grepmail -u -f Dates.from.Emails.no.Dates.txt 1998-2022.Newest > First-grepmail-search.txt
Took 18 seconds.
Proof:
grep "From -" First-grepmail-search.txt | sed 's/From\ -\ //g' > Check-Pattern-Matches-grepmail.txt ;
comm -12 <(sort -k 1,1 Dates.from.Emails.no.Dates.txt) <(sort -k 1,1 Check-Pattern-Matches-grepmail.txt) > Dates-Common-to-Both-Files-grepmail.txt ;
comm -23 <(sort -k 1,1 Dates.from.Emails.no.Dates.txt) <(sort -k 1,1 Check-Pattern-Matches-grepmail.txt) > Dates-missed-by-grepmail.txt
Grepmail appears to have a lot of promise for use in my spam project.
Pièce jointe | Taille |
---|---|
Dates-Common-to-Both-Files-grepmail.txt | 1.49 Ko |
Dates-missed-by-grepmail.txt | 1 octet |
the syntax escapes me.
$ date -d 'Fri, 01 Feb 2019 00:07:49 -0800' +%c
Fri Feb 1 06:07:49 2019
That's not "it."
date -d 'Wed, 13 Jan 2021 00:14:24 -0700' +%c > Wed 13 Jan 2021 02:14:24 AM EST
date -R 'Wed, 13 Jan 2021 00:14:24 -0700' > date: invalid date ‘Wed, 13 Jan 2021 00:14:24 -0700’
but January 13 2021 really was a Wednesday ...
Man date says:
-R, --rfc-email
output date and time in RFC 5322 format. Example: Mon, 14 Aug 2006 02:34:56 -0600
My dates follow RFC 5322; what does bash date not like about those dates ?
grepmail won't find [verbatim] 'Wed, 13 Jan 2021 00:14:24 -0700' with a 'Wed 13 Jan 2021 02:14:24 AM EST' pattern.
One of the patterns is actually truncated ==> '18 Apr 2019 12:21:15 +0300' which is even worse.
Leafpad takes several minutes each time, but does find each target date in the 326 MB 1998-2022.Newest email file.
My attempts to force the %a, &d %b %Y %X %z sequence of fields have all been unsuccessful.
However, there is a way that doesn't require any hard labor: escape the spaces in the dates, making them simple strings.
sed 's/\ /\\ /g' Search-pattern.txt > Escaped-pattern.txt
Then execute these six grepmail commands:
grepmail -u 'Wed\, 13\ Jan\ 2021\ 00:14:24\ \-0700' 1998-2022.Newest > Email-01.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u '18\ Apr\ 2019\ 12:21:15\ \+0300' 1998-2022.Newest > Email-02.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Thu,\ 24\ Jun\ 2021\ 18:09:36\ \+0200' 1998-2022.Newest > Email-03.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Thu,\ 24\ Jun\ 2021\ 18:05:29\ \+0200' 1998-2022.Newest > Email-04.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Fri,\ 03\ Sep\ 2021\ 05:10:55\ \+0200' 1998-2022.Newest > Email-05.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Fri,\ 03\ Sep\ 2021\ 06:25:11\ \+0200' 1998-2022.Newest > Email-06.Found.by.grepmail-Escaped-pattern.txt ;
cat Email-0?.Found.by.grepmail-Escaped-pattern.txt > EmailsFound.by.grepmail-Escaped-pattern.txt
LC_ALL=C ./MB-Scrape-Emails-GL-PO EmailsFound.by.grepmail-Escaped-pattern.txt | sed 's/\t\t/\tnull\t/g' > Hairball-Escaped-pattern-Emails.txt
All six date-strings are in their respective Date column ($2).
Pièce jointe | Taille |
---|---|
EmailsFound.by_.grepmail-Escaped-pattern.txt | 147.4 Ko |
Hairball-Escaped-pattern-Emails.txt | 6.3 Ko |
MB-Scrape-Emails-GL-PO.txt | 2.11 Ko |
Search-pattern.txt | 198 octets |
Escaped-pattern.txt | 228 octets |
what does bash date not like about those dates ?
The date command (not Bash) returns the error. It does so because you need the --date (-d) option to use another date and time than the current one:
$ date -Rd 'Wed, 13 Jan 2021 00:14:24 -0700'
Wed, 13 Jan 2021 04:14:24 -0300
grepmail won't find [verbatim] 'Wed, 13 Jan 2021 00:14:24 -0700' with a 'Wed 13 Jan 2021 02:14:24 AM EST' pattern.
grepmail searches a regular expression and "Wed, 13 Jan 2021 00:14:24 -0700" does not match "Wed 13 Jan 2021 02:14:24 AM EST", obviously (additional comma, "2" instead of "0", and "-0700" instead of "AM EST").
My attempts to force the %a, &d %b %Y %X %z sequence of fields have all been unsuccessful.
If you are referring to date's output format, your "&d" should be "%d":
$ date '+%a, %d %b %Y %X %z'
Wed, 30 Mar 2022 21:04:39 -0300
All six date-strings are in their respective Date column ($2).
That column contains the value of "Date" in the headers of the input emails, hence neither the "'From - [date]' line" nor the "Delivery-date" you were taking about in the original post.
Also, I do not see the point of separating six emails from the rest (in a terrible way; using Leafpad, six hard-coded dates, etc.). Why not simply applying MB-Scrape-Emails-GL-PO to all the emails? You can then select the relevant output lines, maybe with something like that (which only searches the values of the "Date" field, not the whole emails):
awk -F \\t '$2 == "Wed, 13 Jan 2021 00:14:24 -0700" || $2 == "18 Apr 2019 12:21:15 +0300" || $2 == "Thu, 24 Jun 2021 18:09:36 +0200" || $2 == "Thu, 24 Jun 2021 18:05:29 +0200" || $2 == "Fri, 03 Sep 2021 05:10:55 +0200" || $2 == "Fri, 03 Sep 2021 06:25:11 +0200"'
Only a short clarification; other matters will soon interfere.
Those six dates were scraped independently (about which, more later). They're in the six remaining unidentified emails and
serve to mark the "lost" emails. As grepmail is meant to search for strings, my escaping all the spaces in the six emails
with the sed command converted those dates to strings eliminating the dates' format from the syntax dilemma. The strings
that grepmail found in the 326 MB 1998-2022.Newest file are unique and are found in none of those other 10,000+ emails,
and they match perfectly the original unaltered dates.
I need to do a better job explaining how I found those six dates; I hastily skipped over that part of the task; I'm sorry.
Magic Banana's suggested script:
MB's suggested scraping script applied to 1998-2022.Newest appears to do what my longform script and the final script in my posting below do ==>
The one-line script is two hundred times faster and outputs about as much total information; 6 kB vs. 6.3 kB. The hare beats the tortoise, yet again showing the power of scripting for analyzing text files.
awk -F \\t '$2 == "Wed, 13 Jan 2021 00:14:24 -0700" || $2 == "18 Apr 2019 12:21:15 +0300" || $2 == "Thu, 24 Jun 2021 18:09:36 +0200" || $2 == "Thu, 24 Jun 2021 18:05:29 +0200" || $2 == "Fri, 03 Sep 2021 05:10:55 +0200" || $2 == "Fri, 03 Sep 2021 06:25:11 +0200"' Hairball-1998-2022.Newest.txt > Output-MBscript.txt
Pièce jointe | Taille |
---|---|
Output-MBscript.txt | 5.98 Ko |
Hairball-Six-Dates-longform.txt | 6.3 Ko |
The one-line script is two hundred times faster and outputs about as much total information; 6 kB vs. 6.3 kB.
Apart from the order of the lines ("my" AWK selection preserves the order in the input), the only difference is the absence of the header. If you want it and if the input has it, you can write:
awk -F \\t 'NR == 1 || $2 == "Wed, 13 Jan 2021 00:14:24 -0700" || $2 == "18 Apr 2019 12:21:15 +0300" || $2 == "Thu, 24 Jun 2021 18:09:36 +0200" || $2 == "Thu, 24 Jun 2021 18:05:29 +0200" || $2 == "Fri, 03 Sep 2021 05:10:55 +0200" || $2 == "Fri, 03 Sep 2021 06:25:11 +0200"'
I concur; adding the headers stretched the processing time to one second for those six dates.
The outputs are close indeed; see screenshot attached.
The remaining 10,302 emails in 1998-2022.Newest are all accessible by their X-UIDL fields.
If you were given a list of 10,000+ dates enclosed in quotes, how would you build a script like the one from
Thu, 03/31/2022 - 19:18 ?
Leafpad cannot place that many fields in a single row.
On the other hand it is easy to build a 10,000+ row longform script with Leafpad, but that script took several hours to finish.
Pièce jointe | Taille |
---|---|
Double-quoted-Plain-10277-UID-based-Dates.txt | 336.29 Ko |
Double-quoted-10277-Escaped-space-UID-based-Dates.txt | 385.58 Ko |
AWK can first stores the dates in an array and then test whether $2 is in the array:
$ awk -F \\t 'FILENAME == ARGV[1] { gsub(/"/, ""); a[$0] } FILENAME == ARGV[2] && (NR == 1 || $2 in a)' Double-quoted-Plain-10277-UID-based-Dates.txt Hairball-1998-2022.Newest.txt
Continuing to account for all the emails in 1998-2022.Newest:
Steps to account for every email:
LC_ALL=C ./MB-Scrape-Emails-GL-PO 1998-2022.Newest | sed 's/\t\t/\tnull\t/g' > Hairball-1998-2022.Newest.txt
19.3 MB
Gathers all 10308 of the individual emails. Takes about six seconds; 19.3 MB.
head -n 1 Hairball-1998-2022.Newest.txt | sed 's/\t/\n/g' | sed 's/:/\t/g' | sed 's/\#\ //g' > Headers-1998-2022.Newest.txt
Lists the headers of the numbered columns in the email summary:
1 Content analysis details
2 Date
3 Delivery-date
4 IPv4-addr-in-contents
5 IPv4-addr-in-headers
6 Message-ID
7 Return-Path
8 Subject
9 URL-in-contents
10 URL-in-headers
11 X-Account-Key
12 X-Spam-Score
13 X-UIDL
14 email-addr-in-contents
15 email-addr-in-headers
16 email-addr-with-at-in-contents
17 email-addr-with-at-in-headers
cut -f 2,13 Hairball-1998-2022.Newest.txt | awk -F '[\t]' 'NR -1 {print $2"\t"$1}' | sort -nk 1,1 > X-UIDL.and.Date.txt
Took half a second; has a couple of malformed lines at the beginning, 10277 OK, 25 nulls in Col.$2, and four malformed Col.$1 at the end ==> 2 + 10277 + 25 + 4 = 10308, totaling 19.3 MB. These thirty-one malformed sets of email data need further analysis.
grep "null" X-UIDL.and.Date.txt | grep "UID" | awk '{print $1}' > UID-pattern.txt
<== use this as a starting point to grepmail those 25 dates.
grepmail -u -f UID-pattern.txt 1998-2022.Newest > Emails-UID-null.txt ==> 1.1 MB
<== Twenty-five emails containing the target UIDs.
LC_ALL=C ./MB-Scrape-Emails-GL-PO Emails-UID-null.txt | sed 's/\t\t/\tnull\t/g' > Hairball-UID-null.txt
<== MB's script-summary: Twenty-five dates, 18 of them in Col.$2, the other seven in Col.$3.
sed 's/\ /\\ /g' X-UIDL.and.Date.txt | grep -v "UID" | awk -F '[\t]' '{print $2}' > Six-strung-out-dates.txt
'Wed,\ 13\ Jan\ 2021\ 00:14:24\ -0700'
'18\ Apr\ 2019\ 12:21:15\ +0300'
'Thu,\ 24\ Jun\ 2021\ 18:09:36\ +0200'
'Thu,\ 24\ Jun\ 2021\ 18:05:29\ +0200'
'Fri,\ 03\ Sep\ 2021\ 05:10:55\ +0200'
'Fri,\ 03\ Sep\ 2021\ 06:25:11\ +0200'
Unsuccessful short-form script:
while read pattern
Saved as Script-Mailbox-Six-Dates.txt but fails.
do grepmail -e $pattern 1998-2022.Newest > Mailbox-Six-Dates.txt
done < Six-strung-out-dates.txt
Fall back to GL's longform technique, which is constructed on the Leafpad platform: ==>
grepmail -u 'Wed\, 13\ Jan\ 2021\ 00:14:24\ \-0700' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
Saved as Script-Mailbox-Six-Dates-longform.txt
grepmail -u '18\ Apr\ 2019\ 12:21:15\ \+0300' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Thu,\ 24\ Jun\ 2021\ 18:09:36\ \+0200' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Thu,\ 24\ Jun\ 2021\ 18:05:29\ \+0200' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Fri,\ 03\ Sep\ 2021\ 05:10:55\ \+0200' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
grepmail -u 'Fri,\ 03\ Sep\ 2021\ 06:25:11\ \+0200' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt ;
./Script-Mailbox-Six-Dates-longform
==> Took 5.624 seconds; 150.9 kB
LC_ALL=C ./MB-Scrape-Emails-GL-PO Emails.Found.by.grepmail-Escaped-pattern.txt | sed 's/\t\t/\tnull\t/g' > Hairball-Six-Dates-longform.txt
Lists the original six dates in Col.$2.
This series of step thereby enables MB's main script to summarize all of the Emails in 1998-2022.Newest.
Pièce jointe | Taille |
---|---|
X-UIDL.and_.Date_.txt | 519.1 Ko |
UID-pattern.txt | 518 octets |
Emails-UID-null.txt | 1.08 Mo |
Hairball-UID-null.txt | 78.65 Ko |
Six-strung-out-dates.txt | 228 octets |
Script-Mailbox-Six-Dates.txt | 115 octets |
Script-Mailbox-Six-Dates-longform.txt | 707 octets |
Hairball-Six-Dates-longform.txt | 6.3 Ko |
Saved as Script-Mailbox-Six-Dates.txt but fails.
Every output of grepmail overwrites the previous one because you redirect with ">" (not ">>") inside the loop (using ">" after "done" would work). Like grep, grepmail (which I have never used) must have an option -f to match a set of regular expressions:
$ grepmail -f Six-strung-out-dates.txt 1998-2022.Newest > Mailbox-Six-Dates.txt
The plot thickens ...
Magic Banana found errors in my script:
while read pattern
do grepmail -e $pattern 1998-2022.Newest > Mailbox-Six-Dates.txt
done < Six-strung-out-dates.txt
Based on the premise that this script processes one date at a time and concatenates the outputs:
while read pattern
do grepmail -u $pattern 1998-2022.Newest >> Mailbox-Six-Dates.Corrected-U.txt
done < Six-strung-out-dates.txt
Saved as Script-Mailbox-Six-Dates.Corrected-U.txt
cp Script-Mailbox-Six-Dates.Corrected-U.txt Script-Mailbox-Six-Dates.Corrected-U
sudo chmod +x Script-Mailbox-Six-Dates.Corrected-U
./Script-Mailbox-Six-Dates.Corrected-U
Fails to treat each escaped-space date as a single string, yet creates 246 kB email file containing 16 emails,
none containing a sought date.
The proposed three-line script outwardly adheres to the syntax of the six-line longform script, but the difference
is that in the longform script the escaped-spaces dates are explicitly written out, whereas the three-line script
depends on verbatim transfers of said dates, one long string at a time. What have I missed ?
Double quotes around $pattern, I believe. Also escaping the spaces looks useless and I still do not understand why you would not use option -f.
Here's a script that works OK, but the twenty-five patterns are all single strings:
grepmail -u -f UID-pattern.txt 1998-2022.Newest > Emails-UID-null-GL.txt
The longform script has escaped spaces and single-quote-enclosed fields and works OK:
grepmail -u 'Fri,\ 03\ Sep\ 2021\ 06:25:11\ \+0200' 1998-2022.Newest >> Emails.Found.by.grepmail-Escaped-pattern.txt
Zero bytes output:
grepmail -u "Fri, 03 Sep 2021 06:25:11 +0200" 1998-2022.Newest >> Emails.Found.by.grepmail-Double-quoted-pattern.txt
EDIT ==> Found a strawman to get around mboxgrep's and mailgrep's aversions to dates. The squeamish should look away.
First, in plain English:
(1) Use grep to find a suitable quantity of text around each date;
(2) Search grep's output for another email identifier, such as X-UIDL;
(3) Capture the found X-UIDL data;
(4) Save that data as an individual pattern (the strawman) file;
(5) Apply grepmail and the strawman to collect the associated emails;
(6) Confirm by generating a dedicated hairball with MB-Scrape-Emails-GL-PO.
Here is the first line of seventeen in the said script:
grep -B 25 -e "10 Mar 2021 12:39:12 -0200" 1998-2022.Newest | grep -e "X-UIDL:" '-' | sed 's/X-UIDL:\ //g' | tr -d '<>' > Pattern-X-UIDL ; grepmail -u -f Pattern-X-UIDL 1998-2022.Newest >> Combined-App-Invalid-Date-Emails.txt
When I attempted a one-line version of the script:
grep -B 25 -ef Invalid-Double-Quoted-Dates.txt 1998-2022.Newest | grep -e "X-UIDL:" '-' | sed 's/1998-2022.Newest-X-UIDL: //g' | tr -d '<>' > Pattern-X-UIDL ; grepmail -u -f Pattern-X-UIDL 1998-2022.Newest
I was stymied because the pattern data accumulated in Pattern-X-UIDL rather than overwriting the file, resulting in multiple copies of the emails in the output.
Pièce jointe | Taille |
---|---|
UID-pattern.txt | 518 octets |
Emails-UID-null-GL.txt | 1.08 Mo |
Script-Mailbox-Six-Dates-longform.txt | 707 octets |
Emails.Found_.by_.grepmail-Escaped-pattern.txt | 159.95 Ko |
Six-double-quoted-Dates.txt | 199 octets |
Invalid-Double-Quoted-Dates.txt | 1.11 Ko |
Script-Invalid-Double-Quoted-Dates.txt | 3.79 Ko |
Combined-App-Invalid-Date-Emails.txt | 249.75 Ko |
Hairball-Invalid-Date-Emails.txt | 17.22 Ko |
- Vous devez vous identifier ou créer un compte pour écrire des commentaires