Scraping data from a list of emails
- Login o registrati per inviare commenti
Starting from a script prepared by Magic Banana, the need arises to:
(1) Add a few arguments such that all the emails can be accounted for
Here's the pertinent portion of Magic Banana's script:
# One pair (name, regexp) per line in $DISTINCT_IN_HEADERS_OR_CONTENT to get the distinct matching strings in the headers and the content.
To which I'd like to add a function that returns the epochal time:
DISTINCT_IN_HEADERS_OR_CONTENT='URL https?://[a-zA-Z0-9./?=_%:-]*
IPv4-addr (([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])
email-addr [a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+
email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z]+.[a-zA-Z]+'
Epochal-date date -d "string" +%s
Can $string simply be $date ?
(2) Handle the few that seem to "get away"
There are four rows at the bottom of the following list that aren't "following the rules"
cut -f 2,13 TargetHairball-MBGL.txt | awk -F '[\t]' 'NR -1 {print $2"\t"$1}' | sort -nk 1,1 > X-UIDL.and.Date.test.txt
The first three attachments are some of the corresponding emails.
(3) Capture suitable dates for all the emails
LC_ALL=C ./MB-Scrape-Emails-GL TargetEmailFile.txt | sed 's/\t\t/\tnull\t/g' > TargetHairball-MBGL.txt
Capturing just "Date" misses about 25 emails:
cut -f 2,13 Hairball-MBGL.1998-2022.Newest.txt | awk -F '[\t]' 'NR -1 {print $2"\t"$1}' | sort -nk 1,1 > X-UIDL.and.Date.txt
sort -k 2,2 X-UIDL.and.Date.txt | grep "null" '-' | awk '{print $1}' > X-UIDL.withno.Date.txt
25 null's are used to select X-UIDL.withno.Date.txt, but all 25 produce too large a file for the forum ... shorten the list:
mboxgrep "($(cut -d ' ' -f 1 X-UIDL.withno.Date01.txt | tr \\n \| |sed 's/|$//'))" 1998-2022.Newest > Emails.withno.Date03.txt
Leafpad finds the pertinent Emails easily ... and it turns out that they do have "Date" fields, somehow missed by the script.
Allegato | Dimensione |
Email-end-of-Date-File-01.txt | 6.23 KB |
Email-end-of-Date-File-02.txt | 6.1 KB |
Email-end-of-Date-File-03.txt | 12.54 KB |
TargetEmailFile.txt | 1.02 MB |
TargetHairball-MBGL.txt | 82.09 KB |
Emails.withno.Date03.txt | 350.35 KB |
MB-Scrape-Emails-GL.txt | 2.09 KB |
X-UIDL.and_.Date_.test_.txt | 2.36 KB |
X-UIDL.withno.Date01.txt | 265 byte |
MB-Scrape-Emails-GL.txt already scrapes the "Date" in the headers of TargetEmailFile.txt. It is the second column of the output. As far as I understand (but you are unclear), you want in an 18th column, named "18:Epochal-date", that date in seconds since 1970-01-01 00:00:00 UTC. If so, you can post-process MB-Scrape-Emails-GL.txt's output with:
awk -F \\t '{ printf $0 "\t"; if (NR - 1) { cmd = "date -d \"" $2 "\" +%s"; cmd | getline; close(cmd); print } else print "18:Epochal-date" }'
Here's such a post=processing script that Magic Banana was probably visualizing:
cut -f 2,13 Hairball-MBGL.1998-2022.Newest.txt | awk -F '[\t]' 'NR -2 {print $2"\t"$1}' | sort -nk 1,1 | awk -F \\t '{ printf $0 "\t"; if (NR - 1) { cmd = "date -d \"" $2 "\" +%s"; cmd | getline; close(cmd); print } else print "18:Epochal-date" }' >
which accounts for nearly all of the target emails. Thanks !
The ones left out may yet have some importance.
First, there are the two at the top of the output file's list:
account8 Wed, 13 Jan 2021 00:14:24 -0700 18:Epochal-date See ErrantEmail-Rows-1198996-1199093.txt
where the date is buried below the headers.
null 18 Apr 2019 12:21:15 +0300 1555579275See ErrantEmail-Rows-931849-933292.txt
wherein that date is similarly buried. Neither date is accounted for in the scripts that we have devised, nor are their X-UIDL's.
There are also the last four rows of the output file of our script, where the Col.$1 numbers are all X-Spam-Scores instead of X-UIDLs:
135 Thu, 24 Jun 2021 18:09:36 +0200 1624550976 See: ErrantEmail-Rows-2727248-2747365.txt
149 Thu, 24 Jun 2021 18:05:29 +0200 1624550729 See: ErrantEmail-Rows-2747129-2747245.txt
244 Fri, 03 Sep 2021 05:10:55 +0200 1630638655 See: ErrantEmail-Rows-3431322-3431485.txt
244 Fri, 03 Sep 2021 06:25:11 +0200 1630643111 See: ErrantEmail-Rows-3431563-3431726.txt
None of these four look out of the ordinary to me.
The "null" emails' data have already been scraped with:
cut -f 2,13 Hairball-MBGL.1998-2022.Newest.txt | awk -F '[\t]' 'NR -1 {print $2"\t"$1}' | sort -nk 1,1 > X-UIDL.and.Date.txt
sort -k 2,2 X-UIDL.and.Date.txt | grep "null" '-' | awk '{print $1}' > X-UIDL.withno.Date.txt
Listing 25 null's used to select Emails.withno.Date03.txt, whose corresponding emails and their dates can only be found in TargetEmailFile.txt with Leafpad.
Is there some modification of the scripts that could reduce the labor of tracking these 35-odd missing emails ?
Allegato | Dimensione |
ErrantEmail-Rows-931849-933292.txt | 101.63 KB |
ErrantEmail-Rows-1198996-1199093.txt | 8.35 KB |
ErrantEmail-Rows-2747129-2747245.txt | 6.1 KB |
ErrantEmail-Rows-2727248-2747365.txt | 6.23 KB |
ErrantEmail-Rows-3431322-3431485.txt | 12.58 KB |
ErrantEmail-Rows-3431563-3431726.txt | 12.57 KB |
X-UIDL.and_.CalendarDate.to_.EpochalTime.txt | 630.58 KB |
In my original post I overlooked Magic Banana's superb scripts that capture the emails with proper dates:
Fri, 02/18/2022 - 11:47
Magic Banana wrote: As afar as I understand, you want:
the whole lines whose last fields is an IPv4 address:
$ grep -E ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt > Resolved-2k-Domains.txt
[and]the first field of the remaining lines:>
$ grep -vE ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt | cut -d ' ' -f 1 > Unresolved-two-dozen-Domains.txt
Allegato | Dimensione |
Mixed-Types.txt | 1.84 KB |
Do the "missing emails" contain numeric potions after the @ sign? If I understand your regex it seems to only cater for alphabetical characters after the @ sign?
email-addr [a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+
email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z]+.[a-zA-Z]+'
I mean numeric "portions" although regex is somehow akin to a numeric potion
PrimeOrdeal has hit us with an inciteful question; the answer is "yes"
Let me explain ...
First, I modified Magic Banana's script as follows:
email-addr [a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+ ==> email-addr [a-zA-Z0-9._]+@[a-zA-Z0-9._]+.[a-zA-Z0-9._]+
email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z]+.[a-zA-Z]+' ==> email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z0-9._]+.[a-zA-Z0-9._]+'
Then I executed the following pairs of scripts, first, as originally written and then suitably modified:
LC_ALL=C ./MB-Scrape-Emails-GL Emails.withno.Date.txt | sed 's/\t\t/\tnull\t/g'> Hairball.nodate.Newest-GL
cut -f 13,14,15,16,17 Hairball.nodate.Newest-GL | awk -F '[\t]' 'NR -2 {print $1"\t"$2"\t"$3"\t"$4"\t"$5}' > CheckEmailsForum.txt
Followed by:
LC_ALL=C ./MB-Scrape-Emails-GL-PO Emails.withno.Date.txt | sed 's/\t\t/\tnull\t/g'> Hairball.nodate.Newest-GL-PO
cut -f 13,14,15,16,17 Hairball.nodate.Newest-GL-PO | awk -F '[\t]' 'NR -2 {print $1"\t"$2"\t"$3"\t"$4"\t"$5}' > CheckEmailsForum-PO.txt
The second "CheckEmailsForum" file is larger than the first; both are attached. The headers of each file list the identifications of the five columns in the order 13,14,15,16,17 corresponding to $1,$2,$3,$4,$5. $13=$1 being the X-UIDL of each email. "Null" indicates unresponsive queries.
There are about 1.6 kB of additional numerical domains. More scripting is necessary to enumerate exclusively what those domains are.
Fifty-two of sixty-one listed emails have increased quantities of email addresses, but we're not finding the missing emails this way.
Allegato | Dimensione |
CheckEmailsForum.txt | 11.27 KB |
CheckEmailsForum-PO.txt | 12.82 KB |
As promised, here is a further analysis of the effect of admitting alphanumeric domains:
cut -f 1,3 CheckEmailsForum.txt | awk -F '[\t]' 'NR!=1 {print $2}' | sed 's/\,/\n/g' > Alphabetical.Emails.txt ;
awk -F '[\@]' '{print $2}' Alphabetical.Emails.txt | sed 's/\//\ /g' | awk -F '[\ ]' '{print $1}' | sort -u > Alphabetical.Domains.txt
cut -f 1,3 CheckEmailsForum-PO.txt | awk -F '[\t]' 'NR!=1 {print $2}' | sed 's/\,/\n/g' | sed 's/\?/\ /g' > Alphanumeric.Emails.txt ;
awk -F '[\@]' '{print $2}' Alphanumeric.Emails.txt | sed 's/\//\ /g' | awk -F '[\ ]' '{print $1}' | sort -u > Alphanumeric.Domains.txt
diff -y --suppress-common-lines --width=100 <(sort -k 1,1 Alphabetical.Domains.txt) <(sort -k 1,1 Alphanumeric.Domains.txt) > Alphanumeric.differencefrom.Alphabetical.Domains.txt
All those sed commands replace unwanted characters with field-separating spaces.
The difference file looks like this:
ak.mta | |
ams-node |
bnc |
bounce.govexec |
bounces.elasticemail |
dcfdpemail.verizon |
direct-siege |
e.defenseone |
host.maintained | |
jwm |
mail |
mail.groupage |
mailin.mcsv |
messages.cisa |
mx.sailthru | |
power.digitalesolution |
r3IZ | r3IZ-services
rareauctions.ccsend | | |
rtec-instruments |
scheduler.constantcontact |
server |
srv |
tracking.slcb |
unsubscribe.qemailserver |
unsub.spmta |
vip | | |
Some of the differences are dramatically different because of the interruption caused by the missed number.
Other differences exist because the embedded numbers cause complete loss of the alphanumeric email addresses.
Some are a mystery to me. The comparison was made on the email addresses in the email headers.
Allegato | Dimensione |
CheckEmailsForum.txt | 12.48 KB |
CheckEmailsForum-PO.txt | 14.02 KB |
Alphabetical.Emails.txt | 8.36 KB |
Alphanumeric.Emails.txt | 9.25 KB |
Alphabetical.Domains.txt | 1.02 KB |
Alphanumeric.Domains.txt | 1.38 KB |
- Login o registrati per inviare commenti