Scraping data from a list of emails

7 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/04/2015

Starting from a script prepared by Magic Banana, the need arises to:
(1) Add a few arguments such that all the emails can be accounted for
Here's the pertinent portion of Magic Banana's script:
# One pair (name, regexp) per line in $DISTINCT_IN_HEADERS_OR_CONTENT to get the distinct matching strings in the headers and the content.
DISTINCT_IN_HEADERS_OR_CONTENT='URL https?://[a-zA-Z0-9./?=_%:-]*
IPv4-addr (([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])
email-addr [a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+
email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z]+.[a-zA-Z]+'
To which I'd like to add a function that returns the epochal time:
Epochal-date date -d "string" +%sCan $string simply be $date ?

(2) Handle the few that seem to "get away"
There are four rows at the bottom of the following list that aren't "following the rules"
cut -f 2,13 TargetHairball-MBGL.txt | awk -F '[\t]' 'NR -1 {print $2"\t"$1}' | sort -nk 1,1 > X-UIDL.and.Date.test.txtThe first three attachments are some of the corresponding emails.

(3) Capture suitable dates for all the emails
LC_ALL=C ./MB-Scrape-Emails-GL TargetEmailFile.txt | sed 's/\t\t/\tnull\t/g' > TargetHairball-MBGL.txt
Capturing just "Date" misses about 25 emails:
cut -f 2,13 Hairball-MBGL.1998-2022.Newest.txt | awk -F '[\t]' 'NR -1 {print $2"\t"$1}' | sort -nk 1,1 > X-UIDL.and.Date.txt
sort -k 2,2 X-UIDL.and.Date.txt | grep "null" '-' | awk '{print $1}' > X-UIDL.withno.Date.txt

25 null's are used to select X-UIDL.withno.Date.txt, but all 25 produce too large a file for the forum ... shorten the list:
mboxgrep "($(cut -d ' ' -f 1 X-UIDL.withno.Date01.txt | tr \\n \| |sed 's/|$//'))" 1998-2022.Newest > Emails.withno.Date03.txt
Leafpad finds the pertinent Emails easily ... and it turns out that they do have "Date" fields, somehow missed by the script.

AllegatoDimensione
Email-end-of-Date-File-01.txt6.23 KB
Email-end-of-Date-File-02.txt6.1 KB
Email-end-of-Date-File-03.txt12.54 KB
TargetEmailFile.txt1.02 MB
TargetHairball-MBGL.txt82.09 KB
Emails.withno.Date03.txt350.35 KB
MB-Scrape-Emails-GL.txt2.09 KB
X-UIDL.and_.Date_.test_.txt2.36 KB
X-UIDL.withno.Date01.txt265 byte
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

MB-Scrape-Emails-GL.txt already scrapes the "Date" in the headers of TargetEmailFile.txt. It is the second column of the output. As far as I understand (but you are unclear), you want in an 18th column, named "18:Epochal-date", that date in seconds since 1970-01-01 00:00:00 UTC. If so, you can post-process MB-Scrape-Emails-GL.txt's output with:
awk -F \\t '{ printf $0 "\t"; if (NR - 1) { cmd = "date -d \"" $2 "\" +%s"; cmd | getline; close(cmd); print } else print "18:Epochal-date" }'

amenex
Offline
Iscritto: 01/04/2015

Here's such a post=processing script that Magic Banana was probably visualizing:
cut -f 2,13 Hairball-MBGL.1998-2022.Newest.txt | awk -F '[\t]' 'NR -2 {print $2"\t"$1}' | sort -nk 1,1 | awk -F \\t '{ printf $0 "\t"; if (NR - 1) { cmd = "date -d \"" $2 "\" +%s"; cmd | getline; close(cmd); print } else print "18:Epochal-date" }' > X-UIDL.and.CalendarDate.to.EpochalTime.txt which accounts for nearly all of the target emails. Thanks !
The ones left out may yet have some importance.

First, there are the two at the top of the output file's list:

account8 Wed, 13 Jan 2021 00:14:24 -0700 18:Epochal-date See ErrantEmail-Rows-1198996-1199093.txt
where the date is buried below the headers.
null 18 Apr 2019 12:21:15 +0300 1555579275See ErrantEmail-Rows-931849-933292.txt
wherein that date is similarly buried. Neither date is accounted for in the scripts that we have devised, nor are their X-UIDL's.

There are also the last four rows of the output file of our script, where the Col.$1 numbers are all X-Spam-Scores instead of X-UIDLs:
135 Thu, 24 Jun 2021 18:09:36 +0200 1624550976 See: ErrantEmail-Rows-2727248-2747365.txt
149 Thu, 24 Jun 2021 18:05:29 +0200 1624550729 See: ErrantEmail-Rows-2747129-2747245.txt
244 Fri, 03 Sep 2021 05:10:55 +0200 1630638655 See: ErrantEmail-Rows-3431322-3431485.txt
244 Fri, 03 Sep 2021 06:25:11 +0200 1630643111 See: ErrantEmail-Rows-3431563-3431726.txt
None of these four look out of the ordinary to me.

The "null" emails' data have already been scraped with:
cut -f 2,13 Hairball-MBGL.1998-2022.Newest.txt | awk -F '[\t]' 'NR -1 {print $2"\t"$1}' | sort -nk 1,1 > X-UIDL.and.Date.txt then:
sort -k 2,2 X-UIDL.and.Date.txt | grep "null" '-' | awk '{print $1}' > X-UIDL.withno.Date.txtListing 25 null's used to select Emails.withno.Date03.txt, whose corresponding emails and their dates can only be found in TargetEmailFile.txt with Leafpad.
Is there some modification of the scripts that could reduce the labor of tracking these 35-odd missing emails ?

AllegatoDimensione
ErrantEmail-Rows-931849-933292.txt 101.63 KB
ErrantEmail-Rows-1198996-1199093.txt 8.35 KB
ErrantEmail-Rows-2747129-2747245.txt 6.1 KB
ErrantEmail-Rows-2727248-2747365.txt 6.23 KB
ErrantEmail-Rows-3431322-3431485.txt 12.58 KB
ErrantEmail-Rows-3431563-3431726.txt 12.57 KB
X-UIDL.and_.CalendarDate.to_.EpochalTime.txt 630.58 KB
amenex
Offline
Iscritto: 01/04/2015

In my original post I overlooked Magic Banana's superb scripts that capture the emails with proper dates:
Fri, 02/18/2022 - 11:47
Magic Banana wrote: As afar as I understand, you want:
the whole lines whose last fields is an IPv4 address:

$ grep -E ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt > Resolved-2k-Domains.txt
[and]the first field of the remaining lines:>
$ grep -vE ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt | cut -d ' ' -f 1 > Unresolved-two-dozen-Domains.txt

AllegatoDimensione
Mixed-Types.txt 1.84 KB
PrimeOrdeal
Offline
Iscritto: 09/15/2019

Do the "missing emails" contain numeric potions after the @ sign? If I understand your regex it seems to only cater for alphabetical characters after the @ sign?

email-addr [a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+

email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z]+.[a-zA-Z]+'

PrimeOrdeal
Offline
Iscritto: 09/15/2019

I mean numeric "portions" although regex is somehow akin to a numeric potion

amenex
Offline
Iscritto: 01/04/2015

PrimeOrdeal has hit us with an inciteful question; the answer is "yes"
Let me explain ...
First, I modified Magic Banana's script as follows:
email-addr [a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+ ==> email-addr [a-zA-Z0-9._]+@[a-zA-Z0-9._]+.[a-zA-Z0-9._]+
email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z]+.[a-zA-Z]+' ==> email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z0-9._]+.[a-zA-Z0-9._]+'

Then I executed the following pairs of scripts, first, as originally written and then suitably modified:
LC_ALL=C ./MB-Scrape-Emails-GL Emails.withno.Date.txt | sed 's/\t\t/\tnull\t/g'> Hairball.nodate.Newest-GL
cut -f 13,14,15,16,17 Hairball.nodate.Newest-GL | awk -F '[\t]' 'NR -2 {print $1"\t"$2"\t"$3"\t"$4"\t"$5}' > CheckEmailsForum.txt

Followed by:
LC_ALL=C ./MB-Scrape-Emails-GL-PO Emails.withno.Date.txt | sed 's/\t\t/\tnull\t/g'> Hairball.nodate.Newest-GL-PO
cut -f 13,14,15,16,17 Hairball.nodate.Newest-GL-PO | awk -F '[\t]' 'NR -2 {print $1"\t"$2"\t"$3"\t"$4"\t"$5}' > CheckEmailsForum-PO.txt

The second "CheckEmailsForum" file is larger than the first; both are attached. The headers of each file list the identifications of the five columns in the order 13,14,15,16,17 corresponding to $1,$2,$3,$4,$5. $13=$1 being the X-UIDL of each email. "Null" indicates unresponsive queries.
There are about 1.6 kB of additional numerical domains. More scripting is necessary to enumerate exclusively what those domains are.
Fifty-two of sixty-one listed emails have increased quantities of email addresses, but we're not finding the missing emails this way.

AllegatoDimensione
CheckEmailsForum.txt 11.27 KB
CheckEmailsForum-PO.txt 12.82 KB
amenex
Offline
Iscritto: 01/04/2015

As promised, here is a further analysis of the effect of admitting alphanumeric domains:
cut -f 1,3 CheckEmailsForum.txt | awk -F '[\t]' 'NR!=1 {print $2}' | sed 's/\,/\n/g' > Alphabetical.Emails.txt ;
awk -F '[\@]' '{print $2}' Alphabetical.Emails.txt | sed 's/\//\ /g' | awk -F '[\ ]' '{print $1}' | sort -u > Alphabetical.Domains.txt

cut -f 1,3 CheckEmailsForum-PO.txt | awk -F '[\t]' 'NR!=1 {print $2}' | sed 's/\,/\n/g' | sed 's/\?/\ /g' > Alphanumeric.Emails.txt ;
awk -F '[\@]' '{print $2}' Alphanumeric.Emails.txt | sed 's/\//\ /g' | awk -F '[\ ]' '{print $1}' | sort -u > Alphanumeric.Domains.txt

diff -y --suppress-common-lines --width=100 <(sort -k 1,1 Alphabetical.Domains.txt) <(sort -k 1,1 Alphanumeric.Domains.txt) > Alphanumeric.differencefrom.Alphabetical.Domains.txtAll those sed commands replace unwanted characters with field-separating spaces.
The difference file looks like this:
ak.mta | ak.mta1vrest.cc.prd.sparkpost
alum.mit | alum.mit.edu
ams-node | ams-node2.websitehostserver.net
bnc | bnc3.mailjet.com
bounce.govexec | bounce.govexec.com
bounces.elasticemail | bounces.elasticemail.net
dcfdpemail.verizon | dcfdpemail.verizon.com
direct-siege | direct-siege.com
e.defenseone | e.defenseone.com
host.maintained | host.maintained-it.co.uk
indosat.net | indosat.net.id
jwm | jwm12-app.iad1.qprod.net
> jwm19-app.iad1.qprod.net
> jwm9-app.iad1.qprod.net
mail | mail10.sea61.rsgsv.net
> mail159.sea51.mcsv.net
> mail208.atl101.mcdlv.net
> mail236.suw16.rsgsv.net
mail.groupage | mail.groupage.today
mailin.mcsv | mailin.mcsv.net
messages.cisa | messages.cisa.gov
mx.sailthru | mx.sailthru.com
pilotlogistics.com | pilotlogistics.com.sg
power.digitalesolution | power.digitalesolution.info
r3IZ | r3IZ-services
rareauctions.ccsend | rareauctions.ccsend.com
reeleffect.co | reeleffect.co.uk
rougeempire.co | rougeempire.co.nz
rtec-instruments | rtec-instruments.com
scheduler.constantcontact | scheduler.constantcontact.com
server | server01.mdqlab.com
srv | srv04.infranetdns.com
tracking.slcb | tracking.slcb.info
unsubscribe.qemailserver | unsubscribe.qemailserver.com
unsub.spmta | unsub.spmta.com
vip | vip.163.com
walla.co | walla.co.il
workmail.co | workmail.co.za

Some of the differences are dramatically different because of the interruption caused by the missed number.
Other differences exist because the embedded numbers cause complete loss of the alphanumeric email addresses.
Some are a mystery to me. The comparison was made on the email addresses in the email headers.

AllegatoDimensione
CheckEmailsForum.txt 12.48 KB
CheckEmailsForum-PO.txt 14.02 KB
Alphabetical.Emails.txt 8.36 KB
Alphanumeric.Emails.txt 9.25 KB
Alphabetical.Domains.txt 1.02 KB
Alphanumeric.Domains.txt 1.38 KB