Extract data from concatenated emails

50 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/03/2015

The attached emailfile.txt contains a few days' spam emails. Each email appears to have a common
field, "X-UIDL: UID" which I am attempting to use as the record separator so that the emails can
be processed as separate multi-line records (including blank lines). I have worked out a series
of commands to gather appropriate data:
awk 'RS/"X-UIDL: UID" {print $0}' emailfile.txt | grep "X-UIDL:" |more
awk 'RS/"X-UIDL: UID" {print $0}' emailfile.txt | grep "Return-Path:" '-' |more
awk 'RS/"X-UIDL: UID" {print $0}' emailfile.txt | grep "Message-ID:" '-' |more
awk 'RS/"X-UIDL: UID" {print $0}' emailfile.txt | egrep -o "(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])" '-' |more
awk 'RS/"X-UIDL: UID" {print $0}' emailfile.txt | grep -oe "[a-zA-Z0-9._]\+@[a-zA-Z]\+.[a-zA-Z]\+" '-' |more
awk 'RS/"X-UIDL: UID" {print $0}' emailfile.txt | grep -oE "(http|https)://(.*).html" '-' |more

I want to make sure that the extracted data can still be correlated to the individual emails by
processing just one or two records at a time, but I haven't found suitable syntax for that.
Then there's the problem of handling the arguments of grep so as to use a pattern file ...
I'd like the suitable data to be listed in their own multi-line records.

AllegatoDimensione
emailfile.txt1.67 MB
amenex
Offline
Iscritto: 01/03/2015

See: https://stackoverflow.com/questions/33958347/can-you-print-a-record-in-awk?noredirect=1&lq=1 where it's said:
This THE idiomatic awk way to do what you want and it works in all awks, not just gawk:
awk -v RS= -v ORS='\n\n' 'NR ~ /^(1|3)$/' file
Applied here:
sed 's/X-UIDL:/\f/g' emailfile.txt | awk -v RS= -v ORS='\f' 'NR ~ /^(1|3)$/' > First.and.3rd.records.txt
sed 's/X-UIDL:/\f/g' emailfile.txt | awk -v RS= -v ORS='\f' 'NR ~ /^(2)$/' > Second.record.txt

The RS symbol appears where I intend it should in the first output file, but the break between records is uncontrolled.

AllegatoDimensione
First.and_.3rd.records.txt 4.36 KB
Second.record.txt 130 byte
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The thread you found on stackoverflow deals with records separated by a blank line.

Each email appears to have a common field, "X-UIDL: UID" which I am attempting to use as the record separator so that the emails can be processed as separate multi-line records (including blank lines).

Just define the record separator (RS) as "X-UIDL: UID" or, probably safer, as "\nX-UIDL: UID" to constrain "X-UIDL: UID" to start a line:
$ awk -v RS='\nX-UIDL: UID' '...' emailfile.txt
For instance, to have every email in a separate file named after the X-UIDL, you can define the field separator (FS) as a newline so that the X-UIDL is the first field (except for the first record, the beginning of the metadata for the first email: the condition "NR - 1" excludes it):
$ mkdir emails
$ awk -v RS='\nX-UIDL: UID' -F \\n 'NR - 1 { print > "emails/" $1 }' emailfile.txt

To not have the X-UIDL in the file (it is already the file name), gensub can delete the first line of the record:
$ awk -v RS='\nX-UIDL: UID' -F \\n 'NR - 1 { print gensub(/[^\n]*\n/, "", 1) > "emails/" $1 }' emailfile.txt
You can then grep patterns in emails/*. It will prefix every output line with the file name, i.e., with the X-UIDL.

amenex
Offline
Iscritto: 01/03/2015

Magic Banana said:
The thread you found on stackoverflow deals with records separated by a blank line.

Each email appears to have a common field, "X-UIDL: " which I am [today] attempting to use as the record
separator so that the emails can be processed as separate multi-line records (including blank lines). That
space turned out to be troublesome.

Just define the record separator (RS) as "X-UIDL: UID" or, probably safer, as "\nX-UIDL: UID" to constrain "X-UIDL: UID" to start a line:
$ awk -v RS='\nX-UIDL: UID' '...' emailfile.txt

Just the beginning of the thought "..." ?

For instance, to have every email in a separate file named after the X-UIDL, you can define the field
separator (FS) as a newline so that the X-UIDL is the first field (except for the first record, the beginning
of the metadata for the first email: the condition "NR - 1" excludes it):
$ mkdir emails
$ awk -v RS='\nX-UIDL: UID' -F \\n 'NR - 1 { print > "emails/" $1 }' emailfile.txt

awk -v RS='\nX-UIDL: ' -F \\n 'NR - 1 { print > "emails/" $1 }' emailfile.txt
This script works in the blink of an eye; for the whole email file, it'll be 100 times that.

In order to make the output files distinguishable from directories, I wrote the script:
paste -d ' ' front.txt back.txt > Script.rename.outputs.txt
cp Script.rename.outputs.txt Script.rename.outputs
sudo chmod +x Script.rename.outputs
./Script.rename.outputs

Ninety-four output files, soon to become nearly ten thousand.

To not have the X-UIDL in the file (it is already the file name), gensub can delete the first line of the record=:
Actually, I prefer keeping X-UIDL in each email file.

For instance to grep the pieces of information you seem to be interested in, you can define the patterns in a file:
Return-Path: .*
Message-ID: .*
(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])
[a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+
(http|https)://(.*).html

If that file is named patterns:
$ for email in emails/*; do grep -of patterns $email > $email.sel; done

for email in emails/*; do grep -of Patterns-MB-01032022.txt $email > $email.sel; done
Bash complains: "$email.sel: ambiguous redirect"

Use my own [sophomoric] script:
Script-MB-Patterns-Grep-emailfile.txt ==>
grep -of Patterns-MB-01032022.txt emails/UID2781-1165006918.txt > emails/data/UID2781-1165006918.email.sel ;
grep -of Patterns-MB-01032022.txt emails/UID2782-1165006918.txt > emails/data/UID2782-1165006918.email.sel ;
grep -of Patterns-MB-01032022.txt emails/UID2783-1165006918.txt > emails/data/UID2783-1165006918.email.sel ;
...
grep -of Patterns-MB-01032022.txt emails/UID131107-1161171024.txt > emails/data/UID131107-1161171024.email.sel ;
grep -of Patterns-MB-01032022.txt emails/UID131124-1161171024.txt > emails/data/UID131124-1161171024.email.sel ;
grep -of Patterns-MB-01032022.txt emails/UID131127-1161171024.txt > emails/data/UID131127-1161171024.email.sel ;

cp Script-MB-Patterns-Grep-emailfile.txt Script-MB-Patterns-Grep-emailfile
sudo chmod +x Script-MB-Patterns-Grep-emailfile
./Script-MB-Patterns-Grep-emailfile

Alas, the script returns data for only two of the five patterns. Another script appears necessary,
as no pattern file can have control characters, but integration with Script-MB-Patterns-Grep-emailfile.txt looks unwieldy;
grep "UID" emails/UID2781-1165006918.txt >> emails/data/UID2781-1165006918.email.sel ;
grep "Return-path:" emails/UID2781-1165006918.txt >> emails/data/UID2781-1165006918.email.sel ;
grep "Message-ID:" emails/UID2781-1165006918.txt >> emails/data/UID2781-1165006918.email.sel ;
egrep -o "(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])" emails/UID2781-1165006918.txt | sort -u >> emails/data/UID2781-1165006918.email.sel ;
grep -oe "[a-zA-Z0-9._]\+@[a-zA-Z]\+.[a-zA-Z]\+" emails/UID2781-1165006918.txt | sort -u >> emails/data/UID2781-1165006918.email.sel ;
grep -oE "(http|https)://(.*).html" emails/UID2781-1165006918.txt | sort -u >> emails/data/UID2781-1165006918.email.sel ;
grep -oE "(http|https)://(.*)" emails/UID2781-1165006918.txt | sort -u >> emails/data/UID2781-1165006918.email.sel ;

Summary: Magic Banana's suggestions carried the task forward to place each separate email in its own folder,
with the analytical data appended in a data subdirectory of the main email directory, all indexable by their UID's.
I haven't yet found a common grep syntax that will apply to all the data searches, though.

AllegatoDimensione
Script.rename.outputs.txt 6.23 KB
Script-MB-Patterns-Grep-emailfile.txt 10.18 KB
Patterns-MB-01032022.txt 190 byte
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

In order to make the output files distinguishable from directories, I wrote the script

Your ability to write horrible scripts always amazes me: one line per e-mail to append ".txt" to the file names, hence soon-to-become "nearly ten thousand" lines! The AWK program can append ".txt" to the file names, of course:
$ awk -v RS='\nX-UIDL: ' -F \\n 'NR - 1 { print > "emails/" $1 ".txt" }' emailfile.txt
Your Script-MB-Patterns-Grep-emailfile.txt (which should have been named "Script-MB-Patterns-Grep-emailfile.sh") is no better. One single execution of grep can process emails/*! It will prefix every output line with the file name. That is why you probably want every file name to only be the X-UIDL (not followed by ".txt") and why you do not need the X-UIDL in the content of the file.

Here are a dozen lines that probably do all you want, even what you have not asked for such as standardizing the outputs (removing "<" and ">" in the return paths and the message ids) and separating the searches in the headers from the searches in the contents, where I removed the newlines to have whole URLs (broken in some emails):
#!/bin/sh
mkdir headers contents fields
awk -v RS='\nX-UIDL: UID' -F \\n 'NR - 1 { for (i = 2; $i; ++i) print $i >> "headers/" $1; while (++i != NF) printf "%s", $i >> "contents/" $1 }' "$@"
cd headers
grep ^Return-Path: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' > ../fields/Return-Path
grep ^Message-ID: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' > ../fields/Message-ID
for part in headers contents
do
cd ../$part
grep -Eo '(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])' * | sort -u > ../fields/IPv4-addr-in-$part
grep -Eo '[a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+' * | sort -u > ../fields/email-addr-in-$part
grep -Eo '(http|https)://[a-zA-Z0-9./?=_%:-]*' * | sort -u > ../fields/URL-in-$part
done

Call the script with emailfile.txt or any other file(s) in arguments.

EDIT: removing the newlines in the content raises an issue with one email address in emailfile.txt and with any other address that would start/end a line.

amenex
Offline
Iscritto: 01/03/2015

Thank you Magic Banana for continuing to think about this task; what I wrote was prepared before seeing your
latest contribution. Yes, "txt" in the middle of the output files isn't pretty, but it helped immensely to
see with my own eyes that the ID.txt files were not directories. Separating the headers from the bodies of
the emails is very likely going to be useful; I had not thought of doing that.
Your ability to write horrible scripts always amazes me: one line per e-mail to append ".txt" to the file
names, hence soon-to-become "nearly ten thousand" lines!
That was to correct an error; I should have
just run the source script again after correcting it and need not have included it in my comments.
When I run Magic Banana's script:
awk -v RS='\nX-UIDL: ' -F \\n 'NR - 1 { print > "emails/" $1 ".txt" }' emailfile.txt
on the mail file (ca 8000 ID's) it cannot have so many files open at once. I'll sudbdivide the file
into sixteen parts and run a multi-step script:
awk -v RS='\nX-UIDL: ' -F \\n 'NR - 1 { print > "emails/" $1 ".txt" }' 1998-2021.Newest01 ;
awk -v RS='\nX-UIDL: ' -F \\n 'NR - 1 { print > "emails/" $1 ".txt" }' 1998-2021.Newest01 ;
awk -v RS='\nX-UIDL: ' -F \\n 'NR - 1 { print > "emails/" $1 ".txt" }' 1998-2021.Newest02 ;
...
awk -v RS='\nX-UIDL: ' -F \\n 'NR - 1 { print > "emails/" $1 ".txt" }' 1998-2021.Newest15 ;
awk -v RS='\nX-UIDL: ' -F \\n 'NR - 1 { print > "emails/" $1 ".txt" }' 1998-2021.Newest16 ;

In theory, split will do this, but I want to force the beginning of each part on an email boundary.
In practice I had to subdivide a few of the 16 parts to avoid awk complaints:
awk: program limit exceeded: maximum number of fields size=32767

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Attachments are certainly responsible for the emails with over 32767 lines (such as "AETNA PAYMENT CLAIM.pdf" starting at lines 4197 and 20846 of emailfile.txt).

amenex
Offline
Iscritto: 01/03/2015

Fortunately there are only a couple of emails that have this feature; they also block the grep script.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

I do not see how grep, with no concept of "field", would have a trouble with that. Here is a solution to have fewer fields (separating the fields with "\n\n"; the first field is the header, where the first "\n" is searched; the id precedes it):
#!/bin/sh
mkdir headers contents fields
awk -v RS='\nX-UIDL: UID' -F '\n\n' 'NR - 1 {
i = index($1, "\n")
id = substr($1, 1, i - 1)
print substr($1, i + 1) >> "headers/" id
for (i = 2; i <= NF; ++i) {
gsub(/\n/, "", $i)
print $i >> "contents/" id } }' "$@"
cd headers
grep ^Return-Path: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' > ../fields/Return-Path
grep ^Message-ID: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' > ../fields/Message-ID
for part in headers contents
do
cd ../$part
grep -Eo '(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])' * | sort -u > ../fields/IPv4-addr-in-$part
grep -Eo '[a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+' * | sort -u > ../fields/email-addr-in-$part
grep -Eo 'https?://[a-zA-Z0-9./?=_%:-]*' * | sort -u > ../fields/URL-in-$part
done

Are the files in the "fields" directory what you want?

amenex
Offline
Iscritto: 01/03/2015

Magic Banana noted:
EDIT: removing the newlines in the content raises an issue with one email address in emailfile.txt
and with any other address that would start/end a line.

The main concern with html links is the domain and its subfolder[s].
Better to let 'em be truncated to save the email addresses.

I worried that there are 8391 UID's in the main email file, but we found only 7370 with our scripts.
I checked by grepping all the UID's in the main file (8391) and then applying sort -u ==> 7371.
Leafpad counts the same 8391 (in about 30 minutes, compared to grep's 0.145 second).

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The main concern with html links is the domain and its subfolder[s]. Better to let 'em be truncated to save the email addresses.

Remove the line "gsub(/\n/, "", $i)" then. However, when processing emailfile.txt, you will get for instance "http://fmtra=" instead of "http://fmtra=ck1.campaign.panelcompliance.com/ls/click?upn=3DChgKqXKfaeCJvxAAADlH3A4DOQn=sChltPiOo6cXqpCR9aqVUIfH2O98votAFXSCsRZYQz5zJMQ2af1LUdO2dWk2ExQLwHt-2BVJfKM=v9nv9ivYwFH-2F9U5so1rN-2BZ1sDQ6dLt9H1SZJg0XEZqDWDNM-2BRKQusWmH3hh8RE2bdkaqE=P6cbY0tG82RBWwhFgOyMl-2BolLBqPzGv8k8s1YgotlfPnkakJMv6lRILJHaPy3CqVU9lAQOyHT=lC700-2F9WKbl9CJLwTanIZyoXsu6qnZ9V1TJtqu-2BoJ1TRp-2FNyez-2FZVubVA-3D0NCg_od=2XPTO-2Fx2EME4LAtyFHT6ACDrelgaGc0LSWJEWhq6jmbZw2gPcIqbBHk3-2B7Z-2BZNI4VMcJk=sh-2F5ykPbACZ4vdTY6JWPt1Mg5YPXBsDJgzWNE1n-2BV07y53OmrRPo9fj1fBOqND2n08o2enN=qX9LQ3rXTnELN1PP1XjI16wA2MOaEcZyhGFWtJdYJXOu8Mgyf6tuo4CfYlzfU2hPpnTGay8kMkv=iYgIVjgkKna19tRF6HpBR3XB3OLvTGy83YBrlufzWqCD5j82-2F4zkA-2BSwdmNC5kelyx7vmrT=o3rqkh3mFKki53d-2BiWjxYNPKCLV6D-2FvFJlKsxp0wYHab09l-2Bk5u-2BDTq-2FqSGwkayjJ=CXr1scwxlpKdI6JwFc6VfzMzjBYGR-2FNcRJK5w5-2BKemEOryj8sgwzSvWCosTUnjhRK7jGiMq=d4yaXzB9Y04aDytaPddaQrSyQPk1yF-2FTY0fxiZzm-2BiObwOp5rj5OunasV6CLWeTnASlRC-2=FMfhU1IIR3h2woWvfwry96QdvLh50KJDdDR3nLdzcyYPtnEPY-2B2mGO5zLioX0Uv1zVf2lVRIf=ckUh2wywo7-2Fv84t87n2uS60gmhhcEh73bfzpHX2AYRB9s-2B-2Bnts5NnDXER7rksSVP-2BQh=Kb9e4nxYXBaArnOwE1u7TDdIS9hVXSZPXu0gPJWNQDNcLZFwuxp6iZCRaMWdvYihRXflnuvC6Tm=ykHHEQnNMQCA1HuO1rdyVDkTwY3YOoTaQoiRjM6lKEpbhO7sxsWrQk3nPeDPx08Q8GAhtNSsRyb=XJzIwqNQV7yicYKLg5ruXCL-2B06YTKR97N5hSbZak8MSW2rXu-2FQwz21KKWNdTnjS1FqQKjDR=anItUCo-2BJdx5vfBiuPcQ3CkhKJiv3Kwnch-2FU9CL6BpEZkzUOGKoASyT4wD8xeWndcYsiqMW=9-2B6ge29ndVwk7CyMBYB4mEiPjWATuDNrg2W3EKRT1xu-2B8HTp-2Bpzol0q9XNV4Am5evrcBu=K5hFwnjIEsnU8FUrfMmUoFhuErTN-2BD". You do not even get the domain here.

I checked by grepping all the UID's in the main file (8391) and then applying sort -u ==> 7371.

So a same X-UIDL is in different emails, right? Since the AWK program appends, with ">>", lines to the files in "headers" and "contents", you get the concatenation of their headers/contents. If you prefer one header/content per email, the file names can be the positions plus one of the email in the input file(s) (and I now let the X-UIDL on the first line of every header):
#!/bin/sh
mkdir headers contents fields
awk -v RS='\nX-UIDL: UID' -F '\n\n' 'NR - 1 {
print $1 > "headers/" NR
for (i = 2; i <= NF; ++i) {
gsub(/\n/, "", $i)
print $i >> "contents/" NR } }' "$@"
cd headers
grep ^Return-Path: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' > ../fields/Return-Path
grep ^Message-ID: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' > ../fields/Message-ID
for part in headers contents
do
cd ../$part
grep -Eo '(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])' * | sort -u > ../fields/IPv4-addr-in-$part
grep -Eo '[a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+' * | sort -u > ../fields/email-addr-in-$part
grep -Eo 'https?://[a-zA-Z0-9./?=_%:-]*' * | sort -u > ../fields/URL-in-$part
done

I kept "gsub(/\n/, "", $i)" in the above script, to not lose "the domain and its subfolder[s]" you want in every URL. A better reason to remove that line would be to avoid an issue I have already mentioned: if, in the content, an email address starts a line (and the previous line does not end with a blank character) or ends a line (and the next line does not start with a blank character), the grepped address is wrong (it includes the characters of the previous/next word).

amenex
Offline
Iscritto: 01/03/2015

Left to my own devices, I tried the following approach:

(1) Create Script-grep-data-emails.txt ==>
grep "UID" emails/$1 >> emails/data/$1.email.sel ;
grep "Return-path:" emails/$1 >> emails/data/$1.email.sel ;
grep "Message-ID:" emails/$1 >> emails/data/$1.email.sel ;
egrep -o "(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])" emails/$1 | sort -u >> emails/data/$1.email.sel ;
grep -oe "[a-zA-Z0-9._]\+@[a-zA-Z]\+.[a-zA-Z]\+" emails/$1 | sort -u >> emails/data/$1.email.sel ;
grep -oE "(http|https)://(.*).html" emails/$1 | sort -u >> emails/data/$1.email.sel ;
grep -oE "(http|https)://(.*)" emails/$1 | sort -u >> emails/data/$1.email.sel ;

After which I made the script executable by the usual means.

(2) Create Script-grep-data-all-emails.txt ==>
./Script-grep-data-emails UID2781-1165006918.txt ;
./Script-grep-data-emails UID2782-1165006918.txt ;
./Script-grep-data-emails UID2783-1165006918.txt ;
...
./Script-grep-data-emails UID131107-1161171024.txt ;
./Script-grep-data-emails UID131124-1161171024.txt ;
./Script-grep-data-emails UID131127-1161171024.txt ;
This script also made executable.
Then I tried just the first line of Script-grep-data-all-emails:
sudo ./Script-grep-data-emails UID2781-1165006918.txt
It worked OK.

(3) Run the script:
sudo Script-grep-data-all-emails
Filled the emails/data directory with the grepped data.
Note: Script-grep-data-emails.txt is a work in progress.

AllegatoDimensione
Script-grep-data-all-emails.txt 5.68 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

You may have missed my previous post, which I initially sent while you were writing yours. It provides a dozen lines that probably do all you want in a reasonable way (unlike what you write).

amenex
Offline
Iscritto: 01/03/2015

Magic Banana commented:
You may have missed my previous post, which I initially sent while you were writing yours. It provides a dozen lines that probably do all you want in a reasonable way (unlike what you write).
More about this tomorrow ...

Update: I used grep to find the UID's in the overall email file and got 8391;
then I applied sort -u, which left 7371. comm found 1020 not in the smaller set,
presumably all duplicates. Then I picked three UID's and used Leafpad's search
engine, which found another identical UID & email for each of the three in the
original email file. There are 7370 email individual email files in the separated
set.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

A I have already explained in https://trisquel.info/forum/extract-data-concatenated-emails#comment-164158 my script was concatenating the headers/contents of the emails with a same X-UIDL.

The (actually simpler) AWK program in that same post numbers the files according to the position of the related email in the input file(s) and not according to the X-UIDL (in the first line of the header). With it you will normally get the headers/contents of the 8391 emails in as many files numbered from 2 to 8392. Those same numbers identify the emails in fields/*.

You still have not answered my question:
Are the files in the "fields" directory what you want?
If you want the X-UIDL (I now notice that you were grepping it in your original post), add that instruction as the first line of the AWK action (right after "{"):
printf NR ":" substr($1, 1, index($1, "\n")) > "fields/X-UIDL"

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

I have just realized the eight different types of information (all the files in "fields" but X-UIDL, populated with the line I suggested to add above) can be computed in parallel, by ending the related command lines with "&":
#!/bin/sh
mkdir headers contents fields
awk -v RS='\nX-UIDL: UID' -F '\n\n' 'NR - 1 {
printf NR ":" substr($1, 1, index($1, "\n")) >> "fields/X-UIDL"
print $1 > "headers/" NR
for (i = 2; i <= NF; ++i) {
gsub(/\n/, "", $i)
print $i >> "contents/" NR } }' "$@"
cd headers
grep ^Return-Path: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' > ../fields/Return-Path &
grep ^Message-ID: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' > ../fields/Message-ID &
for part in headers contents
do
cd ../$part
grep -Eo '(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])' * | sort -u > ../fields/IPv4-addr-in-$part &
grep -Eo '[a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+' * | sort -u > ../fields/email-addr-in-$part &
grep -Eo 'https?://[a-zA-Z0-9./?=_%:-]*' * | sort -u > ../fields/URL-in-$part &
done

Here, that turns the processing of emailfile.txt 2.4 times faster. For "nearly ten thousand" emails, that will make a perceivable difference.

amenex
Offline
Iscritto: 01/03/2015

Ducking Magic Banana's question about fields after seeing the output of but one email, here are
my experiences with grep, comm, and the UID's:

Here's where grep worked OK:
grep "X-UIDL: " 1998-2021.Newest > Count-UIDs-1998-2021.Newest.txt
wc -l Count-UIDs-1998-2021.Newest.txt ==> 8391
grep "X-UIDL: " 1998-2021.Newest | sort -u > Count-UIDs-1998-2021.Newest.sorted.txt
wc -l Count-UIDs-1998-2021.Newest.sorted.txt ==> 8391 - 7371 = 1020 duplicates.

Here's where grep fails:
grep -f Count-UIDs-1998-2021.Newest.sorted.txt Count-UIDs-1998-2021.Newest.txt > Duplicates-1998-2021.grep.txt
wc -l Duplicates-1998-2021.grep.txt ==> 8391

That's more hits than in there are UID's in the pattern file.

Apply brute force:
grep "UID115026-1161171024" Count-UIDs-1998-2021.Newest.txt ==> UID115026-1161171024
One-at-a-time works ...
cp Script-Count-UIDs-1998-2021.Newest.sorted.quoted.txt Script-Count-UIDs-1998-2021.Newest.sorted.quoted
sudo chmod +x Script-Count-UIDs-1998-2021.Newest.sorted.quoted
./Script-Count-UIDs-1998-2021.Newest.sorted.quoted
wc -l Duplicates-1998-2021.grep.quoted.txt ==> 8391

One-at-a-time executed 7291 times in a row fails miserably; please, tell me why ...

Comm comes to the rescue with its simpler controls:
comm -13 <(sort Count-UIDs-1998-2021.Newest.sorted.txt) <(sort Count-UIDs-1998-2021.Newest.txt) > Duplicates-1998-2021.Newest.comm.txt
wc -l Duplicates-1998-2021.Newest.comm.txt ==> 1020

Further investigation suggests that these 1020 emails are simply duplicates,
based on three examples which occur in 1998-2021.Newest twice for each with the same message ID.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Ducking Magic Banana's question about fields

Why?

Here's where grep fails: (...) That's more hits than in there are UID's in the pattern file.

That is the proper behavior. Grep here selects in 1998-2021.Newest the same 8391 lines as 'grep "X-UIDL: " 1998-2021.Newest' (but uselessly wastes time on testing 7371 patterns instead of one single pattern). I do not understand what else you were expecting.

With option -c, grep directly outputs the number of matching lines (no need to waste time outputting those lines and counting them with wc -l).

One-at-a-time executed 7291 times

Please stop doing those horrible things! Grep properly works. You need to read its manual (info grep) to understand what it does. It has no technical limitation I am aware of. In particular the input lines only need to fit into main memory and there can be infinite lines.

On the other hand, you reached limitations of AWK on the maximal number (32767) of fields and on the maximal number of files it (or any other process) is allowed to open. Error messages clearly explained them. I gave you a workaround for the first problem. I had forgotten about the second one. I believe you only need to close every file after writing it:
#!/bin/sh
mkdir headers contents fields
awk -v RS='\nX-UIDL: UID' -F '\n\n' 'NR - 1 {
printf NR ":" substr($1, 1, index($1, "\n")) >> "fields/X-UIDL"
close("fields/X-UIDL")
print $1 > "headers/" NR
close("headers/" NR)
for (i = 2; i <= NF; ++i) {
gsub(/\n/, "", $i)
print $i >> "contents/" NR }
close("contents/" NR) }' "$@"
cd headers
grep ^Return-Path: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' > ../fields/Return-Path &
grep ^Message-ID: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' > ../fields/Message-ID &
for part in headers contents
do
cd ../$part
grep -Eo '(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])' * | sort -u > ../fields/IPv4-addr-in-$part &
grep -Eo '[a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+' * | sort -u > ../fields/email-addr-in-$part &
grep -Eo 'https?://[a-zA-Z0-9./?=_%:-]*' * | sort -u > ../fields/URL-in-$part &
done

amenex
Offline
Iscritto: 01/03/2015

In the cold light of day, I now see my error in applying grep. It finds every UID from the pattern file
in the target file (7370) once, and some of them (1020) twice, totaling 8391. Only comm -13 can list
the duplicates that are not in the smaller file, nor common to both files.

Magic Banana's script of January 7th is a work of mathematical genius. Think of analyzing all the
statistics of every baseball game of an entire season all in one go. I made the script's argument
the entire ca.300MB collection of about nine thousand emails in one file, which the script processed
in roughly three minutes without error.

Questions & comments:
1. Some '=' characters appear, randomly placed, in the URL's (URL-in-contents). Apply sed 's/\=//g' & sort -u ?
Other '=' signs appear legitimate, as in URL-in-headers and in Return-Path.
2. Is the field preceding the ':' in several lists an index number $1 common to all those lists ?
3. Some non-IANA extensions appear in the analysis of emails No. 2140,2142 & 2143, but grepping the corrupted
email addresses gets no hits, yet grepping the added strings reveals many instances of those strings alongside
but not part of the affected addresses. How do we deal with that ? The attached (incomplete) list appears targeted
towards one specific address.
4. Some emails are truncated but appear alongside the undamaged email in the same message (email-address-in-contents).

AllegatoDimensione
Corrupted-email-addresses.txt 1.79 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Magic Banana's script of January 7th is a work of mathematical genius. Think of analyzing all the statistics of every baseball game of an entire season all in one go.

The script is actually pretty straightforward. It does no analysis. Only scrapping.

Questions & comments

Yes, in fields/*, the number preceding ":" identify the e-mail: as I have already written it is "the positions plus one of the email in the input file(s)". It is easy to have all fields in one single file (with huge lines) if you want.

I now understand that the broken lines end with "=", which is not content (and the reason for the supernumerary "="s in the URLs). In the content, I was deleting every "\n". Adding one single character, I now delete every "=\n":
#!/bin/sh
mkdir headers contents fields
awk -v RS='\nX-UIDL: UID' -F '\n\n' 'NR - 1 {
printf NR ":" substr($1, 1, index($1, "\n"))
print $1 > "headers/" NR
close("headers/" NR)
for (i = 2; i <= NF; ++i) {
gsub(/=\n/, "", $i)
print $i >> "contents/" NR }
close("contents/" NR) }' "$@" > fields/X-UIDL
cd headers
grep -i ^Return-Path: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' | sort -k 1,1 > ../fields/Return-Path &
grep -i ^Message-ID: * | awk -F '[: <>]+' '{ print $1 ":" $3 }' | sort -k 1,1 > ../fields/Message-ID &
for part in headers contents
do
cd ../$part
grep -Eo '(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])' * | sort -u > ../fields/IPv4-addr-in-$part &
grep -Eo '[a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+' * | sort -u > ../fields/email-addr-in-$part &
grep -Eo 'https?://[a-zA-Z0-9./?=_%:-]*' * | sort -u > ../fields/URL-in-$part &
done

That should not only fix the URLs but also the email addresses starting/ending a line (as in the file you have just attached). Is it the problem you were referring to in your third or fourth point? I do not really understand them: examples would help.

EDIT: Removal of close("fields/X-UIDL"), which was stupid: that file is not specific to an email (for each email, one additional X-UIDL is written in fields/X-UIDL) and it was a waste of time to open/close it for each email.

EDIT2: New modification because it is actually clearer to let awk write on the standard output and the shell redirect to fields/X-UIDL.

EDIT3: Now grepping in a case-insensitive way Return-Path and Message-ID, after noticing that an email in emailfile.txt has "Message-Id" instead of "Message-ID".

amenex
Offline
Iscritto: 01/03/2015

Magic Banana's improved script takes 2.6 seconds and accomplishes the desired & stated ends.
Note: "scrapped" in common Internet usage ought to be "scraped." The words have entirely
different meanings. English is a mess. It's time to get used to its idiosyncrasies.

The many fewer '=' characters now look as though they belong in context.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

In every email of emailfile.txt, there is always "Return-Path:" and "Return-path:" and both always indicate the same email address. That is why believe you want to add option -u to the first two calls of sort.

amenex
Offline
Iscritto: 01/03/2015

Magic Banana is correct about using sort -u; I made the collection by combining spam lists from
different time periods that almost certainly overlapped. One of the [only] three that I checked
had different first-line time stamps, everything else being equal. Sort -u ought to cut the
number of duplicates substantially. I'll make the changes and run the script again in the AM.

amenex
Offline
Iscritto: 01/03/2015

Today's Magic Banana script took all of 12 seconds and produces clean outputs, with no
supernumerary '='s, nice one-line URL's, and real '='s where obviously appropriate.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

I wrote:
It is easy to have all fields in one single file (with huge lines) if you want.

Since you certainly want that, I wrote a function group_by_line and simply paste the resulting files (named tubes) at the end. The first output line identifies the fields. You can use cut to select a subset of the columns. While I was at it, I put the (now temporary) files in /tmp and turned the script more generic: you only need to add a line to one of the first two variables to scrap the related additional piece of information. Here is the result:
#!/bin/sh
# One key per line in $FIRST_IN_HEADERS to get its first value in the headers.
FIRST_IN_HEADERS='Return-Path
Message-ID'
# One pair regexp,name per line in $DISTINCT_IN_HEADERS_OR_CONTENT to get the distinct matching strings in the headers and the content.
DISTINCT_IN_HEADERS_OR_CONTENT='https?://[a-zA-Z0-9./?=_%:-]*,URL
(([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5]),IPv4-addr
[a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+,email-addr'
group_by_line () {
awk -F '[ <>]+' -v i=2 '{
if ($1 - i) {
print str
str = $2
while ($1 - ++i)
print "" }
else
if (str)
str = str "," $2
else
str = $2 }
END {
print str }'
}
TMP=$(mktemp -d)
trap "rm -r $TMP 2>/dev/null" 0
mkdir $TMP/headers $TMP/contents $TMP/fields
mkfifo $(echo "$FIRST_IN_HEADERS" | sed s:^:$TMP/fields/: | tr \\n ' ') $(echo "$DISTINCT_IN_HEADERS_OR_CONTENT" | sed -e s:.*,:$TMP/fields/: -e s/\$/-in-headers/ | tr \\n ' ') $(echo "$DISTINCT_IN_HEADERS_OR_CONTENT" | sed -e s:.*,:$TMP/fields/: -e s/\$/-in-contents/ | tr \\n ' ')
files=$(seq 2 $(awk -v RS='\nX-UIDL: UID' -F '\n\n' -v tmp=$TMP/ 'NR - 1 {
printf substr($1, 1, index($1, "\n")) >> tmp "fields/X-UIDL"
print $1 > tmp "headers/" NR
close(tmp "headers/" NR)
for (i = 2; i <= NF; ++i) {
gsub(/=\n/, "", $i)
print $i >> tmp "contents/" NR }
close(tmp "contents/" NR) }
END {
print NR }' "$@"))
cd $TMP/headers
for header in $FIRST_IN_HEADERS
do
grep -im 1 ^$header: $files |
sed 's/:[^:]*:/ /' |
group_by_line > ../fields/$header &
done
for part in headers contents
do
cd ../$part
for pair in $DISTINCT_IN_HEADERS_OR_CONTENT
do
grep -Eo "$(echo $pair | sed 's/,.*//')" $files |
sort -ut : -k 1,1n -k 2 |
sed 's/:/ /' |
group_by_line > ../fields/$(echo $pair | sed 's/.*,//')-in-$part &
done
done
cd ../fields
printf '# '
echo * | sed 's/ /\t/g'
paste *

If only that example could make you value generic solutions without repeated code over ad-hoc ones...

amenex
Offline
Iscritto: 01/03/2015

Magic Banana wrote:
I wrote:
It is easy to have all fields in one single file (with huge lines) if you want.

Note to file: Be careful of that for which I ask ...

Since you certainly want that,... Did I say that ?
I wrote a function group_by_line and simply paste the resulting files (named tubes) at the end.
The first output line identifies the fields. You can use cut to select a subset of the columns.
While I was at it, I put the (now temporary) files in /tmp and turned the script more generic:
you only need to add a line to one of the first two variables to scrap[e] the related additional
piece of information. Here is the result:

After an initial shock, I sent the standard output to a file named "Hairball.txt" which I now recognize
as an 8392-row by nine-column table.

Use awk to pick & choose:
awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10}' Hairball.txt > Table.txt
Pick the columns you want ==>
($1)email-addr-in-contents
($2)email-addr-in-headers
($3)IPv4-addr-in-contents
($4)IPv4-addr-in-headers
($5)Message-ID
($6)Return-Path
($7)URL-in-contents
($8)URL-in-headers
($9)X-UIDL

Based on the previous script's outputs, here's an example of post-processing which might very
well be incorporated into either script:
awk '{print FILENAME":",$0}' headers/* | grep "Subject" | grep -v ":Subject" | sed 's/\*\*\*SPAM\*\*\* //g' | sed 's/headers\///g' > fields/email_Subject
which produces a fairly clean set of Subject names (without the "***SPAM*** distraction) for each email.
I haven't managed to write a similar script for the contents of Table.txt.

EDIT: Some email addresses are hiding in line[s] like this in the emails' headers:
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider [miguelangel_stefanowsky[at]outlook.cl]
Those could be incorporated in the fields as "abused_enduser_email_provider"
grep "\[at\]" headers/* | sed 's/headers\///g' | tr -s ' ' | sed 's/\[at\]/\@/g' | tr -d '[]' > fields/abused_enduser_email_provider

EDIT2: Here's another one:
grep "Content analysis details:" headers/* | sed 's/headers\///g' | sed 's/ details//g' | tr -d '()' | tr -s ' ' | sort -nrk 4,4 > fields/Spam_Content_analysis

EDIT3: Yet one more:
grep "X-Account-Key: " contents/* | sed 's/contents\///g' | sed 's/X-Account-Key://g' | sed 's/account//g' | tr -s '' | sort -gk 2,2 > fields/x-Account-Key

These can be simplified by directing grep to the main email file.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Use awk to pick & choose

Not like that: you need to define the field separator (for instance with option -F) to "\t". By default it is any number of blanks, hence a problem with empty fields. Or use cut, as I have already suggested: the tabulation is cut's default field separator.

Based on the previous script's outputs, here's an example of post-processing which might very well be incorporated into either script

The idea of the newest script is to not have to write ad-hoc code to get additional pieces of information. Here, you would just add a line with "Subject" to the variable FIRST_IN_HEADERS... except that it would not have worked because I had not considered the related value could have spaces. Below, I corrected that issue and added "Subject" to the variable FIRST_IN_HEADERS.

Those could be incorporated in the fields

Indeed, adding the line "email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z]+.[a-zA-Z]+" to the variable DISTINCT_IN_HEADERS_OR_CONTENT. Again: you want to avoid ad-hoc code.

Here's another one

... that should simply have been an additional line ("Content analysis details") to FIRST_IN_HEADERS. Nevertheless, there was another issue with spaces in the field name (and the simple fix avoids the horrible mkfifo line of the previous version).

Here is the resulting script after those additions/corrections:
#!/bin/sh
# One key per line in $FIRST_IN_HEADERS to get its first value in the headers.
FIRST_IN_HEADERS='Return-Path
Message-ID
Subject
Content analysis details'
# One pair (name, regexp) per line in $DISTINCT_IN_HEADERS_OR_CONTENT to get the distinct matching strings in the headers and the content.
DISTINCT_IN_HEADERS_OR_CONTENT='URL https?://[a-zA-Z0-9./?=_%:-]*
IPv4-addr (([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])
email-addr [a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+
email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z]+.[a-zA-Z]+'
group_by_line () {
awk -F : -v i=2 '{
if ($1 - i) {
print str
str = gensub(/[^:]*:/, "", 1)
while ($1 - ++i)
print "" }
else
if (str)
str = str "," gensub(/[^:]*:/, "", 1)
else
str = gensub(/[^:]*:/, "", 1) }
END {
print str }'
}
TMP=$(mktemp -d)
trap "rm -r $TMP 2>/dev/null" 0
mkdir $TMP/headers $TMP/contents $TMP/fields
files=$(seq 2 $(awk -v RS='\nX-UIDL: UID' -F '\n\n' -v tmp=$TMP/ 'NR - 1 {
printf substr($1, 1, index($1, "\n")) >> tmp "fields/X-UIDL"
print $1 > tmp "headers/" NR
close(tmp "headers/" NR)
for (i = 2; i <= NF; ++i) {
gsub(/=\n/, "", $i)
print $i >> tmp "contents/" NR }
close(tmp "contents/" NR) }
END {
print NR }' "$@"))
cd $TMP/headers
echo "$FIRST_IN_HEADERS" | while read -r header
do
mkfifo "../fields/$header"
grep -im 1 "^ *$header:" $files |
sed -e 's/:[^:]*: */:/' -e 's///' |
group_by_line > "../fields/$header" &
done
for part in headers contents
do
cd ../$part
echo "$DISTINCT_IN_HEADERS_OR_CONTENT" | while read -r name regexp
do
mkfifo "../fields/$name-in-$part"
grep -Eo "$regexp" $files |
sort -ut : -k 1,1n -k 2 |
group_by_line > "../fields/$name-in-$part" &
done
done
echo "# $(ls ../fields | awk '{ print NR ":" $0 }' | tr \\n \\t)"
paste ../fields/*

amenex
Offline
Iscritto: 01/03/2015

Magic Banana wrote:
In response to my statement to Use awk to pick & choose
Not like that: you need to define the field separator (for instance with option -F) to "\t".
That's what I did with the translation of Hairball.txt to Table.txt so the columns can be chosen
arbitrarily and in any order.
By default it is any number of blanks, hence a problem with empty fields.
Or use cut, as I have already suggested: the tabulation is cut's default field separator.

Back to school for me in the "cut" department ...

Regarding the latest script: There's a problem with the last (13th) column: X-UIDL. The output
is split between error messages (... grep: 8392: No such file or directory) and the lone column
in Hairball.txt (... 131058-1161171024) which appeared like this (8392:131058-1161171024) in the
last line of the output of my ad hoc script. The same errors appear when the script is applied
to emailfile.txt

Three-column table follows (also as an attachment):
# Columns accounted for Columns needed
1 Content analysis details Spam_Content_analysis
2 email-addr-in-contents email-addr-in-contents
3 email-addr-in-headers email-addr-in-headers
4 email-addr-with-at-in-contents Unnecessary ?
5 email-addr-with-at-in-headers abused_enduser_email_provider
6 IPv4-addr-in-contents IPv4-addr-in-contents
7 IPv4-addr-in-headers IPv4-addr-in-headers
8 Message-ID Message-ID
9 Return-Path Return-Path
10 Subject email_Subject
11 URL-in-contents URL-in-contents
12 URL-in-headers URL-in-headers
13 X-UIDL X-UIDL
14 Absent ? X-Account-Key

My column #5 (abused_enduser_email_provider) only appears in the headers

Examples:
Spam_Content_Analysis ==> 7360: Content analysis: 49.9 points, 5.0 required; 143: Content analysis: 5.1 points, 5.0 required; 1920: Content analysis: -4.6 points, 2.0 required
X-Account-Key ==> 2959: 3; 2038: 8; 8363: 16
abused_enduser_email_provider ==> 4: name at domain; 4090: name at domain; 8387: name at domain

Ad hoc scripts (tweaked somewhat since yesterday)::
grep "\[at\]" headers/* | sed 's/headers\///g' | tr -s ' ' | sed 's/\[at\]/\@/g' | tr -d '[]' | sort -unk 1,1 > fields/abused_enduser_email_provider
Applied to emailfile.txt ==>
grep "\[at\]" emailfile.txt | tr -s ' ' | tr -d ' ' | sed 's/\[at\]/\@/g' | tr -d '[]' but the sequence number is absent.

grep "Content analysis details:" headers/* | sed 's/headers\///g' | sed 's/ details//g' | tr -d '()' | tr -s ' ' | sort -nrk 4,4 > fields/Spam_Content_analysis
Applied to emailfile.txt ==>
grep "Content analysis details:" emailfile.txt | sed 's/ details//g' | tr -d '()' | tr -s ' ' | sort -nrk 3,3 Output looks OK

grep "X-Account-Key: " contents/* | sed 's/contents\///g' | sed 's/X-Account-Key://g' | sed 's/account//g' | tr -s '' | sort -gk 2,2 > fields/X-Account-Key
Applied to emailfile.txt ==>
grep "X-Account-Key: " emailfile.txt | sed 's/X-Account-Key://g' | sed 's/account//g' | tr -s '' | sort -gk 1,1 but the sequence number is absent. Both sets of sequence numbers could easily be replaced with the UID's.

AllegatoDimensione
Trisquel-forum-table.ods 16.91 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

There's a problem with the last (13th) column: X-UIDL. The output is split between error messages (... grep: 8392: No such file or directory) and the lone column in Hairball.txt (... 131058-1161171024) which appeared like this (8392:131058-1161171024) in the last line of the output of my ad hoc script. The same errors appear when the script is applied to emailfile.txt

Here, "No such file or directory" never appears while processing emailfile.txt. The header of the output table was ordered incorrectly because ls' order differs from *. I fixed that below.

My column #5 (abused_enduser_email_provider) only appears in the headers

That is what I found to. Wouldn't you like the email addresses with "[at]" in the content if there would be any? If not, I can add a variable DISTINCT_IN_HEADERS and a few lines of code.

grep "X-Account-Key: " contents/*

X-Account-Key should actually be in the header. As far as I can see in emailfile.txt, every header has a X-Account-Key at the beginning: that makes it a better record separator than X-UIDL. I therefore changed the script to use "X-Account-Key: " as a record separator and added X-UIDL to the variable FIRST_VALUE_IN_HEADERS.

Here is the result:
# One key per line in $FIRST_IN_HEADERS to get its first value in the headers.
FIRST_VALUE_IN_HEADERS='X-UIDL
Return-Path
Message-ID
Subject
Content analysis details'
# One pair (name, regexp) per line in $DISTINCT_IN_HEADERS_OR_CONTENT to get the distinct matching strings in the headers and the content.
DISTINCT_IN_HEADERS_OR_CONTENT='URL https?://[a-zA-Z0-9./?=_%:-]*
IPv4-addr (([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])
email-addr [a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+
email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z]+.[a-zA-Z]+'
group_by_line () {
awk -F : -v i=2 '{
if ($1 - i) {
print str
str = gensub(/[^:]*:/, "", 1)
while ($1 - ++i)
print "" }
else
if (str)
str = str "," gensub(/[^:]*:/, "", 1)
else
str = gensub(/[^:]*:/, "", 1) }
END {
print str }'
}
TMP=$(mktemp -d)
trap "rm -r $TMP 2>/dev/null" 0
mkdir $TMP/headers $TMP/contents $TMP/fields
files=$(seq 2 $(awk -v RS='\nX-Account-Key: ' -F '\n\n' -v tmp=$TMP/ 'NR - 1 {
printf substr($1, 1, index($1, "\n")) >> tmp "fields/X-Account-Key"
print $1 > tmp "headers/" NR
close(tmp "headers/" NR)
for (i = 2; i <= NF; ++i) {
gsub(/=\n/, "", $i)
print $i >> tmp "contents/" NR }
close(tmp "contents/" NR) }
END {
print NR }' "$@"))
cd $TMP/headers
echo "$FIRST_VALUE_IN_HEADERS" | while read -r key
do
mkfifo "../fields/$key"
grep -im 1 "^ *$key:" $files |
sed -e 's/:[^:]*: */:/' -e 's/<//' -e 's/>//' |
group_by_line > "../fields/$key" &
done
for part in headers contents
do
cd ../$part
echo "$DISTINCT_IN_HEADERS_OR_CONTENT" | while read -r name regexp
do
mkfifo "../fields/$name-in-$part"
grep -Eo "$regexp" $files |
sort -ut : -k 1,1n -k 2 |
group_by_line > "../fields/$name-in-$part" &
done
done
cd ../fields/
set *
echo -n '# '
while [ -n "$2" ]
do
i=$(expr $i + 1)
echo -n "$i:$1\t"
shift
done
echo "$(expr $i + 1):$1"
paste *

What you suggest does additional processing, but it is specific to each field. For clarity and maintainability, that has better been done in a post-processing step (piping the output of the above script).

EDIT: in the script, < and > and the characters in between were removed by the forum (which took all that for an HTML tag).

lanun
Offline
Iscritto: 04/01/2021

> in the script, < and > and the characters in between were removed by the forum

You can attach a .txt file instead, to be renamed as .sh before use.

amenex
Offline
Iscritto: 01/03/2015

Magic Banana & lanun both wrote:in the script, < and > and the characters in between were removed by the forum

On the other hand, the following (3) script from my reply below is exactly what I wrote:
diff --width=165 --suppress-common-lines -y <(wget https://amenex.com/images/emailfileAA.txt) <(wget https://trisquel.info/files/emailfile.txt) > Wget-Online-differences-emailfileAA.vs.emailfileTQ.txtBoth links to emailfile are intact & work as written.

amenex
Offline
Iscritto: 01/03/2015

Magic Banana observes:
Here, "No such file or directory" never appears while processing emailfile.txt.
9:37 screenshot:
diff --width=165 --suppress-common-lines -y emailfileAA.txt emailfileTQ.txt > Differences-emailfileAA.vs.emailfileTQ.txt
9:39 screenshot shows Leafpad of emailfileAA.txt on left, Abrowser of emailfileTQ.txt on right.
The two files look pretty much the same to the unaided eye.
Lines like this are unchanged except that (in the forum) everything from [carat01] to [carat02] is missing:
[carat01]mall10:1axcqxkquayjdqnmjoi9m1h2sv78h9ou5twmah[at]bf06x.hubspotemail.nut?[carat02]subject=unsubscribe>
You can find the files at MyScreenName.com/images/emailfile[s].txt online.
Observe the three-step process of getting the online differences:
(1) Use wget to scrape the emailfile[s] from the webpage[s]:
diff --width=165 --suppress-common-lines -y <(wget https://amenex.com/images/emailfileAA.txt) <(wget https://trisquel.info/files/emailfile.txt) > Wget-Online-differences-emailfileAA.vs.emailfileTQ.txt [produces a pair of wget log files.]

(2) Examine the output of the following script, which indicates the saving of the emailfile[s]:
diff --width=165 --suppress-common-lines -y wget-log.2 wget-log.3 > Wget-differences-emailfileAA.vs.emailfileTQ.txt
(3) Repeat the difference command on the emailfile[s] retrieved equally by wget:
diff --width=165 --suppress-common-lines -y emailfileAA.txt.1 emailfile.txt > OnlineWget-Differences-emailfileAA.vs.emailfileTQ.txt
which has zero bytes because ‘emailfileAA.txt.1’ saved [1753926/1753926] = ‘emailfile.txt’ saved [1753926/1753926].
Conclusion: the emailfile.txt stored at https://trisquel.info/files/emailfile.txt is the same as my original emailfile.txt
Your final observation EDIT: in the script, < and > and the characters in between were removed by the forum (which took all that for an HTML tag). is therefore confirmed.

EDIT: The script choked again, except on a different key. Is this a [carat01]...link...[carat02] issue ?
In the error messages: grep: 8392: No such file or directory
In the Hairball.txt file: >em>account8
This is the X-Account-Key file; except that I couldn't sort the 2nd column unless I applied sed 's/account//g'

Applying the script to emailfile.txt:
./Script-12-MBgrep-MBdata-AllMBemails emailfile.txt >Hairball-II.txt
A different set of error messages: awk: line 14: function gensub never defined
From the output in Hairball-II.txt: account8
But there's no difference between the new script vs. yesterday's script around Line 14 (w/o blank lines).

Here are the headers from the first line of Hairball.txt after replacing the \t's with \n's:
1:Content analysis details
2:email-addr-in-contents
3:email-addr-in-headers
4:email-addr-with-at-in-contents
5:email-addr-with-at-in-headers
6:IPv4-addr-in-contents
7:IPv4-addr-in-headers
8:Message-ID
9:Return-Path
10:Subject
11:URL-in-contents
12:URL-in-headers
13:X-Account-Key
14:X-UIDL

Screenshot at 2022-01-11 09-37-01.png Screenshot at 2022-01-11 09-39-15.png
AllegatoDimensione
wget-02.log 2.5 KB
wget-03.log 1.73 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

function gensub never defined

Sorry about that: I forgot that the gensub function is a GNU extension (I use gawk and you probably use mawk). That script uses sub (which is in POSIX) instead of gensub:
# One key per line in $FIRST_IN_HEADERS to get its first value in the headers.
FIRST_VALUE_IN_HEADERS='X-UIDL
Return-Path
Message-ID
Subject
Content analysis details'
# One pair (name, regexp) per line in $DISTINCT_IN_HEADERS_OR_CONTENT to get the distinct matching strings in the headers and the content.
DISTINCT_IN_HEADERS_OR_CONTENT='URL https?://[a-zA-Z0-9./?=_%:-]*
IPv4-addr (([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])
email-addr [a-zA-Z0-9._]+@[a-zA-Z]+.[a-zA-Z]+
email-addr-with-at [a-zA-Z0-9._]+ *\[at\] *[a-zA-Z]+.[a-zA-Z]+'
group_by_line () {
awk -F : -v i=2 '{
if ($1 - i) {
print str
while ($1 - ++i)
print ""
sub(/[^:]*:/, "")
str = $0 }
else {
sub(/[^:]*:/, "")
if (str)
str = str "," $0
else
str = $0 } }
END {
print str }'
}
TMP=$(mktemp -d)
trap "rm -r $TMP 2>/dev/null" 0
mkdir $TMP/headers $TMP/contents $TMP/fields
files=$(seq 2 $(awk -v RS='\nX-Account-Key: ' -F '\n\n' -v tmp=$TMP/ 'NR - 1 {
printf substr($1, 1, index($1, "\n")) >> tmp "fields/X-Account-Key"
print $1 > tmp "headers/" NR
close(tmp "headers/" NR)
for (i = 2; i <= NF; ++i) {
gsub(/=\n/, "", $i)
print $i >> tmp "contents/" NR }
close(tmp "contents/" NR) }
END {
print NR }' "$@"))
cd $TMP/headers
echo "$FIRST_VALUE_IN_HEADERS" | while read -r key
do
mkfifo "../fields/$key"
grep -im 1 "^ *$key:" $files |
sed -e 's/:[^:]*: */:/' -e 's/<//' -e 's/>//' |
group_by_line > "../fields/$key" &
done
for part in headers contents
do
cd ../$part
echo "$DISTINCT_IN_HEADERS_OR_CONTENT" | while read -r name regexp
do
mkfifo "../fields/$name-in-$part"
grep -Eo "$regexp" $files |
sort -ut : -k 1,1n -k 2 |
group_by_line > "../fields/$name-in-$part" &
done
done
cd ../fields/
set *
echo -n '# '
while [ -n "$2" ]
do
i=$(expr $i + 1)
echo -n "$i:$1\t"
shift
done
echo "$(expr $i + 1):$1"
paste *

For some reason, I cannot attach the script, even with a txt extension:
An HTTP error 0 occurred.

grep: 8392: No such file or directory

Having the input would help (no such problem with the file emailfile.txt you originally attached, right?). An empty content could create such an issue. If that can happen, try to add those lines before 'cd $TMP/headers':
cd $TMP/contents
touch $files

amenex
Offline
Iscritto: 01/03/2015

Magic Banana lamented:For some reason, I cannot attach the script, even with a txt extension

Write the script in Leafpad; import the resulting text file into LibreOffice.writer, then save the
file in .ods format and convert it to .pdf format before attaching to the forum posting. The
reader can download the .pdf file and use copy & paste to transfer to Leafpad, whence it can be
saved to the HDD, converted with the terminal to executable mode, and then run in the terminal.

Having the input would help (no such problem with the file emailfile.txt you originally attached,
right?). An empty content could create such an issue.

The emailfile.txt in the trisquel.info/files/ files matched the one I posted online:
diff --width=165 --suppress-common-lines -y <(wget https://amenex.com/images/emailfileAA.txt) <(wget https://trisquel.info/files/emailfile.txt) > Wget-Online-differences-emailfileAA.vs.emailfileTQ.txt You can try this at home ...
Alas, the forum's script stumbles similarly processing emailfile.txt.

I'll try Magic Banana's Tue, 01/11/2022 - 18:28 script with the suggested edit.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The emailfile.txt in the trisquel.info/files/ files matched the one I posted online

That is not what I asked. I asked for the input on which you get the error "grep: 8392: No such file or directory" (that cannot be emailfile.txt, which only contains 94 emails) and whether you get such errors when processing emailfile.txt.

I have no idea why you want to show in great length (along three posts, with diff, screenshots, etc.) that two files posted to two different places are the same. That looks completely irrelevant.

amenex
Offline
Iscritto: 01/03/2015

Magic Banana said in response to my statement "The emailfile.txt in the trisquel.info/files/ files matched the one I posted online"

That is not what I asked. I asked for the input on which you get the error "grep: 8392: No such file or directory" (that cannot be emailfile.txt, which only contains 94 emails) Amenex processed both emailfile.txt and his own multi-MB 1998-2021.AAspam file; the latter gave the grep: 8392 ...
error. and whether you get such errors when processing emailfile.txt. Both arguments gave similar errors.

I have no idea why you want to show in great length (along three posts, with diff, screenshots, etc.)
that two files posted to two different places are the same. That looks completely irrelevant.
We are
having a discussion about the forum's handling of links, of which there are plenty in emailfile.txt.

At any rate, I followed my own advice to transfer the script to LibreOffice.writer and then convert it to
a .pdf file, which the forum accepts as an attachment (attached). Watch out for line-wrapping and correct
that effect. Then open the .pdf file, copy & paste the contents into Leafpad, make the Leafpad.txt file
executable with the terminal, and run with emailfile.txt as argument. There
is progress: The script completes all 94 emails, but then adds an interminable number of \n's ==> 5.0 GB.
Check the results with
ls -l | awk 'NR==90,NR==100' Hairball.txt
Which amenex found to list only the final six complete lines and then blanks (of a demanded eleven) before running on ...

AllegatoDimensione
Script-13.1-MBgrep-MBdata-AllMBemails.pdf 20.67 KB
Script-13.1-MBgrep-MBdata-AllMBemails.pdf.txt 1.82 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The script completes all 94 emails, but then adds an interminable number of \n's ==> 5.0 GB.

Here it properly works. I even tried the script you attached after your useless magic. The only place I see where empty lines could be printed forever is in group_by_line. But its input would have to not be ordered by increasing email position (before the first ":"). I do not see how that could be possible, since seq generates $files and sort -ut : -k 1,1n -k 2 is used before the second call of group_by_line. If you specify LC_ALL=C on the command line (before calling the script), does it make a difference?

Since you still do not provide 1998-2021.AAspam (or part of it causing the problem), I still cannot investigate the error "grep: 8392: No such file or directory".

amenex
Offline
Iscritto: 01/03/2015

OK; here's a verbatim portion of 1998-2021.AAspam that's within the forum's 2 MB limit.
Applying Magic Banana's suggested (and successful !) fix:
LC_ALL=C ./Script-13.1-MBgrep-MBdata-AllMBemails-pdf 1998-2021.AAspam-Part03.txt > Hairball-VI.txt
LC_ALL=C ./Script-13.1-MBgrep-MBdata-AllMBemails-pdf emailfile.txt > Hairball-VII.txt
LC_ALL=C ./Script-13.1-MBgrep-MBdata-AllMBemails-pdf 1998-2021.Newest > Hairball-VII.txt

First $ 2nd: 0.1 s.; 3rd: 7.8 s. 3rd: 11MB. No errors.

In order to maintain email-identity throughout, run the following scripts in post-processing:
1 Content analysis details paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 1 Hairball-VIII.txt) | tr -d '()' > Content_analysis_details
2 IPv4-addr-in-contents paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 2 Hairball-VIII.txt) > IPv4-addr-in-contents
3 IPv4-addr-in-headers paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 3 Hairball-VIII.txt) > IPv4-addr-in-headers
4 Message-ID paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 4 Hairball-VIII.txt) > Message-ID
5 Return-Path paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 5 Hairball-VIII.txt) > Return-Path
6 Subject paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 6 Hairball-VIII.txt) > Subject
7 URL-in-contents paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 7 Hairball-VIII.txt) > URL-in-contents
8 URL-in-headers paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 8 Hairball-VIII.txt) > URL-in-headers
9 X-Account-Key paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 9 Hairball-VIII.txt) > X-Account-Key
10 X-UIDL cut -s -f 10 Hairball-VIII.txt > X-UIDL
11 email-addr-in-contents paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 11 Hairball-VIII.txt) > email-addr-in-contents
12 email-addr-in-headers paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 12 Hairball-VIII.txt) > email-addr-in-headers
13 email-addr-with-at-in-contents paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 13 Hairball-VIII.txt) > email-addr-with-at-in-contents
14 email-addr-with-at-in-headers paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 14 Hairball-VIII.txt) > email-addr-with-at-in-headers
14a Commonly-abused-email-addr paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 14 Hairball-VIII.txt) | sed 's/\[at\]/@/g' > Commonly-abused-email-addr

Another characteristic has importance:
grep "Delivery-date:" 1998-2021.AAspam-Part03.txt but I've not been able to add that function to the script.

EDIT: Multiple join's are easy to accomplish; the same can be achieved with multiple cut's,
executed in any order as done above, but sorting may have to be done on combinations with the same numbers of fields.

EDIT2: In order to enable sorting by a specific key, tag that key and use the tag as the field separator.
Example: Field 1 as key:
paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 14 Hairball-VIII.txt) <(cut -s -f 5 Hairball-VIII.txt) <(cut -s -f 9 Hairball-VIII.txt) > Part01 ; paste -d '\t' Part01 <(cut -s -f 1 Hairball-VIII.txt) | awk 'FS="\t" {print $1,$2}' '-' Places the key field last without regard for the number of fields (when it's present). As luck would have it, that Content_analysis_details field is difficult to sort.

AllegatoDimensione
1998-2021.AAspam-Part03.txt 1.83 MB
Hairball-VI.txt 58.71 KB
Hairball-VII.txt 173.9 KB
Multi-cut-10-14-5-1-9.txt 769.64 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

OK; here's a verbatim portion of 1998-2021.AAspam that's within the forum's 2 MB limit.

No error processing it here. Is there an error on your system?

Applying Magic Banana's suggested (and successful !) fix

I imagine it is more precisely sort that needs a redefinition of the locale (but aren't integers numerically ordered in the same way whatever the locale?!). That is why I added "LC_ALL=C " before "sort" in the attached script. (I discovered that JavaScript Restrictor, an extension to Abrowser, was the reason why I could not attach anything to my posts; your black magic involving Leafpad, LibreOffice and PDF is useless, of course.)

paste -d ' ' <(cut -s -f 10 Hairball-VIII.txt) <(cut -s -f 1 Hairball-VIII.txt)

Just write that:
$ awk -F \\t '{ print $10, $1 }' Hairball-VIII.txt
Or, if you do not mind keeping the fields ordered as in the input:
$ cut -f 1,10 Hairball-VIII.txt
Idem for the other pairs of fields. Nevertheless, I see no point in those many files: the script outputs everything.

I've not been able to add that function to the script.

You just needed to add a new line, "Delivery-date", in FIRST_VALUE_IN_HEADERS. I did so in the attached script. Please, try to understand what I write to you: the script aims to easily scrape additional pieces of information, without having to write "functions".

Multiple join's are easy to accomplish

With the single output the script provides, there is no need for join.

In order to enable sorting by a specific key, tag that key and use the tag as the field separator.

Just use sort (here for the first field that you deemed "difficult to sort", changes the ones for another field):
$ sort -t "$(printf \\t)" -k 1,1
Use 'tail +2' beforehand to remove the header of the table.

I could change the delimiter (using paste's -d option, at the very end of the script) to avoid having to provide a tabulation to sort's -t option. However, the subject of the email is one of the fields and I assume the tabulation is probably one of the rare characters that I would not expect in a subject.

AllegatoDimensione
scrape-emails.sh 2.12 KB
amenex
Offline
Iscritto: 01/03/2015

The old errors are not recurring; I successfully added a couple of new search topics:
LC_ALL=C ./Script-14-MBgrep-MBdata-AllMBemails emailfile.txt > Hairball-XThis is the new order of the "columns."
1 Content analysis details
2 Delivery-date
3 IPv4-addr-in-contents
4 IPv4-addr-in-headers
5 Message-ID
6 Return-Path
7 Subject
8 URL-in-contents
9 URL-in-headers
10 X-Account-Key
11 X-Spam-Score
12 X-UIDL
13 email-addr-in-contents
14 email-addr-in-headers
15 email-addr-with-at-in-contents
16 email-addr-with-at-in-headers

Try 'em out:
paste -d ' ' <(cut -s -f 12 Hairball-X.txt) <(cut -s -f 2 Hairball-X.txt) <(cut -s -f 10 Hairball-X.txt) > Part01 ; paste -d '\t' Part01 <(cut -s -f 11 Hairball-X.txt) > Test01142022.txt ; awk -F \\t '{print $1, $2}' Test01142022.txt | sort -nrk 9,9Even with the single tab delimiter, sort sees only space delimiters, so the column count becomes a variable when some of the outputs of the script have more than one field. That's an issue.
The suggested use of awk runs into the same problem:
awk -F \\t '{print $12 ,$2 ,$10"&"$11}' Hairball-X.txt > Test02-01142022.txt ;sort -t&, nk9 Test02-01142022.txtSort has a syntax issue that's stumping me.
The script scrape-emails.sh is inaccessible to me "You do not have permission..."

AllegatoDimensione
Hairball-X.txt 177.12 KB
Test01142022.txt 5.85 KB
Test02-01142022.txt 5.85 KB
Script-14-MBgrep-MBdata-AllMBemails.txt 2.09 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Even with the single tab delimiter, sort sees only space delimiters

No, it does not. To sort all lines in Hairball-X.txt but the first one (for some reason, that line is not tab-separated in your Hairball-X.txt, unlike in the output of the script I wrote) in reverse order of the numbers in the 11th column:
$ tail +2 Hairball-X.txt | sort -t "$(printf \\t)" -k 11,11nr

Sort has a syntax issue that's stumping me.

What you wrote makes no sense. sort's syntax does.

The script scrape-emails.sh is inaccessible to me "You do not have permission..."

I try again.

AllegatoDimensione
scrape-emails.txt 2.07 KB
amenex
Offline
Iscritto: 01/03/2015

We're seeing the same characters differently:
Amenex wrote: Even with the single tab delimiter, sort sees only space delimiters
Magic Banana demurs: No, it does not.

Output of the first part of my cut script:
paste -d ' ' <(cut -s -f 12 Hairball-X.txt) <(cut -s -f 2 Hairball-X.txt) <(cut -s -f 10 Hairball-X.txt) > Part01.txt ; paste -d '\t' Part01.txt <(cut -s -f 11 Hairball-X.txt) > Test01142022.txt The first four fields are separated by spaces; the fifth is separated by a tab.
UID2781-1165006918 Mon, 27 Dec 2021 05:36:26 -0800 account3 -14
UID131064-1161171024 Sun, 26 Dec 2021 23:44:27 -0800 account5 211
UID29949-1465998636 account8 8
UID29962-1465998636 Mon, 27 Dec 2021 02:29:28 -0800 account8 209

The second section of the script attempts to separate the block of four fields from the fifth field with that tab:
awk -F \\t '{print $1, $2}' Test01142022.txt | sort -nrk 9,9But the tab has disappeared, so I had to count fields separated by spaces to get to the sortable 9th field, which ought to have been the second field separated from the four-field block by that tab. The Delivery-date field accounts for many of those extra seven fields in the list of strings that is being sorted.

EDIT: I downloaded scrape-emails.txt, added X-Spam-Score right after Delivery-date, and executed it as for Script14 ...
with the exact same outputs as already discussed here:
UID29990-1465998636 Tue, 28 Dec 2021 16:21:32 -0800 account8 478
UID29993-1465998636 Tue, 28 Dec 2021 23:49:35 -0800 account8 476
UID30010-1465998636 Thu, 30 Dec 2021 00:03:39 -0800 account8 474
UID29994-1465998636 Wed, 29 Dec 2021 03:56:55 -0800 account8 473

And so on. Sadly, the shorter lines, also ending in X-Spam-Score, are left out of this list. Sort remains unsolved.

AllegatoDimensione
Part01.txt 5.51 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The first four fields are separated by spaces; the fifth is separated by a tab.

The script I wrote for you outputs a tab-separated table. As I have already explained to you in my last two posts, tail +2 (to remove the header), possibly one (and only one!) cut (specifying all the desired columns after -f) and sort -t "$(printf \\t)" plus the specification of the desired ordering are a proper way to sort it. For instance:
$ cut -f 2,10-12 Hairball-X.txt | tail +2 | sort -t "$(printf \\t)" -k 3,3rn
(If, for some reason, you really want a different ordering of the columns, substitute cut -f 2,10-12 for something like awk -F '\t' -v OFS='\t' '{ print $12, $2, $10, $11 }' and modify the argument of sort's -k accordingly.)

But you apparently refuse simple and maintainable solutions. You prefer several cuts and pastes, awk, sub-shells, intermediary files with horrible names, ... before finally sorting something where you cannot distinguish the columns anymore, because some tabs were substituted for spaces, which appear in some fields as well. Your sort is additionally wrong for ignoring the missing values. If you love mess, inefficiency and troubles, it is the way to go I guess.

amenex
Offline
Iscritto: 01/03/2015

Magic Banana's cut & sort script is far more efficient and understandable:
cut -f 2,10-12 Hairball-X.txt | tail +2 | sort -t "$(printf \\t)" -k 1,1nIt sorts the times as expected, but not the dates.
In the 1998-2021.Newest (i.e., the multi-megabyte main spam file) there are nearly a thousand emails
with no Delivery-date fields, which I managed to separate with mboxgrep, but there are a multiplicity of "Date:" entries
in those emails. The "From -" field at the beginning of each such email is the date it was put in the overall file, not
the date of receipt. The affected emails well deserve to be ranked as spam (probably they're malicious), but most have
an X-Spam-Score of zero.
After some thought I added the subject "Date" to the FIRST_VALUE_IN_HEADERS list in the main script, and then attempted
to include a script to translate the Delivery-date and Date fields to epoch time after the command
date -d "calendar date & time" +%swhich works in an ad hoc script when I put double quotes around the calendar date but not in the main script where it
looks like this:
epoch-time01 (date -d "Delivery-date" +%s)
epoch-time02 (date -d "Date" +%s)
in the DISTINCT_IN_HEADERS_OR_CONTENT= list. That produced a longer list of fields in the main header:

1 Content analysis details
2 Date
3 Delivery-date
4 IPv4-addr-in-contents
5 IPv4-addr-in-headers
6 Message-ID
7 Return-Path
8 Subject
9 URL-in-contents
10 URL-in-headers
11 X-Account-Key
12 X-Spam-Score
13 X-UIDL
14 email-addr-in-contents
15 email-addr-in-headers
16 email-addr-with-at-in-contents
17 email-addr-with-at-in-headers
18 epoch-time01-in-contents
19 epoch-time01-in-headers
20 epoch-time02-in-contents
21 epoch-time02-in-headers

Which shows that the main script does what you say.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

It sorts the times as expected, but not the dates.

A classical trick consists in concatenating the year, the month (its number), the day, the hours, the minutes and the seconds, in that order. For instance, AWK can convert to that format the dates in the second column of the table:
$ awk -F \\t -v OFS=\\t 'BEGIN { m["Jan"] = "01"; m["Feb"] = "02"; m["Mar"] = "03"; m["Apr"] = "04"; m["May"] = "05"; m["Jun"] = "06"; m["Jul"] = "07"; m["Aug"] = "08"; m["Sep"] = "09"; m["Oct"] = "10"; m["Nov"] = "11"; m["Dec"] = "12" } NR - 1 { split($2, a, /[ :]/); $2 = a[4] m[a[3]] a[2] a[5] a[6] a[7] } { print }'
sort -t "$(printf \\t)" -k 2,2n will then happily sorts those dates/numbers.

Lef
Lef
Offline
Iscritto: 11/20/2021

There's a nifty tool called mboxgrep you may find useful.

amenex
Offline
Iscritto: 01/03/2015

Lef suggested mboxgrep, which I installed from the Trisquel repository and used to search for a specific
term, which it did successfully, returning just two such emails, each in its entirety, Very useful !
I'd give the example, but the forum ate that contribution because of crossed postings from another user
and me.
Thank you, Lef, for an excellent suggestion.

EDIT: After some expanded use I discover there's no provision for a pattern file like the one that grep
uses, so I simply wrote a series of similar scripts to process each email's distinctive field on a one-
at-a-time basis. That uncovered a lot of duplicate emails in the main email file but I found after
removing all the duplicates that the input list and output list are identical.

Edit2: From man mboxgrep: If a mailbox name is ommited, or a single dash (-) is given instead, it
reads from standard input. It can read mbox folders or output from another mboxgrep process from standard input.

Let's test that statement:
awk '{print "\""$1"\""}' Date-Translated-NDD-UIDs.Sorted.txt | mboxgrep - 1998-2021.AAspam-Part03.txt > Test-Output01.txtGets the entire target file; that's incorrect.
Do the mboxgrep searches one-at-a-time:
mboxgrep "UID25043-1465998636" 1998-2021.AAspam-Part03.txt >> Test-Output02.txt ;
mboxgrep "UID25098-1465998636" 1998-2021.AAspam-Part03.txt >> Test-Output02.txt ;
Gets one email only.
Repeat with the above two scripts and eighty-three more pasted into the terminal all at once so they run, one right after the other:
mboxgrep "UID25043-1465998636" 1998-2021.AAspam-Part03.txt >> Test-Output03.txt ;
mboxgrep "UID25098-1465998636" 1998-2021.AAspam-Part03.txt >> Test-Output03.txt ;
mboxgrep "UID25486-1465998636" 1998-2021.AAspam-Part03.txt >> Test-Output03.txt ;
...
mboxgrep "UID28099-1465998636" 1998-2021.AAspam-Part03.txt >> Test-Output03.txt ;
mboxgrep "UID28103-1465998636" 1998-2021.AAspam-Part03.txt >> Test-Output03.txt ;
mboxgrep "UID28132-1465998636" 1998-2021.AAspam-Part03.txt >> Test-Output03.txt ;
The entire list is attached. There's only one email with no "Delivery-date" field to be found in the much-abbreviated email file 1998-2021.AAspam-Part03.txt whose 1.8 MB represents 0.7 percent of the main file's 267 MB.
Another test:
mboxgrep-v "Delivery-date" 1998-2021.AAspam-Part03.txt > Test-Output04.txt
grep -v "Delivery-date" 1998-2021.AAspam-Part03.txt > Test-Output05.txt
The first script fails by getting the entire target file (Test-Output04.txt is the same as Test-Output01.txt), but the second script correctly captures just the one file that's missing a "Delivery-date" field.
Last test: Make Part-0Fmboxgrep-Part03.txt executable as Script-Part-0Fmboxgrep-Part03.sh and then run it:
cp Part-0Fmboxgrep-Part03.txt Script-Part-0Fmboxgrep-Part03.shThen
./Script-Part-0Fmboxgrep-Part03.shwhich produces Test-Output06.txt which again is the same email file as Test-Output05.txt, Test-Output03.txt and Test-Output02.txt.

Conclusion: mboxgrep can't accept standard input as man mboxgrep claims, but the workaround is to make an executable script
listing the patterns individually in separate commands.

AllegatoDimensione
Date-Translated-NDD-UIDs.Sorted.txt 2.57 KB
Test-Output01.txt 1.83 MB
Test-Output02.txt 37.5 KB
Part-0Fmboxgrep-Part03.txt 6.81 KB
Test-Output05.txt 37.5 KB
Script-Part-0Fmboxgrep-Part03.sh 6.81 KB
Test-Output06.txt 37.5 KB
Test-Output03.txt 37.5 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Let's test that statement

What you "test" is not what the manual sates. The manual states the mailbox can be on the standard input. Not a list of patterns. And, as usual, after misusing a command, you conclude it is the command that incorrectly works...

Repeat with the above two scripts and eighty-three more pasted into the terminal all at once so they run, one right after the other

Back to messy and inefficient solutions. You should build a regular expression from the file:
$ mboxgrep "($(cut -d ' ' -f 1 Date-Translated-NDD-UIDs.Sorted.txt | tr \\n \| | sed 's/|$//'))" 1998-2021.AAspam-Part03.txt

amenex
Offline
Iscritto: 01/03/2015

Magic Banana is kind when he tells me nicely that I'm reading into the mboxgep manual what
I want to see and that I'm not as smart as I'd like to think I am.

Reading the logic of the proper construction of what amenex did with the 85-step script:
The backticks around the cut expression put its (UID) output on the same line with the target
of the mboxgrep expression, and the sed command gets rid of the "|" and "$" so the target
portion of the script can stay on that same line.

My next step in the project is to shoehorn the X-Spam-Score data from the output of Magic
Banana's elegant script back into the UID - translated-date listing for those 85 emails
so I'd have something to plot with Grace to look for any time-related intensity of the
malicious emails that are unfolding from the overall spam data in the 1998-2022.Newest
file that's way too large to post here.

Thanks to Magic Banana for his continuing interest & expert help.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The backticks around the cut expression put its (UID) output on the same line with the target of the mboxgrep expression, and the sed command gets rid of the "|" and "$" so the target portion of the script can stay on that same line.

No. Interpreting $(command), the shell substitutes the output of "command" in the command line (here adding parentheses and calling mboxgrep with that regular expression on 1998-2021.AAspam-Part03.txt). The output must be "UID25043-1465998636|UID25098-1465998636|...|UID28132-1465998636". The UIDs are in the first column of Date-Translated-NDD-UIDs.Sorted.txt that cut selects. tr then substitutes every newline character, "\n", for "|". There may be one trailing "|" that must be deleted (because Date-Translated-NDD-UIDs.Sorted.txt ends with a newline). sed does that job, substituting (command s) the regular expression |$ (what means a trailing "|", $ being the end of the line in a regular expression) for nothing.

amenex
Offline
Iscritto: 01/03/2015

After considerable thought and multi-line scripting, I've managed to make a Grace plot of several thousand points
presented here for scholarly analysis. The data are the X-Spam-Score's from all the emails from 1998-2021.Newest
that _have_ Delivery-date fields in Hardball-XV.txt (too large to attach here). The time scale is standard epochal
time, for which there are relatively easy back-converters to local time. The actual data covers a wider range on
both axes, but the negative spam scores are of little present interest (probably put in the spam folder for other-
than-spam reasons), and earlier data on the time scale are rather too sparse.
There's one gap in the time scale when I wasn't able to use the computer for several weeks.

Edit: While searching for populations of similar emails amongst the many saved in the main spam file 1998-2021.Newest,
I remembered one set of obviously malicious emails containing the sentence Spam detection software, running on
the system "biz291.inmotionhosting.com", has NOT identified this incoming email as spam.
Upon searching with
mboxgrep "NOT identified this incoming email as spam" 1998-2021.Newest I discovered that about two-thirds of the emails so selected had the field Delivery-date filled, and the remaining one third
had no such field. That remaining group of emails (an excerpt is attached) cannot simply be searched withgrep "Date: "Emails-NOTspam-NDD-Excerpt.txt" Emails-NOTspam-NDD-Excerpt.txtbecause several extra emails are selected in error; the search term is scattered about elsewhere
in some of the "NOTspam" emails (two extras in the present example). My aim is to gather the fields selected by "X-UIDL",
"X-Spam-Score" and "Date" and then convert "Date" to epochal time with the transformationecho "`echo UID25853-1465998636` `date -d "Sun, 9 May 2021 15:03:08 +0000 (UTC)" +%s`" as one example. That dataset can be plotted as an overlay (red points and lines) to the second attached graph.

Edit2: The present embodiment of the graph of the three sets of data is attached. It is an attempt to annotate the
horizontal axis with the months of the calendar years 2020 (Oct) through 2021 (Oct). There is a known procedure to
overlay graphs having the same data but with axes that have different units, that can't work here because the months
have different numbers of days. Therefore the data4.txt file has the first day of each month expressed in epochal time,
and those values are used to place "|" (xmgrace's 124 character code) along the top of the graph that signify the
first day of each month.
A problem arises when I attempt to annotate each "|" with a string indicating the month's name. Apparently there's a
duplex procedure to do so, and the only place where it's described
(https://www.originlab.com/doc/Quick-Help/Label-Ploints-with-XY) the steps are presented a'la YouTube but
incremented so quickly that they're incomprehensible.

Edit3: The approximate annotating script has been aligned somewhat better in the updated second xmgrace graph.

AllegatoDimensione
EpochalDate-Sorted-WithDD-Newest-H-XV.pdf 44.04 KB
Emails-NOTspam-NDD-Excerpt.txt 1.51 MB
data4.txt 2.42 KB
Grace_Data1_Data2_Data4_dotTxt-01252022.pdf 133.35 KB