Reversing the order of two columns, one of IPv4 addresses and the other, of email addresses

6 replies [Last post]
amenex
Offline
Joined: 01/03/2015

Without trying to post any malevolent material, here's the problem in plain English:
Starting with a 400+MB file of archival emails, I searched that file using grep to
list all the lines containing one of a long list of current spammy domain names.
Only a few of those domains turned up, but associated with many IPv4 addresses and
quite a few email addresses, all of them over decade old. There being no way that I
know of for running dig or whois searches retrospectively, I'd like to concatenate
a number of lists of the email addresses and their long-ago-resolved IPv4 addresses,
some of them with the dates the emails were posted. As the headers and bodies of those
emails don't have a consistent pattern, processing leaves me with the data columns in
various orders. There are many sed commands needed to clean up the unwanted verbiage,
other commands to remove unwanted white space, and even to delete trailing dots. I've
looked up an tried a dozen or so methods claimed to be infallible, but I cannot change
the order of the various two-, three-, or four-column files that ensue. A typical
file is ten to fifty lines long, but I hesitate to apply manual editing.
As an example, with the simplest two-column file with the IPv4 addresses in Column
one and the email addresses in column two, I'm finding that many solutions have no
effect at all, and a few drop about half of the IPv4 addresses in the new 2nd column.
Thinking that the double meanings of the dots are interfering, I used sed to replace
all the \. with \&, reverse the columns with awk, and then reverse the sed substitution.
Again, half the IPv4 addresses disappear, but the alphanumeric email addresses get
mangled even though the remaining IPv4 addresses are intact.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

As always: please provide an input, exemplifying all possible cases, and the related expected output.

amenex
Offline
Joined: 01/03/2015

Magic Banana rightly requests provenance; alas, a previous attempt to do so, based
on an uploaded text file, vanished without a trace, so I reasoned that my problem
(perhaps also a 'puter problem) resulted from extraneous leftover B.S. that is
invisible to the human eye but plays havoc with scripts, so I printed the original
backwards (IPv4 first, then email) version, copied and pasted it into a new text
file, and applied the usual simple awk script to reverse the presentation of the
columns. Here is the result that I could not get yesterday with every trick I could
imagine:
24-113-123-181.wavecable.com. 24.113.123.181
24-113-145-220.wavecable.com. 24.113.145.220
24-113-170-18.wavecable.com 24.113.170.18
24-113-219-179.wavecable.com 24.113.219.179
24-113-72-6.wavecable.com. 24.113.72.6
76-14-122-19.rk.wavecable.com. 76.14.122.19
76-14-166-123.wsac.wavecable.com 76.14.166.123
76-14-186-170.wsac.wavecable.com. 76.14.186.170
76-14-198-2.or.wavecable.com. 76.14.198.2
76-14-214-53.or.wavecable.com. 76.14.214.53
76-14-217-79.or.wavecable.com 76.14.217.79
76-14-246-255.or.wavecable.com 76.14.246.255
76-14-246-255.or.wavecable.com. 76.14.246.255
static-76-14-252-167.or.wavecable.com. 76.14.252.167

There are pesky trailing dots still there which are exceedingly troublesome to remove.
Even if I had removed them manually, the original file was unresponsive to a sort -u command
because of the invisible B.S. that often causes "file not found" complaints which I grudgingly
deal with by copying and pasting the offending file name from the preceding outputs into the
current script candidate. Without attempting to upload the original file, here is the script
(with apologies for not memorizing more efficient scripting techniques):
cat Redacted-9KB.txt | grep "host\ " | sed 's/\(getting name\)//g' | tr -d '()=' | sed 's/host//g' | sed 's/cached//g' | awk '{$1=$1};1' | awk 'FS="\t" {print $1"\t"$2}' | sort -uwhich produced the preceding lines of output in the reverse order.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Please provide Redacted-9KB.txt or an excerpt. In this way, we can remove as well the useless processing steps you apply (cat a single file to a pipe, useless parentheses in sed, several sed programs instead of using -e, the first awk program that does nothing as far as I understand, etc.).

amenex
Offline
Joined: 01/03/2015

Magic Banana hopes to get the source file for my inquiry. The original vanished when I tried
to upload it, so I'm trying again with a version within which I've redacted all the domain
names, but not the IPv4 addresses. Let's see what happens. The redaction script follows:
awk '{print $0}' SpamCop-QuickReports-ResolvedDomains-Part01.txt | sed 's/indosat\.net\.id/redact-318755393\.info/g' | sed 's/evisionmail\.com/redact-297281989\.info/g' | sed 's/wavecable\.com/redact-296962503\.info/g' | sed 's/maximumasp\.com/redact-296932254\.info/g' | sed 's/amega\.com/redact-254057415\.info/g' | sed 's/skjj\.com/redact-161638839\.info/g' | sed 's/cesmail\.net/redact-113344556\.info/'g | sed 's/inmotionhosting\.com/redact-998877665\.info/g' > SpamCop-QuickReports-RedactedDomains-Part01.txtMy inefficient processing scripts are to collect as many different email addresses containing the \@ symbol as well as their associated IPv4 addresses, plus receipt dates if present, presented with email address in column 1, IPv4 address in column 2, and the date script (later to be converted to epochal) in column 3. My method has been to make grep searches on key words in the 181-row list gleaned from the original 400+ MB email archive, aiming to concatenate and run sort -u on the total, followed later on by comparison to a currently amassed archive. I've been using sed to remove extraneous verbiage, tr to remove enclosing () and [] from IPv4 addresses, and various other "tricks" to remove unwanted white space, blank lines, etc.

AttachmentSize
SpamCop-QuickReports-RedactedDomains-Part01.txt 9.78 KB
Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

$ grep 'host ' SpamCop-QuickReports-RedactedDomains-Part01.txt | sed -e 's/ *host //' -e 's/ (getting name)//' -e 's/ = /\t/' -e 's/ (cached)//' -e 's/\.$//' | sort -u
24.113.123.181 24-113-123-181.redact-296962503.info
24.113.145.220 24-113-145-220.redact-296962503.info
24.113.170.18 24-113-170-18.redact-296962503.info
24.113.219.179 24-113-219-179.redact-296962503.info
24.113.72.6 24-113-72-6.redact-296962503.info
76.14.122.19 76-14-122-19.rk.redact-296962503.info
76.14.166.123 76-14-166-123.wsac.redact-296962503.info
76.14.186.170 76-14-186-170.wsac.redact-296962503.info
76.14.198.2 76-14-198-2.or.redact-296962503.info
76.14.214.53 76-14-214-53.or.redact-296962503.info
76.14.217.79 76-14-217-79.or.redact-296962503.info
76.14.246.255 76-14-246-255.or.redact-296962503.info
76.14.252.167 static-76-14-252-167.or.redact-296962503.info

amenex
Offline
Joined: 01/03/2015

Great ! I reversed the order of the two columns with awk.
(1) Now let's get all the dates, which has taken two steps so far:
grep 'SMTP; ' SpamCop-QuickReports-RedactedDomains-Part01.txt | sed -e 's/Received\:\ \ //' -e 's/1\:\ Received://' -e 's/2\:\ Received\://' -e 's/from\ //' -e 's/HELO\ //' -e 's/unknown\ //' -e 's/by\ //' -e 's/with\ SMTP\;\ //' | awk '{$1=$1};1' | tr -d '([;])' | sed -e 's/\%\:\%/\t/g' > SpamCop-QuickReports-RedactedDomains-Part01D-11282022.txt
I had to copy that output into another instance of featherpad to remove invisible [suspected] code, then apply another script:
awk 'FS="\t" {print $1,$2,$3,$4,$5,$6,$7,$8}' SpamCop-QuickReports-RedactedDomains-Part01D-11282022.txt > SpamCop-QuickReports-RedactedDomains-Part01D1-11282022.txtThe dates are all there, but the email addresses and IPv4 addresses are in a bizarre field order.

(2) grep '2\:\ Received\:\ from ' SpamCop-QuickReports-RedactedDomains-Part01.txt | sed 's/2\:\ Received\:\ from\ //' | tr -d '([;])' | sed -e 's/by\ kkjxfuf\.com\ with\ SMTP\ //' > SpamCop-QuickReports-redact-Group01.info.txtOK but the first two columns are mixed; date & time are the last six columns.

(3) Another amenex effort; the early tries are included with explanatory comments:
grep 'Tracking message source\:' SpamCop-QuickReports-RedactedDomains-Part01.txt | sed -e 's/Tracking\ message\ source\://' | sed -e 's/warning\:Looks\ like\ a\ forgery\ \ \ //' | sed -e 's/^[ \t]*//' | sed -e 's/\:\ \ \ Cached\ whois\ for\ /\ \:\ /' > SpamCop-QuickReports-RedactedDomains-Part01-11282022.txt Copied & pasted to remove mysterious stuff:
awk 'FS=" : " {print $0}' SpamCop-QuickReports-RedactedDomains-Part01-11282022.txt | sed -e 's/\ /\%/g' | awk 'FS="%:%" {print $1"\t"$2"\t"$3}' | sed -e 's/\%\:\%/\t/g'Stopped the mangling.
awk 'FS=" : " {print $0}' SpamCop-QuickReports-RedactedDomains-Part01-11282022.txt | sed -e 's/\ /\%/g' | awk 'FS="%:%" {print $1"\t"$2"\t"$3}' | sed -e 's/\%\:\%/\t/g' > SpamCop-QuickReports-RedactedDomains-Part01A-11282022.txtCopied & pasted [again] into SpamCop-QuickReports-RedactedDomains-Part01A-11282022.txt.
awk 'FS="\t" {print $3"\t"$1}' SpamCop-QuickReports-RedactedDomains-Part01A-11282022.txt | sort -u[ sort -u ] erases the first row of the output.
awk 'FS="\t" {print $3"\t"$1}' SpamCop-QuickReports-RedactedDomains-Part01A-11282022.txt > SpamCop-QuickReports-RedactedDomains-Part01B.txtPerfect 19-row output.

(4) Yet one more effort:
grep 'Cached whois for ' SpamCop-QuickReports-RedactedDomains-Part01.txt | sed -e 's/\:/\ \:\ /g' | sed 's/\ \ Cached\ whois\ for\ //' | sed -e 's/\Tracking\ message\ source\ \:\ //' | awk '{$1=$1};1' | sed -e 's/\warning\ \:\ Looks\ like\ a\ forgery\ //' | sed -e 's/\ /\%/g' | awk 'FS="%:%" {print $1"\t"$2"\t"$3}' | sed -e 's/%:%/\t/g' > SpamCop-QuickReports-RedactedDomains-Part01C-11282022.txtMixture of two-column records and three-columns records. Output copied and pasted into SpamCop-QuickReports-RedactedDomains-Part01C1-11282022.txtMixed two- and three-column outputs, not yet in proper format. You probably won't find any difference between these last two files, but for me this step is often essential.
All the output files are attached.

AttachmentSize
SpamCop-QuickReports-RedactedDomains-Part01D-11282022.txt 1.6 KB
SpamCop-QuickReports-RedactedDomains-Part01D1-11282022.txt 1.7 KB
SpamCop-QuickReports-redact-296962503.info_.txt 677 bytes
SpamCop-QuickReports-RedactedDomains-Part01-11282022.txt 1.18 KB
SpamCop-QuickReports-RedactedDomains-Part01A-11282022.txt 1.11 KB
SpamCop-QuickReports-RedactedDomains-Part01B.txt 871 bytes
SpamCop-QuickReports-RedactedDomains-Part01C-11282022.txt 1.54 KB
SpamCop-QuickReports-RedactedDomains-Part01C1-11282022.txt 1.54 KB