New task: find common lines in two lists of email addresses

6 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/03/2015

See https://trisquel.info/en/forum/separating-ip-addresses-mixed-list-hostnames-and-addresses

Files:
amenex.spam.emails-only.sorted.12262021.txt
419.scam-emails-partial.txt from https://419scam.org/419-bl.htm regarding advance fee fraud emails.

Evade the "@" special character; "AT" is not in the target list:
comm -12 <(sed 's/\@/AT/g' amenex.spam.emails-only.sorted.12262021.txt | sort -k 1,1) <(sed 's/\@/AT/g' 419.scam-emails-partial.txt | sort -k 1,1) | sed 's/AT/\@/g' > amenex.scam.AT.comm.419.scam.partial.txt
produces a list of 36 matching emails in the blink of an eye. Am I missing any ?

Finding the linked list of 419-fraud addresses was fortuitous; are there other lists online that can be used similarly ?

The matched email addresses enable retrieving the original emails from an archived set of several thousand,
from each of which the associated source IPv4 addresses can easily be harvested and evaluated for current
activities. Knowing what the senders were originally attempting is a useful aid.

AllegatoDimensione
amenex.spam_.emails-only.sorted.12262021.txt570.91 KB
419.scam-emails-partial.txt1.19 MB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Your command is correct, although unnecessarily complicated. There is no need to substitute every "@", neither to use a keys to sort (the whole lines must be sorted). In other terms, it could be:
$ comm -12 <(sort amenex.spam_.emails-only.sorted.12262021.txt) <(sort 419.scam-emails-partial.txt) > amenex.scam.AT.comm.419.scam.partial.txt

amenex
Offline
Iscritto: 01/03/2015

Magic Banana's script is much cleaner; I was assuming that comm could not cope with the "@" character
any better than grep, which was not working for the same task, so my script was overkill as a result.
sed will reliably put that "@" back where it belongs in place of "AT" in each email when there are no
other AT's anywhere in the list.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

I was assuming that comm could not cope with the "@" character any better than grep, which was not working for the same task, so my script was overkill as a result.

'@' is not a special character for any of the text-processing commands. In particular, it is not a special character in regular expressions and does not cause grep any trouble.

amenex
Offline
Iscritto: 01/03/2015

Nevertheless, letting one list be the pattern and the other be the target doesn't work;
It actually found twice as many as there were in the pattern file when I tried that.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

You must do something wrong: the '@' character should raise no trouble.

amenex
Offline
Iscritto: 01/03/2015

Magic Banana is correct; I was expecting grep to read my mind.