New task: find common lines in two lists of email addresses
- Login o registrati per inviare commenti
See https://trisquel.info/en/forum/separating-ip-addresses-mixed-list-hostnames-and-addresses
Files:
amenex.spam.emails-only.sorted.12262021.txt
419.scam-emails-partial.txt from https://419scam.org/419-bl.htm regarding advance fee fraud emails.
Evade the "@" special character; "AT" is not in the target list:
comm -12 <(sed 's/\@/AT/g' amenex.spam.emails-only.sorted.12262021.txt | sort -k 1,1) <(sed 's/\@/AT/g' 419.scam-emails-partial.txt | sort -k 1,1) | sed 's/AT/\@/g' > amenex.scam.AT.comm.419.scam.partial.txt
produces a list of 36 matching emails in the blink of an eye. Am I missing any ?
Finding the linked list of 419-fraud addresses was fortuitous; are there other lists online that can be used similarly ?
The matched email addresses enable retrieving the original emails from an archived set of several thousand,
from each of which the associated source IPv4 addresses can easily be harvested and evaluated for current
activities. Knowing what the senders were originally attempting is a useful aid.
Allegato | Dimensione |
---|---|
amenex.spam_.emails-only.sorted.12262021.txt | 570.91 KB |
419.scam-emails-partial.txt | 1.19 MB |
Your command is correct, although unnecessarily complicated. There is no need to substitute every "@", neither to use a keys to sort (the whole lines must be sorted). In other terms, it could be:
$ comm -12 <(sort amenex.spam_.emails-only.sorted.12262021.txt) <(sort 419.scam-emails-partial.txt) > amenex.scam.AT.comm.419.scam.partial.txt
Magic Banana's script is much cleaner; I was assuming that comm could not cope with the "@" character
any better than grep, which was not working for the same task, so my script was overkill as a result.
sed will reliably put that "@" back where it belongs in place of "AT" in each email when there are no
other AT's anywhere in the list.
I was assuming that comm could not cope with the "@" character any better than grep, which was not working for the same task, so my script was overkill as a result.
'@' is not a special character for any of the text-processing commands. In particular, it is not a special character in regular expressions and does not cause grep any trouble.
Nevertheless, letting one list be the pattern and the other be the target doesn't work;
It actually found twice as many as there were in the pattern file when I tried that.
- Login o registrati per inviare commenti