Highlight every instance of all the members of a group of strings found in a target file

16 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/04/2015

Here is a list of domains that have been active in the spam folder: Amenex.dot.com.Hairball-ILV-domain-list-SUQd.txt
Spam folder: 1998-2022.Newest 326 MB (over 11 thousand, too big to list here).
Magic Banana's (and PrimeOrdeal's) script that scrapes a list of emails for pertinent data:
LC_ALL=C ./MB-Scrape-Emails-GL-PO 1998-2022.Newest | sed 's/\t\t/\tnull\t/g' > Hairball-AatAdot.Com-Emails.txt Takes 25 seconds; still too big at 20.2 MB.
Here are the identities of the columns in the hairball file:
head -n 1 Hairball-AatAdot.Com-Emails.txt | sed 's/\t/\n/g' | sed 's/:/\t/g' | sed 's/\#\ //g' > Headers-AatAdot.Com-Emails.txt

1 Content analysis details
2 Date
3 Delivery-date
4 IPv4-addr-in-contents
5 IPv4-addr-in-headers
6 Message-ID
7 Return-Path
8 Subject
9 URL-in-contents
10 URL-in-headers
11 X-Account-Key
12 X-Spam-Score
13 X-UIDL
14 email-addr-in-contents
15 email-addr-in-headers
16 email-addr-with-at-in-contents

Condense the above array to the minimum for the present analysis:
cut -f 2,3,13,15 Hairball-AatAdot.Com-Emails.txt | awk -F '[\t]' 'NR -1 {print $3"\t"$1"\t"$2"\t"$4}' | sort -nk 1,1 > Fields-Hairball-AatAdot.Com-Emails.txtAlso too big at 2.4 MB.
List only the lines that contain the active domains:
grep -ef Amenex.dot.com.Hairball-ILV-domain-list-SUQd.txt Fields-Hairball-AatAdot.Com-Emails.txt | sed 's/Fields-Hairball-AatAdot.Com-Emails.txt\://g' | awk '{print $0}' | grep -v "Amenex.dot.com.Hairball-ILV-domain-list-SUQd.txt:" '-' > Fields-4040-domains-Hairball-AatAdot.Com-Emails.txtThere has to be a better way to keep grep under control here, but how can we highlight every active domain among these long lists, not just one-at-a-time, but every instance of every active domain, all at once ?

AllegatoDimensione
Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt69.24 KB
Fields-4040-domains-Hairball-AatAdot.Com-Emails.txt1.19 MB
MB-Scrape-Emails-GL-PO.txt2.11 KB
Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

Also too big at 2.4 MB.

Compress. If you do not want to lose any time, choose zstd, in Trisquel's repository. You will actually save time, reading/writing less from the disk. Just pipe to zstd right before redirecting to a file. You then read it with zstdcat. For example (although I think you should try to find more meaningful file names):
$ LC_ALL=C ./MB-Scrape-Emails-GL-PO 1998-2022.Newest > Hairball-AatAdot.Com-Emails.zstd
$ zstdcat Hairball-AatAdot.Com-Emails.zstd | ...

To favor the compression ratio over the time to compress/decompress, you can choose another compression algorithm, up to xz/xzcat for an excellent compression ratio but much time spent compressing/decompressing.

There has to be a better way to keep grep under control here, but how can we highlight every active domain among these long lists, not just one-at-a-time, but every instance of every active domain, all at once ?

I do not understand. Please give an excerpt of the input(s) and the expected output.

As usual, the command lines you write show you do not understand them much. For instance awk '{print $0}' does absolutely nothing but repeating the input.

amenex
Offline
Iscritto: 01/04/2015

Anther closely held command (zstd) comes to light; it wasn't mentioned among several sets of useful commands
that I recently found among linux discussions. I can almost get it to work OK here:
cut -f 2,3,13,15 Hairball-AatAdotCom-Emails.txt | awk -F '[\t]' 'NR -1 {print $3"\t"$1"\t"$2"\t"$4}' | sort -nk 1,1 | zstd -11 - - -o Fields-Hairball-AatAdotCom-Emails-11.zstd Took 1/5th second; 473.0 kB (compressed about five times). It'll have to be renamed before the forum will accept it. So near and yet so far ...
The 20 MB file is beyond the reach of zstd until the --ultra secret handshake is revealed. The actual compression ratio at -11 is five.
Even a twenty-fold setting is unlikely to satisfy the 2.0 MB limitation.
Regarding some ferinstances ... pick any domain from the 4,040 in Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt, renamed and recalculated,
and search Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt, also renamed and recalculated, and search it with the chosen name:
grep -e "credipoint.com" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-credipointdotcom-4040fields.txt
grep -e "amenex.com" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-amenexdotcom-4040fields.txt
grep -e "yandex.ru" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-yandexdotru-4040fields.txt
grep -e "gmail.com" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-gmaildotcom-4040fields.txt

The visual effect of highlighting can be seen after opening the output file with leafpad and then using its search function, which highlights
all instances of the search term. On further thought, it might be better not to select all 4040 domain names the first time, lest everything
be highlighted. If anyone outside North America wonders, a "hairball" is what a wise old owl spits out after digesting a small rodent swallowed
whole.

AllegatoDimensione
Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt 69.24 KB
Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt 1.19 MB
Instances-credipointdotcom-4040fields.txt 173 byte
Instances-amenexdotcom-4040fields.txt 813.53 KB
Instances-yandexdotru-4040fields.txt 10.94 KB
Instances-gmaildotcom-4040fields.txt 86.06 KB
Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

The 20 MB file is beyond the reach of zstd until the --ultra secret handshake is revealed.

I repeat: use a different algorithm if you want to favor the compression ratio over the execution time. With XZ:
$ time xz Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt
real 0m0,362s
user 0m0,310s
sys 0m0,020s
$ ls -l Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.xz
-rw-r--r-- 1 banana banana 216100 25 avril 07:54 Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.xz

Regarding some ferinstances ... pick any domain from the 4,040 in Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt, renamed and recalculated,
and search Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt, also renamed and recalculated, and search it with the chosen name:

First of all, Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt contains quotation marks that should not be there. Delete them:
$ sed -i 's/"//g' Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt
Then, use grep's -f option to search in the input everything listed in the file in argument:
$ xzcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.xz | grep -Ff Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt
Importantly, the -F option allows to interpret every line in Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt as a fixed string, not a regular expression. In particular, a dot matches a dot, not "any single character" as in a regular expression. I have explained you that many times.

amenex
Offline
Iscritto: 01/04/2015

Regarding my statement:
The 20 MB file is beyond the reach of zstd until the --ultra secret handshake is revealed. Near the bottom of the
complete man zstd page, it's said that-- All arguments after -- are treated as files which gave me leave to list
the intended compression level before the "--umax" modifier. That got me a minuscule increase in compression ratio.
Magic Banana-style scripts suitably modified:
zstd -11 Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt -o Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt

Nice ! All the hits are highlighted on red typeface; exactly what I'd wished. Waitaminnit ... there's a catch:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt > Highlighted-4040-DomainFields-MB.txtTook 0.024 second; same 5415-row list as before, but not retaining the highlighting that appears in the terminal.
How shall I retain that red typeface in the terminal output when redirecting to an output file ?

AllegatoDimensione
Highlighted-4040-DomainFields-MB.txt 1.19 MB
Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt 1.19 MB
Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

With option --color=always:
$ zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt > Highlighted-4040-DomainFields-MB.txt
Read Highlighted-4040-DomainFields-MB.txt with less:
$ less Highlighted-4040-DomainFields-MB.txt

amenex
Offline
Iscritto: 01/04/2015

The --color=always option to grep changes the text color of all the grep matches when the output of the script is scrolled in the
terminal with |more, but there are over five thousand lines of output, and only the codes appear at each end of each match in the
file saved to disk. What open-source application puts those codes to use ?

Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

I told you: less.

amenex
Offline
Iscritto: 01/04/2015

Here's what almost works:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt > Colored-4040-DomainFields-MB.txtExcept that the highlighting disappears upon saving the output to a file.
Another set of domains:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -Ff DomainsThruZ.txt > Colored-2107-DomainFields-MB.txtSame number of output rows because one domain appears in both domain lists.
Online there appears a claimed solution to less's coyness:
https://superuser.com/questions/36022/getting-colored-results-when-using-a-pipe-from-grep-to-less
where it's said:
When you simply run grep --color it implies grep --color=auto which detects whether the output is a terminal and
if so enables colors. However, when it detects a pipe it disables coloring. The following command will always enable
coloring and override the automatic detection, and you will get the color highlighting in less:

grep --color=always -R "search string" * | lessAlas, when I attempt to send stdout to a file:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -R -Ff DomainsThruZ.txt | less > Output-04252022.txtTrouble became apparent at 1.2 GB; by the time the script was aborted, 1.7 GB had accumulated. The first five hundred
lines of that output are attached.

AllegatoDimensione
Colored-4040-DomainFields-MB.txt 1.6 MB
Colored-2107-DomainFields-MB.txt 1.52 MB
DomainsThruZ.txt 30.75 KB
Output-04252022.txt 184.81 KB
Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

Here's what almost works

Here, it does work. I repeat, read the output file with less:
$ less Colored-4040-DomainFields-MB.txt

Online there appears a claimed solution to less's coyness

It is the solution I gave you too: grep's option --color=always.

amenex
Offline
Iscritto: 01/04/2015

... Which was apparent some days ago, but only in the terminal. What application lets me scroll up & down through
several thousand lines of sometimes-highlighted text ? That's anther question that's been repeated several times.

Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

Which was apparent some days ago, but only in the terminal.

Yes, only terminals and terminal emulators interpret the ANSI escape sequences grep (and other terminal programs, e.g., ls) writes to color its output: https://en.wikipedia.org/wiki/ANSI_escape_code#Colors

What application lets me scroll up & down through several thousand lines of sometimes-highlighted text ? That's anther question that's been repeated several times.

I have repeated the answer as many times: less.

amenex
Offline
Iscritto: 01/04/2015

My abject humble reply: When in doubt, read the directions. man less describes how to scroll up, down, etc. throughout the file.

Two weeks later, output file at the ready; now using Trisquel 10
less Highlighted-4040-DomainFields-MB.txt elicits the query:
"Highlighted-4040-DomainFields-MB.txt" may be a binary file. See it anyway?
I replied "Y" but no colors appear; just ESC codes. Scrolling up, down & sideways fully functional.

AllegatoDimensione
Highlighted-4040-DomainFields-MB.txt 1.6 MB
Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

Is Highlighted-4040-DomainFields-MB.txt compressed with zstd? If so:
$ zstdcat Highlighted-4040-DomainFields-MB.txt | less
less can read .gz files, .bz2 files, .xz files, ... but not .zstd files.

amenex
Offline
Iscritto: 01/04/2015

Didn't use zstd ...
cat Highlighted-4040-DomainFields-MB.txt | lessstill doesn't activate the color.
Man less doesn't say anything about color. Color palette in terminal is activated.

Magic Banana

I am a member!

Offline
Iscritto: 07/24/2010

I had not noticed your attachment in your previous post. Here, on Debian 11 (but that should not make a difference), everything works as expected: 'less Highlighted-4040-DomainFields-MB_0.txt' shows the domains in red.

amenex
Offline
Iscritto: 01/04/2015

Even without the cat, there are escape codes but no color.
Here's the details of a typical "highlighted" email address:
george@ESC[01;31mESC[Kgeorgesbasement.comESC[mESC[K,
The attached screenshots show my feeble attempt to "fix" the problem.
See also https://github.com/microsoft/vscode/issues/21423 where it's said:
ESC[31mESC[1m must set the color to red and then switch to high intensity.
but that's not our escape sequence ...

EDIT: etiona's less & Mate terminal exhibit the identical behavior to nabia's: No colors.

EDIT: See: https://phoenixnap.com/kb/less-command-in-linux where it's said:
-g Highlights the string last found using search. By default, less highlights all strings matching the last search command.
The highlight codes are all present, but less fails to convert the codes to colors of the bracketed domains.

EDIT: See: https://askubuntu.com/questions/39731/terminal-colors-not-working where it's said:
A number of fixes are enumerated; I hesitate to try any or all without expert advice.
But my terminal displays the prompt in green and URL's in blue, so the failure to process less's codes is more specific.

Screenshot at 2022-05-16 07-42-39.png Screenshot at 2022-05-16 07-42-07.png