Highlight every instance of all the members of a group of strings found in a target file

25 respostas [Última entrada]
amenex
Desconectado
Joined: 01/04/2015

Here is a list of domains that have been active in the spam folder: Amenex.dot.com.Hairball-ILV-domain-list-SUQd.txt
Spam folder: 1998-2022.Newest 326 MB (over 11 thousand, too big to list here).
Magic Banana's (and PrimeOrdeal's) script that scrapes a list of emails for pertinent data:
LC_ALL=C ./MB-Scrape-Emails-GL-PO 1998-2022.Newest | sed 's/\t\t/\tnull\t/g' > Hairball-AatAdot.Com-Emails.txt Takes 25 seconds; still too big at 20.2 MB.
Here are the identities of the columns in the hairball file:
head -n 1 Hairball-AatAdot.Com-Emails.txt | sed 's/\t/\n/g' | sed 's/:/\t/g' | sed 's/\#\ //g' > Headers-AatAdot.Com-Emails.txt

1 Content analysis details
2 Date
3 Delivery-date
4 IPv4-addr-in-contents
5 IPv4-addr-in-headers
6 Message-ID
7 Return-Path
8 Subject
9 URL-in-contents
10 URL-in-headers
11 X-Account-Key
12 X-Spam-Score
13 X-UIDL
14 email-addr-in-contents
15 email-addr-in-headers
16 email-addr-with-at-in-contents

Condense the above array to the minimum for the present analysis:
cut -f 2,3,13,15 Hairball-AatAdot.Com-Emails.txt | awk -F '[\t]' 'NR -1 {print $3"\t"$1"\t"$2"\t"$4}' | sort -nk 1,1 > Fields-Hairball-AatAdot.Com-Emails.txtAlso too big at 2.4 MB.
List only the lines that contain the active domains:
grep -ef Amenex.dot.com.Hairball-ILV-domain-list-SUQd.txt Fields-Hairball-AatAdot.Com-Emails.txt | sed 's/Fields-Hairball-AatAdot.Com-Emails.txt\://g' | awk '{print $0}' | grep -v "Amenex.dot.com.Hairball-ILV-domain-list-SUQd.txt:" '-' > Fields-4040-domains-Hairball-AatAdot.Com-Emails.txtThere has to be a better way to keep grep under control here, but how can we highlight every active domain among these long lists, not just one-at-a-time, but every instance of every active domain, all at once ?

AnexoTamaño
Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt69.24 KB
Fields-4040-domains-Hairball-AatAdot.Com-Emails.txt1.19 MB
MB-Scrape-Emails-GL-PO.txt2.11 KB
Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

Also too big at 2.4 MB.

Compress. If you do not want to lose any time, choose zstd, in Trisquel's repository. You will actually save time, reading/writing less from the disk. Just pipe to zstd right before redirecting to a file. You then read it with zstdcat. For example (although I think you should try to find more meaningful file names):
$ LC_ALL=C ./MB-Scrape-Emails-GL-PO 1998-2022.Newest > Hairball-AatAdot.Com-Emails.zstd
$ zstdcat Hairball-AatAdot.Com-Emails.zstd | ...

To favor the compression ratio over the time to compress/decompress, you can choose another compression algorithm, up to xz/xzcat for an excellent compression ratio but much time spent compressing/decompressing.

There has to be a better way to keep grep under control here, but how can we highlight every active domain among these long lists, not just one-at-a-time, but every instance of every active domain, all at once ?

I do not understand. Please give an excerpt of the input(s) and the expected output.

As usual, the command lines you write show you do not understand them much. For instance awk '{print $0}' does absolutely nothing but repeating the input.

amenex
Desconectado
Joined: 01/04/2015

Anther closely held command (zstd) comes to light; it wasn't mentioned among several sets of useful commands
that I recently found among linux discussions. I can almost get it to work OK here:
cut -f 2,3,13,15 Hairball-AatAdotCom-Emails.txt | awk -F '[\t]' 'NR -1 {print $3"\t"$1"\t"$2"\t"$4}' | sort -nk 1,1 | zstd -11 - - -o Fields-Hairball-AatAdotCom-Emails-11.zstd Took 1/5th second; 473.0 kB (compressed about five times). It'll have to be renamed before the forum will accept it. So near and yet so far ...
The 20 MB file is beyond the reach of zstd until the --ultra secret handshake is revealed. The actual compression ratio at -11 is five.
Even a twenty-fold setting is unlikely to satisfy the 2.0 MB limitation.
Regarding some ferinstances ... pick any domain from the 4,040 in Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt, renamed and recalculated,
and search Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt, also renamed and recalculated, and search it with the chosen name:
grep -e "credipoint.com" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-credipointdotcom-4040fields.txt
grep -e "amenex.com" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-amenexdotcom-4040fields.txt
grep -e "yandex.ru" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-yandexdotru-4040fields.txt
grep -e "gmail.com" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-gmaildotcom-4040fields.txt

The visual effect of highlighting can be seen after opening the output file with leafpad and then using its search function, which highlights
all instances of the search term. On further thought, it might be better not to select all 4040 domain names the first time, lest everything
be highlighted. If anyone outside North America wonders, a "hairball" is what a wise old owl spits out after digesting a small rodent swallowed
whole.

AnexoTamaño
Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt 69.24 KB
Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt 1.19 MB
Instances-credipointdotcom-4040fields.txt 173 bytes
Instances-amenexdotcom-4040fields.txt 813.53 KB
Instances-yandexdotru-4040fields.txt 10.94 KB
Instances-gmaildotcom-4040fields.txt 86.06 KB
Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

The 20 MB file is beyond the reach of zstd until the --ultra secret handshake is revealed.

I repeat: use a different algorithm if you want to favor the compression ratio over the execution time. With XZ:
$ time xz Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt
real 0m0,362s
user 0m0,310s
sys 0m0,020s
$ ls -l Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.xz
-rw-r--r-- 1 banana banana 216100 25 avril 07:54 Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.xz

Regarding some ferinstances ... pick any domain from the 4,040 in Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt, renamed and recalculated,
and search Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt, also renamed and recalculated, and search it with the chosen name:

First of all, Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt contains quotation marks that should not be there. Delete them:
$ sed -i 's/"//g' Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt
Then, use grep's -f option to search in the input everything listed in the file in argument:
$ xzcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.xz | grep -Ff Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt
Importantly, the -F option allows to interpret every line in Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt as a fixed string, not a regular expression. In particular, a dot matches a dot, not "any single character" as in a regular expression. I have explained you that many times.

amenex
Desconectado
Joined: 01/04/2015

Regarding my statement:
The 20 MB file is beyond the reach of zstd until the --ultra secret handshake is revealed. Near the bottom of the
complete man zstd page, it's said that-- All arguments after -- are treated as files which gave me leave to list
the intended compression level before the "--umax" modifier. That got me a minuscule increase in compression ratio.
Magic Banana-style scripts suitably modified:
zstd -11 Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt -o Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt

Nice ! All the hits are highlighted on red typeface; exactly what I'd wished. Waitaminnit ... there's a catch:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt > Highlighted-4040-DomainFields-MB.txtTook 0.024 second; same 5415-row list as before, but not retaining the highlighting that appears in the terminal.
How shall I retain that red typeface in the terminal output when redirecting to an output file ?

AnexoTamaño
Highlighted-4040-DomainFields-MB.txt 1.19 MB
Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt 1.19 MB
Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

With option --color=always:
$ zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt > Highlighted-4040-DomainFields-MB.txt
Read Highlighted-4040-DomainFields-MB.txt with less:
$ less Highlighted-4040-DomainFields-MB.txt

amenex
Desconectado
Joined: 01/04/2015

The --color=always option to grep changes the text color of all the grep matches when the output of the script is scrolled in the
terminal with |more, but there are over five thousand lines of output, and only the codes appear at each end of each match in the
file saved to disk. What open-source application puts those codes to use ?

Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

I told you: less.

amenex
Desconectado
Joined: 01/04/2015

Here's what almost works:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt > Colored-4040-DomainFields-MB.txtExcept that the highlighting disappears upon saving the output to a file.
Another set of domains:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -Ff DomainsThruZ.txt > Colored-2107-DomainFields-MB.txtSame number of output rows because one domain appears in both domain lists.
Online there appears a claimed solution to less's coyness:
https://superuser.com/questions/36022/getting-colored-results-when-using-a-pipe-from-grep-to-less
where it's said:
When you simply run grep --color it implies grep --color=auto which detects whether the output is a terminal and
if so enables colors. However, when it detects a pipe it disables coloring. The following command will always enable
coloring and override the automatic detection, and you will get the color highlighting in less:

grep --color=always -R "search string" * | lessAlas, when I attempt to send stdout to a file:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -R -Ff DomainsThruZ.txt | less > Output-04252022.txtTrouble became apparent at 1.2 GB; by the time the script was aborted, 1.7 GB had accumulated. The first five hundred
lines of that output are attached.

AnexoTamaño
Colored-4040-DomainFields-MB.txt 1.6 MB
Colored-2107-DomainFields-MB.txt 1.52 MB
DomainsThruZ.txt 30.75 KB
Output-04252022.txt 184.81 KB
Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

Here's what almost works

Here, it does work. I repeat, read the output file with less:
$ less Colored-4040-DomainFields-MB.txt

Online there appears a claimed solution to less's coyness

It is the solution I gave you too: grep's option --color=always.

amenex
Desconectado
Joined: 01/04/2015

... Which was apparent some days ago, but only in the terminal. What application lets me scroll up & down through
several thousand lines of sometimes-highlighted text ? That's anther question that's been repeated several times.

Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

Which was apparent some days ago, but only in the terminal.

Yes, only terminals and terminal emulators interpret the ANSI escape sequences grep (and other terminal programs, e.g., ls) writes to color its output: https://en.wikipedia.org/wiki/ANSI_escape_code#Colors

What application lets me scroll up & down through several thousand lines of sometimes-highlighted text ? That's anther question that's been repeated several times.

I have repeated the answer as many times: less.

amenex
Desconectado
Joined: 01/04/2015

My abject humble reply: When in doubt, read the directions. man less describes how to scroll up, down, etc. throughout the file.

Two weeks later, output file at the ready; now using Trisquel 10
less Highlighted-4040-DomainFields-MB.txt elicits the query:
"Highlighted-4040-DomainFields-MB.txt" may be a binary file. See it anyway?
I replied "Y" but no colors appear; just ESC codes. Scrolling up, down & sideways fully functional.

AnexoTamaño
Highlighted-4040-DomainFields-MB.txt 1.6 MB
Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

Is Highlighted-4040-DomainFields-MB.txt compressed with zstd? If so:
$ zstdcat Highlighted-4040-DomainFields-MB.txt | less
less can read .gz files, .bz2 files, .xz files, ... but not .zstd files.

amenex
Desconectado
Joined: 01/04/2015

Didn't use zstd ...
cat Highlighted-4040-DomainFields-MB.txt | lessstill doesn't activate the color.
Man less doesn't say anything about color. Color palette in terminal is activated.

Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

I had not noticed your attachment in your previous post. Here, on Debian 11 (but that should not make a difference), everything works as expected: 'less Highlighted-4040-DomainFields-MB_0.txt' shows the domains in red.

amenex
Desconectado
Joined: 01/04/2015

Even without the cat, there are escape codes but no color.
Here's the details of a typical "highlighted" email address:
george@ESC[01;31mESC[Kgeorgesbasement.comESC[mESC[K,
The attached screenshots show my feeble attempt to "fix" the problem.
See also https://github.com/microsoft/vscode/issues/21423 where it's said:
ESC[31mESC[1m must set the color to red and then switch to high intensity.
but that's not our escape sequence ...

EDIT: etiona's less & Mate terminal exhibit the identical behavior to nabia's: No colors.

EDIT: See: https://phoenixnap.com/kb/less-command-in-linux where it's said:
-g Highlights the string last found using search. By default, less highlights all strings matching the last search command.
The highlight codes are all present, but less fails to convert the codes to colors of the bracketed domains.

EDIT: See: https://askubuntu.com/questions/39731/terminal-colors-not-working where it's said:
A number of fixes are enumerated; I hesitate to try any or all without expert advice.
But my terminal displays the prompt in green and URL's in blue, so the failure to process less's codes is more specific.

Screenshot at 2022-05-16 07-42-39.png Screenshot at 2022-05-16 07-42-07.png
amenex
Desconectado
Joined: 01/04/2015

After being dumbfounded by the myriad of suggested changes to enable the command less to display
highlighting as discussed at length in this discussion, I tried the most simple command sequence:
less Highlighted-4040-DomainFields-MB.txtthen at the less: prompt, type -r or -Rwhereupon the ESC codes are all activated.

This brings another problem to the forefront: There appears to be no way of turning off the less: terminal's word-wrap "feature."
There are options to control horizontal (i.e., left-and-right) scrolling) which are not needed when word-wrap is in effect.

How do I turn off that word-wrap annoyance ?

EDIT Attempting to answer my own question, I found https://unix.stackexchange.com/questions/475005/turning-off-word-wrap-with-less-during-paging where it's said to type -S at the less: prompt, but that has the effect to either chop long lines or (invoked again) fold long lines but nothing actually happens. But further on in the same link: To disable line wrapping in terminal more generally, you can use:
setterm -linewrap offbefore invoking the less: command.
Alas, at first, line-wrapping is indeed off, but scrolling horizontally with ctr-shift-> scrolls to the right but also turns on word-wrap.

EDIT Another workaround is to zoom out with control - which reduces the typeface size at the expense of readability.

Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

less -S Highlighted-4040-DomainFields-MB_0.txt works as expected here: with color and truncated lines.

amenex
Desconectado
Joined: 01/04/2015

The display looks very pretty without the line-wrap feature, but there's none of the scrolling left-to-right
that's listed in man less which isn't needed when line-wrap is in effect. Catch-22 ?

The terminal allows zooming out sufficiently to display all of each line, but then the text is illegible.

Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

man less says:
ESC-) or RIGHTARROW
Scroll horizontally right N characters, default half the screen width (see the -# option). If a number N is specified, it becomes the default for future RIGHTARROW and LEFTARROW commands. While the text is scrolled, it acts as though the -S option (chop lines) were in effect.
ESC-( or LEFTARROW
Scroll horizontally left N characters, default half the screen width (see the -# option). If a number N is specified, it becomes the default for future RIGHTARROW and LEFTARROW commands.
ESC-} or ^RIGHTARROW
Scroll horizontally right to show the end of the longest displayed line.
ESC-{ or ^LEFTARROW
Scroll horizontally left back to the first column.

It works here.

amenex
Desconectado
Joined: 01/04/2015

Magic Banana quotes the less man page and then states It works here.

Look more closely. When line-wrap is in effect, you don't need to scroll left-to-right of right-to-left
because the entire line is displayed, albeit folded-over. It's distracting when searching for patterns.

When line-wrap is turned off, all that is displayed is the front of each line; the rest is chopped off
(truncated) and horizontal scrolling to see the invisible portion is impossible.

Where are those right & left arrows ? On the T420's keyboard there's the Tab key and the > & < keys (shift+control+>).
Neither method works with line-wrap turned off, except that (shift+control+>) goes to the end of the file and
(shift+control+<) goes back to the beginning of the file. The Esc-) and Esc-( key combinations don't work at all.

The present exemplar file has all of the pertinent domains highlighted; once it starts to work satisfactorily, the
plan is to look for patterns based on the most obvious abusers of the ring buffer principle. Ring buffers are used
to smooth (buffer) inputs to popular sites, where one or two dozen domain reside at different addresses (servers).
Abused ring buffers have non-word domain names or absurd numbers of IP addresses, turned on & off frequently so as
to be less accessible.

Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

Where are those right & left arrows ?

Looking at https://guide-images.cdn.ifixit.com/igi/rjbNXb6CrOFxbfJE.medium they are in the bottom-right corner of the keyboard.

The Esc-) and Esc-( key combinations don't work at all.

They are not combinations. You press Esc and then either ( or ).

amenex
Desconectado
Joined: 01/04/2015

All solved:

To open the exemplar file per Magic Banana:
less -S Highlighted-4040-DomainFields-MB_0.txtAt the less: prompt, type -R to activate the highlighting colors.

Press Esc, then ): scroll one screen view to the right; repeat: scroll some more; repeat until the last row tail is uncovered.
Press Esc, then (: scroll to the left one screen view; repeat as above to get back to the beginning of the rows.

Mouse wheel: scroll up and down to see everything at each screen view. Think of studying wallpaper, one panel at a time.

Edit

If you have sharp eyesight, place the cursor in the terminal header and press [control -] a couple of times.
That can minimize the number of [Esc, )] scroll commands needed. This last step isn't reversible.

Magic Banana

I am a member!

Desconectado
Joined: 07/24/2010

At the less: prompt, type -R to activate the highlighting colors.

You should be able to directly call less with that option:
$ less -RS Highlighted-4040-DomainFields-MB_0.txt

Press Esc, then ): scroll one screen view to the right; repeat: scroll some more; repeat until the last row tail is uncovered.
Press Esc, then (: scroll to the left one screen view; repeat as above to get back to the beginning of the rows.

Using the arrows looks more convenient to me.

amenex
Desconectado
Joined: 01/04/2015

Magic Banana is correct on both counts; thank you !