Highlight every instance of all the members of a group of strings found in a target file
Here is a list of domains that have been active in the spam folder: Amenex.dot.com.Hairball-ILV-domain-list-SUQd.txt
Spam folder: 1998-2022.Newest 326 MB (over 11 thousand, too big to list here).
Magic Banana's (and PrimeOrdeal's) script that scrapes a list of emails for pertinent data:
LC_ALL=C ./MB-Scrape-Emails-GL-PO 1998-2022.Newest | sed 's/\t\t/\tnull\t/g' > Hairball-AatAdot.Com-Emails.txt
Takes 25 seconds; still too big at 20.2 MB.
Here are the identities of the columns in the hairball file:
head -n 1 Hairball-AatAdot.Com-Emails.txt | sed 's/\t/\n/g' | sed 's/:/\t/g' | sed 's/\#\ //g' > Headers-AatAdot.Com-Emails.txt
1 Content analysis details
2 Date
3 Delivery-date
4 IPv4-addr-in-contents
5 IPv4-addr-in-headers
6 Message-ID
7 Return-Path
8 Subject
9 URL-in-contents
10 URL-in-headers
11 X-Account-Key
12 X-Spam-Score
13 X-UIDL
14 email-addr-in-contents
15 email-addr-in-headers
16 email-addr-with-at-in-contents
Condense the above array to the minimum for the present analysis:
cut -f 2,3,13,15 Hairball-AatAdot.Com-Emails.txt | awk -F '[\t]' 'NR -1 {print $3"\t"$1"\t"$2"\t"$4}' | sort -nk 1,1 > Fields-Hairball-AatAdot.Com-Emails.txt
Also too big at 2.4 MB.
List only the lines that contain the active domains:
grep -ef Amenex.dot.com.Hairball-ILV-domain-list-SUQd.txt Fields-Hairball-AatAdot.Com-Emails.txt | sed 's/Fields-Hairball-AatAdot.Com-Emails.txt\://g' | awk '{print $0}' | grep -v "Amenex.dot.com.Hairball-ILV-domain-list-SUQd.txt:" '-' > Fields-4040-domains-Hairball-AatAdot.Com-Emails.txt
There has to be a better way to keep grep under control here, but how can we highlight every active domain among these long lists, not just one-at-a-time, but every instance of every active domain, all at once ?
Attachment | Size |
---|---|
Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt | 69.24 KB |
Fields-4040-domains-Hairball-AatAdot.Com-Emails.txt | 1.19 MB |
MB-Scrape-Emails-GL-PO.txt | 2.11 KB |
Also too big at 2.4 MB.
Compress. If you do not want to lose any time, choose zstd, in Trisquel's repository. You will actually save time, reading/writing less from the disk. Just pipe to zstd right before redirecting to a file. You then read it with zstdcat. For example (although I think you should try to find more meaningful file names):
$ LC_ALL=C ./MB-Scrape-Emails-GL-PO 1998-2022.Newest > Hairball-AatAdot.Com-Emails.zstd
$ zstdcat Hairball-AatAdot.Com-Emails.zstd | ...
To favor the compression ratio over the time to compress/decompress, you can choose another compression algorithm, up to xz/xzcat for an excellent compression ratio but much time spent compressing/decompressing.
There has to be a better way to keep grep under control here, but how can we highlight every active domain among these long lists, not just one-at-a-time, but every instance of every active domain, all at once ?
I do not understand. Please give an excerpt of the input(s) and the expected output.
As usual, the command lines you write show you do not understand them much. For instance awk '{print $0}' does absolutely nothing but repeating the input.
Anther closely held command (zstd) comes to light; it wasn't mentioned among several sets of useful commands
that I recently found among linux discussions. I can almost get it to work OK here:
cut -f 2,3,13,15 Hairball-AatAdotCom-Emails.txt | awk -F '[\t]' 'NR -1 {print $3"\t"$1"\t"$2"\t"$4}' | sort -nk 1,1 | zstd -11 - - -o Fields-Hairball-AatAdotCom-Emails-11.zstd
Took 1/5th second; 473.0 kB (compressed about five times). It'll have to be renamed before the forum will accept it. So near and yet so far ...
The 20 MB file is beyond the reach of zstd until the --ultra secret handshake is revealed. The actual compression ratio at -11 is five.
Even a twenty-fold setting is unlikely to satisfy the 2.0 MB limitation.
Regarding some ferinstances ... pick any domain from the 4,040 in Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt, renamed and recalculated,
and search Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt, also renamed and recalculated, and search it with the chosen name:
grep -e "credipoint.com" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-credipointdotcom-4040fields.txt
grep -e "amenex.com" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-amenexdotcom-4040fields.txt
grep -e "yandex.ru" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-yandexdotru-4040fields.txt
grep -e "gmail.com" Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt > Instances-gmaildotcom-4040fields.txt
The visual effect of highlighting can be seen after opening the output file with leafpad and then using its search function, which highlights
all instances of the search term. On further thought, it might be better not to select all 4040 domain names the first time, lest everything
be highlighted. If anyone outside North America wonders, a "hairball" is what a wise old owl spits out after digesting a small rodent swallowed
whole.
Attachment | Size |
---|---|
Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt | 69.24 KB |
Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt | 1.19 MB |
Instances-credipointdotcom-4040fields.txt | 173 bytes |
Instances-amenexdotcom-4040fields.txt | 813.53 KB |
Instances-yandexdotru-4040fields.txt | 10.94 KB |
Instances-gmaildotcom-4040fields.txt | 86.06 KB |
The 20 MB file is beyond the reach of zstd until the --ultra secret handshake is revealed.
I repeat: use a different algorithm if you want to favor the compression ratio over the execution time. With XZ:
$ time xz Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt
real 0m0,362s
user 0m0,310s
sys 0m0,020s
$ ls -l Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.xz
-rw-r--r-- 1 banana banana 216100 25 avril 07:54 Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.xz
Regarding some ferinstances ... pick any domain from the 4,040 in Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt, renamed and recalculated,
and search Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt, also renamed and recalculated, and search it with the chosen name:
First of all, Amenexdotcom-Hairball-ILV-domain-list-SUQd.txt contains quotation marks that should not be there. Delete them:
$ sed -i 's/"//g' Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt
Then, use grep's -f option to search in the input everything listed in the file in argument:
$ xzcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.xz | grep -Ff Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt
Importantly, the -F option allows to interpret every line in Amenex.dot_.com_.Hairball-ILV-domain-list-SUQd.txt as a fixed string, not a regular expression. In particular, a dot matches a dot, not "any single character" as in a regular expression. I have explained you that many times.
Regarding my statement:
The 20 MB file is beyond the reach of zstd until the --ultra secret handshake is revealed. Near the bottom of the
complete man zstd page, it's said that-- All arguments after -- are treated as files which gave me leave to list
the intended compression level before the "--umax" modifier. That got me a minuscule increase in compression ratio.
Magic Banana-style scripts suitably modified:
zstd -11 Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt -o Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt
Nice ! All the hits are highlighted on red typeface; exactly what I'd wished. Waitaminnit ... there's a catch:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt > Highlighted-4040-DomainFields-MB.txt
Took 0.024 second; same 5415-row list as before, but not retaining the highlighting that appears in the terminal.
How shall I retain that red typeface in the terminal output when redirecting to an output file ?
Attachment | Size |
---|---|
Highlighted-4040-DomainFields-MB.txt | 1.19 MB |
Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt | 1.19 MB |
With option --color=always:
$ zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt > Highlighted-4040-DomainFields-MB.txt
Read Highlighted-4040-DomainFields-MB.txt with less:
$ less Highlighted-4040-DomainFields-MB.txt
The --color=always option to grep changes the text color of all the grep matches when the output of the script is scrolled in the
terminal with |more, but there are over five thousand lines of output, and only the codes appear at each end of each match in the
file saved to disk. What open-source application puts those codes to use ?
I told you: less.
Here's what almost works:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -Ff Amenexdotcom-Hairball-ILV-domain-list-SU.txt > Colored-4040-DomainFields-MB.txt
Except that the highlighting disappears upon saving the output to a file.
Another set of domains:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -Ff DomainsThruZ.txt > Colored-2107-DomainFields-MB.txt
Same number of output rows because one domain appears in both domain lists.
Online there appears a claimed solution to less's coyness:
https://superuser.com/questions/36022/getting-colored-results-when-using-a-pipe-from-grep-to-less
where it's said:
When you simply run grep --color it implies grep --color=auto which detects whether the output is a terminal and
if so enables colors. However, when it detects a pipe it disables coloring. The following command will always enable
coloring and override the automatic detection, and you will get the color highlighting in less:
grep --color=always -R "search string" * | less
Alas, when I attempt to send stdout to a file:
zstdcat Fields-4040-domains-Hairball-AatAdotCom-Emails-MB.txt.zstd | grep --color=always -R -Ff DomainsThruZ.txt | less > Output-04252022.txt
Trouble became apparent at 1.2 GB; by the time the script was aborted, 1.7 GB had accumulated. The first five hundred
lines of that output are attached.
Attachment | Size |
---|---|
Colored-4040-DomainFields-MB.txt | 1.6 MB |
Colored-2107-DomainFields-MB.txt | 1.52 MB |
DomainsThruZ.txt | 30.75 KB |
Output-04252022.txt | 184.81 KB |
Here's what almost works
Here, it does work. I repeat, read the output file with less:
$ less Colored-4040-DomainFields-MB.txt
Online there appears a claimed solution to less's coyness
It is the solution I gave you too: grep's option --color=always.
... Which was apparent some days ago, but only in the terminal. What application lets me scroll up & down through
several thousand lines of sometimes-highlighted text ? That's anther question that's been repeated several times.
Which was apparent some days ago, but only in the terminal.
Yes, only terminals and terminal emulators interpret the ANSI escape sequences grep (and other terminal programs, e.g., ls) writes to color its output: https://en.wikipedia.org/wiki/ANSI_escape_code#Colors
What application lets me scroll up & down through several thousand lines of sometimes-highlighted text ? That's anther question that's been repeated several times.
I have repeated the answer as many times: less.
My abject humble reply: When in doubt, read the directions. man less
describes how to scroll up, down, etc. throughout the file.
Two weeks later, output file at the ready; now using Trisquel 10
less Highlighted-4040-DomainFields-MB.txt
elicits the query:
"Highlighted-4040-DomainFields-MB.txt" may be a binary file. See it anyway?
I replied "Y" but no colors appear; just ESC codes. Scrolling up, down & sideways fully functional.
Attachment | Size |
---|---|
Highlighted-4040-DomainFields-MB.txt | 1.6 MB |
Is Highlighted-4040-DomainFields-MB.txt compressed with zstd? If so:
$ zstdcat Highlighted-4040-DomainFields-MB.txt | less
less can read .gz files, .bz2 files, .xz files, ... but not .zstd files.
Didn't use zstd ...
cat Highlighted-4040-DomainFields-MB.txt | less
still doesn't activate the color.
Man less doesn't say anything about color. Color palette in terminal is activated.
I had not noticed your attachment in your previous post. Here, on Debian 11 (but that should not make a difference), everything works as expected: 'less Highlighted-4040-DomainFields-MB_0.txt' shows the domains in red.
Even without the cat, there are escape codes but no color.
Here's the details of a typical "highlighted" email address:
george@ESC[01;31mESC[Kgeorgesbasement.comESC[mESC[K,
The attached screenshots show my feeble attempt to "fix" the problem.
See also https://github.com/microsoft/vscode/issues/21423 where it's said:
ESC[31mESC[1m must set the color to red and then switch to high intensity.
but that's not our escape sequence ...
EDIT: etiona's less & Mate terminal exhibit the identical behavior to nabia's: No colors.
EDIT: See: https://phoenixnap.com/kb/less-command-in-linux where it's said:
-g Highlights the string last found using search. By default, less highlights all strings matching the last search command.
The highlight codes are all present, but less fails to convert the codes to colors of the bracketed domains.
EDIT: See: https://askubuntu.com/questions/39731/terminal-colors-not-working where it's said:
A number of fixes are enumerated; I hesitate to try any or all without expert advice.
But my terminal displays the prompt in green and URL's in blue, so the failure to process less's codes is more specific.
After being dumbfounded by the myriad of suggested changes to enable the command less to display
highlighting as discussed at length in this discussion, I tried the most simple command sequence:
less Highlighted-4040-DomainFields-MB.txt
then at the less: prompt, type -r or -R
whereupon the ESC codes are all activated.
This brings another problem to the forefront: There appears to be no way of turning off the less: terminal's word-wrap "feature."
There are options to control horizontal (i.e., left-and-right) scrolling) which are not needed when word-wrap is in effect.
How do I turn off that word-wrap annoyance ?
EDIT Attempting to answer my own question, I found https://unix.stackexchange.com/questions/475005/turning-off-word-wrap-with-less-during-paging where it's said to type -S at the less: prompt, but that has the effect to either chop long lines or (invoked again) fold long lines but nothing actually happens. But further on in the same link: To disable line wrapping in terminal more generally, you can use:
setterm -linewrap off
before invoking the less: command.
Alas, at first, line-wrapping is indeed off, but scrolling horizontally with ctr-shift->
scrolls to the right but also turns on word-wrap.
EDIT Another workaround is to zoom out with control -
which reduces the typeface size at the expense of readability.
less -S Highlighted-4040-DomainFields-MB_0.txt works as expected here: with color and truncated lines.
The display looks very pretty without the line-wrap feature, but there's none of the scrolling left-to-right
that's listed in man less which isn't needed when line-wrap is in effect. Catch-22 ?
The terminal allows zooming out sufficiently to display all of each line, but then the text is illegible.
man less says:
ESC-) or RIGHTARROW
Scroll horizontally right N characters, default half the screen width (see the -# option). If a number N is specified, it becomes the default for future RIGHTARROW and LEFTARROW commands. While the text is scrolled, it acts as though the -S option (chop lines) were in effect.
ESC-( or LEFTARROW
Scroll horizontally left N characters, default half the screen width (see the -# option). If a number N is specified, it becomes the default for future RIGHTARROW and LEFTARROW commands.
ESC-} or ^RIGHTARROW
Scroll horizontally right to show the end of the longest displayed line.
ESC-{ or ^LEFTARROW
Scroll horizontally left back to the first column.
It works here.
Magic Banana quotes the less man page and then states It works here.
Look more closely. When line-wrap is in effect, you don't need to scroll left-to-right of right-to-left
because the entire line is displayed, albeit folded-over. It's distracting when searching for patterns.
When line-wrap is turned off, all that is displayed is the front of each line; the rest is chopped off
(truncated) and horizontal scrolling to see the invisible portion is impossible.
Where are those right & left arrows ? On the T420's keyboard there's the Tab key and the > & < keys (shift+control+>).
Neither method works with line-wrap turned off, except that (shift+control+>) goes to the end of the file and
(shift+control+<) goes back to the beginning of the file. The Esc-) and Esc-( key combinations don't work at all.
The present exemplar file has all of the pertinent domains highlighted; once it starts to work satisfactorily, the
plan is to look for patterns based on the most obvious abusers of the ring buffer principle. Ring buffers are used
to smooth (buffer) inputs to popular sites, where one or two dozen domain reside at different addresses (servers).
Abused ring buffers have non-word domain names or absurd numbers of IP addresses, turned on & off frequently so as
to be less accessible.
Where are those right & left arrows ?
Looking at https://guide-images.cdn.ifixit.com/igi/rjbNXb6CrOFxbfJE.medium they are in the bottom-right corner of the keyboard.
The Esc-) and Esc-( key combinations don't work at all.
They are not combinations. You press Esc and then either ( or ).
All solved:
To open the exemplar file per Magic Banana:
less -S Highlighted-4040-DomainFields-MB_0.txt
At the less: prompt, type -R to activate the highlighting colors.
Press Esc, then ): scroll one screen view to the right; repeat: scroll some more; repeat until the last row tail is uncovered.
Press Esc, then (: scroll to the left one screen view; repeat as above to get back to the beginning of the rows.
Mouse wheel: scroll up and down to see everything at each screen view. Think of studying wallpaper, one panel at a time.
Edit
If you have sharp eyesight, place the cursor in the terminal header and press [control -] a couple of times.
That can minimize the number of [Esc, )] scroll commands needed. This last step isn't reversible.
At the less: prompt, type -R to activate the highlighting colors.
You should be able to directly call less with that option:
$ less -RS Highlighted-4040-DomainFields-MB_0.txt
Press Esc, then ): scroll one screen view to the right; repeat: scroll some more; repeat until the last row tail is uncovered.
Press Esc, then (: scroll to the left one screen view; repeat as above to get back to the beginning of the rows.
Using the arrows looks more convenient to me.
Magic Banana is correct on both counts; thank you !