Grepping issue

6 replies [Last post]
amenex
Offline
Joined: 01/03/2015

Faced with a list of files containing the phrases, "F 1016", "G 1017", "H 1018" and thirteen others, I've
set about counting the numbers of those phrases in each of the 137 files.
I started with the simple grep command
grep -ef Pattern-words.txt <(awk '(print $3,$4}' Folder/File-001.txt)
which returns nothing. The phrases are in the 3rd $ 4th columns, and the files are in a subfolder.
However, when I attempt to find the patterns one-at-a-time
grep "F 1016" <(awk '(print $3,$4}' Folder/File-001.txt)
That works OK.
Enclosing the patterns in Pattern-words.txt in quotes makes no difference; nor does escaping the space.
The files have four columns; if I copy and paste the 3rd and 4th columns of those files into the pattern
file rather than just typing them from the keyboard, that doesn't help either.
Changing the patterns to "1016", "1017", "1018" ... or just 1016, 1017, 1018 ... doesn't settle the issue.
Where is the error of my ways ?

amenex
Offline
Joined: 01/03/2015

What does work (awkwardly) is
grep -e "1016" -e "1017" -e "1018" <(awk '(print $3,$4}' Folder/File-001.txt)
As does
grep -e "F 1016" -e "G 1017" -e "H 1018" <(awk '(print $3,$4}' Folder/File-001.txt)
which require a script with 16x137 = 2192 rows.

Magic Banana

I am a member!

Offline
Joined: 07/24/2010

grep -ef Pattern-words.txt selects the lines containing the letter f in the file Pattern-words.txt and the standard input is ignored. That is why I believe your grep -ef Pattern-words.txt <(awk '(print $3,$4}' Folder/File-001.txt) (with syntax errors in the AWK program: the opening parenthesis should be an opening brace) should be:
$ awk '{ print $3, $4 }' Folder/File-001.txt | grep -f Pattern-words.txt

As usual, without an example of input (showing all possible cases) and the desired output, it is hard to understand what you actually want.

counting the numbers of those phrases

The total number (in each file) or the number for each phrase? If a same line contains two phrases, does it count for one or for two?

The phrases are in the 3rd $ 4th column

To be counted, must the phrase be the whole value in the column or only part of it?

amenex
Offline
Joined: 01/03/2015

Magic Banana kindly discussed my grepping efforts ...
grep -ef Pattern-words.txt selects the lines containing the letter f in the file Pattern-words.txt and the standard input is ignored.

Man grep led me astray:
-e PATTERN, --regexp=PATTERN
Use PATTERN as the pattern. If this option is used multiple times or is combined with the -f (--file) option, search for all patterns given.

So I chose to use (successfully, it seems)
grep -ef "pattern" Target_file ...
That is why I believe your grep -ef Pattern-words.txt <(awk '(print $3,$4}' Folder/File-001.txt) (with syntax errors in the AWK program: the opening parenthesis should be an opening brace) should be:
$ awk '{ print $3, $4 }' Folder/File-001.txt | grep -f Pattern-words.txt
Alas, still not working; even the following real-world example:
awk '{ print $3,$4 }' AAspam-citi_.com_.txt | grep -f Pattern-Words.txt
As usual, without an example of input (showing all possible cases) and the desired output,
it is hard to understand what you actually want.

Here's what does work and what I chose to do with the actual files (there's 136 more):
awk '{ print $3,$4 }' AAspam-citi.com_.txt | grep -e "F 1016" -e "G 1017" -e "H 1018" -e "I 1019" -e "J 1020" -e "K 1021" -e "L 1022" -e "M 1023" -e "N 1024" -e "O 1025" -e "P 1026" -e "Q 1027" -e "S 1029" -e "T 1030" -e "U 1031" -e "V 1101"
counting the numbers of those phrases
What I was actually doing was determining which servers were changing their settings by checking each day's data;
the last numbers in the patterns are date codes, starting with October 16 [2021]. I missed the 28th.
The total number (in each file) or the number for each phrase? If a same line contains two phrases, does it count for one or for two?
The code was meant to count them as pairs, one count for each instance of the pair. That's what it does count.
The phrases are in the 3rd $ 4th column
To be counted, must the phrase be the whole value in the column or only part of it?
It doesn't matter whether or not the patterns include the script code (F through V, excluding R); the outputs
include the entirety of each line in the target file. I was guarding against unintended inclusions.

In the end only NXDOMAIN, SERVFAIL, and two other files exhibited [intentional ?] daily changes, contrasting
with more extensive data collect another way. I chose the spam files flagged by my ISP's "Box Trapper" code.
NXDOMAIN and SERVFAIL change because the non-multi-addressed server domains were being turned on and off.

AttachmentSize
AAspam-citi.com_.txt 7.42 KB
Pattern-Words.txt 144 bytes
Magic Banana

I am a member!

Offline
Joined: 07/24/2010

grep -ef "pattern" Target_file ...

... selects the lines containing the letter f in the files pattern and Target_file.

Alas, still not working

That is because Pattern-Words.txt contains quotation marks grep searches. After removing them, you can use cut (simpler than awk) to select the third and fourth column, grep the desired content, sort it and count with uniq -c:
$ sed -i 's/"//g' Pattern-Words.txt
$ cut -d ' ' -f 3,4 AAspam-citi.com_.txt | grep -f Pattern-Words.txt | sort | uniq -c
9 F 1016
9 G 1017
9 H 1018
9 I 1019
9 J 1020
9 K 1021
9 L 1022
9 M 1023
9 N 1024
9 O 1025
9 P 1026
9 Q 1027
9 S 1029
9 T 1030
9 U 1031
9 V 1101

Is that what you want? You still have not provided an expected output...

amenex
Offline
Joined: 01/03/2015

Magic Banana wrote in reply to my lament:
Alas, still not working
That is because Pattern-Words.txt contains quotation marks grep searches.
Removing the quotes in this command:
awk '{print $3,$4}' PTR-Spam/AAspam-citi.com_.txt | grep -e 1016 -e 1017 -e 1018 -e 1019 -e 1020 -e 1021 -e 1022 -e 1023 -e 1024 -e 1025 -e 1026 -e 1027 -e 1029 -e 1030 -e 1031 -e 1101
gives the same result as with the quotes in my previous post.
Quotes aren't the issue until they're used in the pattern file.

Yes; Magic Banana's result is what I expected. In my own scripts, collecting "citi.com" brought with it all
the subsidiary accounts ... but they were identical (all nine, from different servers) every day, which
was the case among the other 134 of 136 PTR's. NXDOMAIN and SERVFAIL are different stories ...

Magic Banana

I am a member!

Offline
Joined: 07/24/2010

Quotes aren't the issue until they're used in the pattern file.

That is because, on the command line, the shell interprets the quotation marks and strips them of the arguments grep receives.