Odd behavior of awk applied to nmap data

11 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/03/2015

My nmap search ending with -oG - produced the snippet of data called TrisquelQuestion.txt
which consists of a column of the IPv4 addresses under search and a second column of resolved PTR names,
very often "empty"

The usual awk script that I've been using to grab the good data in the second column
produces an odd result; read on.

The following script reproduces a file outwardly identical to the input file, TrisquelQuestion.txt:
awk '{print $0}' TrisquelQuestion.txt > Temp-12132020C1.txt

Here's what ought to have selected only the PTR data in the second column:
awk '{print $2}' TrisquelQuestion.txt > Temp-12132020C2.txt
but there are only blanks where the PTR's ought to be ...

Checking whether or not awk can correct whetever's wrong with those two columns:
awk '{print $2}' Temp-12132020C1.txt > Temp-12132020C22.txt
Temp-12132020C22.txt is identical to Temp-12132020C2.txt.

The following command selects only items containing letters of the alphabet (from a source I couldn't trace):
LC_ALL=C grep '[a-z]' <(awk '{print $0}' TrisquelQuestion.txt) > Temp-12132020C3.txt

The join command reunites the PTR's with their IPv4 addresses, which was the intent of the exercize:
join -a 1 -1 1 -2 2 <(sort Temp-12132020C3.txt) <(sort -k 2,2 TrisquelQuestion.txt) > Temp-12132020C4.txt

What is it about nmap or awk that is leading me astray ?

AllegatoDimensione
TrisquelQuestion.txt2.8 KB
Temp-12132020C1.txt2.8 KB
Temp-12132020C2.txt937 byte
Temp-12132020C3.txt2 KB
Temp-12132020C4.txt2 KB
Temp-12132020C22.txt937 byte
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

a column of the IPv4 addresses under search and a second column of resolved PTR names, very often "empty"

That is not your input. You have records with one single field and records with two fields. The records with two fields have the IP address first and the PTR name second. When you try, with $2, to access the second field of a record with one single field, $2 is the empty string. There is nothing "odd" about that: the second field does not exist.

If the goal is to keep only the records with two fields, test the NF (which means "Number of Fields") variable:
$ awk 'NF == 2' TrisquelQuestion.txt
When, as above, a program AWK only consists of a condition, the default action '{ print }' is assumed. And calling the print function without any argument is the same as calling it with the whole record, $0, in argument.

If you want an ordered output, sort it with the sort command:
$ awk 'NF == 2' TrisquelQuestion.txt | sort

In what you wrote:

  • "awk '{print $0}' TrisquelQuestion.txt > Temp-12132020C1.txt" is a uselessly complicated and inefficient way to do "cp TrisquelQuestion.txt Temp-12132020C1.txt";
  • "LC_ALL=C grep '[a-z]' <(awk '{print $0}' TrisquelQuestion.txt)" is a uselessly complicated and inefficient way to do "grep [a-z] TrisquelQuestion.txt" and, in general, it does not do what you want: you would miss the lines with PTR names containing no lower-case letter;
  • "join -a 1 -1 1 -2 2 <(sort Temp-12132020C3.txt) <(sort -k 2,2 TrisquelQuestion.txt)" makes little sense: Temp-12132020C3.txt contains all the information you want (no need to join it with anything else), just unsorted.

EDIT: if your actual input is large and if you care about performance, it is worth avoiding AWK. You can keep the lines with more than one field by grepping ' .' (a space followed by some character, hence not the end of the line):
$ grep ' .' TrisquelQuestion.txt
Timing on my system with 10,000 TrisquelQuestion.txt to process:
$ time -p awk 'NF == 2' $(yes TrisquelQuestion.txt | head -10000) > /dev/null
real 0.34
user 0.29
sys 0.03
$ time -p grep ' .' $(yes TrisquelQuestion.txt | head -10000) > /dev/null
real 0.04
user 0.01
sys 0.02

amenex
Offline
Iscritto: 01/03/2015

Here comes the ultimate in efficency:
grep ' .' TrisquelQuestion.txt

Which is readily put to use to test whether grep catches both alphabetic cases:
grep ' .' TrisquelQuestionSalted.txt > Temp-12132020TQ9S.txt
where my understanding is that the space detects the second column before the dot starts looking for letters, not numbers.

There's a catch:
awk '{print $1,$2}' TrisquelQuestionSalted.txt | grep ' .' '-' > Temp-12132020TQ10S.txt
vs.
awk '{print $1"\t"$2}' TrisquelQuestionSalted.txt | grep ' .' '-' > Temp-12132020TQ11S.txt

What does one do with a tabbed pair of columns ?

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

As I wrote: grep ' .' selects the lines with "a space followed by some character, hence not the end of the line". By some character, I mean "any single character", including letters, digits, spaces, punctuation, etc.

If you want a tabulation instead of the space, you can substitute the space with a copy-pasted tabulation, but that is horrible. Prefer:
$ grep "$(printf \\t)." TrisquelQuestion.txt
And to select lines with either horizontal whitespace character, followed by any single character:
$ grep '[[:blank:]].' TrisquelQuestion.txt
Nevertheless, in that situation, it would certainly be better to normalize with tr what delimiter is used:
$ tr '\t' ' ' < TrisquelQuestion.txt | grep ' .'

To select specific fields, without reordering them, you can use the cut command. It is faster to type and more efficient than awk. It is here also faster to use cut after (rather than before) grep, which greatly reduces the number of lines cut processes (you wrote that there is "very often" one single field). That gives, for space-separated fields:
$ grep ' .' TrisquelQuestionSalted.txt | cut -d ' ' -f -2
For tab-separated fields (cut's default):
$ grep '$(printf \\t).' TrisquelQuestionSalted.txt | cut -f -2

Also, I repeat: "There is no need to specify '-' as the sole file: given no file in argument, essentially any text processing command processes the standard input, -".

amenex
Offline
Iscritto: 01/03/2015

As usual, Magic Banana's analysis is richly informative.

The first script powerfully does what has driven me nuts for months ...
awk 'NF == 2' TrisquelQuestion.txt

I'll combine it with additional steps to clean up the general output of my varied nmap scripts;
sed 's/92.242.140.21//g; s/No_DNS//g' TrisquelQuestionSalted.txt | awk 'NF == 2' '-' > Temp-12132020C8.txt
where I salted the PTR field with some Barefruit Error Handling addresses and No_DNS's, plus a NO_DNS that sed shouldn't catch.

Here's an unintended result:
grep [a-z] TrisquelQuestionSalted.txt > Temp-12132020C7.txt
grep [a-z] didn't catch that all-caps NO_DNS.
There's one No_DNS and one NO_DNS in TrisquelQuestionSalted.txt; also, two 92.242.140.21's.

That join statement is meant to select _only_ those PTR's that are actually included in the original database;
it's redundant in the present instance, where I should have gone back to the Current Visitor data, within which it's
hard to pick out a selection of finite size that has any chance of matching any of the PTR's in TrisquelQuestion.txt
The IPv6 searches from oh-so-long-ago were in that shotgun category.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The first script powerfully does what has driven me nuts for months ...

Section 1.3, entitled "Some Simple Examples", of GNU AWK's manual includes an example to "Print every line that has at least one field":
awk 'NF > 0' data
https://www.gnu.org/software/gawk/manual/

If you would have spent some minutes reading documentation, you would have certainly changed the 0 in the example into a 1 and not been "driven nuts for months".

sed 's/92.242.140.21//g; s/No_DNS//g' TrisquelQuestionSalted.txt | awk 'NF == 2' '-' > Temp-12132020C8.txt

There is no need to specify '-' as the sole file: given no file in argument, essentially any text processing command processes the standard input, -.

grep [a-z] didn't catch that all-caps NO_DNS.

Indeed. a-z only contains the lower-case letters, from a to z, as I explained you above (my second point).

amenex
Offline
Iscritto: 01/03/2015

Putting Magic Banana's several thoughtful suggestions/solutions to the test, I processed the half-dozen
nmap search results that were in process at the beginning of this discussion several different ways.

First, the scripts that produced the same result:
time cat GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92.242.140.21//g; s/No_DNS//g' '-' | awk 'NF == 2' '-' > Temp-12142020TQ01.txt
Took 5.30 seconds real time; 12.4 MB; 304181 rows.
time cat GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92.242.140.21//g; s/No_DNS//g' '-' | grep ' .' '-' | cut -d ' ' -f -2 > Temp-12142020TQ06.txt
Took 5.40 seconds real time; 12.4 MB; 304181 rows.
time cat GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92.242.140.21//g; s/No_DNS//g' '-' | grep ' .' '-' > Temp-12142020TQ02.txt
Took 5.44 seconds real time; 12.4 MB; 304181 rows.
time cat GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92.242.140.21//g; s/No_DNS//g' '-' | grep '[[:blank:]].' > Temp-12142020TQ05.txt
Took 20.1 seconds real time; 12.4 MB; 304181 rows.

The nmap data weren't tab-delimited, but the first three of these had only small margins between them.
Gathering the 178 MB of the original six sets of results took 0.2 second.

Second, scripts that leave something to be desired:
time cat GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92.242.140.21//g; s/No_DNS//g' '-' | grep [a-z] > Temp-12142020TQ03.txt
Took 19.1 seconds real time; 12.3 MB; 302668 rows.
time cat GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92.242.140.21//g; s/No_DNS//g' '-' | grep [A-Z] > Temp-12142020TQ04.txt
Took 18.7 seconds real time; 516kB; 11348 rows.
time cat Temp-12142020TQ03.txt Temp-12142020TQ04.txt | sort -u > Temp-12142020TQ07.txt
Took 0.945 seconds real time; 12.4 MB; 303549 rows.

In other words, downright ugly and thoroughly unsatisfactory. And I've used grep [a-z] in the past.

Conclusion: "awk 'NF == 2'" is best, and it doesn't care what the field separator is. Excellent lesson !

I'm not including even snippets of the original scan results because they consist about 99% of innocent bystanders.
We're not killing oysters to find a few pearls ...

Now the non-redundant use of the join command comes into play to see how many hostnames in the original Current Visitors are resolved.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

cat GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92.242.140.21//g; s/No_DNS//g' '-' | ...

For the third time: '-' is useless. In fact, using cat is useless here. Write:
sed 's/92.242.140.21//g; s/No_DNS//g' GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | ...

More importantly, and also for the third time: '.', in a regular expression is "any single character". As a consequence, your first sed substitution would erase "920242a140B21", for instance. I very much doubt it is what you want. You want '\.' for a literal dot.

Conclusion: "awk 'NF == 2'" is best

Actually, I am pretty sure the sed substitutions are the bottleneck, not what comes after. As far as I understand, you would get the same output (verify!) more rapidly (because the command line starts with the removal of most of the lines) if you commute the two commands. For instance, try:
$ grep ' .' GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92\.242\.140\.21//g; s/No_DNS//g'

it doesn't care what the field separator is.

By default, runs of spaces and/or tabs and/or newlines separates what awk understands as fields. But that can be any regular expression, using the option -F or redefining the FS (which means Field Separator), possibly in the middle of the program.

amenex
Offline
Iscritto: 01/03/2015

Quoting Magic Banana:
cat GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92.242.140.21//g; s/No_DNS//g' '-' | ...

For the third time: '-' is useless. In fact, using cat is useless here. Write:
sed 's/92.242.140.21//g; s/No_DNS//g' GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | ...

That's a three-character reduction in the length of the script, and it's better looking and more logical.

More importantly, and also for the third time: '.', in a regular expression is "any single character".
As a consequence, your first sed substitution would erase "920242a140B21", for instance.

I tried it at home on the 178 MB data set, where my misteak does no harm ... There's another 400 MB to be done later.

I very much doubt it is what you want. You want '\.' for a literal dot.

To bring the point home with proper emphasis:
sed 's/92.242.140.21//g is deprecated; sed 's/92\.242\.140\.21//g is correct

Magic Banana suggests (quite constructively) that I commute the two comands...For instance, try:
grep ' .' GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92\.242\.140\.21//g; s/No_DNS//g'
That's a five-fold improvement in speed, but grep adds 12 MB of references to the source file.
awk 'NF == 2' GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92\.242\.140\.21//g; s/No_DNS//g'
Is only marginally slower without leaving any tramp text to be cleaned up afterwards:
grep ' .' GLU.GB.1711-2011.0?.UnRConf.oGnMap.txt | sed 's/92\.242\.140\.21//g; s/No_DNS//g' | sed 's/\:/ /g' | awk '{print $2,$3}'
Sed and awk accomplish this cleanup without slowing the script execution at all.

Thank you for your patient and helpful contributions to this process !
George Langford

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

[grep ' .'] adds 12 MB of references to the source file. (...) [awk 'NF == 2'] Is only marginally slower without leaving any tramp text to be cleaned up afterwards

I do not understand how it is possible, unless there are lines with more that two fields (that awk 'NF > 2' would list).

If the lines containing "92.242.140.21" or "No_DNS" must never be output, simply remove them with grep -ve '92\.242\.140\.21' -e No_DNS (not sed).

sed 's/\:/ /g'

Use instead the much simpler and more efficient tr command:
tr : ' '

awk '{print $2,$3}'

Use instead the much simpler and more efficient cut command:
cut -d ' ' -f 2,3

amenex
Offline
Iscritto: 01/03/2015

Magic Banana was put off by my grep results:
[grep ' .'] adds 12 MB of references to the source file.
(...) [awk 'NF == 2'] Is only marginally slower without leaving any tramp text to be cleaned up afterwards

I do not understand how it is possible, unless there are lines with more that two fields (that awk 'NF > 2' would list).

Here's a snippet of one of those grep command results (before the removal of the extraneous text):
GLU.GB.1711-2011.04.UnRConf.oGnMap.txt:unknown.hwng.net 209.197.16.124
GLU.GB.1711-2011.04.UnRConf.oGnMap.txt:unknown.hwng.net 209.197.16.125
GLU.GB.1711-2011.04.UnRConf.oGnMap.txt:unknown.hwng.net 209.197.16.126

When grep is applied to a number of files, the filename is helpfully placed adjacent to the first column of the results,
using a colon as the field separator. In the snippet, the filename is:
GLU.GB.1711-2011.04.UnRConf.oGnMap.txt
I used sed (sed 's/\:/ /g') to remove the colon and (awk '{print $2,$3}
to separate the unnecessary filename from the output file at no cost of time.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Either command (grep ' .' or awk 'NF == 2') selects the three lines you show.

When grep is applied to a number of files, the filename is helpfully placed adjacent to the first column of the results, using a colon as the field separator.

That is the default behavior with several files. To have it with one single file, use option -H. To not have it with several files, use option -h.

I used sed (sed 's/\:/ /g') to remove the colon and (awk '{print $2,$3} to separate the unnecessary filename from the output file at no cost of time.

Just use grep -h. Your handcrafted fixes create unnecessary problems. I insist: sed 's/\:/ /g' does the same as tr : ' ', just less efficiently. It translates *all* colons into spaces. Including colons that could be somewhere else on the lines... And I insist: awk '{print $2,$3}' does here the same as cut -d ' ' -f 2,3, just less efficiently. The field delimiter can be specified through options (for awk, it can be any regular expression): awk -F : '{ print $2, $3 }' or, more clearly and efficiently, cut -d : -f 2,3.