Syntax trouble with awk while trying to process a list of PTR domains with subdomains

4 respuestas [Último envío]
amenex
Desconectado/a
se unió: 01/03/2015

While (again) looking for duplicate hostnames, I'm trying to make a two-column list of all the domain names in a CIDR/19 block by processing the text file of the list, which includes three types of record:

1. IPv4 address with no PTR name;
2. IPv4 address with only a PTR name; and
3. IPv4 address with a PTR name, followed by a list of tab-delimited domain names.

My intention is to make this array into a two-column list that associates all the subdomains (i.e., A records) of the PTR with the PTR's IPv4 address, which is what nslookup or dig produces for each of the A records under that PTR.

I've gotten this far with awk:

time awk 'NR > 2 { print $1 "\t" $2 "\t" NF } { print $1 "\t" $3}' SourceFile.txt > OutputFile.txt

SourceFile.txt comes from a LibreOffice Calc spreadsheet in which all the A files of a PTR are in separate cells within each row of PTR data.
When I "select all" and produce the SourceFile.txt by pasting into Leafpad, the A records are each followed by a comma and then a tab, except the last one in the row, which has neither a comma nor a tab but [probably] has an end-of-line or carriage-return character. The first entry in each row is the IPv4 address of the row, and the second entry is the PTR record. Incidentally, there are a number of duplications of the PTR records, but their associated lists of A records are different, and nslookup returns the correct IPv4 addresses of the domains in each copy of that PTR name. Nearly none of the PTR records that have duplicates are actually live webpages. It's the A records that I want to check.

NF is the number of fields in each row; the fields are delineated by tab characters.

The awk script above correctly prints the IPv4 address and PTR record for each row of the SourceFile in one line, followed by a second line containing the IPv4 address of the PTR and the A record of the first subdomain in the PTR's list (when there is one) which is basically the pattern that I'm trying to achieve. I can get any other A record to appear in this list by changing the second print statement's concatenated hostname index, so it's not the leftover commas that are wrecking the syntax; I can remove those in the Leafpad file.

Here's the awk attempt that's not doing its job:

time awk -f SourceFile.txt 'NR > 2 { x=2 } BEGIN {while (x++<=NF) print $1 "\t" $x } ' ===> awk: 1: unexpected character '.'

No processor time is used, so the error(s) lie somewhere in my syntax. This portion of the script is meant to be appended to the first script above so as to print additional lines for all the A records in each row of the SourceFile.

It's the use of the conditional operators "while ..." and "do ... while" which are tripping me up. I get error messages like "awk: 1: unexpected character '.'" that may emanate from the .txt in the SourceFile's extension, but which aren't going away.

Thanks,
George Langford

Magic Banana

I am a member!

I am a translator!

Desconectado/a
se unió: 07/24/2010

Like last time: an excerpt of the input and the related expected output would help a lot to understand what you are trying o achieve.

Your first AWK program always prints the first and third fields but, before every such lines, except the first two, it prints the first field, the second field and the number of fields. As a consequence, the output has twice more lines (minus two) than the input. Is that really what you want?

Your second AWK program makes no sense. First of all, the option -f does not set the input file (which should be after the program, as in your first command) but a file containing an AWK program (which therefore needs not be on the command line). As a consequence, the error you get deals with whatever is in SourceFile.txt, that AWK tries to interpret as a program.

But the program on the command line (which is not interpreted because -f was used) makes no sense either. The condition BEGIN is only satisfied once, before any input is read. In the related action (what is between braces after BEGIN), x is automatically initialized at 0, NF is 0 and $0, $1, $2, ... are all "" (the empty string) because, as I have just written, no input has been read. Also, but that is a more technical issue, your loop actually accesses $(NF+1) (past the last field, although, again, there is no field when the BEGIN condition is satisfied). That is because you post-increment rather than pre-increment x. Finally, for every line starting with the third (the condition being NR > 2), you repetitively set x to 2 (and never read it).

Maybe I am over-interpreting your buggy second awk command (I assume "} BEGIN {" should have been ";"), but it looks like you want, for every line starting with the third one, to print on separate lines:

  • the first field followed by the second;
  • the first field followed by the third;
  • ...
  • the first field followed by the last.

If so, here is a proper command, using a "for" loop whose (pre)incremented variable ("i" is more common than "x", for integers) is initialized inside it (a "while" would be OK too, with "i = 1;" written before it):
$ awk 'NR > 2 { for (i = 1; ++i <= NF; ) print $1 "\t" $i }' SourceFile.txt

amenex
Desconectado/a
se unió: 01/03/2015

OK; the thoroughly salted and partially obfuscated source file is attached (SourceFile.txt):
SourceFile.txt ===> salted with one duplicated PTR and three duplicated A domains (fishandaman.example, bedtastic.example, and thsalesassociate.example).
===> Ended up with another duplicate after I obfuscated the hostnames: bristol-fire.example.

I adjusted Magic Banana's suggested (and untested !) command:
>> awk 'NR > 2 { for (i = 1; ++i <= NF; ) print $1 "\t" $i }' SourceFile.txt

in order to add the identifications of PTR vs. A records and to force the command to recognize SourceFile.txt correctly. I changed the initial value of the index (i) to 2, as it is immediately incremented by one after each new line of the Source File is read ... after I've already printed the PTR record. And I removed the NR > 2 as that is superfluous while no comparisons are being made:

>> time awk '{ print $2 "\t" $1 "\t" "PTR" } { for (i = 2; ++i <= NF; ) print $i "\t" $1 "\t" "A"}' 'SourceFile.txt' > Source-Hostnames.txt

The resulting three-column hostname list has to be sorted before ferreting out the duplicates:

>> sort Source-Hostnames.txt > Sorted-Hostnames.txt

The following command should catch all the duplicates; I've added the single quotes around Sorted-Hostnames.txt and the >= in NR >= 2:

time awk 'NR >= 2 { print $2, $1, $3 }' 'Sorted-Hostnames.txt' | uniq -Df 1 | awk '{ print $2 "\t" $1 "\t" $3}' > Source-Duplicates.txt

With the following result:

> cat Source-Duplicates.txt
> bedtastic.example 998.997.996.41 A
> bedtastic.example 998.997.996.6 A
> bristol-fire.example 998.997.996.17 A
> bristol-fire.example 998.997.996.7 A
> fishandaman.example 998.997.996.15 A
> fishandaman.example 998.997.996.9 A
> thsalesassociates.example 998.997.996.41 A
> thsalesassociates.example 998.997.996.6 A
> vps24271.serverexample.calm 998.997.996.41 PTR
> vps24271.serverexample.calm 998.997.996.4 PTR

With apologies to the owners of the obfuscated domains.

The two duplicated PTR records have no subdomains; several others did not, also, but are correctly listed in Source-Hostnames.txt, anyway. Don't try these IPv4 addresses at home.

As usual, Magic Banana has made the complex plain as day; thank you !

BTW, the man awk page says that incrementing a variable looks like this: "inc and dec ++ -- (both post and pre)" which requires careful reading.

George Langford

P.S. I tried these commands on the original 2MB file and got no duplicated A records, only PTR records, only a few of which appear to be webpages.

AdjuntoTamaño
SourceFile.txt 14.18 KB
Magic Banana

I am a member!

I am a translator!

Desconectado/a
se unió: 07/24/2010

So... is your problem solved?

Pre and post increments can really be confusing. I generally avoid post-increments. Those operators are programming classics: https://en.wikipedia.org/wiki/Increment_and_decrement_operators

In your awk command, you can replace "} {" with ";" (the condition is the same, none, for your two actions: you simply have one single action executing several instructions in a sequence. Also concatenating (with the space character) constant strings is kind of useless: you can directly write the concatenated strings. With those two changes:
$ awk '{ print $2 "\t" $1 "\tPTR"; for (i = 2; ++i <= NF; ) print $i "\t" $1 "\tA"}' SourceFile.txt

The output can be directly piped to the next command, i.e., Source-Hostnames.txt is an unnecessary intermediary file. So is Sorted-Hostnames.txt. Thanks to its option -k, sort can actually sort the lines w.r.t. any field(s), what makes the awk's reordering, in the last command, kind of useless: the first awk can output the fields in the order expected by 'unique' (which can only skip the first fields). All in all:
$ awk '{ print $1 "\t" $2 "\tPTR"; for (i = 2; ++i <= NF; ) print $1 "\t" $i "\tA"}' SourceFile.txt | sort -k 2 | uniq -Df 1 | awk '{ print $2 "\t" $1 "\t" $3}'
bedtastic.example 998.997.996.41 A
bedtastic.example 998.997.996.6 A
bristol-fire.example 998.997.996.17 A
bristol-fire.example 998.997.996.7 A
fishandaman.example 998.997.996.15 A
fishandaman.example 998.997.996.9 A
thsalesassociates.example 998.997.996.41 A
thsalesassociates.example 998.997.996.6 A
vps24271.serverexample.calm 998.997.996.41 PTR
vps24271.serverexample.calm 998.997.996.4 PTR

If you really want the intermediary files, you can use the 'tee' command. See slides 11-13 of https://dcc.ufmg.br/~lcerf/slides/mda7.pdf

amenex
Desconectado/a
se unió: 01/03/2015

As usual, Magic Banana is way ahead of me. Bear in mind that I'm happy when the awk command works; streamlining is a homework exercize.

Here's an example of the task at hand:

I started with these two two entries: [zappros.ru ==> 80.87.197.27] and [zappros.ru ==> 92.63.99.154].
Running nslookup on each of those IPv4 addresses returns zappros.ru, but an nslookup of zappros.ru returns (surprise !):

Name: zappros.ru
Address: 90.156.201.87
Name: zappros.ru
Address: 90.156.201.22
Name: zappros.ru
Address: 90.156.201.34
Name: zappros.ru
Address: 90.156.201.47

It turns out that these additional four instances of the PTR record zappros.ru are in the CIDR block 90.156.201.0/24, also in Russia, but on the autonomous system AS25532, which wasn't in the search list where I started. Note also that our two original examples of zappros.ru are unresolvable according to nslookup. There is actually a company website residing at one of these instances of zappros.ru. The PTR records that Hurricane Electric gives for these four IPv4 addresses are different ... probably changed since they scanned that CIDR block; my nslookups where performed today (May 2, 2019). I wish there were a way of using nMap to scan for A records ...

Let's check the four lists of A records that reside at each of the four additional zappros.ru PTR locations.

Hurricane Electric gives us a comma- and tab-delimited list of A records for each of the IPv4 addresses: 90.156.201.87, 90.156.201.22, 90.156.201.34, and 90.156.201.47. Convert each of these single-row arrays into two-column lists with an awk command like this one:

time awk ' { for (i = 0; ++i <= NF; ){ print $i"\t" "90.156.201.##" } }' 'BGP-90.156.201.##.ccs.txt' > zappros-##-hosts.txt for ## = 87, 22, 34 or 47, respectively.

where the files like BGP-90.156.201.87.ccs.txt have been stripped of commas, leaving only tabs between the entries in each list. The four zappros-##-hosts.txt files are then concatenated into one large LibreOffice-Calc spreadsheet, sorted on the first column, and saved as a two-column text file, zappros-all-hosts-sorted.txt, which can be checked for duplicate hostnames with the following awk command:

time awk 'NR >= 2 { print $2, $1 }' 'zappros-all-hosts-sorted.txt' | uniq -Df 1 | awk '{ print $2 "\t" $1 }' > Duplicates.txt

which has zero bytes and therefore displays as a blank screen

The zappros-all-hosts-sorted.txt file can be manually checked in a few minutes; yes, there are no duplicate A records there, either within an IPv4 address, or across more than one IPv4 address.

Yes, Magic Banana is right to be showing how to streamline my roadmap of how to find all the duplicate PTR records on the Russian Internet (and, unfortunately elsewhere, thanks to the Internet Research Agency) by starting with the blocklist for Russia (https://community.cisco.com/t5/firewalls/block-all-russia-public-ip-addresses/td-p/2094303).

AdjuntoTamaño
zappros-all-hosts-sorted.txt 91.07 KB