Extracting CIDR and country code from an address block

25 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/03/2015

After discovering that a set of suspicious IPv4 addresses gives different results from an nmap scan
from one day to the next, I'm trying to consolidate that data in order to find any patterns in it.

Nmap has an app that will return AS number and country code for each of a series of IP addresses,
but my lists number in the tens to hundreds of thousand addresses, so I'm making an assumption
(based on experience) that I can take only the first three octets of each address, expressed as
a CIDR/24 address block, which shortens the lists considerably.

Whois (downloadable from the Trisquel repository) will respond with the address range of the main
block of addresses belonging to an entity as well as its two-character country code, but only for
viable CIDR blocks. The bad blocks stop my scripts in their tracks; and both the country codes and
address ranges often appear more than once in the output of a whois query. Presently I want only
the first (primary) whois responses to each whois query.

Here's a typical script with which I'm starting:
awk '{print $1}' IPv4s.2004-2013.txt | sed 's/\./\t/g' | awk '{print $1"."$2"."$3".0/24"}' '-' | sort -u | whois -?? '-' | grep -f grepsearch.txt '-' |more
where grepsearch.txt is a list of the two search patterns, inetnum and country. The address list
IPv4s.2004-2013.txt contains 2.5 MB of addresses, a couple hundred thousand of them, gleaned
from the QuickSpam reports provided by a ten-year spamcop.net subscription. A lot of innocent
parties are represented along with the spammers' IPv4 addresses, so I'll be weeding out collateral
damage.

Suggestions are welcome.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Assuming there is no period in IPv4s.2004-2013.txt but in the IP addresses, all that...
awk '{print $1}' IPv4s.2004-2013.txt | sed 's/\./\t/g' | awk '{print $1"."$2"."$3".0/24"}' '-' | sort -u
... can be condensed a lot:
sed 's:[^.]*$:0/24:' IPv4s.2004-2013.txt | sort -u
If there may be periods after the first field in IPv4s.2004-2013.txt, you can write that (which you may prefer even if there is no period after the first field, for the even simpler use of sed):
cut -d . -f -3 IPv4s.2004-2013.txt | sort -u | sed 's:$:.0/24:'

If you prefer awk to sed, you can write that in the first case (no period after the first field):
awk -F . -v OFS=. '{ $4 = "0/24"; print }' IPv4s.2004-2013.txt | sort -u
That in the second case (possibly periods after the first field):
awk -F . '{ print $1 "." $2 "." $3 ".0/24" }' IPv4s.2004-2013.txt | sort -u

As usual, your '-' are useless.

amenex
Offline
Iscritto: 01/03/2015

It took a while for me to catch on to Magic Banana's perfect English:
Assuming there is no period in IPv4s.2004-2013.txt ...
What that means is there is no trailing dot on any line in the file containing the IPv4 addresses,
as in "no period at the end of a sentence." Those were my nemesis for a while ...

Magic Banana also demonstrates that bash scripts are state functions: the path to
the end doesn't matter (so long as there is no divide by zero). That said, the fifth
script is truly eloquent, as it's instantly understandable.

Imagine saying out loud any of the following: to, two, 10, too, 2.

amenex
Offline
Iscritto: 01/03/2015

Here's some progress, going back to Square One and starting with the raw IPv4 address list and nmap:
sudo nmap -Pn -sn -T2 --max-retries 16 --script asn-query -iL IPv4s.2021.sort.txt | grep -f GrepListnMap.txt - | sed 's/Nmap scan report for //g' | sed 's/(/ /g' | sed 's/|_asn-query: See the result for /ASN: see /g' | sed 's/\n| BGP: / /g' |more
GrepListnMap.txt is attached. Also, a shorter (and different) address list which should serve to
illustrate the task at hand.
With outcomes like these, not quite "there" yet; subdivided by class of result:
cm15.websitewelcome.com 100.42.49.9)
| BGP: 100.42.49.0/24 | Country: US
| Origin AS: 46606 - UNIFIEDLAYER-AS-1, US

Above, I'm having trouble removing that trailing ")" and newline character.
elec.cascadecompany.net 102.129.152.218)
ASN: see 102.129.153.241

Only two lines, because nmap refers us back to the data for 102.129.153.241, the following IPv4 address.
promo.banditthis.com 102.129.153.241)
| BGP: 102.129.153.0/24 and 102.129.144.0/20 | Country: ZA
| Origin AS: 174 - COGENT-174, US

I want to collapse the first pair of line, and then the second set of three lines, each into single lines.

The following link shows how to replace newline characters with sed:
https://www.cyberciti.biz/faq/sed-remove-m-and-line-feeds-under-unix-linux-bsd-appleosx/
but I tried to piggyback the replacement of the preceding ")" so that sed would not collapse the entire
file into a single line. As it is, sed has to proceed all the way through the file to its end, and that
takes too long for me to find out whether or not it works. Compare:
sed ':a;N;$!ba;s/\n//g'
versus my hack:
sed ':a;N;$!ba;s/)\n//g'

AllegatoDimensione
GrepListnMap.txt 90 byte
IPv4s.2021.sort_.txt 39.89 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

I'm having trouble removing that trailing ")" and newline character.

The trailing ")" apparently helps to identify the beginning of the blocks you want in single lines (or are there other lines ending with a closing parenthesis?). Using it as a record separator, maybe you want something like that:

awk -v RS=')\n' -v FS='\n' '{ e = $NF; if (--NF) print; printf e " " }'

I assume here that there is no one-line "block". If there is, replace "if (--NF)" with "--NF; if (NR != 1)", to not have the one-line block on the same line as the next block.

As it is, sed has to proceed all the way through the file to its end, and that takes too long for me to find out whether or not it works.

Just start your command line with 'head' when testing.

amenex
Offline
Iscritto: 01/03/2015

There's something about head that doesn't like capital P's, so I suffered through a few minutes' wait for our
script to produce outputs; I also subdivided the source file to cut the processing times for testing:
sudo nmap -Pn -sn -T2 --max-retries 16 --script asn-query -iL IPv4s.2004-2013.b.e.txt | grep -f GrepListnMap.txt - | sed 's/Nmap scan report for //g' | sed 's/(/ /g' | awk -v RS='\)\n' -v FS='\n' '{ e = $NF; if (--NF) print; printf e " " }' &> Test-07202021.b.e.excerpts.txt
As you may notice, I had to escape that lonely ")"; but the edit (which I skipped adding) for one-liners
removed all their New Line characters, so I added double line feeds to the output file to mark the places
where the awk script missed New Line characters. Here's a snippet of that file:
static.vnpt.vn 123.17.248.240 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.248.9 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.249.104 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.249.106 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.249.113 | BGP: 123.17.0.0/16 | Country: VN | Origin AS: 45899 - VNPT-AS-VN VNPT Corp, VN
static.vnpt.vn 123.17.251.145 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.252.117 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.252.224 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.253.101 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.253.169 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.253.253 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.254.188 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.255.132 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.255.31 |_asn-query: See the result for 123.17.249.113

123.172.7.131 | BGP: 123.172.0.0/19 | Country: CN | Origin AS: 4809 - CHINATELECOM-CORE-WAN-CN2 China Telecom Next Generation Carrier Network, CN

123.174.10.15 | BGP: 123.174.0.0/15 | Country: CN | Origin AS: 4134 - CHINANET-BACKBONE No.31,Jin-rong Street, CN

123.174.12.151 |_asn-query: See the result for 123.174.10.15

123.174.130.19 |_asn-query: See the result for 123.174.10.15

123.174.227.1 |_asn-query: See the result for 123.174.10.15

123.174.240.38 |_asn-query: See the result for 123.174.10.15
static.vnpt.vn 123.17.46.111 |_asn-query: See the result for 123.17.249.113
static.vnpt.vn 123.17.48.101 |_asn-query: See the result for 123.17.249.113

123.174.82.161 |_asn-query: See the result for 123.174.10.15

123.174.8.250 |_asn-query: See the result for 123.174.10.15

123.174.85.39 |_asn-query: See the result for 123.174.10.15

123.175.119.134 |_asn-query: See the result for 123.174.10.15

123.175.159.101 |_asn-query: See the result for 123.174.10.15

123.175.174.148 |_asn-query: See the result for 123.174.10.15

123.175.183.28 |_asn-query: See the result for 123.174.10.15
static.vnpt.vn 123.17.52.98 |_asn-query: See the result for 123.17.249.113

123.175.3.163 |_asn-query: See the result for 123.174.10.15

123.175.5.164 |_asn-query: See the result for 123.174.10.15

123.175.95.62 |_asn-query: See the result for 123.174.10.15

123.176.0.0 | BGP: 123.176.0.0/23 and 123.176.0.0/19 | Country: MV | Origin AS: 7642 - DHIRAAGU-MV-AP DHIVEHI RAAJJEYGE GULHUN PLC, MV 1

23.176.10.23 |_asn-query: See the result for 123.176.0.0

123.176.12.139 |_asn-query: See the result for 123.176.0.0

123.176.13.143 |_asn-query: See the result for 123.176.0.0

123.176.13.71 |_asn-query: See the result for 123.176.0.0

123.176.14.36 |_asn-query: See the result for 123.176.0.0
static.vnpt.vn 123.17.6.172 |_asn-query: See the result for 123.17.249.113

123.176.25.138 |_asn-query: See the result for 123.176.0.0

123.176.25.94 |_asn-query: See the result for 123.176.0.0

123.176.3.1 |_asn-query: See the result for 123.176.0.0

123.176.31.255 |_asn-query: See the result for 123.176.0.0
broadband.actcorp.in 123.176.35.45 | BGP: 123.176.35.0/24 | Country: IN | Origin AS: 18209 - BEAMTELE-AS-AP Atria Convergence Technologies pvt ltd, IN
broadband.actcorp.in 123.176.36.226 | BGP: 123.176.36.0/24 | Country: IN | Origin AS: 18209 - BEAMTELE-AS-AP Atria Convergence Technologies pvt ltd, IN
broadband.actcorp.in 123.176.36.83 |_asn-query: See the result for 123.176.36.226
broadband.actcorp.in 123.176.40.102 | BGP: 123.176.40.0/24 | Country: IN | Origin AS: 18209 - BEAMTELE-AS-AP Atria Convergence Technologies pvt ltd, IN
broadband.actcorp.in 123.176.40.2 |_asn-query: See the result for 123.176.40.102
broadband.actcorp.in 123.176.41.13 | BGP: 123.176.41.0/24 | Country: IN | Origin AS: 18209 - BEAMTELE-AS-AP Atria Convergence Technologies pvt ltd, IN
static.vnpt.vn 123.17.65.151 |_asn-query: See the result for 123.17.249.113
mail.mof.gov.ws 123.176.74.66 | BGP: 123.176.72.0/22 | Country: WS | Origin AS: 38227 - CSLSAMOA-WS-AS-AP Computer Services Limited CSL, WS
static.vnpt.vn 123.17.68.227 |_asn-query: See the result for 123.17.249.113

12.3.177.20 | BGP: 12.0.0.0/9 | Country: US | Origin AS: 7018 - ATT-INTERNET4, US

123.179.164.70 | BGP: 123.178.0.0/15 | Country: CN | Origin AS: 4134 - CHINANET-BACKBONE No.31,Jin-rong Street, CN

123.179.192.83 |_asn-query: See the result for 123.179.164.70

123.179.194.201 |_asn-query: See the result for 123.179.164.70
There's something nightmarish about the nmap output; I'll try separating the rows into groups depending
on their context.

AllegatoDimensione
IPv4s.2004-2013.b.e.txt 28.03 KB
GrepListnMap.txt 90 byte
Test-07202021.b.e.excerpts.txt 55.62 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

There's something about head that doesn't like capital P's, so I suffered through a few minutes' wait for our script to produce outputs

There is nothing wrong with head. If you want to test on the 20 first lines of IPv4s.2004-2013.b.e.txt, the command line should start with:
head -20 IPv4s.2004-2013.b.e.txt | ...

When a command takes time and you are unsure of the validity of its post-processing, save the output of the time-consuming command, to never repay the price. Here, nmap takes time: save its output. It is part of that output and its desired transformation that I need to understand what you want and possibly help you: I do not even have Nmap on my system (and neither wants to). By the way, have you checked Nmap's output formats to see if another one is better suited?

the edit (which I skipped adding) for one-liners removed all their New Line characters

I wrote in my previous post:

I assume here that there is no one-line "block". If there is, replace "if (--NF)" with "--NF; if (NR != 1)", to not have the one-line block on the same line as the next block.

Doing that replacement gives:
awk -v RS=')\n' -v FS='\n' '{ e = $NF; --NF; if (NR != 1) print; printf e " " }'

amenex
Offline
Iscritto: 01/03/2015

Following Magic Banana's advice, I ran the nmap script:
sudo nmap -Pn -sn -T2 --max-retries 16 --script asn-query -iL IPv4s.2004-2013.b.e.txt | grep -f GrepListnMap.txt - > nMap-Output.b.e.02.txt
Counting the number of "Host is up" strings gives the same number as the quantity of IPv4 inputs, 1983.

Processing that output in different ways to reduce the complexity of the Outputs file:
awk '{print $0}' nMap-Output.b.e.02.txt | grep "|_asn-query: " -B 2 | sed 's/Host is up.//g' | sed 's/--//g' | sed '/./!d' > Test-07202021.asn-query.b.e.02.txt
==> 1501 pairs of lines

awk '{print $0}' nMap-Output.b.e.02.txt | grep "Origin AS:" -B 3 | sed 's/--//g' | sed '/./!d' | sed 's/Host is up.//g' | sed '/./!d' > Test-07202021.Origin_AS.b.e.02.txt
==> 222 groups of three lines

Those two scripts account for all but 260 of the input IPv4's. Let's go into sleuth mode:
awk '{print $0}' nMap-Output.b.e.02.txt | grep "Host is up." -B 1 | sed 's/Host is up.//g' | sed 's/--//g' | sed '/./!d' > Test-07202021.Host_is_up.b.e.02.txt ==> 1983 lines
comm -23 <(awk '{print $0}' Test-07202021.Host_is_up.b.e.02.txt | sort) <(awk '{print $0}' Test-07202021.BGP.b.e.02.txt | sort) > Test-07202021.comm23-BGP.b.e.02.txt ==> 1761 lines
Seem to account for those 260 lines ... which IPv4's are they ?
comm -23 <(awk '{print $0}' Test-07202021.Host_is_up.b.e.02.txt | sort) <(awk '{print $0}' Test-07202021.asn-query.b.e.02.txt | sort) > Test-07202021.comm23-ASN.b.e.02.txt ==> 482 lines
comm -12 <(awk '{print $0}' Test-07202021.Host_is_up.b.e.02.txt | sort) <(awk '{print $0}' Test-07202021.asn-query.b.e.02.txt | sort) > Test-07202021.comm12-ASN.b.e.02.txt ==> 222 lines
Same difference: 260 lines. Let's count the lines in Test-07202021.comm23-ASN.b.e.02.txt that aren't in Test-07202021.comm12-ASN.b.e.02.txt:
wc -l <(comm -13 <(sort Test-07202021.comm12-ASN.b.e.02.txt) <(sort Test-07202021.comm12-ASN.b.e.02.txt)) ==> 0
Apply the same logic to the other pair with a differential of 260 lines:
wc -l <(comm -13 <(sort Test-07202021.comm23-BGP.b.e.02.txt) <(sort Test-07202021.Host_is_up.b.e.02.txt)) ==> 222
Maybe the GrepListNMap.txt file is missing something; running that nmap file two-minutes more to count the lines with parentheses:
sudo nmap -Pn -sn -T2 --max-retries 16 --script asn-query -iL IPv4s.2004-2013.b.e.txt | grep "(" - > nMap-Output.parens.b.e.02.txt ==> 267 (net) lines
wc -l <(comm -13 <(sort Test-07202021.comm12-ASN.b.e.02.txt) <(sort nMap-Output.parens.b.e.02.txt)) ==> 85 (net) lines
Here's where we are: There are 1983 input IPv4's, and nmap declared them all "up" (operational, even if not having DNS). My aim is to process all 1983 outputs into single-line records.
Even though nmap found no ASN's or CIDR blocks for quite a few, that may change tomorrow, as it does for the other addresses and their oft-changing PTR's.
Leafpad can flatten the three-line and two-line groups, so bash ought to be able as well. Leafpad processing can be tedious, though.

AllegatoDimensione
nMap-Output.b.e.02.txt 192.16 KB
Test-07202021.Origin_AS.b.e.02.txt 32.99 KB
Test-07202021.Host_is_up.b.e.02.txt 78.38 KB
Test-07202021.comm23-BGP.b.e.02.txt 67.69 KB
Test-07202021.comm23-ASN.b.e.02.txt 19.92 KB
Test-07202021.comm12-ASN.b.e.02.txt 58.46 KB
nMap-Output.parens.b.e.02.txt 17.72 KB
GrepListnMap.txt 102 byte
IPv4s.2004-2013.b.e.txt 28.03 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

There are 1983 input IPv4's, and nmap declared them all "up" (operational, even if not having DNS). My aim is to process all 1983 outputs into single-line records.

If that means you only want records where "Host is up" (a piece of information I therefore erase):

$ awk -v RS='Nmap scan report for ' -v FS='\n' '$2 == "Host is up." { $2 = ""; print }' nMap-Output.b.e.02.txt | sed 's/ *| */|/g'

"Nmap scan report for " is here a perfect record separator, that you were removing. The final sed only deletes the spaces (of inconsistent number) around the delimiters, '|'.

Please, stop writing "awk '{print $0}'". Just give the file to the subsequent command.

amenex
Offline
Iscritto: 01/03/2015

After a revelation a day or two ago that I'd missed a class of nmap outputs, I put all those results in purgatory and started over,
this time running the nmap scans closer together (twenty minutes difference in starting & ending times) or two scans at once (one
second difference in starting times, simultaneous ending time).
In the first comparison, essentially half of the addresses gave different outputs to nmap; in the second, 40% changed; the number
of returns stayed the same as the number of addresses, 1983 in this small sample, one-tenth of one-ninth of the ~180,000 addresses.
Here's the nmap lookup script, the same for all four tests:
sudo nmap -Pn -sn -T2 --max-retries 16 --script asn-query -iL IPv4s.2004-2013.b.e.txt | grep -f GrepListnMap03.txt - > NewASNData/nMap-Output.b.e.03a.txt
with a& b representing the twenty-minute differential and c & d the simultaneous scans, and a new GrepListnMap pattern file; there aren't any "No Answers" resukts.
In order to run the diff scripts, it's necessary to flatten & condense the following scripts into multi-column rows with the addresses in the $1 position; that
task was quickly accomplished with Leafpad (flatten the multi-line results) and LibreOffice.Calc (rearrange and condense the columns):
grep "Origin AS:" -B 3 NewASNData/nMap-Output.b.e.03a.txt | sed 's/--//g' | sed '/./!d' | sed 's/Host is up.//g' | sed '/./!d' > NewASNData/Test-07202021.Origin_AS.b.e.03a.txt
with a,b,c & d as above. There are two differential scripts, one for a-to-b, the second for c-to-d:
diff -y --suppress-common-lines <(sort -k 1,1 NewASNData/Test-07202021.Origin_AS.b.e.03a-Leafpad-Calc.txt) <(sort -k 1,1 NewASNData/Test-07202021.Origin_AS.b.e.03b-Leafpad-Calc.txt) > Diff.Origin_AS.b.e.20min.txt
and
diff -y --suppress-common-lines <(sort -k 1,1 NewASNData/Test-07202021.Origin_AS.b.e.03c-Leafpad-Calc.txt) <(sort -k 1,1 NewASNData/Test-07202021.Origin_AS.b.e.03d-Leafpad-Calc.txt) > Diff.Origin_AS.b.e.01sec.txt
There are further comparisons to be made to tease out what sort of patterns there are. Some appear to result from random responses to presentation of some IPv4 addresses. Others may be the result of actual PTR changes.

AllegatoDimensione
IPv4s.2004-2013.b.e.txt 28.03 KB
GrepListnMap03.txt 127 byte
Test-07202021.Origin_AS.b.e.03a-Leafpad-Calc.txt 12.63 KB
Test-07202021.Origin_AS.b.e.03b-Leafpad-Calc.txt 12.87 KB
Test-07202021.Origin_AS.b.e.03c-Leafpad-Calc.txt 14.2 KB
Test-07202021.Origin_AS.b.e.03d-Leafpad-Calc.txt 13.02 KB
Diff.Origin_AS.b.e.20min.txt 8.34 KB
Diff.Origin_AS.b.e.01sec.txt 6.72 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

In order to run the diff scripts, it's necessary to flatten & condense the following scripts into multi-column rows with the addresses in the $1 position; that task was quickly accomplished with Leafpad (flatten the multi-line results) and LibreOffice.Calc (rearrange and condense the columns)

... and in a few months you will back here asking for help because you made human errors manually processing the "~180,000 addresses" or because you forgot how and want to redo the same. What is fundamentally wrong with the couple of commands I gave you (and that can be saved for reuse)?

Considering what you attached, I modified that couple of commands. Below, 1) sed now removes the field names (although they look useful...), 2) awk now writes the IP address first and either the hostname or "No_DNS" second, 3) a tabulation is now used as the output field separator:
$ sed 's/ *|[^:]*: /\n/g' nMap-Output.b.e.02.txt | awk -F '\n+' -v OFS='\t' -v RS='\n*Nmap scan report for ' '$2 == "Host is up." { if (split($1, a, /[ ()]+/) == 1) $2 = "No_DNS"; else { $1 = a[2]; $2 = a[1] }; print }'

If you really want the parentheses around the IP addresses (as in your attachment), remove the characters '(' and ')' in the third argument of the split function.

amenex
Offline
Iscritto: 01/03/2015

Magic Banana's script does nearly all of my former Leafpad text processing, leaving the subsequent steps that I've been doing with LibreOffice.Calc.:
(1) Remove "the results for" from the phrase "see the results for" as we only need "see" to locate the incomplete CIDR, CC, ASN data for the associated IPv4 address.
(2) Remove the smallest CIDR blocks plus their neighboring "and" to leave the largest CIDR block and no other in each row to get the column numbering under control.
(3) Once (1) & (2) are finished, sorting on Col.$4 and then on Col.$1 puts all the "See" in the same Col.$3 so the associated Col.$4 IPv4 are together in the following rows.
(4) Search Col.$1 for the IPv4 (next to each "See" in Col.$3) in Col.$4 and copy Col.$3,$4,$5 from Col.$1's row into Col.$3,$4,$5 of the source "See' Col.$3 and ensuing rows with the same IPv4.
(5) Print "AS"$5 instead of just $5 so the autonomous system numbers are in the usual format: AS###.
(6) The final print statement should be limited to five columns: IPv4,(PTR or No_DNS), CIDR, CC, ASN to remove the associated server names, etc.
Those last steps are extraordinarily tedious when done with Calc.'s assistance; tolerable for the example, but that's only 1/90th of the job's ~180,000 addresses.

In Step (2) we could alternatively cut the CIDR block list down to the smallest representative CDR block in each row, but some large CIDR blocks encompass many smaller blocks.
If we start by listing just the largest blocks, we can subsequently look up all and evaluate the subsidiary CIDR blocks with the CIDR Report: https://www.cidr-report.org/as2.0/

Attached are the newest nmap scans, "a" being twenty minutes ahead of "b" with the same IPv4 list, and a partially processed Calc. (.ods) file and its companion (.txt) file.

AllegatoDimensione
Test-07202021.Origin_AS.b.e.03a.txt 32.48 KB
Test-07202021.Origin_AS.b.e.03b.txt 32.99 KB
Test-07202021.asn-query.b.e.03a.MB_.ods 37.78 KB
Test-07202021.asn-query.b.e.03a.MB_.Calc_partial.txt 85.9 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

If I properly understood:

$ sed -e 's/ *|[^:]*: /\n/g' nMap-Output.b.e.02.txt | awk -F '\n+' -v OFS='\t' -v RS='\n*Nmap scan report for ' '$2 == "Host is up." { if (split($1, a, /[ ()]+/) == 1) $2 = "No_DNS"; else { $1 = a[2]; $2 = a[1] }; if (sub(/See the result for /, "", $3)) b[$1 OFS $2] = $3; else { NF = 5; sub(/ .*/, "", $3); sub(/ .*/, "", $5); $5 = "AS" $5; print; c[$1] = $3 OFS $4 OFS $5 } } END { for (d in b) print d, c[b[d]] }'

Those last steps are extraordinarily tedious when done with Calc.'s assistance; tolerable for the example, but that's only 1/90th of the job's ~180,000 addresses.

A fraction of the time you spend manually processing data would be enough to properly learn regular expressions, AWK, ...

amenex
Offline
Iscritto: 01/03/2015

Magic Banana's universal script captures essentially everything in my wish list; I applied the script to a more recent nmap scan:
sed -e 's/ *|[^:]*: /\n/g' nMap-Output.b.e.03a.txt | awk -F '\n+' -v OFS='\t' -v RS='\n*Nmap scan report for ' '$2 == "Host is up." { if (split($1, a, /[ ()]+/) == 1) $2 = "No_DNS"; else { $1 = a[2]; $2 = a[1] }; if (sub(/See the result for /, "", $3)) b[$1 OFS $2] = $3; else { NF = 5; sub(/ .*/, "", $3); sub(/ .*/, "", $5); $5 = "AS" $5; print; c[$1] = $3 OFS $4 OFS $5 } } END { for (d in b) print d, c[b[d]] }' > Test-07202021.asn-query.b.e.03a.MB_Calc.txt
In order to facilitate the comparison between the script's output versus my manual effort with LibreOfice.Calc. I sorted both output files:
sort -Vk 1,1 Test-07202021.asn-query.b.e.03a.MB_Calc.txt > Test-07202021.asn-query.b.e.03a.MB_Calc-Sort.txt
awk '{print $1"\t"$2"\t"$3"\t"$4"\tAS"$5}' Test-07202021.asn-query.b.e.03a.MB.Calc_partial.txt | tr -d '()' | sort -Vk 1,1 > Test-07202021.asn-query.b.e.03a.MB.Calc_partial-Sort.txt
Looking at the two sorted outputs line-by-line, I could see that the CIDR blocks were of differing sizes:
diff -y --suppress-common-lines <(sort Test-07202021.asn-query.b.e.03a.MB_Calc-Sort.txt) <(sort Test-07202021.asn-query.b.e.03a.MB.Calc_partial-Sort.txt) > Diff-07202021.asn-query.b.e.03a.MB-to-GL.txt
It would appear that we are truncating the CIDR lines in the nmap output differently, wherein I'm shortening the list by removing the left-most fields, whereas Magic Banana is removing the right-most ones.
Calc. sorts columns easily so I can delete field pairs ending in " and " with a few key strokes. A Bash script might search for " and " and then delete the " and " as well as the preceding field until there are no more and's.

AllegatoDimensione
Test-07202021.asn-query.b.e.03b.MB_Calc.txt 93.85 KB
Test-07202021.asn-query.b.e.03a.MB_Calc-Sort.txt 93.85 KB
Test-07202021.asn-query.b.e.03a.MB_.Calc_partial.txt 90.55 KB
Test-07202021.asn-query.b.e.03a.MB_.Calc_partial-Sort.txt 93.83 KB
Test-07202021.asn-query.b.e.03a.txt 126.99 KB
Diff-07202021.asn-query.b.e.03a.MB-to-GL.txt 182.1 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Calc. sorts columns easily so I can delete field pairs ending in " and " with a few key strokes. A Bash script might search for " and " and then delete the " and " as well as the preceding field until there are no more and's.

Just change the regular expression, from / .*/ to /.* /:

$ sed -e 's/ *|[^:]*: /\n/g' nMap-Output.b.e.03a.txt | awk -F '\n+' -v OFS='\t' -v RS='\n*Nmap scan report for ' '$2 == "Host is up." { if (split($1, a, /[ ()]+/) == 1) $2 = "No_DNS"; else { $1 = a[2]; $2 = a[1] }; if (sub(/See the result for /, "", $3)) b[$1 OFS $2] = $3; else { NF = 5; sub(/.* /, "", $3); sub(/ .*/, "", $5); $5 = "AS" $5; print; c[$1] = $3 OFS $4 OFS $5 } } END { for (d in b) print d, c[b[d]] }'

You really should learn regular expressions...

Also, awk '{print $1"\t"$2"\t"$3"\t"$4"\tAS"$5}' Test-07202021.asn-query.b.e.03a.MB.Calc_partial.txt does absolutely nothing here (there are always five tab-separated fields; the AS variable being undefined, it is "" by default). As a consequence you could directly redirect with < the standard input of the subsequent tr -d '()'... except that there should never be any parentheses either! There is also no need to sort diff's inputs, which have already been sorted.

amenex
Offline
Iscritto: 01/03/2015

Masgic Banana's universal script intended to select the largest CIDR block isn't doing that yet:
diff -y --suppress-common-lines --width=200 <(sort -Vk 1,1 Test-07202021.asn-query.b.e.03a.MB_Calc-Sort.txt) <(sort -Vk 1,1 Test-07202021.asn-query.b.e.03a.MB.Calc_partial-Sort.txt) > Diff-07202021.asn-query.b.e.03a.MBU-LP-Calc-to-GL.txt
Where Test-07202021.asn-query.b.e.03a.MB_Calc-Sort.txt has the "/.* /" expression;
and Test-07202021.asn-query.b.e.03a.MB.Calc_partial-Sort is the completed version of my tedious Calc. copy-and-paste effort, wherein I selected the right-most CIDR blocks.
All the differences in CIDR block sizes have the script's output blocks smaller than mine (larger X in CIDR/X).

The awk print statement was restoring the tab separators, as a previous step had changed them to spaces.
The .()'s were a visual aid until my copy-and-paste task was completed, whereupon they became indeed unnecessary.
Starting the script with tr -d '()' drew tr's ire regarding something about squeezes.
Your tip that I should have used "tr -d '()' <(filename)" instead is a point I should have used instead.

AllegatoDimensione
Test-07202021.asn-query.b.e.03a.MBU-LP-Calc.txt 93.77 KB
Test-07202021.asn-query.b.e.03a.MB_.Calc_partial-Sort.txt 93.83 KB
Diff-07202021.asn-query.b.e.03a.MBU-LP-Calc-to-GL.txt 49.92 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

In AWK, sub(/.* /, "", $3) deletes from the third field everything up to the last space. For instance:
$ printf '123.176.0.0,No_DNS,123.176.0.0/23 and 123.176.0.0/19' | awk -F , '{ sub(/.* /, "", $3); print }'
123.176.0.0 No_DNS 123.176.0.0/19

Isn't it what you want? If not, show me a tiny input (Nmap's output for one single IP address should be enough) where my command line does not provide the correct output and what would be that correct output.

amenex
Offline
Iscritto: 01/03/2015

Here's my first attempt:
Snippet:
(123.205.97.203) 123-205-97-203.adsl.dynamic.seed.net.tw 123.205.96.0/20 and 123.205.64.0/18 and 123.205.0.0/17' TW 18049

GL-manual edit:
(123.205.97.203) 123-205-97-203.adsl.dynamic.seed.net.tw 123.205.0.0/17 TW 18049
MB's mini-script:
printf '(123.205.97.203),123-205-97-203.adsl.dynamic.seed.net.tw,123.205.96.0/20 and 123.205.64.0/18 and 123.205.0.0/17' | awk -F , '{ sub(/.* /, "", $3); print }'
The output of the mini-script was not what I was expecting ...

Falling back to basics:

Diff -y line (on the left: MB's script outputs; on the right, my manual edit of Calc.:
1.232.166.65 No_DNS 1.232.0.0/13 KR AS9318 | 1.232.166.65 No_DNS 1.224.0.0/11 KR AS9318
nMap-Output.b.e.03a.txt lines:
Nmap scan report for 1.232.166.65
Host is up.
| BGP: 1.232.0.0/13 and 1.224.0.0/11 | Country: KR
| Origin AS: 9318 - SKB-AS SK Broadband Co Ltd, KR

Diff -y line:
123.20.116.9 No_DNS 123.20.112.0/20 VN AS45899 | 123.20.116.9 No_DNS 123.20.64.0/18 VN AS45899
nMap-Output.b.e.03a.txt lines:
Nmap scan report for 123.20.116.9
Host is up.
| BGP: 123.20.112.0/20 and 123.20.64.0/18 | Country: VN
| Origin AS: 45899 - VNPT-AS-VN VNPT Corp, VN

Diff -y line:
123.205.139.5 123-205-139-5.adsl.dynamic.seed.net.tw 123.205.136.0/22 TW AS4780 | 123.205.139.5 123-205-139-5.adsl.dynamic.seed.net.tw 123.205.128.0/18 TW AS4780
nMap-Output.b.e.03a.txt lines:
Nmap scan report for 123-205-139-5.adsl.dynamic.seed.net.tw (123.205.139.5)
Host is up.
| BGP: 123.205.136.0/22 and 123.205.128.0/19 and 123.205.128.0/18 | Country: TW
| Origin AS: 4780 - SEEDNET Digital United Inc., TW

Diff -y line:
123.231.105.171 No_DNS 123.231.104.0/22 LK AS18001 | 123.231.105.171 No_DNS 123.231.0.0/17 LK AS18001
nMap-Output.b.e.03a.txt lines:
Nmap scan report for 123.231.105.171
Host is up.
| BGP: 123.231.104.0/22 and 123.231.96.0/19 and 123.231.0.0/17 | Country: LK
| Origin AS: 18001 - DIALOG-AS Dialog Axiata PLC., LK

Applying Magic Banana's script to the four nmap scan examples cited above;
sed -e 's/ *|[^:]*: /\n/g' Examples-4-MB-Script.txt | awk -F '\n+' -v OFS='\t' -v RS='\n*Nmap scan report for ' '$2 == "Host is up." { if (split($1, a, /[ ()]+/) == 1) $2 = "No_DNS"; else { $1 = a[2]; $2 = a[1] }; if (sub(/See the result for /, "", $3)) b[$1 OFS $2] = $3; else { NF = 5; sub(/ .*/, "", $3); sub(/ .*/, "", $5); $5 = "AS" $5; print; c[$1] = $3 OFS $4 OFS $5 } } END { for (d in b) print d, c[b[d]] }'
The outputs are the same as the left-hand side of the quoted diff -y outputs:
1.232.166.65 No_DNS 1.232.0.0/13 KR AS9318
123.20.116.9 No_DNS 123.20.112.0/20 VN AS45899
123.205.139.5 123-205-139-5.adsl.dynamic.seed.net.tw 123.205.136.0/22 TW AS4780
123.231.105.171 No_DNS 123.231.104.0/22 LK AS18001

While there are no three-and examples in the selected fraction of the data, they do exist; the script may have to deal with four-and's.
These --asn-query versions of the nmap scans get their very stable data from "official sources." Not so for the previous, quite volatile nmap scans; that's a revelation !

AllegatoDimensione
Examples-4-MB-Script.txt 703 byte
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The output of the mini-script was not what I was expecting ...

There are tabulations around "and" in your input, contrary to Nmap's output, which uses spaces.

Applying Magic Banana's script

That is the script with sub(/ .*/, "", $3), which indeed keeps the leftmost CIDR. In https://trisquel.info/forum/extracting-cidr-and-country-code-address-block#comment-159155 I replace it with sub(/.* /, "", $3) to keep the rightmost CIDR.

amenex
Offline
Iscritto: 01/03/2015

Not just one, but _two_ critical spaces. Strange; I was searching on [space]and[space]
with Leafpad, and it never let on to my misteak.
Man awk does a good job explaining regular expressions, but I have trouble with them
because I have not found an explanation of what are _not_ regular expressions.

I ran the comparative nmap scans (with the same or different start times on the ten
times larger IPv4 list, IPv4s.2004-2013.b.txt, which takes about 180 minutes, even with
four of them running at once, but I cannot report until the 29th because of other
business. No freeze intervened. The CPU's were kept within their temperature limits in
spite of hitting 96 Celsius for long periods. I used the scripts from
https://trisquel.info/en/forum/extracting-cidr-and-country-code-address-block#comment-159093,
but with the .e's removed from all of the file names. The input file is therefore:
IPv4s.2004-2013.b.txt The outputs are ca 2.5MB, too large for this forum.

Thank you.

AllegatoDimensione
IPv4s.2004-2013.b.txt 283.2 KB
GrepListnMap03.txt 127 byte
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Man awk does a good job explaining regular expressions

I do not think so. Here is a better introduction: http://archive.flossmanuals.net/command-line/regular-expressions.html

I have not found an explanation of what are _not_ regular expressions.

A simple pattern that no regular expression can express is *any* number of nested parentheses, i.e., a pattern matched by '', '()', '(())', '((()))' and so on "up to" infinitely many nested parentheses.

IPv4s.2004-2013.b.txt The outputs are ca 2.5MB, too large for this forum.

Plain text can be compressed a lot. I am talking about saving space on your disk. We do not need your outputs on the forum.

amenex
Offline
Iscritto: 01/03/2015

Turns out that nmap's --asn-query function is bedazzled by the variety of IPv4 addresses,
so the left-most CIDR block in the reported data too often is inappropriate for the actual
IPv4 address, especially in the CIDR/24 range. Therefore, the right-most (maximum range)
block is better for my purposes.

Also, diff ignores the right-hand ends of overly long file names; is there a standard
length that is always appropriate ?

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

diff ignores the right-hand ends of overly long file names

No, it does not. Any good program can work with file names of any length. Well, the program may try to create an output file with a name longer than what the filesystem accept (typically adding a prefix or suffix to the processed file whose name length is close to the limit). Nevertheless, essentially all the filesystems we encounter nowadays limit the file names to 255 bytes (or more), which are very rarely reached: https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits

amenex
Offline
Iscritto: 01/03/2015

Ahem; take a look:
sed -e 's/ *|[^:]*: /\n/g' LongASNData/nMap-Output.b.03a.txt | awk -F '\n+' -v OFS='\t' -v RS='\n*Nmap scan report for ' '$2 == "Host is up." { if (split($1, a, /[ ()]+/) == 1) $2 = "No_DNS"; else { $1 = a[2]; $2 = a[1] }; if (sub(/See the result for /, "", $3)) b[$1 OFS $2] = $3; else { NF = 5; sub(/.* /, "", $3); sub(/ .*/, "", $5); $5 = "AS" $5; print; c[$1] = $3 OFS $4 OFS $5 } } END { for (d in b) print d, c[b[d]] }' > LongASNData/Test-07202021.Origin_AS.b.03a.MB-RM.txt <== Rightmost CIDR ==> 19809 rows
sed -e 's/ *|[^:]*: /\n/g' LongASNData/nMap-Output.b.03a.txt | awk -F '\n+' -v OFS='\t' -v RS='\n*Nmap scan report for ' '$2 == "Host is up." { if (split($1, a, /[ ()]+/) == 1) $2 = "No_DNS"; else { $1 = a[2]; $2 = a[1] }; if (sub(/See the result for /, "", $3)) b[$1 OFS $2] = $3; else { NF = 5; sub(/ .*/, "", $3); sub(/ .*/, "", $5); $5 = "AS" $5; print; c[$1] = $3 OFS $4 OFS $5 } } END { for (d in b) print d, c[b[d]] }' > LongASNData/Test-07202021.Origin_AS.b.03a.MB-LM.txt <== Leftmost CIDR's, also 19809 rows
diff -y --suppress-common-lines --width=235 <(sort -Vk 1,1 Diff.Origin_AS.b.20min-SCL.MB-RM.txt) <(sort -Vk 1,1 Diff.Origin_AS.b.20min-SCL.MB-LM.txt) |more ==> Different scripts, identical outputs.
diff -y --width=235 <(sort -Vk 1,1 Diff.Origin_AS.b.20min-All.MB-RM.txt) <(sort -Vk 1,1 Diff.Origin_AS.b.20min-All.MB-LM.txt) > Diff-MB.RM-vs-LM.txt ==> All lines printed, but no differences found.
Rename the files, putting the differentiating characters at the beginning rather than at the ends:
diff -y --suppress-common-lines --width=235 <(sort -Vk 1,1 RM-Test-07202021.Origin_AS.b.03a.MB.txt) <(sort -Vk 1,1 LM-Test-07202021.Origin_AS.b.03a.MB.txt) > Diff-MB.RM-vs-LM-SCL.txt
4151 lines have differing CIDR sizes, leftmost always smaller blocks than rightmost as we should expect.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Whatever you are doing, diff is not wrong. You are, probably misnaming files.

amenex
Offline
Iscritto: 01/03/2015

In the cold light of day, thinking through every step a second time, Magic Banana is right.
What exactly I did incorrectly is not yet clear, but it probably has something to do with
mixing LM=Leftmost with RM=Rightmost and typing LM when I should have typed RM, making LM
equal RM by comparing RM to RM as a result.
This time I copied the files into new folders and then renamed them to move the RM from one
end of the name to the other end, but changing the date each time so the diff -y comparison
saw only the move of the two characters each time.
Now the positions of the characters make no difference any more.