Removing unwanted carriage returns
The output from my nmap script for gleaning hostname, ASN, CIDR and country code from a list of IP addresses
generally looks like this:
Nmap scan report for 2a00:1298:8011:212::165
Host is up.
Host script results:
| asn-query:
| BGP: 2a00:1298::/32 | Country: SK
|_ Origin AS: 5578 - AS-BENESTRA Bratislava, Slovak Republic, SK
Nmap scan report for 2a00:1370:8110:3eea:ddea:8b70:415a:f33e
Host is up.
Host script results:
|_asn-query: See the result for 2a00:1370:8114:b2d1:45ee:f77e:facb:d2e8
Nmap scan report for 2a00:1370:8110:79d7:2821:a9b2:9315:cb0f
Host is up.
Host script results:
|_asn-query: See the result for 2a00:1370:8114:b2d1:45ee:f77e:facb:d2e8
I'm using the following grep script to separate the desired data:
grep -e "Nmap scan report for" -e "BGP:" -e "Origin AS:" -e "asn-query: See the result for" SS.IPv6-HN-GLU-MB-Domains-January2020-Uniq-nMap.txt > SS.IPv6-HN-GLU-MB-Domains-January2020-Resolve.txt
Which [nearly instantly] produces results that look like this (after stripping a few (9000+) carriage returns with Leafpad:
Nmap scan report for 2a00:1298:8011:212::165 2a00:1298::/32 | Country: SK AS5578 - AS-BENESTRA Bratislava, Slovak Republic, SK
Nmap scan report for 2a00:1370:8110:3eea:ddea:8b70:415a:f33e
|_asn-query: See the result for 2a00:1370:8114:b2d1:45ee:f77e:facb:d2e8
Nmap scan report for 2a00:1370:8110:79d7:2821:a9b2:9315:cb0f
|_asn-query: See the result for 2a00:1370:8114:b2d1:45ee:f77e:facb:d2e8
I can remove "|_asn-query:" with sed:
sed 's/|_asn-query://g' SS.IPv6-HN-GLU-MB-Domains-January2020-ResolvePart.txt > SS.IPv6-HN-GLU-MB-Domains-January2020-ResolveStep01.txt
With the following general result:
Nmap scan report for 2a00:1298:8011:212::165 2a00:1298::/32 | Country: SK AS5578 - AS-BENESTRA Bratislava, Slovak Republic, SK
Nmap scan report for 2a00:1370:8110:3eea:ddea:8b70:415a:f33e
See the result for 2a00:1370:8114:b2d1:45ee:f77e:facb:d2e8
Nmap scan report for 2a00:1370:8110:79d7:2821:a9b2:9315:cb0f
See the result for 2a00:1370:8114:b2d1:45ee:f77e:facb:d2e8
Replacing the carriage return in the string "f33e [C.R.] See the result for" with a tab and just "See"
is proving problematic. In Leafpad, it will take way too long (days ...) so I'm forced to learn some
more scripting tricks ... I need to do this without inadvertently stripping all 400,000 carriage returns.
George Langford
jaret remarked:
> I believe you are referencing to a new line character, ...
Not just _any_ new line character: A combination of the new line character
on the end of one row, plus the phrase at the beginning of the following row.
Removing the new line characters willy-nilly will leave a one-row file with
all 750,000 lines all concatenated together ... I've done that inadvertently.
What I did do was to divide those 750,000 rows into twenty 50,000 row files
and then apply search & replace in Leafpad, which took a couple of minutes
for each file. It took longer to subdivide the original file by hand ...
George Langford
It looks like you could have nmap format its output in a way that would be easier to process (multiline patterns are a pain): https://nmap.org/book/output-formats-grepable-output.html
Magic Banana constructively added:
> It looks like you could have nmap format its output ...
Oh ! Gee ! That's a welcome suggestion. I have two more sets of IPv6 data
already nmap'ed over quite a few hours that are in the old grep-unfriendly
format. Fortunately, my brute-force workarounds are less time-consuming
than the original nmap scans, from which there is no escape.
Unfortunately, nmap's avoidance of excessive repetition runs afoul of my
use of the simple-to-use LibreOffice Calc in that I'm faced with multiple
days of filling in the empty cells between the infrequent asn-query results,
which nmap limits to one lookup per CIDR block.
Another roadblock is Google's aversion to robots, so my search for "other"
IP addresses of multi-addressed PTR's is necessarily a manual task, what
with the scores of CIDR blocks filled with identically named PTR's.
Try chasing down hn.kd.ny.adsl, motorspb.fvds.ru or hosted-by.leaseweb.com.
George Langford
Here's my present dilemma, exemplified by a snippet from the spreadsheet:
2401:4900:1888:c07f:1:2:4283:5767 2401:4900:1888:fcb4:1:2:4282:aab3
2401:4900:1888:cd70:1:1:4a58:fc0c 2401:4900:1888:fcb4:1:2:4282:aab3
2401:4900:1888:d068:fce8:8739:a7a0:4c60 2401:4900:1888:fcb4:1:2:4282:aab3
2401:4900:1888:e8f5:1:2:4cde:e7ca 2401:4900:1888:fcb4:1:2:4282:aab3
2401:4900:1888:ee55:23c5:e0ec:79fb:59dd 2401:4900:1888:fcb4:1:2:4282:aab3
2401:4900:1888:fcb4:1:2:4282:aab3 2401:4900:1888::/48 IN AS45609
2401:4900:1889:9396:5693:8b98:3a70:da67 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0
2401:4900:1889:a2d9:382e:b73:73dd:8693 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0
2401:4900:1889:aa8c:730c:fa94:8c27:7bf9 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0
2401:4900:1889:aad7:1:1:7b54:1e4c 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0
2401:4900:1889:c648:2161:968a:1c9e:b1c1 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0
2401:4900:1889:c7c0:f461:a726:a208:3ccb 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0
2401:4900:1889:cd44:e950:74db:8fd2:c134 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0
2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 2401:4900:1889::/48 IN AS45609
The positions (i.e. $2) ending in ...:aab3 have to be replaced with
2401:4900:1888::/48 IN AS45609
and the positions ending in ...:10c0 ($2) have to be replaced with
2401:4900:1889::/48 IN AS45609 (i.e., $2,$3,$4)
Those key rows, returned by nmap but not repeated by nmap, could have been anywhere in
the preceding rows. Of course nmap should not have to repeat the look-ups, but merely
repeating the stating of them would be helpful. It is open source ...
The entire text file has 387,000 rows, so even an inefficient script would be plenty
fast enough. I can fill in about five thousand rows an hour ... leading to a time
estimate of 387,000/5,000*1 hour = 77 hours ... not impossible while I'm housebound.
It may look silly when the spreadsheet is sorted by IPv6 address, but it's all very
necessary when it's sorted by the number of domains visited and/or the number of visits
per domain.
George Langford
The problem is not clear. At least to somebody like me who knows little about IP addresses. I assumed:
- the "keys" are in lines with four fields, in the second field, before the first (and only?) "::";
- the lines to rewrite have two fields, the key being before one of the ":" in the second field (I consider prefixes of growing size).
Here is a solution with AWK:
#!/usr/bin/awk -f
BEGIN {
while (getline < ARGV[1])
if (NF == 4)
{
split($2, a, "::")
m[a[1]] = $2 " " $3 " " $4
}
}
NF == 2 {
n = split($2, a, ":")
k = a[1]
for (i = 1; !(k in m) && ++i <= n; )
k = k ":" a[i]
print $1, m[k]
}
On the rows you gave, the output is:
2401:4900:1888:c07f:1:2:4283:5767 2401:4900:1888::/48 IN AS45609
2401:4900:1888:cd70:1:1:4a58:fc0c 2401:4900:1888::/48 IN AS45609
2401:4900:1888:d068:fce8:8739:a7a0:4c60 2401:4900:1888::/48 IN AS45609
2401:4900:1888:e8f5:1:2:4cde:e7ca 2401:4900:1888::/48 IN AS45609
2401:4900:1888:ee55:23c5:e0ec:79fb:59dd 2401:4900:1888::/48 IN AS45609
2401:4900:1889:9396:5693:8b98:3a70:da67 2401:4900:1889::/48 IN AS45609
2401:4900:1889:a2d9:382e:b73:73dd:8693 2401:4900:1889::/48 IN AS45609
2401:4900:1889:aa8c:730c:fa94:8c27:7bf9 2401:4900:1889::/48 IN AS45609
2401:4900:1889:aad7:1:1:7b54:1e4c 2401:4900:1889::/48 IN AS45609
2401:4900:1889:c648:2161:968a:1c9e:b1c1 2401:4900:1889::/48 IN AS45609
2401:4900:1889:c7c0:f461:a726:a208:3ccb 2401:4900:1889::/48 IN AS45609
2401:4900:1889:cd44:e950:74db:8fd2:c134 2401:4900:1889::/48 IN AS45609
Those key rows, returned by nmap but not repeated by nmap, could have been anywhere in the preceding rows.
In your example, they are in *subsequent* lines. If they are actually in preceding lines, then it is easy to read the input only once (the program above reads it twice):
#!/usr/bin/awk -f
NF == 4 {
split($2, a, "::")
m[a[1]] = $2 " " $3 " " $4
}
NF == 2 {
n = split($2, a, ":")
k = a[1]
for (i = 1; !(k in m) && ++i <= n; )
k = k ":" a[i]
print $1, m[k]
}
I can fill in about five thousand rows an hour ... leading to a time estimate of 387,000/5,000*1 hour = 77 hours ... not impossible while I'm housebound.
It makes no sense to spend so much time doing a repetitive task. Not to mention that you will err and the result will not be easily automatically processable anymore. 77 hours is more than enough to read about Nmap (and discover that it use an output format that is easier to automatically process) *and* master AWK or learn the basics of a generic programming language and of a library to parse XML (one of the formatting option).
I'll restate the problem, unencumbered by distracting arrays of colons and hexadecimals.
All 387,000 rows fall into one of three types, each IP address appearing only once in the first column:
Type A: $1("key" IP address), $2(CIDR block), $3(country code), $4(AS number)
Type B: $1(IP address falling within the $2CIDR block of Type A), $2(Type A's "key" IP address, repeated many times in successive rows)
Type C: $1(hostname), $2(Ip address from which $1hostname can be resolved), $3(CIDR block), $4(country code), $5(AS number)
(Type C is not very populous and can be handled with Leafpad)
The desired script:
awk should locate Type A's $1Key and find all the Type B rows whose $2Key match $1's Key, and then
copy Type A's columns $2, $3 & $4 in place of Type B's column $2 in every instance of a match with Type A's $1Key
I have found a small number of Type A rows with no data, but those I can look up with whois and fix easily.
The already looked-up hostnames are the only non-IP data in the $1 columns of Types A & B, so awk can safely
concentrate on all the Columns $1.
Also, all the IP addresses of looked-up hostnames will not reappear as not-looked-up IP addresses.
If awk can do everything described above with the first Type A $1Key before proceeding, even if that
involves searching the entire 370,000 rows once for each Type A $1Key, then we're on the right track.
George Langford
It is much simpler than what I thought... but still unclear: what to do with type-A/C lines and with type-B lines whose second field never appears in a type-A line? Notice in particular that, keeping type-A lines along with modified type-B lines, it becomes impossible to distinguish them (I believe).
So, I believe you actually want to first separate the three types, e.g., with this trivial AWK program:
#!/usr/bin/awk -f
NF == 4 { print > "type-A" }
NF == 2 { print > "type-B" }
NF == 5 { print > "type-C" }
Then, you can give "type-A" and "type-B" (in this order) as arguments of:
#!/usr/bin/awk -f
FILENAME == ARGV[1] { m[$1] = $2 " " $3 " " $4 }
FILENAME == ARGV[2] { if ($2 in m) $2 = m[$2]; print }
That two-line program outputs as many lines as there are in the second input file ("type-B"). Those that were modified have two additional fields. As a consequence, they are easily identifiable (in an AWK program: testing NF == 4).
Notice the simplicity of those programs. Again, you do not need to study AWK for 77 hours to be able to write such programs (8 hours should be enough, I believe). And they do not make typos.
These "trivial" AWK programs are presently beyond my ken. Way too compact for me at this hour.
In the meantime I started with this script:
awk '{print $2}' 'File01.txt' | sort | uniq -c > TempTQ01.txt ;
awk '{print $2, $1}' 'TempTQ01.txt' | sort -nrk 2 > TempTQ02.txt
where File01.txt is the original 370,000 row file, albeit truncated to exclude the 38,000 rows
that I've already filled in and also all the five-column resolved-hostname rows, leaving only
Type A and Type B IPv6 address data in 347,000 rows. Type B are the keyed IPv6 addresses and
the number of occurrences; Type B are CIDR blocks distinguishable by their '/' characters and
the number of those occurrences. We don't need to use the Type B data except as a reality check.
What remains is to print all the $1 columns of the IPv6 rows that match the first IPv6 key in
the TempTQ02 list, plus the $2, $3, and $4 columns of the corresponding Type B row to make C
rows (Column $2 of TempTQ02.txt) of filled-in data, then move on to the next IPv6 key in the
TempTQ02.txt file. The largest number of occurrences (55,000) exist in one contiguous group
of 55,000 rows, one of which contains the IPv6 key address and its three columns of asn-query
data. The occurrences data (C) are also needed only as a reality check.
I also meddled with the asn-query source code (https://svn.nmap.org/nmap/scripts/asn-query.nse)
and learned how to store & retrieve it as a program file which returns the same data for those
eight IPv6 addresses given above, plus the asn-query data. Alas, further meddling (beyond just
putting something else between the quotes of "See the result for %s") has been unproductive.
George Langford
These "trivial" AWK programs are presently beyond my ken.
Study AWK for a few hours and they will not be. Believe me: you need nowhere near 77 hours to be able to write such small programs that will save you those 77 hours of manual repetitive work... and much more in the future. In fact, reading Sections 1, 4 up to 4.5.1, 5.1, 5.6, 7.3, 7.4.1, 7.5.2 and 8.1.1 to 8.1.3 of https://www.gnu.org/software/gawk/manual/ is enough to understand in all details the two programs. More generally, reading Part I of that manual (minus the sections that describe "advanced features" and announced themselves as such) and practicing (testing modifications of the examples and so on) for a few hours is all you need to get an excellent level in AWK.
Remember the Delta Process from Calculus 1.01 ?
https://www.brighthubeducation.com/homework-math-help/108376-finding-the-derivative-of-a-function-in-calculus/
That's where I am in Scripting 1.01 ...
Back to the problem at hand.
Step (1) selected the IPv6 addresses of the Type A & Type B rows in the cleansed File01.txt:
awk '{print $2}' 'File01.txt' | sort | uniq -c > TempTQ01.txt ;
awk '{print $2, $1}' 'TempTQ01.txt' | sort -nrk 2 > TempTQ02.txt
Step (2) selects and lists all the Type B entries in File01.txt (SS.IPv6-HN-GLU-MB-Domains-January2020-All.ods.txt):
awk '{print $1}' 'TempTQ02.txt' | sort > TempTQ10.txt ;
awk '{print $1}' 'TempTQ10.txt' - | grep - File01.txt | awk '{print $1,$2,$3,$4}' '-' > TempTQ11.txt
Never mind simplicity or efficiency; it took 0.006 second CPU time and 0.032 second real time.
It did reveal a number of Type C rows that I had missed in my visual inspection ==> TempTQ13.txt
Next step: For each row in TempTQ11.txt, print $2,$3,$4 to cache, find $1 in File10.txt's $2 column,
and print that $2 to Column $1 along with cache's $2,$3,$4 into a new file ...
Step (3) matches the Keys in Col.$2 of the Type A rows with the data in Col's $2,$3 & $4 of Type B rows:
join -a 2 -1 1 -2 2 <(sort -k 1,1 TempTQ11.txt) <(sort -k 2 SS.IPv6-HN-GLU-MB-Domains-January2020-All.ods.txt) > NewFile.txt
Seems to work; took 0.132 second CPU time, 0.765 second real time; a little messy what with hostnames, etc, 343,000 rows OK.
That's a 200,000:1 improvement over my 77 hour estimate, not counting an hour to perfect it and clean up the output.
Downside: The key rows of the original File10.txt are left out, but easily added with join and cat:
Step (4) Combine the Type B rows in TempTQ11.txt ith the cleaned-up Type A rows in NewFile-Expunged.txt:
cat TempTQ11.txt NewFile-Expunged.txt | sort -k 1,1 > FinalFile-Complete.txt
Took 0.077 second CPU time, 1.013 seconds real time ==> 347,000 rows of No_DNS IPv6's with CIDR, ASN & CC.
Step (5) I attempted to join the present spreadsheet with the domains-visited and visits-per-domain data:
join -a 2 -1 1 -2 1 <(sort -k 1 VpDCts-DVCts.txt) <(sort -k 1 FinalFile-Complete.txt) > SS.IPv6-NLU-Visitor-ASN-Data.txt
But the results look incomplete: only 13,000 rows of fully filled-in data with correct & complete counts,
yet there are 330,000 rows of the uncombined data ... adding up to 343,000 rows. Needs some work ...
George Langford
awk '{print $2}' 'File01.txt' | sort | uniq -c > TempTQ01.txt ;
Are we are still talking about the same task? I gave you a simple solution already. How is counting how many times each value occurs in the second field useful? It is what the above command does.
join -a 2 -1 1 -2 1 <(sort -k 1 VpDCts-DVCts.txt) <(sort -k 1 FinalFile-Complete.txt)
Again: 'sort -k 1' is the same as 'sort': certainly not what you want. Also, 'join -1 1 -2 1' is the same as 'join'.
See: https://svn.nmap.org/nmap/scripts/asn-query.nse
where the applicable (?) script reads, noting especially "( "See the result for %s" ):format( last_ip )":
------------------------------------------------------
... begin snip
---
-- Checks whether the target IP address is within any BGP prefixes for which a query has
-- already been performed and returns a pointer to the HOST SCRIPT RESULT displaying the applicable answers.
-- @param ip String representing the target IP address.
-- @return Boolean true if there are cached answers for the supplied target, otherwise
-- false.
-- @return Table containing a string for each answer or nil
if there are none.
function check_cache( ip )
local ret = {}
-- collect any applicable answers
for _, cache_entry in ipairs( nmap.registry.asn.cache ) do
if ipOps.ip_in_range( ip, cache_entry.cache_bgp ) then
ret[#ret+1] = cache_entry
end
end
if #ret < 1 then return false, nil end
-- /0 signals that we want to kill this thread (all threads in fact)
if #ret == 1 and type( ret[1].cache_bgp ) == "string" and ret[1].cache_bgp:match( "/0" ) then return true, nil end
-- should return pointer unless there are more than one unique pointer
local dirty, last_ip = false
for _, entry in ipairs( ret ) do
if last_ip and last_ip ~= entry.pointer then
dirty = true; break
end
last_ip = entry.pointer
end
if not dirty then
return true, ( "See the result for %s" ):format( last_ip )
else
return true, ret
end
return false, nil
end
... end snip
------------------------------------------------------
Where we should _print_ the result for %s instead of just pointing to it ...
George Langford
Previously, I had attempted a join script:
> Step (5) I attempted to join the present spreadsheet with the domains-visited and visits-per-domain data:
> join -a 2 -1 1 -2 1 <(sort -k 1 VpDCts-DVCts.txt) <(sort -k 1 FinalFile-Complete.txt) > SS.IPv6-NLU-Visitor-ASN-Data.txt
> But the results look incomplete: only 13,000 rows of fully filled-in data with correct & complete counts,
> yet there are 330,000 rows of the uncombined data ... adding up to 343,000 rows. Needs some work ...
You bet it needs some work ... I had made a couple of irreparable errors, so I restarted the construction
of the useless spreadsheet, which is now ready to be filled in per a previous posting. More about this later.
George Langford