Separating IP addresses in a mixed list of hostnames and addresses

13 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/03/2015

Staring at a supply of a a couple hundred files with mixed IP addresses and hostnames,
I'd like to separate the pure addresses (never gratuitously converted to PTR's by the
naive servers at Internet Service Providers) from the converted hostnames in the two
humdred files, some of them thousands of lines in length.

IPv6 addresses are easy, as they are the only strings containing colons.
Hostnames usually contain letters of the alphabet, so it should be easy
to invert the grep selection process to collect names that contain no
letters, but I haven't been able to find a grep syntax encompassing all the
uppercase and lowercase letters ...

However, this link helps:
https://www.shellhacks.com/regex-find-ip-addresses-file-grep/

Applied in the present context:
grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}" Redacted/MonthlyCVs/Redacted-2014-12.txt > Redacted/IPv4s/Redacted-2014-12.txt

However, when I check the output against the original lists, there seem to be less
than half as many IPv4's in the original list as in the list the script finds,
meaning that the script is extracting IPv4's from the hostnames, which is not what
I intend. Confirming that with grep:
grep -f Redacted/IPv4s/Redacted-2014-12.txt Redacted/No-IPv6s/Redacted-2014-12.txt > Temp-03312021-C01.txt
which produces a list of the same number of items as the first grep script, containing mixed IPv4's and hostnames.

This brings me back to my original task: List the elements of the file (with no
IPv6's) that have no letters in them.
An inverse grep such as this:
grep -v [a A ...z Z] Mixed_List > IPv4-only_List
In all my notes I'm not finding that syntax.

George Langford

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

As always, your problem is not clearly specified: please show us an excerpt of the input and the corresponding expected output.

Assuming the strings to analyze are separated by horizontal and/or vertical white spaces, this may be what you want ("[file] ..." is your input files):
$ awk -F . 'BEGIN { RS = "[[:space:]]+" } NF == 4 { while (++i != 5 && $i >= 0 && $i < 256); } i == 5 { i = 0; print }' [file] ...

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Given what you tried, you apparently have every IPv4 address on a separate line. The BEGIN block could then be removed. More importantly, I forgot to reset i for lines with four dot-delimited numbers that are not four numbers between 0 and 255:
$ awk -F . 'NF == 4 { while (++i != 5 && $i >= 0 && $i < 256); if (i == 5) print; i = 0 }' [file] ...

amenex
Offline
Iscritto: 01/03/2015

Setting aside Magic Banana's comments for the moment, I managed to get past this impasse:

This brings me back to my original task: List the elements of the file (with no
IPv6's) that have no letters in them.
An inverse grep such as this:
grep -v [a A ...z Z] Mixed_List > IPv4-only_List

In all my notes I'm not finding that syntax.

But somehow I stumbled upon it:
grep -v "[a-z, A-Z]" Mixed_List > IPv4-only_List

In a day or two I'll elucidate my task with a series of scripts which have been
effective, if not always very efficient (such as grep). They bring to mind my
chemistry classes from sixty-five years ago ...

amenex
Offline
Iscritto: 01/03/2015

Here are the promised steps ...
1. Collect the Webalizer data for the domain under scrutiny, and aggregate the first
an last columns (Hits and Hosts): Raw-PTRs/VisitorList.txt, which brings the list
below the forum's 2.0 MB limit.
Note that I've distinguished the varying properties of the VisitorList by placing it
in an appropriately name subdirectory, such as Raw-PTRs (above) and in the following
scripts.

2. Extract any IPv6 addresses:
grep ":" Raw-PTRs/VisitorList.txt
None were found, though some PTR's may have IPv6 addresses still to be resolved.

3. Remove the IPv6's:
grep -v ":" Raw-PTRs/VisitorList.txt > No-IPv6s/VisitorList.txt

4. Extract the IPv4's from the modified VisitorList:
grep -v "[a-z,A-Z]" No-IPv6s/VisitorList.txt > IPv4s/VisitorList.txt
6181 were found.

5. Remove those IPv4's from the modified VisitorList:
grep -v -f IPv4s/VisitorList.txt No-IPv6s/VisitorList.txt > No-IPv6s.or.4s/VisitorList.txt
20752 were found after a moderately long wait.

6. Use nmap to scan the No-IPv6s.or.4s/VisitorList.txt listed PTR's to resolve their IPv4 addresses:
sudo nmap -Pn -sn -T2 --max-retries 8 -iL No-IPv6s.or.4s/VisitorList.txt -oG - | grep "Host:" '-' |
awk '{print $3,$2}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' > Resolved.PTRs/VisitorList.A.txt

Took over an hour to produce 20444 rows of data.

7. Remove the "No_DNS" rows from the preceding nmap scan data:
grep -v "No_DNS" Resolved.PTRs/VisitorList.A.txt | sort -u | awk '{print $2,$1}' '-' > Resolved.PTRs/VisitorList.B.txt
14764 rows of resolved PTR's remain.

8. Resolve the IPv4's in the IPv4s/VisitorList.txt data:
sudo nmap -Pn -sn -T2 --max-retries 8 -iL IPv4s/VisitorList.txt -oG - | grep "Host:" '-' |
awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' > Resolved.PTRs/VisitorList.C.txt

9. Remove the "No_DNS" rows from the preceding nMap scan data:
grep -v "No_DNS" Resolved.PTRs/VisitorList.C.txt | sort -u | awk '{print $1,$2}' '-' > Resolved.PTRs/VisitorList.D.txt
1012 rows of resolved PTR's.

10. Aggregate the resolved PTR's:
cat Resolved.PTRs/VisitorList.B.txt Resolved.PTRs/VisitorList.D.txt > Resolved.PTRs/VisitorList.E.txt
5776 rows of resolved PTR's from nmap scans of PTR's & IPv4's.

11. Reconcile the resolved PTR's to the Resolved.PTRs/VisitorList.A.txt in Step(6):
awk '{print $2}' Resolved.PTRs/VisitorList.E.txt | grep -vf '-' <(cat Resolved.PTRs/VisitorList.A.txt Resolved.PTRs/VisitorList.C.txt) > No-IPv6s.or.4s/VisitorList.F.txt
5433 rows of as-yet-unresolved PTR's.

12. Extract four octets of IPv4 data from each PTR that has incorporated an address in its name:
grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}' No-IPv6s.or.4s/ViditorList.F.txt > IPv4s-to-24/VisitorList.G.txt
5435 rows of IPv4's to be treated to the twenty-four 4321 permutations (Permutations.1234.txt) scripts.

13. Attempt resolution of all twenty-four permutations of the four octets extracted from the PTR's in VisitorList.G.txt:
awk '{print $0}' IPv4s-to-24/VisitorList.G.txt | sed 's/\./\t/g' | awk '{print $1"."$2"."$3"."$4}' '-' | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' oG - | grep "Nmap scan report for " '-' | awk '{print $6,$5}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' >> IPv4s-to-24/VisitorList.G.1234-oGnMap.txt ;
awk '{print $0}' IPv4s-to-24/VisitorList.G.txt | sed 's/\./\t/g' | awk '{print $1"."$2"."$4"."$3}' '-' | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' oG - | grep "Nmap scan report for " '-' | awk '{print $6,$5}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' >> IPv4s-to-24/VisitorList.G.1243-oGnMap.txt ;
...
awk '{print $0}' IPv4s-to-24/VisitorList.G.txt | sed 's/\./\t/g' | awk '{print $4"."$3"."$2"."$1}' '-' | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' oG - | grep "Nmap scan report for " '-' | awk '{print $6,$5}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' >> IPv4s-to-24/VisitorList.G.4321-oGnMap.txt ;

14. Eliminate the unresolved IPv4's from the preceding nMap scans and join with the original Current-Visitor data:
cat IPv4s-to-24/*-oGnMap.txt | grep "[a-z,A-Z]" '-' | join -1 2 -2 3 <(sort -k 2,2 '-') <(sort -k 3,3 CV-InputData/VisitorList.txt) | awk '{print $2,$1}' '-' | sort -u > Resolved.PTRs/VisitorList.H.txt
799 rows of newly resolved PTR's.

15. Aggregate the resolved PTR's and find the remaining unresolved PTR's that have to be found elsewhere:
cat Resolved.PTRs/VisitorList.H.txt Resolved.PTRs/VisitorList.E.txt | awk '{print $2}' '-' | sort -u | grep -vf '-' No-IPv6s.or.4s/VisitorList.txt > PTRs-to-GG/VisitorList.I.txt
2642 rows of as-yet-unresolved PTR's to be found in lists of malicious hosts aggregated online.

16. Script to join the just-obtained results to the original input data; looks like these:
join -a 2 -1 2 -2 9 <(sort -k 2,2 CV-InputData/VisitorList.2014-12.txt) <(sort -k 9,9 VisitorList-2014-12.txt) | awk '{print $2,$1,$3,$4}' '-' | sort -nrk 3 > Analysis/VisitorList-2014-12.txt ;
join -a 2 -1 2 -2 9 <(sort -k 2,2 CV-InputData/VisitorList.2015-01.txt) <(sort -k 9,9 VisitorList-2015-01.txt) | awk '{print $2,$1,$3,$4}' '-' | sort -nrk 3 > Analysis/VisitorList-2015-01.txt ;
...
join -a 2 -1 2 -2 9 <(sort -k 2,2 CV-InputData/VisitorList.2021-03.txt) <(sort -k 9,9 VisitorList-2021-03.txt) | awk '{print $2,$1,$3,$4}' '-' | sort -nrk 3 > Analysis/VisitorList-2021-03.txt ;

Gives four unmatched columns below the last month column; see the next step.

17. Join the nmap'ed data from the tail end of the analysis data to the analysis data:
grep -v "2014-12" Analysis/VisitorList-2014-12.txt | awk '{print $2}' '-' | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' | join -a 1 -1 2 -2 2 <(sort -k 2,2 '-') <(sort -k 2,2 Analysis/VisitorList-2014-12.txt) | awk '{print $2,$1}' '-' | sort -u | grep -v "No_DNS" '-' > Analysis/Join-VisitorList-2014-12.txt ;
grep -v "2015-01" Analysis/VisitorList-2015-01.txt | awk '{print $2}' '-' | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' | join -a 1 -1 2 -2 2 <(sort -k 2,2 '-') <(sort -k 2,2 Analysis/VisitorList-2015-01.txt) | awk '{print $2,$1}' '-' | sort -u | grep -v "No_DNS" '-' > Analysis/Join-VisitorList-2015-01.txt ;
...
grep -v "2021-03" Analysis/VisitorList-2021-03.txt | awk '{print $2}' '-' | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' | join -a 1 -1 2 -2 2 <(sort -k 2,2 '-') <(sort -k 2,2 Analysis/VisitorList-2021-03.txt) | awk '{print $2,$1}' '-' | sort -u | grep -v "No_DNS" '-' > Analysis/Join-VisitorList-2021-03.txt ;

18. The following seven-line script supplements the listed resolved-PTR data for one month's VisitorList.txt data:
grep -v "2014-12" Analysis/Join-VisitorList-2014-12.txt | awk '{print $1}' '-' | grep -f '-' VisitorList-2014-12.txt | awk '{print $9,$1}' '-' | join -a 1 -1 1 -2 1 <(sort -k 1,1 '-') <(sort -k 1,1 Analysis/Join-VisitorList-2014-12.txt | sed 's/2014-12//g') | awk '{print $1,$2}' '-' > Temp-04082021-C03.txt ;
grep -v "2014-12" Analysis/Join-VisitorList-2014-12.txt | awk '{print $2}' '-' | grep -if '-' VisitorList-2014-12.txt | awk '{print $9,$1}' '-' | join -a 1 -1 1 -2 2 <(sort -k 1,1 '-') <(sort -k 2,2 Analysis/Join-VisitorList-2014-12.txt | sed 's/2014-12//g') | awk '{print $1,$2}' '-' > Temp-04082021-D03.txt ;
cat Temp-04082021-C03.txt Temp-04082021-D03.txt | sort -u > VisitorList-2014-12.C03-and-D03.txt ;
awk '{print $1}' Temp-04082021-C03-and-D03.txt | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' > Temp-2014-12.txt ;
awk '{print $0}' Temp-2014-12.txt > List-04082021-A01.txt ;
awk '{print $2" 2014-12"}' VisitorList-2014-12.C03-and-D03.txt > List-04082021-A02.txt ;
paste -d ' ' List-04082021-A01.txt List-04082021-A02.txt > Analysis/Add.VisitorList-2014-12.txt ;

Needs to be run seventy-five more times to cover the entire date range.
[skipping many confused steps akin to the making of sausage]
19. List the Analysis/VisitorList-2014-12.txt and Analysis/Add.VisitorList-2014-12.txt files:
cat Analysis/VisitorList-2014-12.txt Analysis/Add.VisitorList-2014-12.txt | sort -u | grep "2014-12" '-' | sort -nrk 3,3 > Analysis/Sum-VisitorList-2014-12.txt ;
cat Analysis/VisitorList-2015-01.txt Analysis/Add.VisitorList-2015-01.txt | sort -u | grep "2015-01" '-' | sort -nrk 3,3 > Analysis/Sum-VisitorList-2015-01.txt ;
...
cat Analysis/VisitorList-2021-03.txt Analysis/Add.VisitorList-2021-03.txt | sort -u | grep "2021-03" '-' | sort -nrk 3,3 > Analysis/Sum-VisitorList-2021-03.txt ;

The Analysis/Sum-VisitorList-2014-12.txt, Analysis/Sum-VisitorList-2015-01.txt, through
Analysis/Sum-VisitorList-2021-03.txt are to be passed on to the next step.

George Langford

AllegatoDimensione
VisitorList.txt 788.11 KB
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

20752 were found after a moderately long wait.

It takes time because, for every line of No-IPv6s/VisitorList.txt, you are checking whether each and every regular expression in IPv4s/VisitorList.txt matches it. They are regular expressions and not fixed strings because of the dots in the IPv4 addresses which are interpreted as "any single character". As a consequence, worse than the unnecessary long processing time, unwanted lines of No-IPv6s/VisitorList.txt may be returned (. matching another character than a dot). I explained you that many times.

I do not understand why you do not use the AWK program I gave you in this thread. It outputs the line in the input file(s) that are composed of four dot-separated numbers between 0 and 255. That is safer than just checking that the line does not contain letters or commas. It can be easily complemented to get both the IPv4 addresses and the rest in two separate files:

$ awk -F . '{ while (++i != 5 && $i >= 0 && $i < 256); if (NF == 4 && i == 5) print >> "IPv4"; else print >> "Not-IPv4"; i = 0 }' No-IPv6s/VisitorList.txt

amenex
Offline
Iscritto: 01/03/2015

After running the first five steps explicitly to confirm that my interpretation of my own
40,000+ lines of notes on this subject were correct ...

Not to quibble with Magic Banana, I made two modifications to accommodate my directory structure:
awk -F . '{ while (++i != 5 && $i >= 0 && $i < 256); if (NF == 4 && i == 5) print >> "IPv4s/MB.IPv4"; else print >> "No-IPv6s.or.4s/MB.Not-IPv4"; i = 0 }' No-IPv6s/VisitorList.txt
which accomplished its task in about the same blink that it took me to realize that my modifications
correctly accorded to the script's intent.

Naturally, according to Magic Banana's admonishments about accuracy, I discovered on further
analysis that my Step(4) script found one more field than Magic Banana's combination script did
(2015-10-10-12-46-47.113745IPv4, which is false).

However, my Step(5) script has disastrous results, in that it misses 172 PTR's. Magic Banana's
script makes a binary choice; a field is either IPv4 or it isn't. I thought that my Step(5)
script was also making a binary choice with the -v argument to its grep statement. Not so fast.

Resorting to another binary choice not directly dependent on grep, I compared the two files with diff:
diff --suppress-common-lines <(sort Raw-PTRs/VisitorList.txt) <(sort IPv4s/VisitorList.txt) | grep "<" '-' | sed 's/"< "//g' '-' > Temp-04162021-A01.txt ; wc -l Temp-04162021-A01.txt ==> 20923
which accurately omits the 2015-10-10-12-46-47.113745 that is in my IPv4 list erroneously.

It may be that grep is easily fooled ...

The remaining steps now require corrections to account for the missed 172 PTR's.

Thank you, Magic Banana, for being my source of best resort.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

It may be that grep is easily fooled ...

grep does what it is supposed to do. Besides the use of dots to literally mean dots and not "any single character" (option -F makes grep interpret the patterns as fixed strings), you forgot option -x, to force the pattern to match the whole line (an hostname may "include" an IPv4 address, right?). I believe 'grep -Fvxf IPv4s/VisitorList.txt No-IPv6s/VisitorList.txt' would provide the correct output, if IPv4s/VisitorList.txt is correct in the first place (and it may not be, unless hostnames are somehow forced to include a letter).

However, it is inefficient: for the reason I gave in my last post, the execution time depends on the product of the size of IPv4s/VisitorList.txt and the size of No-IPv6s/VisitorList.txt. In contrast, the execution time of the small AWK program I gave you only depends on the size of No-IPv6s/VisitorList.txt.

amenex
Offline
Iscritto: 01/03/2015

Substituting an awk script for one using grep ==>

See: https://www.theunixschool.com/2012/09/grep-vs-awk-examples-for-pattern-search.html?m=1
Where it's said:
Using awk, the [grep] thing can be done by placing the pattern within slashes:
awk '/Linux/' file

IGNORECASE is a special built-in variable present in GNU awk/gawk.
When it is set to a non-zero value, it does a case insentive search.

awk '/linux/' IGNORECASE=1 file

Step(11) with the -x argument to grep challenged my T420's RAM & swap nearly to their limits:
awk '{print $2}' Resolved.PTRs/VisitorList.MBE.txt | grep -vxf '-' <(cat Resolved.PTRs/VisitorList.MBA.txt Resolved.PTRs/VisitorList.MBC.txt) > No-IPv6s.or.4s/VisitorList.MBF.txt
Note: VisitorList.MBA.txt has its columns reversed ! Grep didn't care, but awk & diff surely do.

Step(11) modified to do the search w/o grep:
awk '{print $2}' Resolved.PTRs/VisitorList.MBE.txt | awk '/-/' IGNORECASE=1 <(cat <(awk '{print $2,$1}' Resolved.PTRs/VisitorList.MBA.txt) Resolved.PTRs/VisitorList.MBC.txt) > Temp-04172021-A02.txt

This script finds the matches, not the anti-matches ... follow up with diff:
diff Temp-04172021-A02.txt <(cat <(awk '{print $2,$1}' Resolved.PTRs/VisitorList.MBA.txt) Resolved.PTRs/VisitorList.MBC.txt) > Temp-04172021-B02.txt

Separate diff's output lines containing karats (> or <):
awk /">"/ Temp-04172021-B02.txt | awk '{print $2,$3}' '-' > No-IPv6s.or.4s/VisitorList.RightF.txt
Unique to VisitorList.MBA.txt or to VisitorList.MBC.txt: 14067 rows
awk /"<"/ Temp-04172021-B02.txt | awk '{print $2,$3}' '-' > No-IPv6s.or.4s/VisitorList.LeftF.txt
Unique to Temp-04172021-A02.txt: none

That's four non-grep scripts which can be run in sequence with lots more characters, but scarcely the blink
of an eye in execution time and no challenge to the 'puter's memory.
My original Step(11) was followed by a typographical error, here corrected:
15433 rows of as-yet-unresolved PTR's.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Grep didn't care, but awk & diff surely do.

Again, those commands behave as specified in their documentations. They do not have free will. If you want to properly work with them, you need to understand what they do (or why they do not do what you expected). Instead, you apparently try random things, make the command lines unreadable doing so, until the output "appears" correct but is often not (it is only correct on the few lines you checked).

For instance, instead of "grep -vxf '-' <(cat Resolved.PTRs/VisitorList.MBA.txt Resolved.PTRs/VisitorList.MBC.txt)", you can write "grep -vxf - Resolved.PTRs/VisitorList.MBA.txt Resolved.PTRs/VisitorList.MBC.txt". Your uselessly more complicated command is also less efficient and not portable. And it is not what you want in the first place (as I have already explained).

As for efficiency, grep '...' is faster, clearer and shorter to type than than awk '/.../', cut -f 2 is faster, clearer and shorter to type than awk '{ print $2 }', etc. More generally, specific commands (head, tail, shuf, cat, tr, wc, cut, paste, comm, join, uniq, sort, grep, etc.) are faster, clearer and shorter to type than what can be achieved with Turing-complete languages (sed, awk, python, etc.). Those languages are far more powerful. They can compute anything a computer can compute.

The blog you found (which looks good, at least at first sight), "The UNIX school", aims to teach you things. How to filter records with awk is definitely something to know. But if it is the only thing you want (no action on those filtered records, but printing them), then you had better use grep.

amenex
Offline
Iscritto: 01/03/2015

Step(14) Eliminate the unresolved IPv4's from the preceding nMap scans and join with the original
Current-Visitor data (this was the deprecated grep approach):
cat IPv4s-to-24/*-oGnMap.txt | grep "[a-z,A-Z]" '-' | join -1 2 -2 3 <(sort -k 2,2 '-') <(sort -k 3,3 CV-InputData/VisitorList.txt) | awk '{print $2,$1}' '-' | sort -u > Resolved.PTRs/VisitorList.H.txt
Becomes:
cat IPv4s-to-24/VisitorList.Non-GrepG.????-oGnMap.txt | awk '{print $2}' '-' | join -1 1 -2 3 <(sort -k 1,1 '-') <(sort -k 3,3 CV-InputData/VisitorList.txt) | awk '{print $1,$2,$3}' '-' | sort -u | sort -k 3,3 -nrk 2,2 > Resolved.PTRs/VisitorList.Non-GrepH.01.txt
Reunite Resolved.PTRs/VisitorList.Non-GrepH.txt with its IPv4 data:
join -1 2 -2 1 <(cat IPv4s-to-24/VisitorList.Non-GrepG.????-oGnMap.txt | sort -k 2,2) <(sort -k 1,1 Resolved.PTRs/VisitorList.Non-GrepH.txt) | awk '{print $2,$1,$3,$4}' '-' > Temp-04172021-C01.txt
Starting over with Step(15): Aggregate the resolved PTR's and find the remaining unresolved PTR's
that have to be found elsewhere:
cat Temp-04172021-C01.txt Resolved.PTRs/VisitorList.MBE.txt | awk '{print $2}' '-' | sort -u | grep -vxf '-' No-IPv6s.or.4s/VisitorList.txt > PTRs-to-GG/VisitorList.MBI.txt
Yields 8309 rows of unresolved PTR's.
Grep demanded a lot of RAM but no swap and took about five minutes to complete its work.
I tried but could not find the syntax to accomplish the same end as "grep -v" with awk, but
now I'm assured that grep properly applied is indeed very handy and accurate.

There are a great many IPv4's incorporated into those 8309 unresolved PTR's. I'll see how many
were missed by the previous IPv4-extraction script.

In high school my plane geometry instructor made us take a short quiz at the close of every
class, and somehow I managed to get every one right, except one. Under that time constraint,
I grabbed the first "proof" that came to mind and stuck with it, even when a better approach
would have led to less scribbling. Text processing is not yet even a second language to me.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Writing a mathematical constructive proof is very similar to writing a program. The difference is that, for programs, we care about the time and memory it requires to be executed. For instance, it is important to understand that the execution of 'grep -f patterns file' requires (in the worst case) a time that depends on the product of the size of patterns and the size of file.

Anyway, performance and style (such as simply writing grep '[a-z,A-Z]' IPv4s-to-24/*-oGnMap.txt instead of your cat IPv4s-to-24/*-oGnMap.txt | grep "[a-z,A-Z]" '-') are not your main problem. Your main problem is that your programs/proofs are wrong, because you do not look at the definitions of the operators you use. You check what you come up with on a few examples and iteratively increase the complexity of your "solution" (without really understanding what you do) until it works for those examples. Nevertheless, it usually still does not work in general case. In this thread, the difference between the output you got and the correct one (provided by the small AWK program I wrote for you) is one more example of that.

Worse, you apparently refuse to acknowledge that the commands you use behave according to a well-defined specification. For instance, I have told you many times that '.' in a regular expression does not necessarily match a dot but any single character, as specified on https://www.gnu.org/software/grep/manual/html_node/Fundamental-Structure.html#Fundamental-Structure

One more time, see by yourself:
$ echo 102:3A4 | grep 1.2.3.4
102:3A4

I tried but could not find the syntax to accomplish the same end as "grep -v" with awk

Precede the condition with an exclamation mark, as in C and as https://www.theunixschool.com/2012/09/grep-vs-awk-examples-for-pattern-search.html (the link you gave) teaches it.

amenex
Offline
Iscritto: 01/03/2015

Magic Banana's academic diagnosis is correct. While I am not an ad hoc student of lithography who is
in the class so I can make better $20 bills, I do see an unaddressed deficiency in Internet security
which is adversely affecting way too many folks. My present goal is to demonstrate that way too much
Internet traffic is malevolent or misused and can be detected by the lack of reversibility of the
packet identities from numerical (IPv4 or IPv6) to alphanumeric (hostnames or pointers (PTR's)) and
back.

Regarding awk and the exclamation point("!") for which I finally got a working syntax:
awk '{print $2}' Resolved.PTRs/VisitorList.MBE.txt | awk /-/ IGNORECASE=1 <(cat <(awk '{print $2,$1}' Resolved.PTRs/VisitorList.MBA.txt) Resolved.PTRs/VisitorList.MBC.txt) > Positive.txt
vs.
awk '{print $2}' Resolved.PTRs/VisitorList.MBE.txt | awk '!/-/' IGNORECASE=1 <(cat <(awk '{print $2,$1}' Resolved.PTRs/VisitorList.MBA.txt) Resolved.PTRs/VisitorList.MBC.txt) > Negative.txt

Testing whether awk is making a binary choice:
cat <(awk '{print $2,$1}' Resolved.PTRs/VisitorList.MBA.txt) Resolved.PTRs/VisitorList.MBC.txt > Neutral.txt
Followed by:
diff -s <(cat Positive.txt Negative.txt | sort) <(sort Neutral.txt)
Returns "Files /dev/fd/63 and /dev/fd/62 are identical."

Q.E.D. The two awk scripts are complementary.

What's making this task difficult is that Webalizer data presented post hostname lookup
has to be back-converted to numerical address data to provide a reliable starting place.
When I did that for my own principal domain it took over a month to track down way too
many untraceable hostnames whose identities were revealed by their users' concurrent
malevolent email activity, detectable only by one-at-a-time Internet searches.

amenex
Offline
Iscritto: 01/03/2015

"What's the point of the tedious list of scripting steps," one might ask.

After much online grousing about the practice of gratuitous hostname lookups, I've realized
that those often-unresolvable hostnames (a.k.a. Pointers, abbreviated as PTR) are probably
hiding the addresses of malicious material residing on the addressed servers. They therefore
are gaining access to our machines whether or not hostname lookups are performed at the ISP's
servers.

I'm advocating that the hostname lookup process be extended to include an immediate attempt
to resolve the advertised PTR that is the response to the original hostname lookup, but I
don't have the programming knowledge to accomplish that step. Instead, I'm providing a
demonstration of what the Up/Down Lookup process reveals when applied to Webalizer data
published online, using text-processing scripts akin to what Magic Banana is teaching us.

The starting points for the Up/Down Lookup comparison are in this selection of three months
of a six-year record of visiting hosts (attached):
Analysis/Sum-VisitorList-2014-12.txt
Analysis/Sum-VisitorList-2015-01.txt
...
Analysis/Sum-VisitorList-2021-03.txt

1. Perform nmap lookups for Col.$1 (the IPv4's) in all the month files,
followed by the opposite lookup of the resulting PTR's:
awk '{print $1}' Analysis/Sum-VisitorList-2014-12.txt | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' > Analysis/nMap-I-VisitorList-2014-12.txt ; awk '{print $2}' Analysis/nMap-I-VisitorList-2014-12.txt | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' > Analysis/nMap-II-VisitorList-2014-12.txt ;
awk '{print $1}' Analysis/Sum-VisitorList-2015-01.txt | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' > Analysis/nMap-I-VisitorList-2015-01.txt ; awk '{print $2}' Analysis/nMap-I-VisitorList-2015-01.txt | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' > Analysis/nMap-II-VisitorList-2015-01.txt ;
awk '{print $1}' Analysis/Sum-VisitorList-2021-03.txt | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' > Analysis/nMap-I-VisitorList-2021-03.txt ; awk '{print $2}' Analysis/nMap-I-VisitorList-2021-03.txt | sudo nmap -Pn -sn -T2 --max-retries 8 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' > Analysis/nMap-II-VisitorList-2021-03.txt ;

2. Keep track of the resolved vs. unresolved PTR's, requiring four steps for each month's data:
awk '{print $2}' Analysis/nMap-II-VisitorList-2014-12.txt | grep -i -e '-' -f Analysis/Sum-VisitorList-2014-12.txt | sort -nrk 3,3 | awk '{print $0" Resolvable"}' > Analysis/Resv-VisitorList-2014-12.txt ;
awk '{print $2}' Analysis/nMap-II-VisitorList-2014-12.txt | grep -iv -e '-' -f Analysis/Sum-VisitorList-2014-12.txt | sort -nrk 3,3 | awk '{print $0" Unresolvable"}' > Analysis/Unrs-VisitorList-2014-12.txt ;
cat Analysis/Resv-VisitorList-2014-12.txt Analysis/Unrs-VisitorList-2014-12.txt > Analysis/Website-VisitorList-2014-12.txt ;
echo "scale=4; $(wc -l < Analysis/Resv-VisitorList-2014-12.txt) / ($(wc -l < Analysis/Resv-VisitorList-2014-12.txt) + $(wc -l < Analysis/Unrs-VisitorList-2014-12.txt))" | bc -l | awk '{print "0"$1}' '-' > Analysis/FractionRes-VisitorList-2014-12.txt ;

awk '{print $2}' Analysis/nMap-II-VisitorList-2015-01.txt | grep -i -e '-' -f Analysis/Sum-VisitorList-2015-01.txt | sort -nrk 3,3 | awk '{print $0" Resolvable"}' > Analysis/Resv-VisitorList-2015-01.txt ;
awk '{print $2}' Analysis/nMap-II-VisitorList-2015-01.txt | grep -iv -e '-' -f Analysis/Sum-VisitorList-2015-01.txt | sort -nrk 3,3 | awk '{print $0" Unresolvable"}' > Analysis/Unrs-VisitorList-2015-01.txt ;
cat Analysis/Resv-VisitorList-2015-01.txt Analysis/Unrs-VisitorList-2015-01.txt > Analysis/Website-VisitorList-2015-01.txt ;
echo "scale=4; $(wc -l < Analysis/Resv-VisitorList-2015-01.txt) / ($(wc -l < Analysis/Resv-VisitorList-2015-01.txt) + $(wc -l < Analysis/Unrs-VisitorList-2015-01.txt))" | bc -l | awk '{print "0"$1}' '-' > Analysis/FractionRes-VisitorList-2015-01.txt ;

awk '{print $2}' Analysis/nMap-II-VisitorList-2021-03.txt | grep -i -e '-' -f Analysis/Sum-VisitorList-2021-03.txt | sort -nrk 3,3 | awk '{print $0" Resolvable"}' > Analysis/Resv-VisitorList-2021-03.txt ;
awk '{print $2}' Analysis/nMap-II-VisitorList-2021-03.txt | grep -iv -e '-' -f Analysis/Sum-VisitorList-2021-03.txt | sort -nrk 3,3 | awk '{print $0" Unresolvable"}' > Analysis/Unrs-VisitorList-2021-03.txt ;
cat Analysis/Resv-VisitorList-2021-03.txt Analysis/Unrs-VisitorList-2021-03.txt > Analysis/Website-VisitorList-2021-03.txt ;
echo "scale=4; $(wc -l < Analysis/Resv-VisitorList-2021-03.txt) / ($(wc -l < Analysis/Resv-VisitorList-2021-03.txt) + $(wc -l < Analysis/Unrs-VisitorList-2021-03.txt))" | bc -l | awk '{print "0"$1}' '-' > Analysis/FractionRes-VisitorList-2021-03.txt ;

3. Calculate the fraction resolved:
awk '{print FILENAME,$0}' Analysis/FractionRes-VisitorList-2014-12.txt >> PSI-Plot-data/PSI-Plot-VisitorList.txt ;
awk '{print FILENAME,$0}' Analysis/FractionRes-VisitorList-2015-01.txt >> PSI-Plot-data/PSI-Plot-VisitorList.txt ;
...
awk '{print FILENAME,$0}' Analysis/FractionRes-VisitorList-2021-03.txt >> PSI-Plot-data/PSI-Plot-VisitorList.txt ;

Once built into the sourcecode of the servers, the Up/Down Lookup process would throttle malicious
Internet traffic far more effectively than the extra time and computer power needed to perform the
second resolution attempt for each incoming packet.

On April 18, 2021 I edited the syntax of grep in accordance with man grep.

George Langford

AllegatoDimensione
Sum-VisitorList-2014-12.txt 16.75 KB
Sum-VisitorList-2015-01.txt 17.34 KB
Sum-VisitorList-2101-03.txt 56.4 KB