Distinguish the end of a list of IPv4 addresses from the following alphanumeric strings

4 replies [Last post]
amenex
Offline
Joined: 01/03/2015

The attached Mixed-Types.txt file is copied from a 2000-row list of resolved domain names.
The original list had been rearranged with sort -Vk 2,2, leaving about two dozen partially
resolved domains at the end. Those partially resolved domains have to be processed again
with dig, but by grep-ing six additional lines after the ANSWER SECTION because the IPv4
address is in the last line of the grep-ed output.
Selection of those last two dozen lines requires that they be counted somehow; for example:
awk '{print $1}' Mixed-Types.txt > PartAA.txt ;
grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}' <(awk '{print $2}' Mixed-Types.txt) > PartBB.txt ;
paste -d ' ' PartAA.txt PartBB.txt > PartCC.txt ;
wc -l PartBB.txt | sed 's/PartBB.txt//g' ;
wc -l PartCC.txt | sed 's/PartCC.txt//g' ;

Then the main file can be separated into two parts with:
awk '{print $1,$2}' PartCC.txt | head -n 35 '-' > Resolved-2k-Domains.txt
awk '{print $1}' PartAA.txt | tail -n -25 '-' > Unresolved-two-dozen-Domains.txt

Perform a dig on any one of those last 25 domains and you're likely to get different answers
every day ...
Two questions remain:
(1) Is there a less cumbersome way of counting the resolved domains ?
(2) How does one move the wc -l [filename] counts into those last two scripts ?

AttachmentSize
Mixed-Types.txt1.84 KB
Magic Banana

I am a member!

Online
Joined: 07/24/2010

As afar as I understand, you want:

  • the whole lines whose last fields is an IPv4 address:
    $ grep -E ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt > Resolved-2k-Domains.txt
  • the first field of the remaining lines:
    $ grep -vE ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt | cut -d ' ' -f 1 > Unresolved-two-dozen-Domains.txt

Both commands can run in parallel and Mixed-Types.txt needs not be sorted.

Answering your questions anyway:

(1) Is there a less cumbersome way of counting the resolved domains ?

Grep has option -c for that:
$ grep -Ec ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt
35

(2) How does one move the wc -l [filename] counts into those last two scripts ?

With $(...);
$ head -$(wc -l < PartBB.txt) PartCC.txt > Resolved-2k-Domains.txt
$ tail -$(wc -l < PartCC.txt) PartAA.txt > Unresolved-two-dozen-Domains.txt

Notice also the removal of the useless uses of awk, which only copies the input here!, and sed (by redirecting wc's input so that it does not print the file name).

amenex
Offline
Joined: 01/03/2015

The lesson for today was to learn the use of "$" in sed and in head as well as the use of cut,
not to mention counting with grep.

Another related question comes up:
mboxgrep has no provision for a pattern file like that in grep, nor can one pipe in a pattern
one-at-a-time, as from awk. My workaround has been to create a huge script file with Leafpad,
which takes only a few keystrokes. Is there a better way of listing all the emails containing
the pattern(s) ? My big scripts are held up by the dig searches, not be processing their outputs.

Magic Banana

I am a member!

Online
Joined: 07/24/2010

For the shell to read the lines on the standard input one by one:
while read pattern
do
(... "$pattern" is the line ...)
done

As always, < can redirect the standard input so that the lines are read from a file ("$1" below, i.e., the first argument of the shell script):
while read pattern
do
(... "$pattern" is the line ...)
done < "$1"

Of course, the variable containing the line needs not be named "pattern".

amenex
Offline
Joined: 01/03/2015

Magic Banana's assigned homework problem:
while read pattern
do
(... "$pattern" is the line ...)
done < "$1"

Solutions proposed by amenex:
while read pattern
do dig $pattern | grep -A6 ";; ANSWER SECTION:" | sed 's/;;\ ANSWER SECTION://g' |awk '{print $1,$5} NR==6{exit}' >> Resolved-XX-Domains-MB.txt
done < Unresolved-two-dozen-Domains.txt

plus some sed coding to clean up Resolved-XX-Domains-MB.txt:
sed 's/\.\ /\ /g' Resolved-XX-Domains-MB.txt | sed -r 's/\.$//' | sed 's/;;\ msec//g' | sed 's/;;//g' | grep "\S" '-' > Resolved-YY-Domains-MB.txt
In response to my complaint about the limitations of mboxgrep:
while read pattern
do mboxgrep $pattern /media/george/523ff5d3-64ea-486d-ba82-58721680b667/george/Georgesbasement.com.A/Thumb256E/GeorgesBasement.com/AAspam/1998-2021.Newest > Emails-02182022.txt
done < Unresolved-two-dozen-Domains.txt

Takes a little longer to analyze a 200+ MB email collection ... bear in mind that the Email collection
remains fixed, but the unresolved domains vary from day to day.

Thanks to Magic Banana for making me think about applying his suggestions ..