Languages

User login
Create new account

Navigation

Recent donations

Martin Herlitschka
donated € 40.00

dennis.butz
donated € 100.00

knife
donated € 50.00

David Padilla Abarca
donated $ 10.00

knife
donated € 40.00

Donate now!

bc1q3t3vxjhd3dmvg3cfn24k4l7n4mf750utpp75hn

Submitted by amenex on Fri, 02/18/2022 - 16:05

Distinguish the end of a list of IPv4 addresses from the following alphanumeric strings

Please read and follow the Community Guidelines.

4 replies [Last post]

Fri, 02/18/2022 - 16:05

amenex

Offline

Joined: 01/03/2015

The attached Mixed-Types.txt file is copied from a 2000-row list of resolved domain names.
The original list had been rearranged with sort -Vk 2,2, leaving about two dozen partially
resolved domains at the end. Those partially resolved domains have to be processed again
with dig, but by grep-ing six additional lines after the ANSWER SECTION because the IPv4
address is in the last line of the grep-ed output.
Selection of those last two dozen lines requires that they be counted somehow; for example:
awk '{print $1}' Mixed-Types.txt > PartAA.txt ; grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}' <(awk '{print $2}' Mixed-Types.txt) > PartBB.txt ; paste -d ' ' PartAA.txt PartBB.txt > PartCC.txt ; wc -l PartBB.txt | sed 's/PartBB.txt//g' ; wc -l PartCC.txt | sed 's/PartCC.txt//g' ;
Then the main file can be separated into two parts with:
awk '{print $1,$2}' PartCC.txt | head -n 35 '-' > Resolved-2k-Domains.txt awk '{print $1}' PartAA.txt | tail -n -25 '-' > Unresolved-two-dozen-Domains.txt
Perform a dig on any one of those last 25 domains and you're likely to get different answers
every day ...
Two questions remain:
(1) Is there a less cumbersome way of counting the resolved domains ?
(2) How does one move the wc -l [filename] counts into those last two scripts ?

Attachment	Size
Mixed-Types.txt	1.84 KB

Fri, 02/18/2022 - 16:47

Magic Banana

I am a member!

I am a translator!

Offline

Joined: 07/24/2010

As afar as I understand, you want:

the whole lines whose last fields is an IPv4 address:
$ grep -E ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt > Resolved-2k-Domains.txt
the first field of the remaining lines:
$ grep -vE ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt | cut -d ' ' -f 1 > Unresolved-two-dozen-Domains.txt

Both commands can run in parallel and Mixed-Types.txt needs not be sorted.

Answering your questions anyway:

(1) Is there a less cumbersome way of counting the resolved domains ?

Grep has option -c for that:
$ grep -Ec ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt 35

(2) How does one move the wc -l [filename] counts into those last two scripts ?

With $(...);
$ head -$(wc -l < PartBB.txt) PartCC.txt > Resolved-2k-Domains.txt $ tail -$(wc -l < PartCC.txt) PartAA.txt > Unresolved-two-dozen-Domains.txt
Notice also the removal of the useless uses of awk, which only copies the input here!, and sed (by redirecting wc's input so that it does not print the file name).

Fri, 02/18/2022 - 18:50

amenex

Offline

Joined: 01/03/2015

The lesson for today was to learn the use of "$" in sed and in head as well as the use of cut,
not to mention counting with grep.

Another related question comes up:
mboxgrep has no provision for a pattern file like that in grep, nor can one pipe in a pattern
one-at-a-time, as from awk. My workaround has been to create a huge script file with Leafpad,
which takes only a few keystrokes. Is there a better way of listing all the emails containing
the pattern(s) ? My big scripts are held up by the dig searches, not be processing their outputs.

Fri, 02/18/2022 - 20:54

Magic Banana

I am a member!

I am a translator!

Offline

Joined: 07/24/2010

For the shell to read the lines on the standard input one by one:
while read pattern do (... "$pattern" is the line ...) done
As always, < can redirect the standard input so that the lines are read from a file ("$1" below, i.e., the first argument of the shell script):
while read pattern do (... "$pattern" is the line ...) done < "$1"
Of course, the variable containing the line needs not be named "pattern".

Fri, 02/18/2022 - 22:31

amenex

Offline

Joined: 01/03/2015

Magic Banana's assigned homework problem:
while read pattern do (... "$pattern" is the line ...) done < "$1"
Solutions proposed by amenex:
while read pattern do dig $pattern | grep -A6 ";; ANSWER SECTION:" | sed 's/;;\ ANSWER SECTION://g' |awk '{print $1,$5} NR==6{exit}' >> Resolved-XX-Domains-MB.txt done < Unresolved-two-dozen-Domains.txt
plus some sed coding to clean up Resolved-XX-Domains-MB.txt:
sed 's/\.\ /\ /g' Resolved-XX-Domains-MB.txt | sed -r 's/\.$//' | sed 's/;;\ msec//g' | sed 's/;;//g' | grep "\S" '-' > Resolved-YY-Domains-MB.txt
In response to my complaint about the limitations of mboxgrep:
while read pattern do mboxgrep $pattern /media/george/523ff5d3-64ea-486d-ba82-58721680b667/george/Georgesbasement.com.A/Thumb256E/GeorgesBasement.com/AAspam/1998-2021.Newest > Emails-02182022.txt done < Unresolved-two-dozen-Domains.txt
Takes a little longer to analyze a 200+ MB email collection ... bear in mind that the Email collection
remains fixed, but the unresolved domains vary from day to day.

Thanks to Magic Banana for making me think about applying his suggestions ..

top

Languages

Navigation

Recent donations

Distinguish the end of a list of IPv4 addresses from the following alphanumeric strings