Soumis par amenex le ven, 02/18/2022 - 16:05

Distinguish the end of a list of IPv4 addresses from the following alphanumeric strings

Veuillez lire et suivre les lignes directrices de la communauté Trisquel.

Vous devez vous identifier ou créer un compte pour écrire des commentaires

4 réponses [Dernière contribution]

ven, 02/18/2022 - 16:05

amenex

Hors ligne

A rejoint: 01/03/2015

The attached Mixed-Types.txt file is copied from a 2000-row list of resolved domain names.
The original list had been rearranged with sort -Vk 2,2, leaving about two dozen partially
resolved domains at the end. Those partially resolved domains have to be processed again
with dig, but by grep-ing six additional lines after the ANSWER SECTION because the IPv4
address is in the last line of the grep-ed output.
Selection of those last two dozen lines requires that they be counted somehow; for example:
awk '{print $1}' Mixed-Types.txt > PartAA.txt ; grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}' <(awk '{print $2}' Mixed-Types.txt) > PartBB.txt ; paste -d ' ' PartAA.txt PartBB.txt > PartCC.txt ; wc -l PartBB.txt | sed 's/PartBB.txt//g' ; wc -l PartCC.txt | sed 's/PartCC.txt//g' ;
Then the main file can be separated into two parts with:
awk '{print $1,$2}' PartCC.txt | head -n 35 '-' > Resolved-2k-Domains.txt awk '{print $1}' PartAA.txt | tail -n -25 '-' > Unresolved-two-dozen-Domains.txt
Perform a dig on any one of those last 25 domains and you're likely to get different answers
every day ...
Two questions remain:
(1) Is there a less cumbersome way of counting the resolved domains ?
(2) How does one move the wc -l [filename] counts into those last two scripts ?

Pièce jointe	Taille
Mixed-Types.txt	1.84 Ko

ven, 02/18/2022 - 16:47

Magic Banana

I am a member!

I am a translator!

Hors ligne

A rejoint: 07/24/2010

As afar as I understand, you want:

the whole lines whose last fields is an IPv4 address:
$ grep -E ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt > Resolved-2k-Domains.txt
the first field of the remaining lines:
$ grep -vE ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt | cut -d ' ' -f 1 > Unresolved-two-dozen-Domains.txt

Both commands can run in parallel and Mixed-Types.txt needs not be sorted.

Answering your questions anyway:

(1) Is there a less cumbersome way of counting the resolved domains ?

Grep has option -c for that:
$ grep -Ec ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt 35

(2) How does one move the wc -l [filename] counts into those last two scripts ?

With $(...);
$ head -$(wc -l < PartBB.txt) PartCC.txt > Resolved-2k-Domains.txt $ tail -$(wc -l < PartCC.txt) PartAA.txt > Unresolved-two-dozen-Domains.txt
Notice also the removal of the useless uses of awk, which only copies the input here!, and sed (by redirecting wc's input so that it does not print the file name).

ven, 02/18/2022 - 18:50

amenex

Hors ligne

A rejoint: 01/03/2015

The lesson for today was to learn the use of "$" in sed and in head as well as the use of cut,
not to mention counting with grep.

Another related question comes up:
mboxgrep has no provision for a pattern file like that in grep, nor can one pipe in a pattern
one-at-a-time, as from awk. My workaround has been to create a huge script file with Leafpad,
which takes only a few keystrokes. Is there a better way of listing all the emails containing
the pattern(s) ? My big scripts are held up by the dig searches, not be processing their outputs.

ven, 02/18/2022 - 20:54

Magic Banana

I am a member!

I am a translator!

Hors ligne

A rejoint: 07/24/2010

For the shell to read the lines on the standard input one by one:
while read pattern do (... "$pattern" is the line ...) done
As always, < can redirect the standard input so that the lines are read from a file ("$1" below, i.e., the first argument of the shell script):
while read pattern do (... "$pattern" is the line ...) done < "$1"
Of course, the variable containing the line needs not be named "pattern".

ven, 02/18/2022 - 22:31

amenex

Hors ligne

A rejoint: 01/03/2015

Magic Banana's assigned homework problem:
while read pattern do (... "$pattern" is the line ...) done < "$1"
Solutions proposed by amenex:
while read pattern do dig $pattern | grep -A6 ";; ANSWER SECTION:" | sed 's/;;\ ANSWER SECTION://g' |awk '{print $1,$5} NR==6{exit}' >> Resolved-XX-Domains-MB.txt done < Unresolved-two-dozen-Domains.txt
plus some sed coding to clean up Resolved-XX-Domains-MB.txt:
sed 's/\.\ /\ /g' Resolved-XX-Domains-MB.txt | sed -r 's/\.$//' | sed 's/;;\ msec//g' | sed 's/;;//g' | grep "\S" '-' > Resolved-YY-Domains-MB.txt
In response to my complaint about the limitations of mboxgrep:
while read pattern do mboxgrep $pattern /media/george/523ff5d3-64ea-486d-ba82-58721680b667/george/Georgesbasement.com.A/Thumb256E/GeorgesBasement.com/AAspam/1998-2021.Newest > Emails-02182022.txt done < Unresolved-two-dozen-Domains.txt
Takes a little longer to analyze a 200+ MB email collection ... bear in mind that the Email collection
remains fixed, but the unresolved domains vary from day to day.

Thanks to Magic Banana for making me think about applying his suggestions ..

Vous devez vous identifier ou créer un compte pour écrire des commentaires

top

Langues

Navigation

Dons récents

Distinguish the end of a list of IPv4 addresses from the following alphanumeric strings