Distinguish the end of a list of IPv4 addresses from the following alphanumeric strings
- Vous devez vous identifier ou créer un compte pour écrire des commentaires
The attached Mixed-Types.txt file is copied from a 2000-row list of resolved domain names.
The original list had been rearranged with sort -Vk 2,2, leaving about two dozen partially
resolved domains at the end. Those partially resolved domains have to be processed again
with dig, but by grep-ing six additional lines after the ANSWER SECTION because the IPv4
address is in the last line of the grep-ed output.
Selection of those last two dozen lines requires that they be counted somehow; for example:
awk '{print $1}' Mixed-Types.txt > PartAA.txt ;
grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}' <(awk '{print $2}' Mixed-Types.txt) > PartBB.txt ;
paste -d ' ' PartAA.txt PartBB.txt > PartCC.txt ;
wc -l PartBB.txt | sed 's/PartBB.txt//g' ;
wc -l PartCC.txt | sed 's/PartCC.txt//g' ;
Then the main file can be separated into two parts with:
awk '{print $1,$2}' PartCC.txt | head -n 35 '-' > Resolved-2k-Domains.txt
awk '{print $1}' PartAA.txt | tail -n -25 '-' > Unresolved-two-dozen-Domains.txt
Perform a dig on any one of those last 25 domains and you're likely to get different answers
every day ...
Two questions remain:
(1) Is there a less cumbersome way of counting the resolved domains ?
(2) How does one move the wc -l [filename] counts into those last two scripts ?
Pièce jointe | Taille |
---|---|
Mixed-Types.txt | 1.84 Ko |
As afar as I understand, you want:
- the whole lines whose last fields is an IPv4 address:
$ grep -E ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt > Resolved-2k-Domains.txt
- the first field of the remaining lines:
$ grep -vE ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt | cut -d ' ' -f 1 > Unresolved-two-dozen-Domains.txt
Both commands can run in parallel and Mixed-Types.txt needs not be sorted.
Answering your questions anyway:
(1) Is there a less cumbersome way of counting the resolved domains ?
Grep has option -c for that:
$ grep -Ec ' ([0-9]{1,3}\.){3}[0-9]{1,3}$' Mixed-Types.txt
35
(2) How does one move the wc -l [filename] counts into those last two scripts ?
With $(...);
$ head -$(wc -l < PartBB.txt) PartCC.txt > Resolved-2k-Domains.txt
$ tail -$(wc -l < PartCC.txt) PartAA.txt > Unresolved-two-dozen-Domains.txt
Notice also the removal of the useless uses of awk, which only copies the input here!, and sed (by redirecting wc's input so that it does not print the file name).
The lesson for today was to learn the use of "$" in sed and in head as well as the use of cut,
not to mention counting with grep.
Another related question comes up:
mboxgrep has no provision for a pattern file like that in grep, nor can one pipe in a pattern
one-at-a-time, as from awk. My workaround has been to create a huge script file with Leafpad,
which takes only a few keystrokes. Is there a better way of listing all the emails containing
the pattern(s) ? My big scripts are held up by the dig searches, not be processing their outputs.
For the shell to read the lines on the standard input one by one:
while read pattern
do
(... "$pattern" is the line ...)
done
As always, < can redirect the standard input so that the lines are read from a file ("$1" below, i.e., the first argument of the shell script):
while read pattern
do
(... "$pattern" is the line ...)
done < "$1"
Of course, the variable containing the line needs not be named "pattern".
Magic Banana's assigned homework problem:
while read pattern
do
(... "$pattern" is the line ...)
done < "$1"
Solutions proposed by amenex:
while read pattern
do dig $pattern | grep -A6 ";; ANSWER SECTION:" | sed 's/;;\ ANSWER SECTION://g' |awk '{print $1,$5} NR==6{exit}' >> Resolved-XX-Domains-MB.txt
done < Unresolved-two-dozen-Domains.txt
plus some sed coding to clean up Resolved-XX-Domains-MB.txt:
sed 's/\.\ /\ /g' Resolved-XX-Domains-MB.txt | sed -r 's/\.$//' | sed 's/;;\ msec//g' | sed 's/;;//g' | grep "\S" '-' > Resolved-YY-Domains-MB.txt
In response to my complaint about the limitations of mboxgrep:
while read pattern
do mboxgrep $pattern /media/george/523ff5d3-64ea-486d-ba82-58721680b667/george/Georgesbasement.com.A/Thumb256E/GeorgesBasement.com/AAspam/1998-2021.Newest > Emails-02182022.txt
done < Unresolved-two-dozen-Domains.txt
Takes a little longer to analyze a 200+ MB email collection ... bear in mind that the Email collection
remains fixed, but the unresolved domains vary from day to day.
Thanks to Magic Banana for making me think about applying his suggestions ..
- Vous devez vous identifier ou créer un compte pour écrire des commentaires