Are there better hostname resolvers than dig or nslookup ?
Here are three hostnames that do not have embedded addresses:
dbbvxj3yl0rc53l1vf8ct-4.rev.dnainternet.fi
node-1w7jr9wnqcus27otj0ldvrzd4.ipv6.telus.net
ptr-gooru9pqyo5unun4j40.18120a2.ip6.access.telenet.be
Dig will return authoritative nameservers for the first and third
of the above three hostnames, but further requests either timeout
or return failure responses.
Are there more sophisticated name resolvers available ?
George Langford
Continuing this thread:
Long pointer records like the following aren't easily resolvable
by dig but their embedded IPv6 addresses are plainly visible:
2001-1c05-0001-e6e5-99c1-3191-b0e7-aca8.cable.dynamic.v6.ziggo.nl
2806-1000-0001-ae95-7768-0cea-3f9e-7ba0.ipv6.infinitum.net.mx
2603-6000-0001-60a5-1f58-fd79-7344-53fc.res6.spectrum.com
2a02-8388-57ba-fb8e-f2f7-e78c-7468-2b2c.cable.dynamic.v6.surfer.at
2001-b011-0001-4a39-6db8-b711-8a27-3281.dynamic-ip6.hinet.net
dynamic-2a00-1028-7f41-b8de-5e49-2809-729d-0736.ipv6.broadband.iol.cz
These addresses can be recovered with an awk script and some obvious editing:
awk '{print $1}' 'LongNames.txt' | sed 's/\-/\:/g' > LongNamesIPv6.txt
Here's one from which the separators (:) have been removed:
2a01cb0c000110f46cbc7ba4c95fcb23.ipv6.abo.wanadoo.fr
That's a bit more challenging and can be solved by trial-and-error
relocation of the separators until dig -x returns the original hostname:
dig -x 2a01:cb0c:0001:10f4:6cbc:7ba4:c95f:cb23 ==> IN PTR 2a01cb0c000110f46cbc7ba4c95fcb23.ipv6.abo.wanadoo.fr
Assuming the eight four-digit fields are always separated by either no character or one single character, which must always be the same, that horrible sed substitution does the work:
s/.*\([[:xdigit:]]\{4\}\)\([^[:xdigit:]]\{0,1\}\)\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\).*/\1:\3:\4:\5:\6:\7:\8:\9/
If you still want the separator to be always the same but want it to be any string of at most n characters, replace \{0,1\} with \{0,n\}. If you want the separator to be any string of at most n characters but possibly a different string between different fields, replace every \2 with [^[:xdigit:]]\{0,n\}.
The latter generalization makes it more likely that a prefix including four consecutive hexadecimal digits will be wrongly interpreted as the first field of the address. Imagine for instance a domain that would be academy.0123-4567-89ab-cdef-0123-4567-89ab-cdef.berkeley.edu. The leading four letters "acad" are hexadecimal digits too. If any sequence of at most 4 (or more) non-hexadecimal characters is seen as a separator, "acad" will be seen as the first field of the address. If that threshold is below 4 or if the separator must always be the same, then sed extracts the proper address.
Magic Banana's Horrible-Sed script has potential:
awk '{print $0}' 'MB-HorribleSed-Set01.txt' | sed 's/.*\([[:xdigit:]]\{4\}\)\([^[:xdigit:]]\{0,1\}\)\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\).*/\1:\3:\4:\5:\6:\7:\8:\9/g' > Set01-output.txt
Where MB-HorribleSed-Seto1.txt is:
a7d74f79-4640-4158-b5c2-3097778fe363.fr-par-2.baremetal.scw.cloud
jobqueue-listener.jobqueue.netcraft.com-u840912b2930611eab47d156d838d6ab1u-digitalocean
jobqueue-listener.jobqueue.netcraft.com-u8af4d1e48e2711ea94c96760838d6ab1u-digitalocean-2gb
jobqueue-listener.jobqueue.netcraft.com-ubd544f468e2411ea94c96760838d6ab1u-digitalocean-2gb
The first hostname is a mess of hexadecimal characters, but the script deciphers the other three nicely.
Alas, dig -x doesn't make any headway with the resulting pretty IPv6 addresses. Here they are:
8409:12b2:9306:11ea:b47d:156d:838d:6ab1
8af4:d1e4:8e27:11ea:94c9:6760:838d:6ab1
bd54:4f46:8e24:11ea:94c9:6760:838d:6ab1
Note that the first hextet is a bald-faced lie; IPv6 hasn't gotten there yet. There's further obfuscation at play.
There's no 'gotcha' here. These hostnames were plucked from the wild; I may know in a week
or so if my onging nMap scans come up with some similar PTR records ...
George Langford
awk '{print $0}' 'MB-HorribleSed-Set01.txt' | sed ...
Just give MB-HorribleSed-Set01.txt as an argument to sed!
The first hostname is a mess of hexadecimal characters, but the script deciphers the other three nicely.
I wrote in my last post: "If you want the separator to be any string of at most n characters but possibly a different string between different fields, replace every \2 with [^[:xdigit:]]\{0,n\}". With n = 1, which is enough for the example you gave, and additionally removing the now useless second pair of parentheses, the substitution becomes:
s/.*\([[:xdigit:]]\{4\}\)[^[:xdigit:]]\{0,1\}\([[:xdigit:]]\{4\}\)[^[:xdigit:]]\{0,1\}\([[:xdigit:]]\{4\}\)[^[:xdigit:]]\{0,1\}\([[:xdigit:]]\{4\}\)[^[:xdigit:]]\{0,1\}\([[:xdigit:]]\{4\}\)[^[:xdigit:]]\{0,1\}\([[:xdigit:]]\{4\}\)[^[:xdigit:]]\{0,1\}\([[:xdigit:]]\{4\}\)[^[:xdigit:]]\{0,1\}\([[:xdigit:]]\{4\}\).*/\1:\2:\3:\4:\5:\6:\7:\8/
Actually, a few came along a little sooner, along with their addresses:
2a00:1110:253:910b:5ca1:eb1e:c7f9:9ae1 2A0011100253910B5CA1EB1EC7F99AE1.mobile.pool.telekom.hu
2a00:1110:656:55f8:f243:e02e:c1dc:21d1 2A001110065655F8F243E02EC1DC21D1.mobile.pool.telekom.hu
2a00:1110:22a:ff9b:5fc4:7808:faa2:51a3 2A001110022AFF9B5FC47808FAA251A3.mobile.pool.telekom.hu
awk '{print $2}' 'SetA.txt' | sed 's/.*\([[:xdigit:]]\{4\}\)\([^[:xdigit:]]\{0,1\}\)\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\)\2\([[:xdigit:]]\{4\}\).*/\1:\3:\4:\5:\6:\7:\8:\9/g' > SetA-output.txt
The answers are:
2A00:1110:0253:910B:5CA1:EB1E:C7F9:9AE1
2A00:1110:0656:55F8:F243:E02E:C1DC:21D1
2A00:1110:022A:FF9B:5FC4:7808:FAA2:51A3
And dig -x is happy now.
George Langford
In preparation for applying Horrible sed's in their various forms, I set about the task of extracting
the IPv6 and IPv4 addresses from my list of gratuitously looked up hostnames gleaned from nearly two
hundred sets of publicly available recent visitor data. It's a long list, way too big to use my Libre
Office Calc. crutches.
Separate the IPv6's from the GLU hostnames from which the IPv4's have just been removed:
grep ":" SourceFile.txt | sort | uniq -c | awk '{print $2}' '-' > IPv6-List.txt
It's difficult to use comm; I used the invert-match option in grep instead:
grep -v ":" SourceFile.txt | sort | uniq -c | awk '{print $2}' '-' > NoIPv6-List.txt
The NoIVp6-List.txt still has a lot of not-looked-up IPv4 addresses.
Ref.: https://superuser.com/questions/202818/what-regular-expression-can-i-use-to-match-an-ip-address
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /etc/hosts
Applied here:
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' NoIPv6-List.txt > IPv4-List.txt
Alas, this IPv4-List file includes many addresses extracted from the hostnames where '.' is used as
the separator. I'll be doing that with a sed script later, but now I want the IPv4's that were not
gratuitously looked up (because they couldn't be ?). Reversing grep would erase those legitimate PTR's.
Is there sed-way of doing this separation?
George Langford
I do not understand the question. Please, give an example of input (showing all cases) and the expected output.
Magic Banana suggests that I cite a few ferinstances:
I've changed a couple of the filenames to separate them in my tenuous logic:
First script:
grep ":" IPv6-SourceList.txt | sort | uniq -c | awk '{print $2}' '-' > IPv6-List.txt
Second script:
grep -v ":" IPv6-SourceList.txt | sort | uniq -c | awk '{print $2}' '-' > NoIPv6-List.txt
#note: IPv6-SourceList.txt originally had only one IPv4: 2.63.83.182
Fourth script:
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' NoIPv6-List.txt > IPv4-List.txt
Ref: https://superuser.com/questions/202818/what-regular-expression-can-i-use-to-match-an-ip-address
#note: Now there are quite a few (53) additional IPv4's that have crept in because of the weakness of the script.
I also tried a "two-minute drill" by splitting all the addresses on the dots "." and reassembling the first four octets
with a $ as the separator between the fourth octet and the hostname-remnant in the fifth column:
awk '{print $1}' 'IPv4-SourceList.txt' | sed 's/\./\t/g' '-' | awk '{print $1"."$2"."$3"."$4"$"$5}' '-' > IPv4-List.txt
At this writing I'm stumped by the task of sorting the $-separated file to capture just the rows containing
proper IPv4 addresses. This script at least shouldn't snatch any IPv4 prefixes from the dot-separated PTR's.
Note: IPv6-SourceList.txt and IPv4-SourceList.txt were each extracted from the same original multi-megabyte
source file.
George Langford
Attachment | Size |
---|---|
IPv6-SourceList.txt | 18.67 KB |
IPv4-SourceList.txt | 11.88 KB |
You attached an input but still no expected output. Assuming you again want to extract the IPV4 addresses, the sed substitution can be adapted (here with the four fields separated by 1 to 5 characters, not necessarily always the same):
s/[^0-9]*\([0-9]\{1,3\}\)[^0-9]\{1,5\}\([0-9]\{1,3\}\)[^0-9]\{1,5\}\([0-9]\{1,3\}\)[^0-9]\{1,5\}\([0-9]\{1,3\}\).*/\1.\2.\3.\4/
Some of the lines in IPv4-SourceList.txt do not contain IP addresses or may contain addresses, which cannot be uniquely defined (is the address in 17715510250.tvnsul.com.br 177.155.10.250 or 177.155.102.50?).
Magic Banana lamented, "You attached an input but still no expected output."
... and then jumped to the conclusion, " Assuming you again want to extract the IPV4 addresses..."
While that's the long-term (end of next week ...) goal, the immediate concern is to find a reliable script to
separate the painfully obvious IPv4 addresses, not from the bodies of the PTR's, but from the list of them. My
subterfuge of switching from the dot '.' separator to the $ separator for the fifth column at least cleans up
the visual aspect of the sorting problem. The output of that script flags the fourth octets of the proper IPv4
addresses with a trailing dollar sign.
I assumed that the 2nd script would provide its own answer, as it's laden with 53 unintended IPv4 consequences.
The 1st script's output is 100% IPv6 addresses, with no stowaways. The NoIPv6-List.txt file has PTR's with
glaringly obvious IPv6 origins, but no actual IPv6 addresses that you could resolve with dig -x, plus one IPv4
leftover.
Those IPv4 and IPv6 addresses that I'm trying to cut out of the herd are special, because they weren't looked
up by that apache hostname-lookup option (which to apache's credit is deprecated in their instructions), because
the associated server was unavailable or misconfigured. Now I want to see what else is doing on those servers, a
few weeks or months later.
Once the heifers have been separated from the bulls, then the ensuing tasks are a little easier. Analyzing the
contents of the near-infinite address spaces of IPv6 CIDR blocks is best addressed by Magic Banana's random-
selection of IPv6 addresses to be searched with nMap scripts, whereas the very cramped space of IPv4 CIDR
blocks can be addressed by inquiring with more direct scripts.
Finding those multi-addressed hostnames in the outputs of scripts that provide answers in CPU time scales is a
huge step forward compared to nosing around, one hostname at a time, for the finite data gathered by a few
Internet watchdogs that is agglomerated in the near-infinite data hoarded by Google. The geek definitely is on
a better track than the uneducated plodder.
George Langford
Attachment | Size |
---|---|
NoIPv6-List.txt | 5.96 KB |
IPv6-List.txt | 12.71 KB |
While that's the long-term (end of next week ...) goal, the immediate concern is to find a reliable script to separate the painfully obvious IPv4 addresses, not from the bodies of the PTR's, but from the list of them.
Still no expected output...
If the so-called "painfully obvious IPv4 addresses" are those that my last sed's substitution extract, then you can just 'grep' and/or 'grep -v' the output, using the regular expression you found on https://superuser.com/questions/202818/what-regular-expression-can-i-use-to-match-an-ip-address
If you want one single AWK program to do everything (maybe faster, maybe not):
$ awk '{ $0 = gensub(/[^0-9]*([0-9]{1,3})[^0-9]{1,5}([0-9]{1,3})[^0-9]{1,5}([0-9]{1,3})[^0-9]{1,5}([0-9]{1,3}).*/, "\\1.\\2.\\3.\\4", "1"); if ($0 ~ /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/) print >> "IPv4_found"; else print >> "IPv4_not_found" }' IPv4-SourceList.txt
I suspect you do not consider the substitution I wrote "painfully obvious". Maybe you want the separator to always be one single character (but not necessarily always the same). If so, you can modify the substitution accordingly, as I explained in my first post in this thread (for IPv6 addresses). The single-AWK-program solution becomes:
$ awk '{ $0 = gensub(/[^0-9]*([0-9]{1,3})[^0-9]([0-9]{1,3})[^0-9]([0-9]{1,3})[^0-9]([0-9]{1,3}).*/, "\\1.\\2.\\3.\\4", "1"); if ($0 ~ /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/) print >> "IPv4_found"; else print >> "IPv4_not_found" }' IPv4-SourceList.txt
Fewer addresses end up in IPv4_found. On the positive side, it becomes less likely that it contains wrong addresses: the output is correct when the PTR contains digits before the actual address, but these digits are separated by more than one one character (against five in the previous solution).
If you want to keep the association with the PTR, then do not overwrite $0, and print it along the address when found. For the first AWK program above:
$ awk '{ addr = gensub(/[^0-9]*([0-9]{1,3})[^0-9]{1,5}([0-9]{1,3})[^0-9]{1,5}([0-9]{1,3})[^0-9]{1,5}([0-9]{1,3}).*/, "\\1.\\2.\\3.\\4", "1") } { if (addr ~ /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/) print $0, addr >> "IPv4_found"; else print >> "IPv4_not_found" }' IPv4-SourceList.txt
With sed (whose output can be grepped, I repeat), you can just paste the input file with its output:
$ sed 's/[^0-9]*\([0-9]\{1,3\}\)[^0-9]\{1,5\}\([0-9]\{1,3\}\)[^0-9]\{1,5\}\([0-9]\{1,3\}\)[^0-9]\{1,5\}\([0-9]\{1,3\}\).*/\1.\2.\3.\4/' IPv4-SourceList.txt | paste IPv4-SourceList.txt -
After a good night's sleep, and while still waiting from some very slow nMap scans to finish, I found
a roundabout way of separating those pesky IPv4 addresses from the source list without accidentally
dismembering any PTR's that deserve to be treated differently later.
Start here with the fourth script from my June 22 posting, applied globally, but with a larger original
source file, one-tenth of the Recent Visitors collection:
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' SourceList.0pt1.txt > IPv4-ListB.txt
Continue with my two-minute drill, adding a modified fourth script from June 22:
awk '{print $1}' 'SourceList.0pt1.txt' | sed 's/\./\t/g' '-' | awk '{print $1"."$2"."$3"."$4"$"$5}' '-' > IPv4-ListC.txt;
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\$' IPv4-ListC.txt | tr -d $ | sort -nrk 1 > IPv4-Only-List.txt
Use comm to match IPv4-Only-List.txt with SourceList.0pt1.txt
sort -k 1 SourceList.0pt1.txt > Temp0623B.txt; sort -k 1 IPv4-Only-List.txt > Temp0623C.txt;
comm -12 Temp0623B.txt Temp0623C.txt > Clean.IPv4-List.txt ; rm Temp0623B.txt Temp0623C.txt
The IPv4 addresses found in Clean.IPv4-List.txt look OK to me; however, there should be an effort to come up with
a verification script.
Try grep:
grep -hf Clean.IPv4-List.txt SourceList.0pt1.txt |more
This script starts out OK, with a nice list of IPv4 addresses, but soon turns into a memory hog
that has to be stopped.
George Langford
Attachment | Size |
---|---|
SourceList.0pt1.txt | 581.38 KB |
IPv4-ListB.txt | 294.91 KB |
Clean.IPv4-List.txt | 253.03 KB |
IPv4-Only-List.txt | 294.46 KB |
Magic Banana grumbled:
"Still no expected output..."
Maybe our postings passed each other by like ships in the night ... I really did add several example files
attached to their relevant postings or referenced to a preceding posting. Maybe a list of what appears to
be missing would help. I was careful to test all the scripts with their source files.
"If the so-called "painfully obvious IPv4 addresses" are those that my last sed's substitution extract, ..."
Not so. Too much is being read into my hyperbolic remark. Those pesky IPv4 addresses are listed, thoroughly
intermixed, with the looked-up PTR data. They're difficult to unmix, because sorting doesn't discriminate
between numerical parts of PTR's and real IP addresses. IPv6 addresses are easy, because they can be found
with grep by searching on the colon ':' separator.
Repeating: those IPv4and IPv6 addresses that weren't looked up by the website's apache server had no DNS
service, and so are in classes by themselves. The addresses that Magic Banana's excellent scripts extract
can often be tested by applying reverse DNS, but I'm finding that some IPv6 addresses that came to my
attention by means of nMap scans can be rather refractory, sometimes because their server has been powered
down or because new PTR records have been applied to them. I even found one set of addresses that came back
online between a Sunday nMap scan and a Tuesday re-scan.
Magic Banana's fourth script in his June 23 (17:00) works convincingly:
sed 's/[^0-9]*\([0-9]\{1,3\}\)[^0-9]\{1,5\}\([0-9]\{1,3\}\)[^0-9]\{1,5\}\([0-9]\{1,3\}\)[^0-9]\{1,5\}\([0-9]\{1,3\}\).*/\1.\2.\3.\4/' IPv4-SourceList.txt | paste IPv4-SourceList.txt - > FourthScriptOutput.txt
Bear in mind that various obfuscation schemes may have been applied to these PTR's, as Magic Banana addressed
some time ago with more sophisticated scripts, so the IP addresses have to be checked by reverse DNS against
their PTR records as recorded in the Recent Visitor data. I'll check these with:
awk '{print $2}' 'FourthScriptOutput.txt' > Temp0623E.txt ; dig -x -f Temp0623E.txt | grep -A 1 "ANSWER SECTION:" '-' | awk '{print $5} '-' |more
The output of this ad hoc script is simply, 92.242.140.21, which is the catch-all for unresolvable addresses.
That's the same as the apache server reported for those IPv4 addresses in May 2020. Their CIDR/24 scans may be
more revealing. That said, my check does not discover any IPv4 addresses that inadvertently came from any of
the PTR's. Incidentally, the IPv6-List.txt file gives the identical result, which is as expected, as there are
no PTR's with colon ':' separators.
In the meantime, collecting multi-addressed PTR's lurking in CIDR blocks has been proving very fruitful,
especially when applied to the CIDR/32 portions (i.e., the left-most two hextets of the not-looked-up
IPv6 addresses in the Recent Visitor data, as well as the right-most octets of the not-looked-up IPv4
addresses in the same Recent Visitor data). There remain a great many simple PTR records with no
embedded IP data at all or with inscrutible obfuscations that can only be resolved by the nMap searching
scripts which are filling up my storage media.
Thank you for your continuing constructive analyses.
George Langford
Attachment | Size |
---|---|
IPv4-SourceList.txt | 11.88 KB |
FourthScriptOutput.txt | 17.97 KB |
Magic Banana's fourth script in his June 23 (17:00) works convincingly
OK. However I realized I was lying when I wrote that "the output is correct when the PTR contains digits before the actual address (...)". Everything up to the last digit before the address was not removed. Here is a similar solution that does not suffer from that problem and that is nicer, thanks to AWK's FPAT variable:
awk -v FPAT='[0-9]{1,3}[^0-9]{1,5}[0-9]{1,3}[^0-9]{1,5}[0-9]{1,3}[^0-9]{1,5}[0-9]{1,3}' '$1 { printf $0 " "; gsub(/[^0-9]+/, ".", $1) } { print }' IPv4-SourceList.txt
No second column means no address was found in the PTR, in the first column. As I have already explained, the "5" can be tuned: it controls a trade-off between precision and recall.