Scripting the random replacement of fields in an IPv6 address
In an effort to estimate the degree to which a block of Internet addresses have been assigned the same PTR record,
I'm attempting to reassign the contents of randomly selected fields in the retrieved addresses of the block.
I've found a script which generates a random number among the numerals 4 through 7: shuf -i 4-7 -n 1
Reference: https://stackoverflow.com/questions/2556190/random-number-from-a-range-in-a-bash-script
Also another script to create a random four-digit hexadecimal number, suitably modified: echo "#$(openssl rand -hex 2)" | tr -d '\#'
https://stackoverflow.com/questions/40277918/shell-script-to-generate-random-hex-numbers/40278205
These both produce the desired outputs, but I have been unable to write a script which causes the randomly generated field
number from the output of the first function to replace that field with the output from the second function.
This technique is based on my training in metallurgy, where averaging of randomly selected fields in a microscopic view
can be proven mathematically to represent the property of the entire view.
Why I want to do this: The number of addresses in a block such as field:field::/32 is too large to look up over several
lifetimes.
I've written a script which replaces the last field in the IPv6 address with :0/112 so that the script which looks up the
PTR records has just 64K addresses & PTR's in its output. Repeating the script for a hundred or so found IPv6 addresses
takes several hours, which is tolerably quick for my purposes. Repeating that task for my suggested random changes in the
source IPv6 addresses within just the 4th through 7th fields will not usually cause the search to stray outside the original
CIDR blocks of the source addresses. That would randomly sample the originating CIDR block, all the more so, the more times
the proposed script is run.
I've done something like this by running my basic nmap search script on two data sets for the same PTR record, one gleaned
from the Internet with a search on the hostname/PTR record, and the other from a database of publicly available recent-visitor
data gathered without first applying hostname-lookup to the original visitor addresses. Each address set was different from
the other, both had around a hundred addresses, and the outputs of each nmap search script lists over six million identical
PTR records, making twelve million ... how many more are there ?
George Langford
Also another script to create a random four-digit hexadecimal number, suitably modified
I would not call that modification "suitable". At least, it is not "reasonable" to call an additional program (tr) to remove a character that was purposefully concatenated. Just don't concatenate it in the first place: simply write 'openssl rand -hex 2'.
Summing up (could you try to be brief and clear?), you want to replace, in an IPv6 address, the n-th group of four hexadecimal digits (with n a random integer from 4 to 7) with a random one. Right? You did the hardest, finding ways to generate the random numbers. The rest is sed's classical s command to substitute:
$ echo $ipv6_address | sed s/'[0-9a-f]\{4\}'/$(openssl rand -hex 2)/$(shuf -i 4-7 -n 1)
Magic Banana translated my wordy english:
> Summing up (could you try to be brief and clear?), you want to replace, in an IPv6 address,
> the n-th group of four hexadecimal digits (with n a random integer from 4 to 7) with a random one. Right?
Restated:
... replacing the [randomly selected] n-th field [of eight, skipping the first three which likely define
the CIDR block of the IPv6 address] to be replaced with another [randomly generated] group of four
hexadecimal digits ...
Example:
Original IPv6 address: 2a02:2788:1000:0:6037:fc9a:27ac:10c7
Select field number with shuf -i 4-7 -n 1 ==> 5
Generate a new field with [suitably simplified] echo "$(openssl rand -hex 2)" ==> 83bb
Place the new field: 2a02:2788:1000:0:83bb:fc9a:27ac:10c7 (83bb is the new 5th field)
Demonstrating Magic Banana's elegant & correctly interpreted solution:
echo 2a02:2788:1000:0:6037:fc9a:27ac:10c7 | sed s/'[0-9a-f]\{4\}'/$(openssl rand -hex 2)/$(shuf -i 4-7 -n 1)
with the result: 2a02:2788:1000:0:6037:fc9a:2b1e:10c7 (2ble is the new 7th field)
Imagine the task of writing the PTR's of the 79,228,162,514,264,337,593,543,950,336 addresses in 2a02:2788::/32
Ref: https://www.ultratools.com/tools/netMaskResult?ipAddress=2a02%3A2788%3A%3A%2F32
Now imagine the task of looking up all those 79,228,162,514,264,337,593,543,950,336 addresses ...
For those of you at home: 2a02:2788:5d3a:f8e2:83bb:198c:4a68:b1be gives the same nslookup
result as all the original and modified IPv6 addresses here.
Could it be that nslookup is being hijacked ?
George Langford
While investigating. Magic Banana's elegant & correctly interpreted solution:
echo 2a02:2788:1000:0:6037:fc9a:27ac:10c7 | sed s/'[0-9a-f]\{4\}'/$(openssl rand -hex 2)/$(shuf -i 4-7 -n 1)
I discovered that this script occasionally places the newly generated four-digit hex number in the eighth field.
Examples:
2a02:2788:1000:0:6037:fc9a:27ac:5467 and 2a02:2788:1000:0:6037:fc9a:27ac:5558
However, {shuf -i 4-7 -n 1} stays within its allotted boundaries, producing 4's, 5's, 6's & 7's randomly; never 8's.
Also, {openssl rand -hex 2} consistently cranks out four-digit hex numbers.
Later I realized that the fourth field has only one digit, which should have been padded to four characters, so
the script skips over that field, leading to the unexpected entries in field eight.
What I'm actually trying to do is to replace all of the last six fields of the initiating IPv6 address with new
four-digit hex numbers, run the script 65,536 times, and then use a simplified version of my nmap script to capture
the registered PTR for each new IPv6, a far quicker approach than my earlier method of capturing the PTR records of
a hundred, 64K groups of IPv6 addresses, which does the hostname lookups 6,553,600 times. Much more probitive and
a hundred times quicker.
So far, I've tried replacing (shuf -i 4-7 -n 1) with $i in a six-step "do" loop and nesting that inside a 65,536
step nested do loop (as in Basic) but I'm wrestling with the syntax of that approach.
Back to the prospect that nslookup (and dig -x) are getting hijacked: Doesn't matter, as the server's gratuitous
hostname lookups would be hijacked the same way. And none of them can be resolved to any usable numerical address.
George Langford
Later I realized that the fourth field has only one digit, which should have been padded to four characters, so the script skips over that field, leading to the unexpected entries in field eight.
Indeed. You can change the regular expression so that it catches the character ":" and what precedes it down to the previous ":" (excluded):
$ echo $ipv6_address | sed s/'[^:]*:'/$(openssl rand -hex 2):/$(shuf -i 4-7 -n 1)
What I'm actually trying to do is to replace all of the last six fields of the initiating IPv6 address with new four-digit hex numbers, run the script 65,536 times (...)
If I properly understood, here is a one-line solution:
$ prefix=0123:4567; od -A n -N 786432 -xw12 /dev/urandom | tr ' ' : | sed s/^/$prefix/
One call of 'od' to read 786,432 bytes from /dev/urandom should be orders of magnitude faster than 393,216 calls of 'openssl' to generate 2 bytes each time. I let you read 'info od' if you want to understand the options I use.
After doing the homework suggested by Magic Banana, I tried out his well thought out one-line solution, substituting
the prefix for a representative multi-addressed PTR record (2a02:120b):
$ prefix=2a02:120b ; od -A x -N 786432 -xw12 /dev/urandom | tr ' ' : | sed s/^/$prefix/
The time to run this script, which produces 65,536 complete IPv6 addresses ready for analysis,
is 0.000 second CPU time & 0.000 second real time, even when run four times at once.
It appears not to produce any "000x", "00xx" or "0xxx" fields, which do occur in real IPv6 addresses.
"-A x" means hexadecimal notation throughout; "-N 786432" means number of bytes output; and "-xw12" means six,
four-character hexadecimal fields. Together, these arguments gather sufficient bytes of random data from
/dev/urandom to fill the 3rd through 8th fields of 65,536 IPv6 addresses.
Putting it all together with our nmap script and an enumeration script:
prefix=2a02:120b ; od -A n -N 786432 -xw12 /dev/urandom | tr ' ' : | sed s/^/$prefix/ > TempMB4320A.txt ;
nmap -6 -sn -T4 -sL -iL TempMB4320A.txt | grep "Nmap scan report for " - | tr -d '()' | sort -k5 | awk 'NR >= 1 { print $5, $6 }' | awk 'NR >= 1 { print $2, $1 }' | uniq -Df 1 | sed '/^\s\s*/d' | awk '{ print $2 "\t" $1 }' > TempMB4320B.txt ;
awk '{print $2,$1}' 'TempMB4320B.txt' | sort -k 2 | uniq -cdf 1 | awk '{print $3"\t"$1}' '-' > Multi-IPv6-MB-dynamic.wline.6rd.res.cust.swisscom.ch-Tally.txt | awk '{print $3"\t"$1}' '-' > Multi-IPv6-MB-dynamic.wline.6rd.res.cust.swisscom.ch-Tally.txt ;
rm TempMB4320A.txt TempMB4320B.txt
Final output file: dynamic.wline.6rd.res.cust.swisscom.ch 65536
Giving a similar answer in about six minutes that my other scripts took 58 hours to complete when run on
508 IPv6 addresses: All the IPv6 addresses have the same dig -x result, even when randomly generated rather
than hunted down with Internet searches.
It appears not to produce any "000x", "00xx" or "0xxx" fields, which do occur in real IPv6 addresses.
Indeed. The probability distribution is uniform. To sample an hexadecimal number (with at most four digits) according to a Zipf distribution (probably what you want), you should abandon simple commands and find an implementation in your favorite programming language. For instance, I know C++ and searching the Web for "sample Zipf C++" gave https://stackoverflow.com/questions/9983239/how-to-generate-zipf-distributed-numbers-efficiently as the first result.
Magic Banana suggested
> ... you should abandon simple commands and find an implementation in your favorite programming language. <
Alas, over a sixty-year career I have had to get a rudimentary knowledge of MAD (Michigan Algorithm Decoder)
Fortran (with decks of cards punched on an O29 console), Basic & True Basic (a Kemeny product - my favorite).
I reached a brick wall with True Basic when I could not get any form of scrolling to work on the PC-AT that
I was using at the time (ca. 1998). The PC-AT had none of the graphics capabilities of an Apple product.
There is a slim chance I can retrieve my original media for True Basic (intended for MS-DOS) and install that
with the aide of Wine. Trisquel flidas has Basic-256, an educational form of Basic, so I'll start with that.
I'm more than fifty years rusty in any form of programming.
What currently stumps me is that I'd like to post-process the output of your all-in-one-step script by using
its output file of IPv6 addresses (a list by substituting "-N 3072" for "-N 786432" to just 256 lines, but I
cannot get past that echo command with something like "cat DPv6-file.txt ..." That appears to put two & three
digit hexadecimals into the fields.
George Langford
Trisquel flidas has Basic-256, an educational form of Basic, so I'll start with that.
I doubt anybody wrote in Basic-256 an efficient code to sample a Zipf-distributed random variable. If you do not care about efficiency that much, that AWK program (which computes the CDF for each realization!) seems to work:
$ prefix=0123:4567; sample_size=10; od -A n -N $(expr $sample_size \* 24) -dw4 /dev/urandom | awk -v prefix=$prefix 'NR % 6 == 1 { printf prefix } { n = ($1 + $2 * 65536) * 2.71656973828463501e-09; cdf = 1; for (i = 2; cdf < n; ++i) cdf += 1 / i; printf ":%04x", i - 2 } NR % 6 == 0 { print "" }'
0123:4567:0027:0015:1b8f:1411:0046:698c
0123:4567:295a:03e9:0024:0000:0049:8246
0123:4567:e026:0e89:1b79:0000:0878:0002
0123:4567:0089:0001:0cac:0034:0026:03e9
0123:4567:000b:6341:0a6e:06e8:913b:0000
0123:4567:0000:0008:1125:0001:8bae:039b
0123:4567:02ba:454d:0000:004f:0000:0718
0123:4567:0000:0000:0023:01e5:0001:085b
0123:4567:0001:852b:0004:0050:0000:00f6
0123:4567:ea85:003d:1051:0000:0039:2064
On my system, This command provides ~270 addresses per second. The exponent of the Zipfian distribution is here 1: except for the two first group of four hexadecimal digits (the "prefix" shell variable fixes them), 0000 occurs twice more than 0001, three times more than 0002, ..., 65536 times more than ffff.
I do not understand the problem you tried to explain at the end of your post: please show the command line you executed, its output, and describe the expected one.
EDIT: I reduced the rather absurd precision I initially went with to multiply by 6 the throughput.
Magic Banana suggested a useful script to provide IPv6 addresses with :000x, :00xx, and :0xxx fields:
$ prefix=0123:4567; sample_size=10; od -A n -N $(expr $sample_size \* 48) -dw8 /dev/urandom | awk -Mv prefix=$prefix -v PREC=64 'NR % 6 == 1 { printf prefix } { n = 0; for (p = 0; p != 4; ++p) n += $(p + 1) * 65536^p; n *= 6.3250068069543573221e-19; cdf = 1; for (i = 2; n > cdf; ++i) cdf += 1 / i; printf ":%04x", i - 2 } NR % 6 == 0 { print "" }'
As a test, I applied the Magic Banana script to a specific CIDR block's prefix:
prefix=2a02:2788 ; sample_size=4096; od -A n -N 196608 -dw8 /dev/urandom | awk -Mv prefix=$prefix -v PREC=64 'NR % 6 == 1 { printf prefix } { n = 0; for (p = 0; p != 4; ++p) n += $(p + 1) * 65536^p; n *= 6.3250068069543573221e-19; cdf = 1; for (i = 2; n > cdf; ++i) cdf += 1 / i; printf ":%04x", i - 2 } NR % 6 == 0 { print "" }' > IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt
That script generates a 164KB file with 4096 entries in about five minutes real time.
Let's count the :0xxx, :00xx and :000x occurrences.
See: https://www.tecmint.com/count-word-occurrences-in-linux-text-file/
Where it's said: grep -o -i mauris example.txt | wc -l
grep -c -o -i :0 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 4095
grep -c -o -i :00 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 4053
grep -c -o -i :000 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 3599
Extending Magic Banana's reasoning about the relative frequency of occurrences of :0001, :0002 and :0003, the
relative frequencies of the occurrences of :0xxx, :00xx, and :000x in a 4096-row list of IPv6 addresses ought
to be 256/4096, 16/4096, and 1/4096, respectively. In a 65,536-address list, prefix::0/128 may happen just once.
Then I used nmap to evaluate those addresses:
nmap -6 -sn -T4 -sL -iL IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt | grep "Nmap scan report for " - | tr -d '()' | sort -k5 | awk 'NR >= 1 { print $5, $6 }' | awk 'NR >= 1 { print $2, $1 }' | uniq -Df 1 | sed '/^\s\s*/d' | awk '{ print $2 "\t" $1 }' >> Multi-IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt
This script resolves 4064 of the 4096 addresses as host.dynamic.voo.be in fifteen seconds real time.
Enumerating the output file from the nmap script:
awk '{print $2,$1}' 'Multi-IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt' | sort -k 2 | uniq -cdf 1 | awk '{print $3"\t"$1}' '-' > Multi-IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.Tally.txt
The output file reads: "host.dynamic.voo.be 4064"
That's because the first 32 of the 4096 addresses return NXDOMAIN.
CIDR blocks with less intensely multi-addressed PTR's will reveal lists of all the different multi-addressed PTR's with
these scripts. However, the more addresses that are included in the randomized search, the more (and different !)
multi-addressed PTR's will be found.
It would appear that one needs to concatenate the variously randomized lists of addresses, eliminate duplicates, and
then apply the last pair of scripts to achieve a relatively accurate evaluation of the target CIDR block. Could it be
that the 79,228,162,514,264,337,593,543,950,336 addresses in 2a02:2788::/32 are dynamically generated on demand ?
George Langford
That script generates a 164KB file with 4096 entries in about five minutes real time.
One hour before your post, I edited the AWK program in my previous post to reduce the unnecessarily high precision. Now, on my system, generating 4096 addresses takes ~15s.
To easily get better performances, take Fortran (not Basic); store the 65536 first harmonic numbers (with double precision) in an array, repetitively read four bytes from /dev/urandom, multiply the related integer by 2.71656973828463501e-09 and search the position where it would be inserted in the array (to stay sorted), using a a binary search. That position is the number to return.
grep -c -o -i :0 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 4095
grep -c -o -i :00 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 4053
grep -c -o -i :000 IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt ==> 3599
Options -o and -i are useless here. You may believe that using -o would make 'grep -c' count all occurrences on a line. It does not. It still counts the number of lines with at least one occurrence among the six random groups of four hexadecimal digits. Those outputs therefore mean that, in IPv6-SS.IPv6-NLU-2a02.2788.MB4420-4096.txt, all addresses but one have at least one group that starts with "0", 99.0% have at least one group that starts with "00", and 87.9% have at least one group that starts with "000".
Extending Magic Banana's reasoning about the relative frequency of occurrences of :0001, :0002 and :0003
It is not a reasoning but a choice of distribution to sample from. I believe groups of four hexadecimal digits chosen by local network administrators approximately follow a Zipfian distribution. The exponent may not be 1 though. A more realistic exponent could be fitted from real-world addresses, by regression. More importantly, the ordering, from most common to least common, is certainly not that of increasing integers (ffff is certainly more common than 21a6, for instance). In fact, I believe that more realistic IPv6 addresses could be generated by shuffling the groups actually occurring in each field of real-world addresses. Something like that (which will never give in a group that never occurs among the real-world addresses in "addresses"):
$ addresses="addresses"; sample_size=4096; for i in $(seq $sample_size); do for i in $(seq 3 8); do cut -d : -f $i "$addresses" | shuf > group$i; done; cut -d : -f -2 "$addresses" | paste -d : - group$(seq -s ' group' 3 8); done; rm group$(seq -s ' group' 3 8)
the relative frequencies of the occurrences of :0xxx, :00xx, and :000x in a 4096-row list of IPv6 addresses ought to be 256/4096, 16/4096, and 1/4096, respectively. In a 65,536-address list, prefix::0/128 may happen just once.
Your math looks wrong. The probabilities to sample 0000, 000x, 00xx or 0xxx are:
$ awk 'BEGIN { i = 1; for (p = 0; p != 5; ++p) { for (; i < 16^p + 1; ++i) cdf += 1 / i; partial[p] = cdf }; for (p = 0; p != 4; ++p) print partial[p] / cdf }'
0.0857076
0.289754
0.524903
0.762378
Technically, generating 0000 or not is the realization of a Bernoulli variable of parameter 0.0857076, generating 000x or not is that of a Bernoulli variable of parameter 0.289754, etc. Complementing the above program, here are, among 4096 addresses, the expected numbers of addresses with at least one 0000, at least one 000x, at least one 00x and at least one 0xxx:
$ awk 'BEGIN { i = 1; for (p = 0; p != 5; ++p) { for (; i < 16^p + 1; ++i) cdf += 1 / i; partial[p] = cdf }; for (p = 0; p != 4; ++p) print 4096 - 4096 * (1 - partial[p] / cdf)^6 }'
1703.4
3570.21
4048.9
4095.26
That looks compatible with the counts of your 'grep -c'. A p-value could be computed.... but I will stop here with the statistics.
awk 'NR >= 1 { print $5, $6 }' | awk 'NR >= 1 { print $2, $1 }'
All that is the same as 'awk { print $6, $5 }'!
It would appear that one needs to concatenate the variously randomized lists of addresses, eliminate duplicates, and then apply the last pair of scripts to achieve a relatively accurate evaluation of the target CIDR block.
Instead of concatenating, you can give a list of prefixes (one per line) to the AWK program augmented with one single condition-action to set the prefix:
$ prefixes=my_prefixes; sample_size=4096; od -A n -N $(expr $(wc -l < "$prefixes") \* $sample_size \* 24) -dw4 /dev/urandom | awk -v prefixes="$prefixes" -v s=$(expr 6 \* $sample_size) 'NR % s == 1 { getline prefix < prefixes } NR % 6 == 1 { printf prefix } { n = ($1 + $2 * 65536) * 2.71656973828463501e-09; cdf = 1; for (i = 2; cdf < n; ++i) cdf += 1 / i; printf ":%04x", i - 2 } NR % 6 == 0 { print "" }'
Duplicates are unlikely. I will not do the math to compute the probability of any duplicate. Notice however that the probability to get the most likely address, ending with 0000:0000:0000:0000:0000:0000, is 0.0857076^6 = 0.000000396384. That is about 4 in 10 millions. As a consequence, getting it twice or more among 4096 addresses is extremely unlikely.
Could it be that the 79,228,162,514,264,337,593,543,950,336 addresses in 2a02:2788::/32 are dynamically generated on demand ?
If you could generate one billion addresses per second, it would take 79,228,162,514,264,337,594 seconds to generate them all. That is more than 2510 billions of years. By the way, I do not understand the point of all this: you most probably will not find the address of a machine by randomly generating even realistic IPv6 addresses.
Magic Banana asserted:
" ... I believe groups of four hexadecimal digits chosen by local network administrators ..."
Waitaminnit. The IPv6 addresses start at 1 and then climb one-by-one to an astronomical number like
the 79,228,162,514,264,337,593,543,950,336 addresses in 2a02:2788::/32 which are boiled down to the
more workable hexadecimal notation 2a02:2788:ffff:ffff:ffff:ffff:ffff:ffff. Zeroes are just part of
the originating consecutively realized decimal number, but they come less often than every 10th count
as in ...7,8,9,a,b,c,d,e,f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20, i.e., every 16th count.
In a dynamically addressed addressing scheme, the local network administrator might just be creating
the appearance of that astronomical number. Just patch together an array of four-digit hexadecimal
numbers with scripts like the ones with which we have been experimenting. That could be a large
but not astronomical number of arbitrary addresses. One of your scripts generates these arbitrary
addresses immeasurably quickly. Those addresses need only conform to the astronomical address space
which he owns (rents ? ...bought at an ill-advertised public auction ? ... got it for free ?).
A script can be written that would call up a series of text-file addresses, create new addresses,
populate each one with .html code or a script, upload the data to the new addresses on the appropriate
server, delete the former addresses' code, store the new addresses on the local HDD, delete the
replaced addresses, and move on without ever closing the cover of his laptop or running out of storage.
Magic Banana's efficient script:
prefix=2a02.2788; od -A n -N 12582912 -xw12 /dev/urandom | tr ' ' : | sed s/^/$prefix/ > SS.IPv6-NLU-MB4520A.txt
produces 1,048,576 IPv6 addresses in three (a thousand one, a thousand two, a thousand three ...) seconds real time.
Grep does the :0, :00, :000, & :0000 counting for us:
grep -c :0 SS.IPv6-NLU-MB4520A.txt = 336,443
grep -c :00 SS.IPv6-NLU-MB4520A.txt = 24,358
grep -c :000 SS.IPv6-NLU-MB4520A.txt = 1,500
grep -c :0000 SS.IPv6-NLU-MB4520A.txt = 93
which seems rather realistic, despite the original script's neglecting actually to count from the beginning.
Evaluating our progress ...
nmap -6 -sn -T4 -sL -iL SS.IPv6-NLU-MB4520A.txt | grep "Nmap scan report for " - | tr -d '()' | sort -k5 | awk 'NR >= 1 { print $5, $6 }' | awk 'NR >= 1 { print $2, $1 }' | uniq -Df 1 | sed '/^\s\s*/d' | awk '{ print $2 "\t" $1 }' >> Multi-SS.IPv6-NLU-MB4520A.txt
Alas, 2a02.2788::/32 is gome; WhoIs returns "no one found at this address"... "Failed to resolve ..." is all that nmap gets.
Try mobile.tre.se's 2a02:aa1::/32 instead:
prefix=2a02:aa1; od -A n -N 12582912 -xw12 /dev/urandom | tr ' ' : | sed s/^/$prefix/ > SS.IPv6-NLU-January2020-mobile.tre.se.txt
Forgot to count ... I thought something was wrong ... but there it was: 41MB with 1,048,576 addresses.
nmap -6 -sn -T4 -sL -iL SS.IPv6-NLU-January2020-mobile.tre.se.txt | grep "Nmap scan report for " - | tr -d '()' | sort -k5 | awk 'NR >= 1 { print $5, $6 }' | awk 'NR >= 1 { print $2, $1 }' | uniq -Df 1 | sed '/^\s\s*/d' | awk '{ print $2 "\t" $1 }' >> Multi-SS.IPv6-NLU-January2020-mobile.tre.se.txt
Network initially running ten times as fast as for the first nmap script ... took 63 minutes (55.2MB) ==> 1,048,534 mobile.tre.se's.
Enumerating:
awk '{print $2,$1}' 'Multi-SS.IPv6-NLU-January2020-mobile.tre.se.txt' | sort -k 2 | uniq -cdf 1 | awk '{print $3"\t"$1}' '-' > Multi-SS.IPv6-NLU-January2020-mobile.tre.se.Tally.txt
Result: mobile.tre.se 1048534
Apply grep again:
grep -c :0 SS.IPv6-NLU-January2020-mobile.tre.se.txt = 336,811
grep -c :00 SS.IPv6-NLU-January2020-mobile.tre.se.txt = 24,452
grep -c :000 SS.IPv6-NLU-January2020-mobile.tre.se.txt = 1,553
grep -c :0000 SS.IPv6-NLU-January2020-mobile.tre.se.txt = 84
A million makes a pretty good sample size, and grep's actual counts have the 1:16:256:4093 (i.e, ~16 times per step).
Those randomly generated four-character fields appear not to exclude the normal numbers of zeros as I first thought.
In graduate school, I was at first trying to attain my Sc.D. in Materials Engineering, but my advisor wanted me to take
one more course: Statistical Mechanics, which does thermodynamics from the beginning. I contacted the department's
registration officer, and he said that I had spent enough time in graduate school (six years) and I had sufficient
course credits for a degree in Metallurgy. That's how I escaped mathematical statistics. I got though my metallurgical
consulting career with plain old experimentally based thermodynamics, which is what the elements do when left to their
own devices. Never missed those statistical calculations until now.
George Langford
which seems rather realistic
I thought you were saying groups of four hexadecimal digits in real-world IPv6 addresses more often start with 0 than not. I extrapolated that interpretation: "yet another distribution obeying Benford's law", I thought: https://en.wikipedia.org/wiki/Benford%27s_law
Magic Banana's efficient script
Writing the prefixes with simple commands (rather than sed), may save some CPU cycles:
$ prefix=2a02:2788; sample_size=1048576; od -A n -N $(expr 12 \* $sample_size) -xw12 /dev/urandom | tr ' ' : | paste -d '' <(yes $prefix | head -$sample_size) -
awk 'NR >= 1 { print $5, $6 }' | awk 'NR >= 1 { print $2, $1 }'
Read again my previous post.
Those randomly generated four-character fields appear not to exclude the normal numbers of zeros as I first thought.
/dev/urandom provides uniformly random bytes. 0 is included, of course.
Magic Banana continued our previous discussion:
>> which seems rather realistic
> I thought you were saying groups of four hexadecimal digits in real-world IPv6 addresses more often start
> with 0 than not. I extrapolated that interpretation: "yet another distribution obeying Benford's law", I
> thought: https://en.wikipedia.org/wiki/Benford%27s_law
Quite a few very tall buildings have heights (in feet) which start with the numeral 1. Far fewer 2's.
Buildings with heights starting in 3's have physical limitations.
On the other hand, IPv6 addresses very often start with 2's; rarely 3's ... the address space starts with
zero within the governing base, but that was arbitrary and had (to my knowledge) no physical limitation
such as that which occurs within the radio spectrum.
An address range such as 2a02:2788::/32 starts with zero:
Therefore, 2a02:2788:0000:0000:0000:0000:0000:0001 is the second address.
2a02:2789::/32 starts the same: 2a02:2789:0000:0000:0000:0000:0000:0001 and may belong to another party.
2a02:2788::/31 has two cycles that look exactly like 2a02:2788::/32 and 2a02:2789::/32 even though they
are in consecutive positions along the address line of the unbounded IPv6 address line that started at
zero only once. There is no overlap and no sale of one address on the infinite IPv6 line more than once.
The zeros in the hexadecimal version of the decimal 79,228,162,514,264,337,593,543,950,336 are fewer in
number but are all in the same places in the six fields of four hexadecimal digits in 2a02:2788::/32 as
they are in 2a02:2789::/32. Think of the successive rings on a dart board, the one labelled :000h being
the smallest and the one labelled :0hhh the largest. Our randomized looks at the hhhh;hhhh::/32 dart board
ought to have similar relative numbers of :0000, :000h, :00hh, and :0hhh every time.
Magic Banana also said:
> Writing the prefixes with simple commands (rather than sed), may save some CPU cycles:
$ prefix=2a02:aa1; sample_size=1048576; od -A n -N $(expr 12 \* $sample_size) -xw12 /dev/urandom | tr ' ' : | paste -d '' <(yes $prefix | head -$sample_size) - > SS.IPv6-NLU-January2020-mobile.tre.se.txt
Applied to another prefix, it finishes in three seconds; the grep's come out :0 ==> 336,650, :00 ==> 24,393, :000 ==>
1,537 & :0000 ==> 98. Those steps come out ~15 to one. The advantage of this script is that I can scale it readily.
The nmap script remains the rate limiting step in this exercise. I'm gathering prefixes for a marathon nmap session,
but, with the randomized method of condensing, the impossible-to-scan verbatim prefix=2a02:aa1::/32 can be visualized in an
hour. My laptop was formerly enduring CIDR/12's that took on the order of a thousand hours to return a result. Internet
searches on PTR addresses gleaned from email databases remain a tedious roadblock to the evaluation of the gratuitously
resolved addresses in the other two-thirds of published recent visitor data.
Magic Banana, grading my homework, said:
[quoting] awk 'NR >= 1 { print $5, $6 }' | awk 'NR >= 1 { print $2, $1 }'
> Read again my previous post.
Don't need to; that expedient script helped me navigate my logic, but can also be written: awk '{print $6,$5}' efficiently
if I were to proofread it before publishing.
I corrected it and am randomizing the 2a02:2788::/32 block, which has come back to life after being closed over the weekend:
prefix=2a02:2788; sample_size=1048576; od -A n -N $(expr 12 \* $sample_size) -xw12 /dev/urandom | tr ' ' : | paste -d '' <(yes $prefix | head -$sample_size) - > SS.IPv6-NLU-January2020-host.dynamic.voo.beB.txt
nmap -6 -sn -T4 -sL -iL SS.IPv6-NLU-January2020-host.dynamic.voo.beS.txt | grep "Nmap scan report for " - | tr -d '()' | sort -k5 | awk '{ print $6, $5 }' | uniq -Df 1 | sed '/^\s\s*/d' | awk '{ print $2 "\t" $1 }' >> Multi-SS.IPv6-NLU-January2020-host.dynamic.voo.beS.txt
awk '{print $2,$1}' 'Multi-SS.IPv6-NLU-January2020-host.dynamic.voo.beS.txt' | sort -k 2 | uniq -cdf 1 | awk '{print $3"\t"$1}' '-' > Multi-SS.IPv6-NLU-January2020-host.dynamic.voo.beS.Tally.txt
The count of "host.dynamic.voo.be" came to 983,391 out of a possible 1,048,576 (93.8%)
George Langford
Some small amount of progress to report:
Suitably combined with my simplified Magic Banana script:
for (( sample = 1; sample <= 3; sample++ ))
do
for (( hexxxx = 3; hexxxx <= 8; hexxxx++ ))
do
echo 2a02:2788:1000:0:6037:fc9a:27ac:10c7 | sed s/'[0-9a-f]\{4\}'/$(openssl rand -hex 2)/$hexxxx
done
done
With the following output (three iterations, each with six lines of output):
Since corrected by padding the :0 field to :0000 ...
2a02:2788:f140:0:6037:fc9a:27ac:10c7
2a02:2788:1000:0:e6d6:fc9a:27ac:10c7
2a02:2788:1000:0:6037:a0fe:27ac:10c7
2a02:2788:1000:0:6037:fc9a:de12:10c7
2a02:2788:1000:0:6037:fc9a:27ac:78b6
2a02:2788:1000:0:6037:fc9a:27ac:10c7
2a02:2788:4def:0:6037:fc9a:27ac:10c7
2a02:2788:1000:0:6ada:fc9a:27ac:10c7
2a02:2788:1000:0:6037:d74a:27ac:10c7
2a02:2788:1000:0:6037:fc9a:05f7:10c7
2a02:2788:1000:0:6037:fc9a:27ac:048c
2a02:2788:1000:0:6037:fc9a:27ac:10c7
2a02:2788:766f:0:6037:fc9a:27ac:10c7
2a02:2788:1000:0:4f2e:fc9a:27ac:10c7
2a02:2788:1000:0:6037:f79b:27ac:10c7
2a02:2788:1000:0:6037:fc9a:9c0e:10c7
2a02:2788:1000:0:6037:fc9a:27ac:b677
2a02:2788:1000:0:6037:fc9a:27ac:10c7
Which is what I was originally intending to do.
However, now I'd like to modify six of the eight fields in the IPv6 address within each iteration
of the inner loop so that only three IPv6 addresses will be output (in the abbrieviated range of the
above script). That would scatter the sampling far more widely within the original CIDR/32 block.
I would rather start with a seed IPv6 of field01:field02:0000:0000:0000:0000:0000:0000 where
field01 and field02 (and sometimes field03) are part of the field01:field02::/32 CIDR block.
George Langford
Some more progress:
What with all the gracious and informed help of Magic Banana, the extent of multi-addressed PTR records
is becoming ever clearer. I collected all the PTR records from the never-looked-up IPv6 addresses which
I found for January 2020, that had two or more addresses resolving to the same PTR. Then I applied these
randomization techniques to the upper level fields of those IPv6 addresses, truncated thusly:
field01:field02::/32, field01:field02:field03::/48, field01:field02:field03:field04::/64, or sometimes
field01:field02::0000/112 or even field01:field02::field07:0000/112, and then adjusting the parameters
of the randomization scripts to process a number of similar scripts as a batch so as to limit the number
of IPv6 addresses to be resolved to about 10 million and the address file to less than 150MB.
The end result: All of the multi-addressed PTR records this evaluated had more than a thousand additional
addresses in the same CIDR blocks as those ones recorded for their IPv6 addresses, with some extending to
a million or more addresses, all for the same-named PTR record.
The next step in this procedure is to bring together the many additional singly addressed PTR records in
the published recent visitor data so as to to find out which among them has been similarly obfuscated.
One CIDR block which appeared to be essentially _all_ one PTR name last week was shut off and unavailable
over the weekend, back in service on Monday, and further populated on Wednesday, indicating that it is
being used dynamically to obfuscate sensitive payloads stored behind that unresolvably recorded PTR.
George Langford
P.S. You can do this analysis at home without setting foot outdoors. There are other months from which
to choose, extending back into ancient history, before the 2016 U.S. election, or even to the present, i.e.,
though March 2020.