Translating a hex-coded Internet address to the usual four-octet IPv4 address

12 Antworten [Letzter Beitrag]
amenex
Offline
Beigetreten: 01/03/2015

In the process of extracting spammer addresses from collected spams, I came across 766 entries that look like this:
127.0.0.1 00531084.d9t0h.ketobreaddessert.buzz
127.0.0.1 00531096.jc877t.resurgeplus.buzz
127.0.0.1 0053109a.hz850.southbeachskinlab.buzz
127.0.0.1 0053109c.58p1r.resurgesuppliment.buzz
127.0.0.1 005310a2.ubnl5gabo.mendthemarriage.buzz
127.0.0.1 005310a6.k7w3vlv.lostways.buzz
127.0.0.1 005310b0.pkzx8hb8c.newsynapsextplus.buzz

After which I applied the following script:
awk '{print $2}' Temp-08312021G01.txt | awk -F. '{print $1"\t"$2"\t"$3"\t"$4"."$5"."$6}' '-'
whose first column is this list of obviously hex-coded IPv4 addresses:
00531084
00531096
0053109a
0053109c
005310a2
005310a6
005310b0

which I would like to re-write as four-octet IPv4 addresses, starting with:
00.53.10.84
00.53.10.96
00.53.10.9a
00.53.10.9c
00.53.10.a2
00.53.10.a6
00.53.10.b0

and then doing the math:
000.3+5*16.10*16.4+8*16 ==> 0.83.160.132
000.3+5*16.10*16.6+9*16 ==> 0.83.160.150
000.3+5*16.10*16.10+9*16 ==> 0.83.160.154
000.3+5*16.10*16.12+9*16 ==> 0.83.160.156
000.3+5*16.10*16.2+10*16 ==> 0.83.160.162
000.3+5*16.10*16.6+10*16 ==> 0.83.160.166
000.3+5*16.10*16.0+11*16 ==> 0.83.160.176

Alas, dig -x elicits nothing from the calculated IPv4 addresses, but dig'ing the combined $4"."$5 columns does:
dig ketobreaddessert.buzz ==> 50.3.179.131
dig resurgeplus.buzz ==> 50.2.77.207
dig southbeachskinlab.buzz ==> 185.121.123.113
dig resurgesuppliment.buzz ==> 50.3.179.143
dig mendthemarriage.buzz ==> 185.121.123.121
dig lostways.buzz ==> 50.3.179.148
dig newsynapsextplus.buzz ==> 50.3.179.153

The sticking point is the process of separating those eight-character strings into four pairs of two-
character strings separated by dots so I can study the other 759 obfuscated addresses. FS = "" will
accomplish the separation, but the syntax escapes me.

amenex
Offline
Beigetreten: 01/03/2015

Some progress, following this link:
https://ubuntuforums.org/showthread.php?t=624630
awk '{print $1}' Temp-08312021G02.txt | awk 'BEGIN{FS="";OFS="\t"}{$1=$1}1' '-' | awk '{print $1$2"."$3$4"."$5$6"."$7$8}'
which re-writes the hex-coded eight-character addresses as four hex pairs, making four hextets:
00.53.10.84
00.53.10.96
00.53.10.9a
00.53.10.9c
00.53.10.a2
00.53.10.a6
00.53.10.b0

as done by hand previously.
Still puzzling about scripting the math.

AnhangGröße
Temp-08312021G02.txt 62 Bytes
Magic Banana

I am a member!

I am a translator!

Offline
Beigetreten: 07/24/2010

With GNU AWK's strtonum function:
$ cut -d ' ' -f 2 Temp-08312021G01.txt | gawk -v FPAT=.. '{ for (i = 1; i != 4; ++i) printf strtonum("0x" $i) "."; print strtonum("0x" $4) }'
0.83.16.132
0.83.16.150
0.83.16.154
0.83.16.156
0.83.16.162
0.83.16.166
0.83.16.176

0x10 is 16 (not 160) in decimal.

Platoxia
Offline
Beigetreten: 05/30/2018

Man I really want to spend some time learning Gawk now. This is amazingly simple and succinct. The all Bash equivalent requires so much more work it isn't even comparable (and is much slower).

Magic Banana

I am a member!

I am a translator!

Offline
Beigetreten: 07/24/2010

AWK is very easy to learn. I spend a little over 3 hours to teach almost all of AWK to undergraduate students. Here are my slides (the data for the exercises are still online, with answers): https://dcc.ufmg.br/~lcerf/slides/mda6.pdf

Those slides do not present extensions that may only be present in GNU's AWK. They are sometimes very useful though. The one-liner I gave above actually uses two GNU extensions: the FPAT variable (to specify the field, here any two subsequent characters, rather the separation between fields) and strtonum to convert a (possibly octal or, here, hexadecimal) string to a number.

amenex
Offline
Beigetreten: 01/03/2015

It's not all that awful. Read on:
See: https://ubuntuforums.org/showthread.php?t=624630 where it's said (I replaced \n with \t) ==>
awk '{print $1}' Temp-08312021G02.txt | awk 'BEGIN{FS="";OFS="\t"}{$1=$1}1' '-' | awk '{print $2"+"$1"*"16"."$4"+"$3"*"16"."$6"+"$5"*"16"."$8"+"$7"*"16}' '-' > Temp-08312021G05.txt
Then execute this series of six scripts which put the data into decimal notation:
awk '{print $0}' Temp-08312021G05.txt | sed 's/a/10/g' | awk '{print $0}' '-' > Temp-08312021G0601.txt ;
awk '{print $0}' Temp-08312021G0601.txt | sed 's/b/11/g' | awk '{print $0}' '-' > Temp-08312021G0602.txt ;
awk '{print $0}' Temp-08312021G0602.txt | sed 's/c/12/g' | awk '{print $0}' '-' > Temp-08312021G0603.txt ;
awk '{print $0}' Temp-08312021G0603.txt | sed 's/d/13/g' | awk '{print $0}' '-' > Temp-08312021G0604.txt ;
awk '{print $0}' Temp-08312021G0604.txt | sed 's/e/14/g' | awk '{print $0}' '-' > Temp-08312021G0605.txt ;
awk '{print $0}' Temp-08312021G0605.txt | sed 's/f/15/g' | awk '{print $0}' '-' > Temp-08312021G0606.txt ;

Now it's necessary to put the four math statements in compound form understood by bc:
awk '{print "{"$0"}"}' Temp-08312021G0606.txt | sed 's/\./;/g' | bc > Temp-08312021G0607.txt
The output file looks terrible because the command "bc" places a newline character after each statement,
producing a list of single results, which can be put in the proper IPv4 organization with Leafpad ==>
{precede each line with a space, then replace "space [0] newline" with [.];
precede each line with a space, then replace "space [2] newline" with [.];
repeat as necessary with each new leftmost octet (of which there are none in this data).
Then the output file looks just like the result of Magic Banana's ever-so-succinct script.

AnhangGröße
Temp-08312021G02.txt 27.37 KB
Magic Banana

I am a member!

I am a translator!

Offline
Beigetreten: 07/24/2010

It's not all that awful.

You definitely have a weird conception of what is clear code (even including manual substitutions in a text editor!). Even if one wants stick to POSIX AWK, she can write:
$ sed 's/./& /g' Temp-08312021G02_0.txt | awk -v OFS=. 'function d(i) { h = "123456789abcdef"; return 16 * index(h, $i) + index(h, $++i) } { print d(1), d(3), d(5), d(7) }'
0.40.174.119
0.40.188.129
0.82.229.123
0.82.229.235
(...)

Platoxia
Offline
Beigetreten: 05/30/2018

Yeah, I mean...just for comparison's sake, in order to have the same capability as the version you posted in Gawk with the GNU extensions with Bash you would first have to create a script to convert ANY common base's number into decimal (and probably another to convert any decimal number back to any other base) to be used as filters.

Keeping in mind that I am a complete amateur at programming and I'm sure it can be done better, here is what an equivalent Bash version looks like:


while read line; do
set1=${line:0:2}
set2=${line:2:2}
set3=${line:4:2}
set4=${line:6:2}
echo -n $(./any2dec.sh $set1 16).
echo -n $(./any2dec.sh $set2 16).
echo -n $(./any2dec.sh $set3 16).
echo $(./any2dec.sh $set4 16)
done < Temp-08312021G02.txt

Script 'any2dec.sh' is attached as a text file.

Keep in mind, this script could be much simpler if it was designed only for the current problem but to compare it with the GNU extension it should be able to convert other base's to decimal as well. At any rate, the difference between your Gawk with GNU extensions example and this Bash example are night and day.

The clarity and succinctness of your version blows it away...its quite a beautiful little piece of code, IMHO.

Also, thanks for the links. I'll definitively check it out.

AnhangGröße
any2dec.txt 2.77 KB
Magic Banana

I am a member!

I am a translator!

Offline
Beigetreten: 07/24/2010

The clarity and succinctness of your version blows it away...its quite a beautiful little piece of code, IMHO.

Thank you. I quite like the POSIX version too. There is a little trick in it: 0 is not in the string h but the built-in index function returns 0 if the substring (here "0") does not occur (otherwise the position, which starts at 1 in AWK).

If sed is accepted (but awk is not), one can use the attached substitutions and write:
#!/bin/sh
sed -f hex2dec.txt "$1" | while read a b c d e f g h rest
do
echo $((16 * $a + $b)).$((16 * $c + $d)).$((16 * $e + $f)).$((16 * $g + $h))
done

AnhangGröße
hex2dec.txt 63 Bytes
Platoxia
Offline
Beigetreten: 05/30/2018

I can't stop laughing at how much better that is, lol.

I had 'any2dec.sh' laying around, as I had made it when reading an ASM book so the logic in the script was intended to test my understanding of how the conversions work. It is just soooo slow in comparison.

Your "fit for purpose" examples are definitely better than the generalized example I gave. The difference in speed really is hilarious to me.

Magic Banana

I am a member!

I am a translator!

Offline
Beigetreten: 07/24/2010

For speed, it is important to do as little shell as possible. The shell is slow.

Magic Banana

I am a member!

I am a translator!

Offline
Beigetreten: 07/24/2010

I quite like the POSIX version too.

I like it even more now that I realize that having the digits after zero in a string allows to easily implement an any2dec that is really *any*2dec:
#!/bin/sh
if [ -z "$1" -o "$1" = "-h" -o "$1" = "--help" ]
then
printf "Usage: $0 digits-after-zero
One number per line on the standard input.
Examples:
\$ echo 1b | $0 123456789abcdef
27
\$ echo xooxox | $0 x
37
"
exit
fi
awk -F '' -v d="$1" '{
n = 0
for (p = 0; NF; --NF)
n += index(d, $NF) * (length(d) + 1)^p++
print n }'

-F '' (to define FS as "") is undefined in the POSIX standard, which however mentions splitting the record into individual characters as a possible interpretation. That is what gawk and mawk (the two most common implementations of awk) do. To be POSIX-compliant, one can write sed 's/./& /g' | awk -v d="$1" '{ (...) }'.

amenex
Offline
Beigetreten: 01/03/2015

More than half of the original 766 addresses computed by MB's script respond positively to dig -x.
Somehow that 0X10 error escaped my glassy-eyed stare, even though my notes have a lookup table
for a=10 through f=15 and then ... 10=16.

Thank you yet again.