Sort and Uniq fail to remove all duplicates from a list of hostnames and their IPv4 addresses

6 replies [Last post]
amenex
Offline
Joined: 01/04/2015

As apache server software on shared servers routinely performs hostname lookups
on data requests made to the hosted domains on their servers, I'm compiling a
database of the thousands of example.com hostnames that are on the Internet.

I've reached an impasse: LibreOffice's Calc spreadsheet will filter _most_ of
the many duplicated lines in my lists, but a great many pairs, triplicates,
and quadruplicates of the lines in my lists still remain. There are enough of
them that their manual removal is tedious.

I've tried uniq -d to try to print one of each duplicated line, followed by
uniq -u to print only the unique lines, but the outputs retained these
duplicated lines nevertheless.

Here's a sample of my predicament:

jaholper1.example.com 95.182.79.24
jaholper1.example.com 95.182.79.24
jaholper1.example.com 95.182.79.33
jaholper1.example.com 95.182.79.33
jaholper4.example.com 109.248.200.4
jaholper7.example.com 109.248.203.131
jaholper7.example.com 109.248.203.188
jaholper7.example.com 109.248.203.189
jaholper7.example.com 109.248.203.191
jaholper7.example.com 109.248.203.198
jaholper7.example.com 185.186.141.79
jaholper7.example.com 185.186.142.10
jaholper7.example.com 185.186.142.10
jaholper7.example.com 185.186.142.100
jaholper7.example.com 185.186.142.100
jaholper7.example.com 185.186.142.101
jaholper7.example.com 185.186.142.101

uniq -d returns only one line: jaholper7.example.com 185.186.142.101
uniq -u keeps everything _but_ the last two lines.

Reversing the positions of the two columns in LibreOffice only makes
matters worse: Get single line output or complete erasure of the file.

It's been suggested that the IPv4 addresses can each be presented as a
single decimal number, but the thought of doing that for my thousands
of IPv4 addresses makes manual editing look pretty good.

George Langford

chaosmonk

I am a member!

I am a translator!

Offline
Joined: 07/07/2017

> jaholper1.example.com 95.182.79.24
> jaholper1.example.com 95.182.79.24
> jaholper1.example.com 95.182.79.33
> jaholper1.example.com 95.182.79.33
> jaholper4.example.com 109.248.200.4
> jaholper7.example.com 109.248.203.131
> jaholper7.example.com 109.248.203.188
> jaholper7.example.com 109.248.203.189
> jaholper7.example.com 109.248.203.191
> jaholper7.example.com 109.248.203.198
> jaholper7.example.com 185.186.141.79
> jaholper7.example.com 185.186.142.10
> jaholper7.example.com 185.186.142.10
> jaholper7.example.com 185.186.142.100
> jaholper7.example.com 185.186.142.100
> jaholper7.example.com 185.186.142.101
> jaholper7.example.com 185.186.142.101

I pasted that into a file called "test"

$ uniq test
jaholper1.example.com 95.182.79.24
jaholper1.example.com 95.182.79.33
jaholper4.example.com 109.248.200.4
jaholper7.example.com 109.248.203.131
jaholper7.example.com 109.248.203.188
jaholper7.example.com 109.248.203.189
jaholper7.example.com 109.248.203.191
jaholper7.example.com 109.248.203.198
jaholper7.example.com 185.186.141.79
jaholper7.example.com 185.186.142.10
jaholper7.example.com 185.186.142.100
jaholper7.example.com 185.186.142.101

Every item seems to be represented with no duplicates.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

You want 'sort -u', which is faster that 'sort | uniq' in presence of many duplicates.

amenex
Offline
Joined: 01/04/2015

At Magic Banana's suggestion, I applied sort -u to the test file in my original
posting, with a perfect elimination of the extra files, but when I tried the
same approach on another set of data that I had first sorted with LibreOffice
Calc, sort -u reduced the numer of rows from 67 to 46, but four of these were
still duplicates in pairs and in threes.

In the meantime I used the hands-on approach to the original body of the test
file (2,000+ rows), first cutting it down to about half that number of rows
with the duplicate-removal function supplied by LibreOffice Calc, which left
me with about 500 pairs of duplicate rows, with a few singles mixed in. Took
about an hour to select and delete the rows, one at a time.

Still, sort -u is a big improvement if it gets rid of most of the duplicates.

I had thought that Leafpad would cleanse a text file of LibreOffice Calc baggage,
but there may be some stuff still in there. The test file had additional cleansing,
what with its trip to Spain and back via much commotion that would have lost all
the invisible stuff compared to the text file that I was working with, fresh from
a LibreOffice Calc file.

I tried sort -u on the main body of the test file, with the output literally full
of duplicates. There's something about LibreOffice Calc that's leaving residues
that fool the sorting software ...

Thanks for thinking about this.

BTW, my original example.com raw data was about 18,000 rows, which I put into
LibreOffice Calc, alphabetized, and then divided into about thirty smaller and
more manageable files, each starting with one of the letters of the alphabet.
the extra four files hold the hostnames beginning with numbers and three files
of macro-multiples of example.com like the test file. There will be many
thousands of unresolvable hostnames represented in the final result.

George Langford

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

'sort -u' does remove all duplicates. What you consider duplicates in its output must differ in some subtle way, e.g., an additional space at the end of the line. Those differences can be ironed out. For instance, that command "squeezes" the spaces (only leaves one if there are more) in "file", removes those at the end of the lines (if any), sort and removes duplicates:
tr -s ' ' < file | sed 's/ $//' | sort -u
LibreOffice Calc is to make calculations. It is not a database management system.

amenex
Offline
Joined: 01/04/2015

Magic Banana does it again ... with "tr -s ' ' < file | sed 's/ $//' | sort -u"

I plugged in the filename thusly:

> tr -s ' ' < HolkitsorN.txt | sed 's/ $//' | sort -u > HolkitsorN-MBd.txt

With the result that the original 13.1 kB shrank to 8.8 kB, without any residual duplicate lines.

The LibreOffice Calc spreadsheet shrank from 332 rows to 224 rows; before sorting and trying
to remove duplicates, just with Libreoffice Calc, there were 2461 rows. Clearly, this has to
be a multi-step process. Sort -u by itself had no effect on the file size ... even after I
"laundered" the file by emailing it to myself and copying it from the email text into Leafpad.

I tried to parse MB's expression ... I'm OK until sed 's/ $//' because in man sed it looks like
it might be sed 's///' instead. Am I missing something ?

A residual task remains: The sorting process only works for me if I pad the one- and two-digit
octets with leading zeroes ... which have to be stripped out afterwards. Is there a "sort" in
any freedom-compatible application which can put numbers in numerical order ?

Thanks & best regards,
George Langford

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Sort -u by itself had no effect on the file size.

If you give 'sort' a text file in argument, it will not be modified, if that is what you mean. It is how all text-processing command work (well, 'sed' actually has an --in-place option): you must redirect their output to a file (or use the option --output, if the command has such an option) and you can then move that file onto the original one, if you are certain you do not need it anymore.

Since 'sort -u' removes duplicates, its output is always at most as large as its input, but it can be much smaller: one single line if the input only contains this line repeated many times.

I'm OK until sed 's/ $//' because in man sed it looks like
it might be sed 's///' instead. Am I missing something ?

What is between the first and the second / is a regular expression the substitution command ('s') substitutes. "$" at the end of a regular expression means "end of the line". So " $" is a single space at the end of the line (in the command line I gave, there cannot be more space because tr -s ' ' squeezes the sequences of spaces). It is substituted by nothing (what is between the second and the third /), i.e., it is removed.

If you want to learn text-processing commands, I have slides: https://dcc.ufmg.br/~lcerf/en/mda.html#slides (5 to 9).

Is there a "sort" in
any freedom-compatible application which can put numbers in numerical order ?

'sort' can do that (and much more). I assume there can be any number of dots in a hostname. That is why, below, I use 'awk' rather than 'tr' and 'sed' to not only remove the supernumerary spaces, but also to write the IPv4 addresses before the hostnames (in this way, 1, 2, 3 and 4 are the numbers of the dot-separated columns containing the four numbers in an IPv4 address):
$ awk '{ print $2 "." $1 }' file | sort -ut . -k 1,1n -k 2,2n -k 3,3n -k 4,4n -k 5 | sed 's/\./ /4'

If you want to swap back the two columns, just append "| awk '{ print $2, $1 }" to the command line.

If you plan to reuse the command:

  1. Write that in a new file in a directory listed in your PATH variable, e.g., in /usr/local/bin/sort-n-remove-duplicates:
    #!/bin/sh
    awk '{ print $2 "." $1 }' | sort -ut . -k 1,1n -k 2,2n -k 3,3n -k 4,4n -k 5 | sed 's/\./ /4'
  2. Save.
  3. Turn the file executable (e.g., using 'chmod +x' or using your file browser).

You can then execute 'sort-n-remove-duplicates < my-file > my-file.sorted', where "my-file" is whatever file you want to process and "my-file.sorted" is wherever you want to redirect the output.