Another uniq -u feature emerges

8 replies [Last post]
amenex
Offline
Joined: 01/04/2015

Over a year ago, I lamented that sort followed by uniq -u wasn't removing duplicates from a list:
https://trisquel.info/en/forum/sort-and-uniq-fail-remove-all-duplicates-list-hostnames-and-their-ipv4-addresses

Recently I've been faced with the results of grep searches in other files that overlap because
they contain the same string on which grep was searching. After sorting the grep outputs, then
cutting & pasting, I ended up with pairs of files that contain many duplicates because the strings
were caught twice.

grep -h lns03.v6.018.net.il *Rev.oGnMap.txt >> PTR.IPv6-Data/IPv6-lns03.v6.018.net.il.txt ;
grep -h cable-lns03.v6.018.net.il *Rev.oGnMap.txt >> PTR.IPv6-Data/IPv6-cable-lns03.v6.018.net.il.txt

The grep outputs were expected to list the PTR record in the first column and the corresponding
IPv6 address in the second column, because I reversed the order of those columns in the outputs
of the originsl nMap -oG searches as well as removing the parentheses enclosing the IPv6 addresses.
In the sorting scripts below, $1 is the PTR and $2 is the IPv6 address, except for the uniq -c
script where I printed $2 and $3 to skip the counts column produced by uniq -c.

Here are the three pairs of scripts intended to consolidate the files:

sort IPv6-lns03.v6.018.net.il.txt | uniq -u > IPv6-uniq.lns03.v6.018.net.il.txt ;
sort IPv6-cable-lns03.v6.018.net.il.txt | uniq -u > IPv6-uniq.cable-lns03.v6.018.net.il.txt

sort -k 2 IPv6-lns03.v6.018.net.il.txt | uniq -c | awk '{print $2"\t"$3}' '-' > IPv6-uniq.lns03.v6.018.net.il.txt ;
sort -k 2 IPv6-cable-lns03.v6.018.net.il.txt | uniq -c | awk '{print $2"\t"$3}' '-' > IPv6-uniq.cable-lns03.v6.018.net.il.txt

sort -u IPv6-lns03.v6.018.net.il.txt > IPv6-uniqB.lns03.v6.018.net.il.txt
sort -u IPv6-cable-lns03.v6.018.net.il.txt > IPv6-uniqB.cable-lns03.v6.018.net.il.txt

The first pair produced zero bytes output for both scripts; the original files were not zero.

The second pair reduced both files by half as expected.

Then I remembered to check this forum, wherein Magic Banana had suggested using sort -u
instead of the first pair's combination of sort and uniq -u. This third pair produced the
exact same halving of the original file sizes as my less efficient use of uniq -c and awk
to eliminate the counts column. Thank you again, Magic Banana !

I had tried to "fix" the uniq -u debacle of the second pair of sorting scripts by copying the
affected file names directly from the File manager into the script text, as that has been a
useful workaround in the past, but this time the first pair of sorting scripts produced zero
bytes output again, same as did my first attempt.

What is it about uniq -u of which I should be wary ?

George Langford

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

According to 'info uniq':
‘-u’
‘--unique’
Discard the last line that would be output for a repeated input
group. When used by itself, this option causes ‘uniq’ to print
unique lines, and nothing else.

'sort | uniq' does the same as 'sort -u' but is slower (and longer to type).

amenex
Offline
Joined: 01/04/2015

Following up, I noticed a pattern among the outputs of| sort | uniq -u versus | sort u:

The three files that I evaluated had 26.1MB, 12MB, and 2.0MB, repectively, among
1. The original file, the result of grepping about 10GB of nMap output files, with many duplicates;
2. The | sort -u file; and
3. The | sort | uniq -u file, the smallest of the three.

I applied comm (with no arguments):
comm IPv6-uniq.lns01.v6.018.net.il.txt IPv6-uniqB.lns01.v6.018.net.il.txt > IPv6-commAll.lns01.v6.018.net.il.txt

An excerpt from this last script's output is attached; it has no Column $2 (files unique to
the second (smaller) file; Column $3 (the less well represented among the two files) has nothing
obviously different from the entries above & below.

Not to contradict man uniq's description of uniq -u, but I'm suspicious. I'll be using sort -u
from now on.

AttachmentSize
Example0702.txt 1.96 KB
Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Again: 'sort -u' equates to 'sort | uniq'; it does not equate to 'sort | uniq -u', because "used by itself, -u causes ‘uniq’ to print unique lines, and nothing else".

There is no need to reverse engineer 'uniq'. Just read its documentation:
$ info uniq

amenex
Offline
Joined: 01/04/2015

Magic Banana quoted the magic words:
"used by itself, -u causes 'uniq' to print unique lines, and nothing else"

Put another way: Uniq -u skips duplicated lines altogether.

Sort -u needs to be mentioned earlier on the man sort page, as it will handle
unpredictable outputs reliably.

Thanks are due to the always-conscientious teacher !

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

The options are usually categorized and listed in alphabetical order. It would be especially weird if 'info sort' would not respect that order! Notice that the introduction of 'info uniq' includes that sentence:

If you want to discard non-adjacent duplicate lines, perhaps you want to use ‘sort -u’.

amenex
Offline
Joined: 01/04/2015

Within 'info sort' there are sections:

Ordering options: -b though -V
Other options: --batch-size=NMERGE through -u and on to --version

The lesson for me: when advised to read, read all of it ...

George Langford, suffering the consequences of failing to appreciate the
subleties of uniq -u and sort -u until very recently.

amenex
Offline
Joined: 01/04/2015

Sort has a problem, as illustrated with the attached pair of files,
wherein the first application of sort below fails to separate the five sets of PTR's:
sort -k 1 Temp0706AR.txt > Temp0706AS.txt

My workaround does manage the separation OK without changing the file otherwise:
awk '{print $1,$2}' 'Temp0706AR.txt' | sed 's/\./'\ '/g' | sort -k 3 | awk '{print $1,$2,$3,$4,$5"\t"$6}' '-'
| sed 's/'\ '/\./g' | awk '{print $1"\t"$2}' '-' > Temp0706AS.txt

The second script is working and follows my straightforward logic.

The first script sometimes does work; why should that be so ?

My source files contain IPv6 addresses for each PTR, but none of these PTR's can be otherwise
resolved; even when dig returns an appropriate nameserver, that nameserver nearly always turns
out to be unavailable. At times, a nonauthoritative nameserver will reply with an IPv4 address
corresponding to ...barefruit.co.uk, a catchall site.

George Langford

AttachmentSize
Temp0706AR.txt 291.28 KB
Temp0706AS.txt 291.29 KB
Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

sort -k 1 Temp0706AR.txt > Temp0706AS.txt

As I have already told you many times, '-k 1' does not make any difference. It mean "sort the lines considering *everything starting with* the first field". You certainly mean '-k 1,1', which means "sort the lines considering *only* the first field".

That command behaves as documented. Properly using 'sort' is your problem. It is not sort that "has a problem": it cannot magically guess what you want. You have to tell it, using the proper options.

awk '{print $1,$2}' 'Temp0706AR.txt' | ...

That AWK program is useless: just give Temp0706AR.txt as an argument to the subsequent command.

... | sed 's/\./'\ '/g' | sort -k 3 | ...

Just specify '.' as sort's field separator. 'info sort' explains:
‘-t SEPARATOR’
‘--field-separator=SEPARATOR’
Use character SEPARATOR as the field separator when finding the sort keys in each line. By default, fields are separated by the empty string between a non-blank character and a blank character. By default a blank is a space or a tab, but the ‘LC_CTYPE’ locale can change this.
That is, given the input line ‘ foo bar’, ‘sort’ breaks it into fields ‘ foo’ and ‘ bar’. The field separator is not considered to be part of either the field preceding or the field following, so with ‘sort -t " "’ the same input line has three fields: an empty field, ‘foo’, and ‘bar’. However, fields that extend to the end of the line, as ‘-k 2’, or fields consisting of a range, as ‘-k 2,3’, retain the field separators present between the endpoints of the range.

And, again, you probably mean '-k 3,3' rather than '-k 3'. Besides the specification of option -k, 'info sort' includes good examples such as:
Sort numerically on the second field and resolve ties by sorting alphabetically on the third and fourth characters of field five. Use ‘:’ as the field delimiter.
sort -t : -k 2,2n -k 5.3,5.4
Note that if you had written ‘-k 2n’ instead of ‘-k 2,2n’ ‘sort’ would have used all characters beginning in the second field and extending to the end of the line as the primary _numeric_ key. For the large majority of applications, treating keys spanning more than one field as numeric will not do what you expect.
Also note that the ‘n’ modifier was applied to the field-end specifier for the first key. It would have been equivalent to specify ‘-k 2n,2’ or ‘-k 2n,2n’. All modifiers except ‘b’ apply to the associated _field_, regardless of whether the modifier character is attached to the field-start and/or the field-end part of the key specifier.

So in the end, I believe you only want:
$ sort -t . -k 3,3 Temp0706AR.txt > Temp0706AS.txt

... | awk '{print $1,$2,$3,$4,$5"\t"$6}' '-' | sed 's/'\ '/\./g' | awk '{print $1"\t"$2}' '-'

The two AWK programs are useless and there is unnecessary quoting and escaping in sed's substitution. In other terms, all that is equivalent to:
sed 's/ /./g'
But there is no good reason to use sed when tr applies:
tr ' ' '.'