The join command is missing the IPv4 addresses in long mixed lists of strings

6 respuestas [Último envío]
amenex
Desconectado/a
se unió: 01/03/2015

When I look for matching strings in a pair of one-column files, join has been ignoring the entries that are IPv4 addresses.

When I try the join command with short files, admittedly seeded with inserted data to complement the existing matches, the
result includes all the matching IPv4 addresses.

When I apply the same command to a pair of much longer files, all the IPv4 addresses are ignored.

Here are the two commands:

First: join -1 1 -2 1 file01.txt file02.txt &> Join-01-02.txt [It doesn't matter whether or not I add --nocheck-order]

Second: join -1 1 -2 1 /pathtofileA/fileA.txt /pathtofileB/fileB.txt &> Join-A-B.txt [Sorting the files twice doesn't help]

The joined output in the first instance includes the matching IPv4 addresses; in the second, no IPv4 addresses are listed,
but I'm sure there are many matching fields.

When I visually picked sequences in each original file that encompass several confirmed matches, including both alphanumeric
and plain unencumbered IPv4 addresses in four-octet format, the joined output file includes both types of strings.

I checked whether the paths interfere ... it doesn't matter whether the input files are in the same folder or in different
folders. But for the larger files (6MB and 2MB) the join command skips all the IPv4 entries; when they're in the same
directory, the system takes twice as long (0.008 sec.) as when they're in different directories (0.004 sec.). Adding the
--nocheck-order argument doesn't change anything but removes join's complaints about the sorting.

I even tried viewing the System Monitor during the large-file sorting. Nothing ...

Then I realized that it might be better to put the smaller of the files to be joined first ... nope.

Lastly, I split each file at a common matching IPv4 address at about the halfway point where the sorting places most of
the numerical IPv4 addresses at the top of the file, with nearly all alphanumeric entries from there on to the end of the
file. I had >||< this much success: In the joined front-half pair, there are no IPv4 addresses, not even the known one,
missing some known matches. In the joined back-half pair, there are a few matched IPv4 addresses, but other known matches
are not represented in the joined output. The matched IPv4 address at the break points is the same for all four files.

One of the longer files is short enough for testing: GBsmt-front.txt

My ThinkPad T420 has 8GB of memory, running Trisquel's flidas operating system.

George Langford

AdjuntoTamaño
GBsmt-front.txt547.31 KB
Magic Banana

I am a member!

I am a translator!

Desconectado/a
se unió: 07/24/2010

'join' takes as input two (paths to) text files. It does not matter what that text is. 'join' does not know what an IP address is, because it does not need to. Attach the two text files you are joining, if you want us to take a look.

Forget about the size of the input data: 'join', like any GNU text-processing command, can process arbitrarily long inputs. The problem is not there. The problem is almost certainly that the files are not ordered (with the 'sort' command and the same locale) w.r.t. their join fields. Have you read 'info join'?

Additional remarks:

  • Writing "-1 1 -2 1" is like writing nothing, because 1 is the default value for both options.
  • You certainly do *not* want to use --nocheck-order, to be warned if the files are not ordered w.r.t. their join fields.
  • You probably want to redirect the sole standard output (with ">") instead of both the standard and the error output (with "&>"), to have errors/warnings (such as one about a misordered file) appearing in the terminal.
amenex
Desconectado/a
se unió: 01/03/2015

In the process of proving myself wrong, I did the following experiment:

1. Make a shuffled version of the file GBsmt-front.txt.
2. Make its number of lines divisible by four (by temporarily deleting one line).
3. Divide GBsmt-front-shuf into four parts (A,B,C,D); replace the deleted line into part D.
4. Sort each of the four parts with the console (previously I had been believing the sorted output of LibreOffice Calc).
5. Sort GBsmt-front.txt (again, just to be sure, with the console).
6. Run the join command four times, with the A, B,C, & D portions of GBsmt-front-shuf, against GBsmt-front.txt.
7. Interim reality check: The sum in kB of the four output files equals the size of the original GBsmt-front.txt file.
8. For-sure reality check: Concatenate the four outputs of the above join command, sort, and compare to the original GBsmt-front.txt list.

After all this manipulation, the two files (GBsmt-front.txt and GBsmt-front-shuf-(A,B,C,D-join-concatenate-sort.txt) are identical.

Then I tried the original task, and now the IPv4 addresses appear in the joined output. As long as I sort each of the files
to be joined right before the join operation, the command doesn't complain ... and the IPv4 data appear in droves.

Thanks to Magic Banana for confirming that join doesn't have undisclosed limitations.

Another tidbit: Sorting a file with "sort [file]" alone sends the sorted output to the console; "sort [file] > itself" gives O bytes output.
My two-step "solution": "sort [file] > [file-newname]" then "mv [file-newname] [file]" preserved the original file and its name and left
no residue. The correct way to do this in line with the join command is "join <(sort [file01]) <(sort [file02]) > Joined-file0102.txt"
(see https://shapeshed.com/unix-join/).

Magic Banana

I am a member!

I am a translator!

Desconectado/a
se unió: 07/24/2010

About the end of your post:

  • You cannot both read and write in a same file; your "two-step solution" is OK.
  • I did not know it was OK to redirect twice the standard input; to avoid touching the disk I would have created named pipes, as in this short (untested) script:
    #!/bin/sh
    mkfifo file1.sorted file2.sorted
    sort -k 1b,1 file01 > file1.sorted &
    sort -k 1b,1 file02 > file2.sorted &
    join file1.sorted file2.sorted > Joined-file0102.txt
    rm file1.sorted file2.sorted
amenex
Desconectado/a
se unió: 01/03/2015

Magic Banana wrote:

> I did not know it was OK to redirect twice the standard input; to avoid touching the disk I would have created named pipes,
> as in this short (untested) script ...

After spending the intervening time making about 150 joins, combining eighteen sets of four-month Webalizer data every which
way, I see that your suspicions may be well founded. The smaller output files (30 to several hundred kB) are clean-looking,
but the largest ones (1500 kB down to ~700 kB) have duplicated rows, nearly exclusively. They'll have to have the duplicates
removed during post-processing ... and be checked for errors.

Before I start that processing, I'll see if I can try out your script; the extra steps won't be any drag on the joining,
as the longest times for any joins were still in the blink-of-an-eye category (0.044 sec. system time).

I've been pairing up the most recent data with all of the prior data, one pair at a time, and that's getting tedious. The
"info join" page says that one of the target fields (but not both !) can be read from standard input. In these repetitive
joins that I'm doing now, can one of the target fields be read from a file that lists the other target files ? I've got
fifteen more sets of data, so this file list can grow ... up to thirty-two now, but almost without end if one looks at the
number of Webalizer data sets that are available.

Magic Banana

I am a member!

I am a translator!

Desconectado/a
se unió: 07/24/2010

I see that your suspicions may be well founded. The smaller output files (30 to several hundred kB) are clean-looking, but the largest ones (1500 kB down to ~700 kB) have duplicated rows, nearly exclusively.

That was no suspicions. I was writing that you taught me something. I thought the shell would spit an error. Since it does not, it is probably valid syntax, doing what the user expects it to do. However, I suspect it may be a "bashism", i.e., a syntax not all shells would accept.

join's output certainly has duplicates because the input files have duplicates (is that normal?). Just add the option --unique (or simply -u) to the sort commands.

They'll have to have the duplicates removed during post-processing ... and be checked for errors.

Do not do that as a post-process. As I have just written: just add the option --unique (or simply -u) to the sort commands. It actually turns their execution faster. That of 'join' too (smaller inputs and output).

Before I start that processing, I'll see if I can try out your script; the extra steps won't be any drag on the joining, as the longest times for any joins were still in the blink-of-an-eye category (0.044 sec. system time).

Using two named pipes may actually be faster than using two subshells (what happens when you put commands between parentheses)... by a constant time you should not care about. Only optimize at the end, if necessary. After ensuring the whole process is correct and after identifying the bottleneck (usually one single command).

I've been pairing up the most recent data with all of the prior data, one pair at a time, and that's getting tedious.

You can use a Shell loop (or two). For instance, if what you call "prior data" and "most recent data" are files in two separate directories and you want all pairs, then you can pass these two directories as the two arguments of a script like this one:
#!/bin/sh
mkfifo old.sorted
for old in "$1"/*
do
for new in "$2"/*
do
out=joined-$(basename "$old")-$(basename "$new")
sort -uk 1b,1 "$old" > old.sorted &
sort -uk 1b,1 "$new" | join old.sorted - > "$out"
done
done
rm old.sorted

"info join" page says that one of the target fields (but not both !) can be read from standard input.

One of the two input files (not "target fields": there is no such thing), yes. I did it above, to give you an example.

In these repetitive joins that I'm doing now, can one of the target fields be read from a file that lists the other target files ?

You can do such a thing in a Shell script, using 'while read line; do ...; done < file'. Don't you prefer to organize files in directories and specify these directories, as I suggested above?

Remark: do you join files whose join fields are the whole lines (there are no additional fields)? In other words, are you searching for equal lines in two files? If so, then you actually want to use 'comm -12' instead of 'join'. 'comm' is a simpler command, to compare the (sorted) lines of two files.

amenex
Desconectado/a
se unió: 01/03/2015

Considering that at present there are only eighteen source files to be joined, I applied my two-step approach to the task of
first checking the sort status of my input files, then sorting them, and finally re-checking them, for example:

> $ sort -c HBgky.txt
> sort: HBgky.txt:2: disorder: 4532218122
> $ sort HBgky.txt > HBgkyA.txt
> $ mv HBgkyA.txt HBgky.txt
> $ sort -c HBgky.txt [null response]

All eighteen had sorting flaw(s) ... took about half an hour to perform these fixes; but read on ... LVgky.txt had duplicates !

Then I set about the task of removing duplicates from my 154 joined pairs, starting with the largest one first, on the theory that
the largest ones are the ones with duplicates, but the smaller ones won't have duplicates ... :

> tr -s ' ' < Join-LVgky-NWrkr.txt | sed 's/ $//' | sort -u > LVgky-NWrkr.txt ==> from 1400 kB to 698 kB, properly sorted ... OK so far
> join LVgky-NWrkr.txt NWrkr.txt > LVgky-NWrkr-NWrkr.txt ==> LVgky-NWrkr-NWrkr.txt file length matches LVgky-NWrkr.txt file length ... Great !

At this point, I rejoiced: the duplicates were all still common to both files ... being suspicious, I continued:

> join LVgky-NWrkr.txt LVgky.txt > LVgky-NWrkr-LVgky.txt ==> LVgky-NWrkr-LVgky.txt file length is doubled to 1400 kB ... WaitaMinnit !

Aha ! LVgky.txt was the culprit ... it was full of duplicates.

> tr -s ' ' < LVgky-NWrkr-LVgky.txt | sed 's/ $//' | sort -u > LVgky-NWrkr-LVgky-test.txt ==> File length halved as for the first no-dupes script.

Magic Banana suspected that:

> join's output certainly has duplicates because the input files have duplicates (is that normal?). Just add the option --unique (or simply -u)
> to the sort commands.

My repair task was therefore reduced to the removal of duplicates from all the joined pairs of LVgky.txt and the other seventeen input files,
vastly easier than redoing the 154 join pairs that I had created ... just eighteen repairs in all, including the repair to LVgky.txt.

Re-joining the repaired output pair against the repaired LVgky.txt
(remembering that both files have been checked for duplicates and their sort status):

> join LVgky-NWrkr.txt LVgky.txt > LVgky-NWrkr-LVgky-test.txt ==> 698 kB as for NWrkr.txt. OK !

All this hassle would have been avoided with the modified join command in spite of the dire warning in info join:

>> ‘join’ writes to standard output a line for each pair of input lines that have identical join fields. Synopsis:
>>
>> join [OPTION]... FILE1 FILE2
>>
>> Either FILE1 or FILE2 (but not both) can be ‘-’, meaning standard input. FILE1 and FILE2 should be sorted on the join fields.

Here are my processing choices:

Sort each input file with my two-step routine, then apply the simple join command (forgetting to check the input files for duplicates)...
or
Apply Magic Banana's "sort --unique" concurrently to the modified join command:

> join <(sort -u [file01]) <(sort -u [file02]) > Joined-file0102.txt

For our Trisquel (flidas) it is apparently OK to double-up on the use of standard input.

Regarding the task of pairing up each newly acquired input data file with all the prior ones (an ever-increasing list), a script would
indeed be welcome addition to our set of tools as the number of online Webalizer data sets increases into the thousands ...