Unforeseen feature of the join command
Here's a task that has been defeating me for a couple of weeks:
Start with a list of items in a database file: ListA.txt
Combine with the original database: ListB.txt
The steps that I've been using:
awk '{print $1}' 'ListA.txt' | sort -k1 | uniq -c > ListA-Counts.txt
Note: there are no duplicates, for sake of simplicity.
Then:
join -a 2 -1 1 -2 2 <(sort -k1 ListB.txt) <(sort -k2 ListA-Counts.txt) > TempX.txt ;
Followed by:
awk '{print $1"\t"$2"\t"$4"\t"$3}' 'TempX.txt' | sort -k1 > Test-DV.txt ; rm TempX.txt
The output file has twenty-five incomplete rows whose the second column is missing,
leaving Columns $1, $3, and $4 as $1, $2, $3. See Test-DVS.txt (sorted on $4).
George Langford, surrounded by Covid19 in SE PA
Attachment | Size |
---|---|
Test-DVS.txt | 4.66 KB |
ListA.txt | 1.89 KB |
ListA-Counts.txt | 3.88 KB |
ListB.txt | 4.36 KB |
Test-DV.txt | 4.66 KB |
This has occurred before; see:
https://www.researchgate.net/post/The_join_command_seems_to_be_messing_up_when_using_numbers_as_the_comparing_fields_Any_ideas
Alas, my actual data are IPv4 addresses, by the millions; so that workaround may be unworkable.
Here's what I found that works; I simply append a letter to each "proxy-IP address"
for the duration of the processing and then remove it afterwards.
Workaround:
awk '{print$1"\t""w"}' 'ListA.txt' | tr -d "\t" > ListA-w.txt ... somewhat redundant
awk '{print $1"w",$2, $3, $4}' ListB.txt > ListB-w.txt .......... this works also
Then
awk '{print $1}' 'ListA-w.txt' | sort -k1 | uniq -c > ListA-w-Counts.txt
Followed by:
join -a 2 -1 1 -2 2 <(sort -k1 ListB-w.txt) <(sort -k2 ListA-w-Counts.txt) > TempX-w.txt ;
And on the same line:
awk '{print $1"\t"$2"\t"$4"\t"$3}' 'TempX-w.txt' | sort -k1| tr -d "w" > Test-DV-w.txt ; rm TempX-w.txt
The end result is attached; the other files can be generated from the original ListA.txt and ListB.txt
Watch out for any other entries with "w" in the name ! Purely IPv4 data are safe, of course.
George Langford
Attachment | Size |
---|---|
Test-DV-w.txt | 4.85 KB |
'sort -k 1' is the same as 'sort'. Indeed, giving one single integer N (here 1) to option -k means "*from* the Nth column to the last one". Here is the relevant excerpt from 'info sort':
‘-k POS1[,POS2]’
‘--key=POS1[,POS2]’
Specify a sort field that consists of the part of the line between
POS1 and POS2 (or the end of the line, if POS2 is omitted),
_inclusive_.
Writing "-k N" when you actually want "-k N,N" is the root of your problem here. The join fields of ListB-w.txt and ListA-w-Counts.txt end up not sorted in the same way. They have to for 'join' to properly work.
Also simplifying the sequence of commands (with apparently useless steps and options), you get:
$ join -2 2 <(sort -k 1,1 ListB.txt) <(sort ListA.txt | uniq -c) | awk '{ print $1 "\t" $2 "\t" $4 "\t" $3 }'
An alternative:
$ awk 'FILENAME == ARGV[1] { ++count[$0] } FILENAME == ARGV[2] { print $1 "\t" $2 "\t" count[$1] "\t" $3 }' ListA.txt ListB.txt | sort -k 1,1
I do not know which one is faster: tests are needed.
Magic Banana comes to the rescue yet again!
I tried the two suggested scripts: The second takes half as much CPU time as the first.
I applied MB's first suggestion to a couple of multi-megabyte files, obfuscated to
protect the innocent:
awk '{print $1, $3, $2}' 'DataX-VpDCts.txt' > Temp01.txt ;
awk '{print $1, $3, $2}' 'DataY-DVCts.txt' > Temp02.txt ;
join -2 1 <(sort Temp01.txt) <(sort Temp02.txt) > Temp03.txt ;
awk '{print $1"\t"$3"\t"$5"\t"$2}' 'Temp03.txt' | sort -nrk 2,2 > DataZ-VpDCts-DVCts-Keyed.txt ;
rm Temp01.txt Temp02.txt Temp03.txt
Forgive me for using crutches ... but the result has ~300,000 rows, four columns, no
missing cells and the script(s) took less than a second real time for processing.
Here's the story behind the two subject files:
Each has three columns: IPv4, Count 1 or Count 2, Textfile name.
The join command is to produce IPv4, Count 1, Count 2, Textfile name in four colums
and does so OK.
A few of the IPv4's visited more than one Textfile name; and some IPv4's visited
a single Textfile name multiple times. Therefore, some IPv4's appear up to twenty
or more times in successive rows when the resulting file is sorted on the IPv4 column.
A total of thirty-two sets of domain visitor data were examined; some IPv4's visited
twenty-nine of the thirty-two domains. Over a thousand IPv4's visited a single domain
over a thousand times. The maximum was over 100,000 visits in the month examined.
I note on looking at the resulting files that my earler generated domain visit
counts are inaccurate, but that it is not the result of an inaccurate join.
More to come, it appears.
George Langford
Each has three columns: IPv4, Count 1 or Count 2, Textfile name.
Above, I assumed ListA.txt had one single column (because the attached file is like that). If there are more columns, then ++count[$0] must be replaced with ++count[$1].
For performance, use pipes instead of temporary files and avoid useless commands. As far as I understand the beginning of your last post (very little), you do not need to reorder the input fields and can get rid of the last one in DataY-DVCts.txt (your awk program does not print it). In the end, it looks like you could write:
$ join <(sort -k 1,1 DataX-VpDCts.txt) <(cut -d ' ' -f -2 DataY-DVCts.txt | sort -k 1,1) | awk '{ print $1 "\t "$2 "\t" $4 "\t" $3 }' 'Temp03.txt' | sort -nrk 2,2
Yesterday in some despair, I wrote:
> I note on looking at the resulting files that my earlier generated domain visit
> counts are inaccurate, but that it is not the result of an inaccurate join.
With a better grasp of the scripting & join process, I re-started from the
collected original data, cleaned out the invalid material and (a great many)
IPv6 data and repeated the appropriate steps. Now, the domains-visited counts
are correct and the visits per domain data are carried through intact; there
are no empty cells.
I had been guilty of not reading (i.e., unaware of !) the "info sort" material;
even if I had, Magic Banana would still have had to interpret it for me. "cut"
is not yet in my scripting vocabulary.
I looked into that so as to learn what "cut -d ' ' -f -2" means in the present
context. I've used "tr ':' '\t'" to separate the IPv6 data into fields so that I
could use LibreOffice Calc to sort the fourth column so as to collect all the
IPv6 data into one series of lines in the spreadsheet. Now I'll use it again
to reconstruct the IPv6 files in their own domain.txt, IPv6, Visits-per-domain
spreadsheet file. As long as I'm not making entries one-at-a-time, I'm happy
when my inefficient script does the same task in less than an hour.
I can look up the PTR records that go with the IPv4 data, but in my parallel
study of the gratuitously resolved IP address data (now about 90% PTR's or
hostnames) the reverse is extraordinarily tedious. I've been using Google to
gather the server CIDR data (with nMaps ASN-query function) for the multi-
addressed PTR's, and Google does not like scripts ... I am challenged every
few minutes while at that task.
I adjusted Magic Banana's suggested script for my actual data:
From:
join <(sort -k 1,1 DataX-VpDCts.txt) <(cut -d ' ' -f -2 DataY-DVCts.txt |
sort -k 1,1) | awk '{ print $1 "\t "$2 "\t" $4 "\t" $3 }' 'Temp03.txt' | sort -nrk 2,2
To:
join <(sort -k 1,1 DataX-VpDCts.txt) <(cut -d ' ' -f -2 DataY-DVCts.txt |
sort -k 1,1) > Temp03.txt ; awk '{ print $1 "\t "$2 "\t" $4 "\t" $3 }' 'Temp03.txt' | sort -nrk 2,2 > Output.txt ; rm Temp03.txt
With much the same result as mine for those incomprehensible steps; alas, it shows
that there's some errors in that data (not in the scripts) which I've corrected as
described at the top of this post by doing the analysis in a more efficient sequence.
Thanks again,
George Langford
I had been guilty of not reading (i.e., unaware of !) the "info sort" material
'man sort' specifies the expected argument of option -k, although in a much more arid way (as always for GNU commands: 'info' provides the full documentation):
-k, --key=KEYDEF
sort via a key; KEYDEF gives location and type
(...)
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.
"cut" is not yet in my scripting vocabulary.
'cut' can select fields, specified after -f as you specify pages to print (so "-2" means "every field up to the second one; I could have written "1,2" for the same effect), delimited by one single character (the tabulation is the default but a different one can be specified after -d). See 'info cut' for the full documentation.
Regarding "sort" ...
After going back to my collection of gratuitously looked-up hostnames, and after organizing
the potential spreadsheet so as contain the essential information: hostname, visits-per-domain,
domains visited, and target domain, I realized that LibreOffice Calc. cannot handle more than
a million rows ... my file containing 1.7 million ... I decided to sort the file, first on
Column 3, then Column 1, and then on Column 2.
Accordingly, I rearranged the columns thusly: $3, $1, $2, $4 and sorted with: "sort -nrk 1,4"
where "nr" puts the biggest numbers at the top of the column, but sort evidently did not reach
to the third column, resulting in an ordering of only hostname and visits-per-domain.
That was sufficient, because it allowed me to split the file at the 1,000,000 mark, whereupon
I completed the sort with Calc. Rows from 1,000,001 on to 700,000 have about one domain visit
per hostname; the actual numbering of the secondary portion is 1 through 700,001, of course.
Good thing I'm doing it this way: sorting the secondary list thusly: $2, $1, $3, $4 reveals
some eye-popping numbers: The uppermost ~1500 of those rows have visitors making one request
per hour up to over three visits per second (upwards of 900,000 in a month) ... the uppermost
visits-per-domain were to a meteorological domain, but there were other visitors with less
understandable motives making about a hundred visits per hour.
George Langford
I realized that LibreOffice Calc. cannot handle more than a million rows ...
Spreadsheets are only meant to do computation on little data. To store many data, use text files or a database management system.
I decided to sort the file, first on Column 3, then Column 1, and then on Column 2.
Accordingly, I rearranged the columns thusly: $3, $1, $2, $4 and sorted with: "sort -nrk 1,4" where "nr" puts the biggest numbers at the top of the column, but sort evidently did not reach to the third column, resulting in an ordering of only hostname and visits-per-domain.
'sort -k 1,4' uses the part of the line up to column 4 (one single string) to sort. It is not what you want, and neither is 'sort -k 1,3'. What you want ("sort the file, first on Column 3, then Column 1, and then on Column 2") is achieved using three times option -k (where the order matters): 'sort -k 3,3 -k 1,1 -k 2,2'.
Again: read 'info sort'. At the end of it, there are even well-explained examples with multiple -k options, starting with this one:
Sort numerically on the second field and resolve ties by sorting
alphabetically on the third and fourth characters of field five.
Use ‘:’ as the field delimiter.
sort -t : -k 2,2n -k 5.3,5.4
Note that if you had written ‘-k 2n’ instead of ‘-k 2,2n’ ‘sort’
would have used all characters beginning in the second field and
extending to the end of the line as the primary _numeric_ key. For
the large majority of applications, treating keys spanning more
than one field as numeric will not do what you expect.
Also note that the ‘n’ modifier was applied to the field-end
specifier for the first key. It would have been equivalent to
specify ‘-k 2n,2’ or ‘-k 2n,2n’. All modifiers except ‘b’ apply to
the associated _field_, regardless of whether the modifier
character is attached to the field-start and/or the field-end part
of the key specifier.
We wrote:
> I realized that LibreOffice Calc. cannot handle more than a million rows ...
> Spreadsheets are only meant to do computation on little data.
> To store many data, use text files or a database management system.
Starting with 134 sets of recent visitor data, the spreadsheet comes to 1.7
million rows, eventually expanding to about fifteen columns. It's not
unwieldy yet at that point, even with an adults' table and a kids' table.
Your interest & expertise in data mining can best be brought to bear after
the spreadsheet is populated to the extent that I envision. The multi-
address PTR records can have millions of cells for each such name. There
will be patterns in the visits to the domains and their varied subject
matter. Some CIDR address spaces are filled with large numbers of different
multi-address PTR records, which demand the database treatment. Not to
mention the IPv6 data, which have ballooned to about a third of the entire
data set.
The number of ISP's that choose to publish the recent visitor data will
grow exponentially (I hope) and that will make my approach burdensome;
It's already doubled between June 2019 and January 2020.
It's great that sort can be "trained" to do the same kind of multi-stage
sorting task that appears to be built into LibreOffice Calc. By June I'll
be forced to face up to that homework assignment.
George Langford
It's great that sort can be "trained" to do the same kind of multi-stage sorting task that appears to be built into LibreOffice Calc.
It is built in, not trained. And GNU sort is certainly much faster than what LibreOffice achieves on spreadsheets, which are also limited in the number of rows they can have.
What I wrote yesterday was tongue-in-cheek:
>It's great that sort can be "trained" to do the same kind of multi-stage sorting task that appears to be built into LibreOffice Calc.
What I meant was that its user [me] can be trained ...
Once the sort is complete on domains-visited and PTR's, the remaining ordering is within each range of domains visited, no more than
128 in the current spreadsheet. When that number stretches into the thousands, the agility of sort becomes even more essential.
The other day, I tried to accomplish a too large task with grep, which was well on its way to consuming all the RAM & swap when I
stopped it. After dividing the target into about ten pieces, the next try completed in a few minutes.
George Langford
Thousands of lines are nothing. GNU sort can process millions if not billions. It automatically uses temporary files to not run short of RAM.
Your experience with grep is weird. As far as I understand, grep's memory requirement do not depend on the number of lines (which could be infinite). It processes one single line, outputs it if and only if it matches the regular expression, then forgets about it to process the subsequent line and so on.
Magic Banana is perplexed by my experience with grep's memory usage:
> Your experience with grep is weird. As far as I understand, grep's memory requirement
> do not depend on the number of lines (which could be infinite). It processes one single
> line, outputs it if and only if it matches the regular expression, then forgets about
> it to process the subsequent line and so on.
My example had a pattern file of 7700 rows and a target 0f 2200 rows; the grep script
searching that combination saturated both 7.7 GB of Ram and 18 GB of swap.
I divided the 7700 rows into eight pieces of 1000 rows each (except the last one of 700
rows). Those eight grep scripts took about 20 seconds each to finish without taxing
either form of memory.
George Langford