Script needed to compare one two-column file with another two-column file

4 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/04/2015

Here's my task in plain language:

Compare two files of different sizes, each with two columns:
For each element(s) in the first column of the larger file,
find all matches to those elements in the first column of the smaller file, and
Then print the matching row(s) of the larger file, followed by the matching rows of the smaller file.

Sometimes there will be more than one element in the larger file that is duplicated. For those,
print the duplicated rows from the larger file before printing the matching rows of the smaller file.

The output file should be two columns, with the first column containing the pertinent
(i.e., duplicated) elements, and the second column listing all the associated instances.

There aren't any duplications in the second column of either file;
and nothing in either second column matches anything in the other second column.

This seems easy enough to explain and to perform by visual inspection,
but only a script will work for the main files.

I'm attaching three files: LargerFile.txt; SmallerFile.txt; and OutputFile.txt.
They've been obtained by visually matching the Real-LargerFile.txt (500,000 rows)
with the Real-SmallerFile.txt (200,000 rows) and then doing some obfuscation.

All files are sorted on their first columns.

George Langford

AllegatoDimensione
LargerFile.txt752 byte
SmallerFile.txt408 byte
OutputFile.txt869 byte
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

First of all:

  • SmallerFile_0.txt is not sorted (conceptcable.com would be first): below, I sort the files;
  • I do not understand why OutputFile_0.txt does not associate pool.mirgiga.net with Uhnagty, Yjnmase, and Bnhjyht: below, I assume it should.

The format you use is redundant. Moreover, in the output, it becomes hard (if not impossible) to set apart what comes from the "larger file" and from the "smaller file". I suggest to transform the two input files to have no duplicate in the first columns and a list of comma-separated values in the second columns (if commas can appear in the files, replace this character), using twice the same command line:
$ sort -k 1,1 LargerFile.txt | awk '{ if ($1 == key) printf "," $2; else { if (NR > 1) printf "\n"; printf $0; key = $1 } }' > LargerFile.csv
$ sort -k 1,1 SmallerFile.txt | awk '{ if ($1 == key) printf "," $2; else { if (NR > 1) printf "\n"; printf $0; key = $1 } }' > SmallerFile.csv

You then only need to "join" the two files (see https://en.wikipedia.org/wiki/Relational_algebra#Natural_join_(%E2%8B%88) for the theory):
$ join LargerFile.csv SmallerFile.csv
pool.giga.net.ru 91.210.179.94,91.210.179.95,91.210.179.96,91.210.179.97,91.210.179.98,91.210.179.99 Evgbhan,Ghbfght,Kmnslet,Loasfrt,Wnhmahy
pool.mirgiga.net 78.158.193.1,78.158.193.10,78.158.193.104,78.158.193.105,78.158.193.106,78.158.193.107,78.158.193.11,78.158.193.110,78.158.193.111,78.158.193.112,78.158.193.113 Bnhjyht,Uhnagty,Yjnmase
pool.sevtele.com 46.172.203.8,46.172.203.80,46.172.203.83,46.172.203.85,46.172.203.87,46.172.203.88 Ghbfght

As a script taking the two files as arguments and running everything in parallel:
#!/bin/sh
if [ -z "$2" ]
then
printf "Usage: $0 file1 file2
"
exit
fi
TMP=$(mktemp)
trap "rm $TMP* 2>/dev/null" 0
mkfifo $TMP.1 $TMP.2
sort -k 1,1 "$1" | awk '{ if ($1 == key) printf "," $2; else { if (NR > 1) printf "\n"; printf $0; key = $1 } }' > $TMP.1 &
sort -k 1,1 "$2" | awk '{ if ($1 == key) printf "," $2; else { if (NR > 1) printf "\n"; printf $0; key = $1 } }' > $TMP.2 &
join $TMP.1 $TMP.2 # | tr , ' ' | awk '{ for (i = 1; ++i <= NF; ) print $1, $i }'

Remove the character "#" on the last line if you really want the format you asked for.

valessio
Offline
Iscritto: 04/02/2019

Maybe I did not quite understand what it needs, but to unite two files
and eliminate similar occurrences; It would do so:

$ cat SmallerFile.txt LargerFile.txt | sort | uniq > OutputFile.txt

sort (order a-z) and uniq (remove duplicates); Attachment my result.

att,
Valessio Brito

Em qui, 6 de jun de 2019 às 16:27, <name at domain> escreveu:
>
> Here's my task in plain language:
>
> Compare two files of different sizes, each with two columns:
> For each element(s) in the first column of the larger file,
> find all matches to those elements in the first column of the smaller file,
> and
> Then print the matching row(s) of the larger file, followed by the matching
> rows of the smaller file.
>
> Sometimes there will be more than one element in the larger file that is
> duplicated. For those,
> print the duplicated rows from the larger file before printing the matching
> rows of the smaller file.
>
> The output file should be two columns, with the first column containing the
> pertinent
> (i.e., duplicated) elements, and the second column listing all the associated
> instances.
>
> There aren't any duplications in the second column of either file;
> and nothing in either second column matches anything in the other second
> column.
>
> This seems easy enough to explain and to perform by visual inspection,
> but only a script will work for the main files.
>
> I'm attaching three files: LargerFile.txt; SmallerFile.txt; and
> OutputFile.txt.
> They've been obtained by visually matching the Real-LargerFile.txt (500,000
> rows)
> with the Real-SmallerFile.txt (200,000 rows) and then doing some obfuscation.
>
> All files are sorted on their first columns.
>
> George Langford
>
>

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

As far as I understand, amenex wants a "natural join". Also, you could write 'sort -u SmallerFile.txt LargerFile.txt' instead of 'cat SmallerFile.txt LargerFile.txt | sort | uniq', which is not only uselessly long but also much slower in presence of many duplicates.

amenex
Offline
Iscritto: 01/04/2015

Thank your for the simple explanations. My brain was so fried from gathering all the nmap data
and Google search data and then combining it that the straightforward approach never dawned.

My obfuscations of the test files upset their original sorting.

My quick & dirty approach was to load the longer file into LibreOffice Calc, then append the
smaller file, sort the combined files, copy & paste into Leafpad, and then apply Magic Banana's
duplicated-file script, which performed the last step in 0.064 second, leaving me with ca. 85,000
rows of duplicated hostnames. Some refinements need - more data from Google searches, inclusion
of data fro earlier analyses (i.e., more ASN's), etc. Not a task for just one person.

The trisquel forum continues its invaluable contributions. Thank you !

George Langford