Joining a pair of files

12 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/04/2015

Here are two files that I want to join. The first has a list of Email addresses and their IPv4 addresses; some of the list have the receipt date in human-readable format. The second file is a list of those human-readable dates and their epochal (seconds since the start of January 1, 1970) equivalents. The files are attached. Each file has the human-readable dates packed with % signs to reduce the labor of dealing with the associated spaces. The aim to is to eliminate the human-readable dates by replacing them with the easier-to-process epochal dates. Join and sed have not been my friend in this endeavor. The order of the two- and three-column sets in the first file is important. At present the orders of the human-readable dates match; of course it's also important that the epochal dates maintain that order so the receipt dates can be properly kept associated with the addresses of the emails.

AllegatoDimensione
Temp-12022022-Part02.txt2.27 KB
Temp-12022022-Part01.txt619 byte
prospero
Offline
Iscritto: 05/20/2022

I used to do similar things in a previous life with the VLOOKUP function in LibreOffice Calc, but have always been looking up to Magic Banana for his more elegant ways to do that kind of stuff.

https://help.libreoffice.org/6.2/en-US/text/scalc/01/04060109.html?DbPAR=CALC

In other "Trisquel users" news, did you manage to defeat Suzanne in story mode, expert level, in the SuperTuxKart version that comes with Aramo?

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

$ awk -v OFS="\t" '{ if (FILENAME == ARGV[1]) date[$1] = $2; else { $3 = date[$3]; print } }' Temp-12022022-Part01.txt Temp-12022022-Part02.txt
name at domain 114.58.71.100
name at domain 114.58.96.173
name at domain 124.81.125.2
name at domain 124.81.186.58
name at domain 124.81.210.178
name at domain 124.81.255.18
name at domain 202.155.115.82
name at domain 202.155.125.71
name at domain 202.155.82.36
name at domain 209.162.182.125
name at domain 216.128.11.22
name at domain 216.128.25.170
name at domain 216.128.30.43
name at domain 66.29.203.224
name at domain 24.113.123.181
name at domain 24.113.219.179
name at domain 24.113.72.6
name at domain 76.14.122.19
name at domain 76.14.166.123
name at domain 76.14.186.170
name at domain 76.14.191.0
name at domain 76.14.214.53
name at domain 76.14.217.79
me-whn8ye0nmp5k.redact-296962503.info 66.237.50.126 1239343739
76-14-214-53.or.redact-296962503.info 76.14.214.53 1218500281
76-14-122-19.rk.redact-296962503.info 76.14.122.19 1234694923
76-14-166-123.wsac.redact-296962503.info 76.14.166.123 1213609457
24-113-145-220.redact-296962503.info 24.113.145.220 1242443586
static-76-14-252-167.or.redact-296962503.info 76.14.252.167 1248040888
24-113-207-143.redact-296962503.info 24.113.207.143 1272730282
76-14-198-2.or.redact-296962503.info 76.14.198.2 1240733695
76-14-246-255.or.redact-296962503.info 76.14.246.255 1243400056
76-14-246-255.or.redact-296962503.info 76.14.246.255 1243574383
76-14-217-79.or.redact-296962503.info 76.14.217.79 1236377626
24-113-170-18.redact-296962503.info 24.113.170.18 1239208769
24-113-123-181.redact-296962503.info 24.113.123.181 1204938891
mx0.redact-297281989.info 201.255.222.21 1306416944
redact-254057415.info 194.106.221.130 1299625102
mx0.redact-297281989.info 83.228.43.217 1303311703

The order of the two- and three-column sets in the first file is important.

The solution above keeps that order and the order of the other file is irrelevant. To use the join command, both files would have to be sorted on the join fields.

prospero
Offline
Iscritto: 05/20/2022

> To use the join command, both files would have to be sorted on the join fields.

Were they not? I thought the problem was that one of the input files has a bunch of extra lines without a value in the join field.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Were they not?

They are, apparently. join works:

$ join -a 1 -1 3 Temp-12022022-Part02.txt Temp-12022022-Part01.txt | cut -d ' ' -f 2- | tr ' ' '\t'
name at domain 114.58.71.100
name at domain 114.58.96.173
name at domain 124.81.125.2
name at domain 124.81.186.58
name at domain 124.81.210.178
name at domain 124.81.255.18
name at domain 202.155.115.82
name at domain 202.155.125.71
name at domain 202.155.82.36
name at domain 209.162.182.125
name at domain 216.128.11.22
name at domain 216.128.25.170
name at domain 216.128.30.43
name at domain 66.29.203.224
name at domain 24.113.123.181
name at domain 24.113.219.179
name at domain 24.113.72.6
name at domain 76.14.122.19
name at domain 76.14.166.123
name at domain 76.14.186.170
name at domain 76.14.191.0
name at domain 76.14.214.53
name at domain 76.14.217.79
me-whn8ye0nmp5k.redact-296962503.info 66.237.50.126 1239343739
76-14-214-53.or.redact-296962503.info 76.14.214.53 1218500281
76-14-122-19.rk.redact-296962503.info 76.14.122.19 1234694923
76-14-166-123.wsac.redact-296962503.info 76.14.166.123 1213609457
24-113-145-220.redact-296962503.info 24.113.145.220 1242443586
static-76-14-252-167.or.redact-296962503.info 76.14.252.167 1248040888
24-113-207-143.redact-296962503.info 24.113.207.143 1272730282
76-14-198-2.or.redact-296962503.info 76.14.198.2 1240733695
76-14-246-255.or.redact-296962503.info 76.14.246.255 1243400056
76-14-246-255.or.redact-296962503.info 76.14.246.255 1243574383
76-14-217-79.or.redact-296962503.info 76.14.217.79 1236377626
24-113-170-18.redact-296962503.info 24.113.170.18 1239208769
24-113-123-181.redact-296962503.info 24.113.123.181 1204938891
mx0.redact-297281989.info 201.255.222.21 1306416944
redact-254057415.info 194.106.221.130 1299625102
mx0.redact-297281989.info 83.228.43.217 1303311703

That is certainly a faster solution, if both files are always sorted on the join fields.

I thought the problem was that one of the input files has a bunch of extra lines without a value in the join field.

The -a option I used above takes care of that.

amenex
Offline
Iscritto: 01/04/2015

They were already sorted before Magic Banana started his script.

amenex
Offline
Iscritto: 01/04/2015

Here's the explanation I understand from Magic Banana's script:
Here's my explanation of Magic Banana's script ==>
awk -v OFS="\t" '{ if (FILENAME == ARGV[1]) date[$1] = $2; else { $3 = date[$3]; print } }' Temp-12022022-Part01.txt Temp-12022022-Part02.txt
where
-v ==> var=value assigns value to program variable var.; which in this script is OFS="\t" (output
field separator, \t)
ARGV ==> array of command line arguments, 0..ARGC-1; where ARGC is the number of such arguments ...
in this case, the two input files, Temp-12022022-Part01.txt and Temp-12022022-Part02.txt
Magic Banana's print statement, '{ if (FILENAME == ARGV[1]) date[$1] = $2; else { $3 = date[$3];
print } }', says "if Temp-12022022-Part01.txt has date[$1] (the human-readable date) which equals
the epochal date [$2], then the human-readable date [$3] in the second input file is to be the
epochal date of the first input file.
I piped that script to my de-redaction script, and the combined script printed the identical
output list, but with the redacted domains restored ... in the same blink of an eye as the
original Magic Banana script.

Thank you, Magic Banana !

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Magic Banana's print statement, '{ if (FILENAME == ARGV[1]) date[$1] = $2; else { $3 = date[$3]; print } }', says "if Temp-12022022-Part01.txt has date[$1] (the human-readable date) which equals the epochal date [$2]

It is actually: if the record is in the first file (the files are processed in the order they are given) then have the array named "date" associate its first field (the date) with its second field (the number of seconds since the epoch), otherwise (it is not the first file that is processed) have the third field (the date) overwritten by the associated value in date (the number of seconds since the epoch) and output (aka "print") the record.

As you can see, it is a pretty straightforward.

amenex
Offline
Iscritto: 01/04/2015

Here's what's probably the slowest and most ungainly solution, using sed:
sed -e 's/10%Apr%2009%06:08:59%-0000/1239343739/' Temp-12022022-Part02.txt | sed -e 's/12%Aug%2008%00:18:01%-0000/1218500281/' | sed -e 's/15%Feb%2009%10:48:43%-0000/1234694923/' | sed -e 's/16%Jun%2008%09:44:17%-0000/1213609457/' | sed -e 's/16%May%2009%03:13:06%-0000/1242443586/' | sed -e 's/19%Jul%2009%22:01:28%-0000/1248040888/' | sed -e 's/1%May%2010%16:11:22%-0000/1272730282/' | sed -e 's/26%Apr%2009%08:14:55%-0000/1240733695/' | sed -e 's/27%May%2009%04:54:16%-0000/1243400056/' | sed -e 's/29%May%2009%05:19:43%-0000/1243574383/' | sed -e 's/6%Mar%2009%22:13:46%-0000/1236377626/' | sed -e 's/8%Apr%2009%16:39:29%-0000/1239208769/' | sed -e 's/8%Mar%2008%01:14:51%-0000/1204938891/' | sed -e 's/Thu,%26%May%2011%10:35:44%-0300/1306416944/' | sed -e 's/Tue,%08%Mar%2011%23:58:22%+0100/1299625102/' | sed -e 's/Wed,%20%Apr%2011%15:01:43%+0000/1303311703/' where I've simply taken apart Temp-12022022-Part01.txt and patched the sed commands
around the two columns, one line at a time, then changing the new line characters into tabs,
and piping them all together into one long line. The % characters made each human-readable date
into an unbroken string, but they don't have to be fixed afterwards ... and so the race goes to
awk (0.004s real time) then join (0.008s) and last sed (0.015s).
This one is slow because I make each sed command process the entirety of
Temp-12022022-Part02.txt, which is naturally sixteen times as long as it would take when targeted
at the one line to which it applies.
Accuracy ? Diff says that all three solutions produce the same output:
diff -y --suppress-common-lines --width=100 <(awk -v OFS="\t" '{ if (FILENAME == ARGV[1]) date[$1] = $2; else { $3 = date[$3]; print } }' Temp-12022022-Part01.txt Temp-12022022-Part02.txt) <(join -a 1 -1 3 Temp-12022022-Part02.txt Temp-12022022-Part01.txt | cut -d ' ' -f 2- | tr ' ' '\t') > Test.awk.vs.join.txt 'cuz the outputs of diff are all zero.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Here's what's probably the slowest and most ungainly solution, using sed

You do not want to call sed as many times. One time is enough: -e introduce an additional action to apply. In other terms, you can write:
$ sed -e 's/10%Apr%2009%06:08:59%-0000/1239343739/' -e 's/12%Aug%2008%00:18:01%-0000/1218500281/' (...) -e 's/Wed,%20%Apr%2011%15:01:43%+0000/1303311703/' Temp-12022022-Part02.txt
But you do not want to write such a long sed program on the command line. You want to pass it through option -f:
$ sed -e 's:^:s/:' -e 's:\t:/:' -e 's:$:/:' Temp-12022022-Part01.txt | sed -f - Temp-12022022-Part02.txt
name at domain 114.58.71.100
name at domain 114.58.96.173
name at domain 124.81.125.2
name at domain 124.81.186.58
name at domain 124.81.210.178
name at domain 124.81.255.18
name at domain 202.155.115.82
name at domain 202.155.125.71
name at domain 202.155.82.36
name at domain 209.162.182.125
name at domain 216.128.11.22
name at domain 216.128.25.170
name at domain 216.128.30.43
name at domain 66.29.203.224
name at domain 24.113.123.181
name at domain 24.113.219.179
name at domain 24.113.72.6
name at domain 76.14.122.19
name at domain 76.14.166.123
name at domain 76.14.186.170
name at domain 76.14.191.0
name at domain 76.14.214.53
name at domain 76.14.217.79
me-whn8ye0nmp5k.redact-296962503.info 66.237.50.126 1239343739
76-14-214-53.or.redact-296962503.info 76.14.214.53 1218500281
76-14-122-19.rk.redact-296962503.info 76.14.122.19 1234694923
76-14-166-123.wsac.redact-296962503.info 76.14.166.123 1213609457
24-113-145-220.redact-296962503.info 24.113.145.220 1242443586
static-76-14-252-167.or.redact-296962503.info 76.14.252.167 1248040888
24-113-207-143.redact-296962503.info 24.113.207.143 1272730282
76-14-198-2.or.redact-296962503.info 76.14.198.2 1240733695
76-14-246-255.or.redact-296962503.info 76.14.246.255 1243400056
76-14-246-255.or.redact-296962503.info 76.14.246.255 1243574383
76-14-217-79.or.redact-296962503.info 76.14.217.79 1236377626
24-113-170-18.redact-296962503.info 24.113.170.18 1239208769
24-113-123-181.redact-296962503.info 24.113.123.181 1204938891
mx0.redact-297281989.info 201.255.222.21 1306416944
redact-254057415.info 194.106.221.130 1299625102
mx0.redact-297281989.info 83.228.43.217 1303311703

The first sed program above creates the sed program given to the second call of sed ("-" being the standard input, /dev/stdin).

This one is slow because I make each sed command process the entirety of Temp-12022022-Part02.txt, which is naturally sixteen times as long as it would take when targeted at the one line to which it applies.

Even without the multiple calls to sed, every substitution (one per line in Temp-12022022-Part01.txt) is applied to every line in Temp-12022022-Part02.txt. As a consequence, the execution times is approximately proportional to the *product* of the number of lines in each file.

Using an AWK array (the first solution I proposed), the execution time is approximately proportional to the *sum* of the number of lines in each file, because AWK's array are hash maps, with an amortized constant-time access to a pair (key, value).

join's execution time is proportional to the *sum* of the number of lines in each file too. Nevertheless, the proportionality constant is smaller: join only tests whether the join fields, strings of characters, are equal, whereas AWK hashes those strings to access the related address in memory and then test the equality, possibly with several strings because, with hashes, there may be collisions (several fields with the same hash). However, beforehand, with join (and not with AWK), both files have to be sorted on the join fields. The time to do so is, at least for large files, greater than the time to join. Indeed, sorting n lines takes a time that is proportional to n*log(n), which is (a little) greater than n.

In summary, denoting m the number of lines in the first file and n the number of lines in the second file, the execution times of the three solutions are:

  • AWK: proportional to m + n;
  • sort + join: proportional to m log(m) + n log(n) (substitute m log m for m if the first file is already sorted; n log n for n if the second file is already sorted);
  • sed: proportional to nm.

... but that is assuming m and n are tending to infinity (asymptotic analysis). With smaller inputs, the best is to measure the times (but, even with small inputs, I do not think the sed solution can be faster).

amenex
Offline
Iscritto: 01/04/2015

This morning I became aware of grepcidr, which has a scary reputation for user confusion, but which I was able to
put to use. Here's what's working for me.
(1) grep -Eo "(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])" 1998-2022.12032022 | sed 's/\./\ /g' | awk '{print $1"."$2"."$3".0/24"}' | sort -Vk 1,1 |sort -u > CIDR24s-19980-2022.12032022.txt;and

grep -Eo "(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])" SpamCop_Quick_Reports.txt | sed 's/\./\ /g' | awk '{print $1"."$2"."$3".0/24"}' | sort -Vk 1,1 |sort -u > CIDR24s-SpamCop_Quick_Reports.txt;followed by
(2) comm -12 <(sort -k 1,1 CIDR24s-19980-2022.12032022.txt) <(sort -k 1,1 CIDR24s-SpamCop_Quick_Reports.txt) > Carryover-CIDR24s-SpamCop-QRs-to-1998-2022.12032022.txt Finds 224 matching CIDR/24 blocks.

(3) There's a need for Magic Banana's scripting genius here, because otherwise I'd have 448 rows of CIDR/24-specific
commands to plow through, but a few examples might be informative:
grepcidr -e 172.17.2.0/24 1998-2022.12032022versus
grepcidr -e 172.17.2.0/24 SpamCop_Quick_Reports.txt Within the same CIDR/24 block, but different server & CC.

grepcidr -e 178.76.211.0/24 1998-2022.12032022versus
grepcidr -e 178.76.211.0/24 SpamCop_Quick_Reports.txt Within the same CIDR/24 block, but different server & CC.

grepcidr -e 94.100.28.0/24 1998-2022.12032022versus
grepcidr -e 94.100.28.0/24 SpamCop_Quick_Reports.txt Within the same CIDR/24 block, but different server & CC.

grepcidr -e 41.224.246.0/24 1998-2022.12032022versus
grepcidr -e 41.224.246.0/24 SpamCop_Quick_Reports.txt Within the same CIDR/24 block, but no server or CC.

(4)Look a little deeper with mboxgrep, which captures the associated emails' sourcecodes:
mboxgrep -e 41.224.246.5 1998-2022.12032022versus
mboxgrep -e 41.224.246.9 SpamCop_Quick_Reports.txt Lots of information to be processed.

The pattern (based on too few examples) is for over two hundred CIDR/24 blocks to keep on spamming for a decade or so,
but with ever-changing server (a.k.a. domain) names and host countries. I could have cranked out those 448 rows of scripts
with the substitutional feature of featherpad, but what I'd really like to see is the degree to which the CIDR/24 blocks
are changing hands or being controlled by third parties. If only the ISP's would stop their gratuitous hostname lookups.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

There's a need for Magic Banana's scripting genius here, because otherwise I'd have 448 rows of CIDR/24-specific commands to plow through

I do not know grepcidr, but reading http://www.pc-tools.net/unix/grepcidr/ I see you can search them all at once by providing the file with every row to option -f (as for grep).

amenex
Offline
Iscritto: 01/04/2015

Not quite that easy: I couldn't make it work, so I built the 224-row script (easy & quick with Featherpad) for grepcidr,
which recognizes the /24 format for CIDR blocks) making 224 output files keyed to their CIDR/24 blocks. That script took
less than half an hour and found data in the vast majority of CIDR/24 blocks. Most but not all of that date contains
the IPv4 address and email name with domain. Next step will involve extracting the IPv4s and piping them one-at-time
with mboxgrep's pattern search to construct a lineup of pertinent emails for Magic Banana's "Hairball" script.