Maintaining the association between two variables (one of them multiply comma-delimited)

7 replies [Last post]
amenex
Offline
Joined: 01/03/2015

Here's my best effort:
tail +2 Hairball-file.txt | cut -f 1,6 | awk '{print $1"\t"$2}' | sed 's/\,/\n/g' | awk -F[/:] '{print $1,$4}'
whose output looks like this:

UID25098-1465998636 http www.askrty.casa
http www.askrty.casa
http www.askrty.casa
http www.askrty.casa
http www.askrty.casa
http www.askrty.casa
http www.askrty.casa
UID25099-1465998636 http api.whatsapp.com
http api.whatsapp.com
http imagizer.imageshack.com
http imagizer.imageshack.com
http imagizer.imageshack.com
http imagizer.imageshack.com
http imagizer.imageshack.com
http imagizer.imageshack.com
http imagizer.imageshack.com
http www.instagram.com ...

which in months past I would have filled in with LibreOffice .Calc before removing the duplicate rows:
sort -u Calc.file.txt
UID25098-1465998636 www.askrty.casa
UID25099-1465998636 api.whatsapp.com
UID25099-1465998636 imagizer.imageshack.com
UID25099-1465998636 www.instagram.com

Surely there's a better way.

AttachmentSize
Hairball-file.txt24.08 KB
Calc-file.txt672 bytes
Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Here is one way, if I have properly guessed (since you do not clearly explain) what you want (a space-separated two-column output: the UID together with every distinct domain appearing in URLs listed in the sixth column of the input):
$ cut -f 1,6 Hairball-file_0.txt | awk -F '[,\t]' 'gsub(/[^,\t]*:\/\//, "") { for (i = 2; i <= NF; ++i) print $1, $i }' | cut -d / -f 1 | sort -u
I can explain.

amenex
Offline
Joined: 01/03/2015

Magic Banana's solution$ cut -f 1,6 Hairball-file_0.txt | awk -F '[,\t]' 'gsub(/[^,\t]*:\/\//, "") { for (i = 2; i <= NF; ++i) print $1, $i }' | cut -d / -f 1 | sort -u Works perfectly, but for the 1,7 combination (UID's & Email domains) I've only gotten this far:cut -f 1,7 <(tail +2 Hairball-file_0.txt) | sed 's/\,/\n/g' | sed 's/\t/\n/g' | grep "\S" |more which lists the originating UID's, each followed by the associated Emails' domains.
Further progress requires more actual understanding and not guessing.
Thanks are due Magic Banana for this valuable contribution.

amenex
Offline
Joined: 01/03/2015

This problem comes up several places in the outputs from your script's parsing of the Hairball:
The email addresses and (separately) the IPv4 addresses in the headers;
The emails in the contents of the emails;
The present contents' URL's.
While I was reconnoitering, I just collected the domains with the assumption that I could just grep them later.
The latest version of the project is to keep up to date with a mind to run the analyses (plural) daily, and so
it's become important to keep track of the UID's.
The script you're suggesting handles pairs of the columns; the number of records will be variable for the ones
I've listed so just pasting columns created separately isn't workable. We're on the right track.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

So you want the distinct domains in the sixth column as an additional column, don't you? If so, that should do it:
$ cut -f 6 Hairball-file.txt | awk -F , 'gsub(/[^,]*:\/\//, "") { gsub(/\/[^,]*/, ""); for (i = 1; i <= NF; ++i) if (!($i in a)) { a[$i]; s = s "," $i }; delete(a) } { print substr(s, 2); s = "" }' | paste Hairball-file.txt -
Please properly specify what you want if that is still not it.

amenex
Offline
Joined: 01/03/2015

Magic Banana's first solution (Wed, 02/23/2022 - 18:56) does exactly what I want.
What I've been doing is placing zeros between back-to-back tabs to mark empty fields with sed:
sed 's/\t\t/\t0\t/g'
The non-empty fields of email addresses require additional coding to produce a column of UID's
followed by a column of the domains separated from their user names by making "@" a field separator.
That requires that we first list each UID followed by the first email, then another UID followed by
the second email associated with that UID, etc. and then separating the domain from each email to
obtain a two-column file of UID's and domains (from Col$7), just like the UID's and domains from
the many URL's in Col$6 of Hairball-file.txt.
Opening the URL's in Col.$6 of Hairball-file.txt is for more venturesome investigators than me.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Magic Banana's first solution (Wed, 02/23/2022 - 18:56) does exactly what I want.

I actually believe the second solution is better: it complements the existing table instead of creating a new incomplete one.

To get the same format as that of my earlier command but using the completed table:
$ cut -f 1,8 completed_file | awk -F '[,\t]' '$2 { for (i = 2; i <= NF; ++i) print $1, $i }'
That should actually be in a separate executable script, maybe named unflatten-pairs, expecting a field number (8 in the command line above) and optionally files, that you would save in a directory listed in your PATH variable:
#!/bin/sh
if [ "$1" -gt 1 ]
then
field_pos=$1
shift
cut -f 1,$field_pos "$@" | awk -F '[,\t]' '$2 { for (i = 2; i <= NF; ++i) print $1, $i }'
exit
fi
printf "Usage: $0 field_pos [file]...
Every output line is a value in the first input column associated with
every comma-separated value in the field_pos-th input field.
"

In this way, to get Hairball-file.txt's URLs (for instance) in the other format, you would simply write:
unflatten-pairs 6 Hairball-file.txt
That is what people writing minimally-reusable code would do at least. I wrote "minimally-reusable" because the script could be made more general, not hard-coding the tabulation and the comma as the separator, and performing a Cartesian product of an arbitrary number of columns.

What I've been doing is placing zeros between back-to-back tabs to mark empty fields with sed

That makes further processing harder. If a field is empty, why not keeping it empty?

That requires that we first list each UID followed by the first email, then another UID followed by the second email associated with that UID, etc. and then separating the domain from each email to obtain a two-column file of UID's and domains (from Col$7), just like the UID's and domains from the many URL's in Col$6 of Hairball-file.txt.

If it is "just like the UID's and domains from the many URL's" (it does not look like, from what you wrote before), then the solution simpler: there is nothing to remove after the domain. To complement Hairball-file.txt (or Hairball-file.txt already completed with the distinct domains in the URLs), you can write:
$ cut -f 7 Hairball-file.txt | awk -F , 'gsub(/[^,]*@/, "") { for (i = 1; i <= NF; ++i) if (!($i in a)) { a[$i]; s = s "," $i }; delete(a) } { print substr(s, 2); s = "" }' | paste Hairball-file.txt -
If you then want the other format, you can post-process the above command line with unflatten-pairs (see the beginning of this post).

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

I wrote "minimally-reusable" because the script could be made more general, not hard-coding the tabulation and the comma as the separator, and performing a Cartesian product of an arbitrary number of columns.

Also, it looks better to actually keep all the input fields (one can pipe the output into cut, if she wants to):
#!/bin/sh
if [ "$1" -gt 0 ]
then
field_pos=$1
shift
awk -F \\t -v OFS=\\t -v field_pos=$field_pos '{
n = split($field_pos, a, ",")
for (i = 1; i <= n; ++i) {
$field_pos = a[i]
print } }
!n {
print }' "$@"
exit
fi
printf "Usage: $0 field_pos [file]...
The output rows are the tab-delimited input rows with the field_pos-th
field substituted by each comma-separated value in it.
"