Re-order groups of identical strings separated by blank lines so that the most frequently repeated strings are at the top

3 risposte [Ultimo contenuto]
amenex
Offline
Iscritto: 01/04/2015

Starting with a 109,000 row LibreOffice Calc file containing many groups of identical strings separated by blank rows,
how can I re-order this file with the largest groups at the top, still separated by blank rows ?

Here's how that original file was created:

First, sort the original 159,000 row file on the third (hostname) column:
>> sort -k3 HNs.bst.lt.visitors03.txt > HNs.bst.lt.sorted03.txt

The original file had two fields separated by a colon(:) that I converted to tabs with Leafpad in ~30 minutes.
It now has the Current Visitor filename in the first column, the occurrence count in the second column, and
the master file's hostnames in the third column, the one on which this file has to be sorted.

Second, gather the duplicates into bunches separated by blank lines, retaining the original general sorted sequence:
>> time awk 'NR >= 2 { print $1, $2, $3 }' 'HNs.bst.lt.sorted03.txt' | uniq -f 2 --all-repeated=separate | awk '{ print $1 "\t" $2 "\t" $3 }' > HNs.bst.lt.Duplicates03.txt

This makes a very pretty .ods file that's slightly larger than the original file, but there's so much there that it
would be tedious to rank the most-often-repeated Recent Visitors' hostnames by hand.

How can I re-sort the .ods file so that the longest identical-hostname lists are at the top ? I want to retain the associations
with the original domains' Recent Visitor data. This is a task which has to be repeated many times ...

BTW, the longest identical-hostname list that I found visually in the .ods file was 42 out of 45 original Recent Visitor lists.
All the original data is freely available on the Internet.

George Langford

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The original file had two fields separated by a colon(:) that I converted to tabs with Leafpad in ~30 minutes.

I bet 'tr : \\t' takes at most seconds...

All the original data is freely available on the Internet.

Where is HNs.bst.lt.visitors03.txt?

I believe a small AWK program, with RS = "\n\n" and FS = "\n", can rewrite every "cluster" on a single line and add the number of repetition (NF) that 'sort -n' can sort. Another AWK program can then restore the original format. But I would like an input to be sure...

amenex
Offline
Iscritto: 01/04/2015

Magic Banana wonders:

> The original file had two fields separated by a colon(:) that I converted to tabs with Leafpad in ~30 minutes.

>> I bet 'tr : \\t' takes at most seconds...

Alas, there are IPv6 fields in the data; when I asked LibreOffice Calc to treat the colon (:) as a field separator,
chaos ensued. That succinct script might do the same ...

Also:

> All the original data is freely available on the Internet.

>> Where is HNs.bst.lt.visitors03.txt?

It's 10.3 MB, so you'll have to grow your own ...

That starts here: http://www.bst.lt/webalizer/site_201906.html
Then take out the middle seven columns from the text file, separate the IPv4 entries from that data, run a grep
search against a number of other webalizer data sets (discoverable with a Google search on webalizer and "view
all sites" followed by the removal of those middle seven columns each time) and then remove the pesky colons
that appear after the HNs.domain.txt file names. I don't know how thos IPv6's crept in ...

My processing of the mixed data that is online to separate each data set into IPv4 and hostname fractions looks
tedious from the number of steps, as it involves swapping the positions of two columns of data several times,
but each step is in the blink-of-an-eye category. I use one of Magic Banana's scripts to identify the first four
octets of each entry, followed by a join operation to match the results of that script to the actual original
data, giving me a list of pure IPv4 addresses. Separating the hostnames uses another join script to retrieve
the hostnames from the original data set by listing the entries that are not IPv4's.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Alas, there are IPv6 fields in the data; when I asked LibreOffice Calc to treat the colon (:) as a field separator,
chaos ensued. That succinct script might do the same ...

Certainly. If it is always the first ":" (on the line) that must be substituted, then a simple solution is:
$ cut -d : -f 1 > tmp1
$ cut -d : -f 2- > tmp2
$ paste tmp1 tmp2

If there can be IPv6 addresses before or after the ":" to substitute then a regular expression and sed or awk should probably do the job.

It's 10.3 MB, so you'll have to grow your own ...

Well, I will not do that. If I have correctly understood the input format and the task, the last paragraph of https://trisquel.info/forum/re-order-groups-identical-strings-separated-blank-lines-so-most-frequently-repeated-strings-ar#comment-142452 should be useful.