Grep consumes all my RAM and swap ding a big job

18 réponses [Dernière contribution]
amenex
Hors ligne
A rejoint: 01/03/2015

Here's the task at hand:

There are forty-five sets of Recent Visitor webalizer files that I've collected with Google
within which I'm trying to find which hostnames are most frequently found in those sets.

I've managed to do this once:

First, create a single-column hostname list from the original two-column list:
>> time awk < HNs.bst.lt.txt '{print $2}' > HNusage/HNs.bst.lt/temp
(the first column is Webalizer's count of the occurrence of each hostname)

Second, find all the instances that the hostnames from the temp file are found in the forty-five Current Visitor lists:
>> time grep -f HNusage/HNs.bst.lt/temp *.txt > HNusage/HNs.bst.lt/HNs.bst.lt.visitors.txt

This script managed to create a 159,100 row file without saturating memory and swap, but other combinations can't finish.
I'll address the analysis of that 159,100 row file in a separate posting.

It appears that grep is storing all its output in memory without posting intermediate results.

How do I re-write my script so that grep takes just one hostname at a time for use as its pattern to search the 45 other lists ?
When I do this by hand, each such single search takes ~0.015 seconds; the input file has ~2000 hostnames, so the total search
time would be ~30 seconds. My failed grep script, on the other hand, took ~23 seconds of processor time and 15 minutes real time,
but ran out of memory (7.7 GB RAM, 7.8 GB swap, both 100% at the end).

Here's my best effort, which starts OK, but shows no progress ...
>> time awk '-f HNusage/HNs.bst.lt/temp { for (i = 0; ++i <= NF; )'{grep $1 *.txt}'} > HNusage/HNs.bst.lt/HNs.bst.lt.visitors08.txt
(the path statements are to keep the output file away from the from the Recent Visitor files that are in the current directory)

Attached are the master file (HNs.bst.lt.txt) and two target files out of the 45-file list.

George Langford

Pièce jointeTaille
HNs.bst_.lt_.txt53.85 Ko
HNs.www_.barcodeus.com_.txt136.71 Ko
HNs.www_.outwardbound.net_.txt330.34 Ko
Magic Banana

I am a member!

I am a translator!

Hors ligne
A rejoint: 07/24/2010

Again: you have no idea how much time you would save by stopping for ~10 hours and actually learning the commands you use (e.g., as I have already told you, the "system" function must be used in AWK to call system commands... but you do not need that here), regular expressions (unless escaped, a dot means "any character"; 'grep -F' must be used to interpret the patterns as fixed strings), simpler commands (your first AWK program just does 'cut -f 2'; 'grep' is rarely needed with structured files), etc. Here, the simplest solution is to use 'join', which was also the solution in the previous thread you created on this forum... However, your input looks wrong: on line 674 of HNs.bst_.lt_.txt, the second column only contains the character 0... and your 'grep' selects (among others) all the lines that include this character. I assume you want whole domain matches.

The solution with 'join':
$ cut -f 2 HNs.bst_.lt_.txt | sort > HNusage/HNs.bst.lt/temp
$ sort -k 2 *.txt | join -1 2 - HNusage/HNs.bst.lt/temp

If you want the output formatted like the inputs, append "| awk '{ print $2 "\t" $1 }'" to the last command.

If HNs.bst_.lt_.txt may be much larger and if you care about improved execution time, you can have all four commands run in parallel using a named pipe, created with 'mkfifo', instead of a temporary file. Doing that in a Shell script taking as input first the file with the domain names to search and then all other files:
#!/bin/sh
if [ -z "$2" ]
then
printf "Usage: $0 searched-domain-file file1 [file2 ...]
"
exit
fi
searched="$1"
shift
TMP=$(mktemp -u)
trap "rm $TMP 2>/dev/null" 0
mkfifo $TMP
cut -f 2 "$searched" | sort > $TMP &
sort -k 2 "$@" | join -1 2 - $TMP

amenex
Hors ligne
A rejoint: 01/03/2015

Magic Banana wonders:

> However, your input looks wrong: on line 674 of HNs.bst_.lt_.txt, the second column only contains the character 0...
> and your 'grep' selects (among others) all the lines that include this character. I assume you want whole domain matches.

I checked the original webalizer data. Right after the "ws-68.oscsbras.ru" entry there is a hostname "." that also
appears in my own Recent Visitor data because my shared server gratuitously performs hostname lookups on every IPv4
address that appears on its doorstep. I have complained to my ISP, and a couple of times they turned off that "feature."
... but the server quickly reverts to the hostname lookup, which Apache actually deprecates. Something in Leafpad or in
LibreOffice Calc is converting that "." to a zero character. Maybe it's a warning to do the processing of the webalizer
data sets without resorting to LibreOffice Calc. Some day I'll try searching my voluminous nMap results to see if I can
put IPv4 address(es) with those "." hostnames.

Regarding the second half of the second part of the proposed solution:

>> join -1 2 - HNusage/HNs.bst.lt/temp0

Grep would have included the file names of the data sets encompassed by *.txt in the curent directory and by "-" in this
script, but this syntax of join does not. Is there a way of maintaining the association between the "hits" in the joined
file and the data set wherein they reside ?

Magic Banana

I am a member!

I am a translator!

Hors ligne
A rejoint: 07/24/2010

Simply add the filename in a new field:
$ cut -f 2 HNs.bst_.lt_.txt | sort > HNusage/HNs.bst.lt/temp
$ awk '{ print $2, $1, FILENAME }' *.txt | sort -k 1,1 | join - HNusage/HNs.bst.lt/temp

In the script:
#!/bin/sh
if [ -z "$2" ]
then
printf "Usage: $0 searched-domain-file file1 [file2 ...]
"
exit
fi
searched="$1"
shift
TMP=$(mktemp -u)
trap "rm $TMP 2>/dev/null" 0
mkfifo $TMP
cut -f 2 "$searched" | sort > $TMP &
awk '{ print $2, $1, FILENAME }' "$@" | sort -k 1,1 | join - $TMP

amenex
Hors ligne
A rejoint: 01/03/2015

Followup:

Starting with the setup script:

> time awk < HNs.bst.lt.txt '{print $2}' > HNusage/HNs.bst.lt/temp

When the grep command acts on a group of files that have been sorted, the script works much more quickly as
well as utilizing much less RAM without need of swap support:

> time grep -f HNusage/HNs.bst.lt/temp <( sort -k 2 *.txt) > HNusage/HNs.bst.lt/HNs.bst.lt.visitors18.txt
The names of the searched files are not recorded, however.

My attempt to force inclusion of the searched files' filenames:

> time grep -H -f HNusage/HNs.bst.lt/temp <(sort -k 2 *.txt) > HNusage/HNs.bst.lt/HNs.bst.lt.visitors18.txt
Lists the name of a binary file (/dev/fd/63:) not the names of the data sets associated with each match.

After some opposition from bash, I combined the setup and grep scripts:

> time awk < HNs.bst.lt.txt '{print $2}' > HNusage/HNs.bst.lt/temp | time grep -H -f HNusage/HNs.bst.lt/temp <( sort -k 2 *.txt) > HNusage/HNs.bst.lt/HNs.bst.lt.visitors18.txt

But the output still includes that /dev/fd/63 filename, which I suspect includes all the target filenames ...

However, man grep states:

>> Output Line Prefix Control
>> -H, --with-filename
>> Print the file name for each match. This is the default when there is more than one file to search.

With my initial grep script, the filenames were included, apparently at the expense of RAM and swap ... All I'm doing
now is sorting the data sets that are being searched. I suspect that I'll have to sort each of the forty-five data sets,
one at a time, before starting the grep script:

> time awk < HNs.bst.lt.txt '{print $2}' > HNusage/HNs.bst.lt/temp | sort -k 2 *.txt | time grep -H -f HNusage/HNs.bst.lt/temp *.txt > HNusage/HNs.bst.lt/HNs.bst.lt.visitors20.txt

With a little feedback from bash:
>> 702.25user 1.98system 11:48.88elapsed 99%CPU (0avgtext+0avgdata 2960936maxresident)k
>> 0inputs+20104outputs (0major+740271minor)pagefaults 0swaps

But this very long script manages to retain the domains' filenames along with the two columns of matches ... 10.3 MB worth,
but in eleven minutes and without taxing RAM or even using swap. Those 10.3 MB match my earlier and more RAM-extravagant results.

Magic Banana

I am a member!

I am a translator!

Hors ligne
A rejoint: 07/24/2010

Your new 'grep' takes one single input: the output of sort -k 2 *.txt, which is textual, not binary (and happened to have the file descriptor 63 in your execution). That 'grep' never sees the files that match .txt. So, of course, it cannot write their names.

Reading and writing a same file (HNusage/HNs.bst.lt/temp) in a same command line is still nondeterministic. I explained it to you three or four times. I will not do it again: learn pipes and redirections.

Learning regular expressions is not optional before using 'grep'. Again: a dot (all over HNusage/HNs.bst.lt/temp) means "any character". Not necessarily a dot. And again: 'grep' selects lines containing the pattern. Not necessarily as a whole "field" (a concept that does not exist in 'grep'). For instance, if the pattern "0" is still in HNusage/HNs.bst.lt/temp, all the lines that include "0" are selected. And I see little reason why a sorted input would fasten 'grep' (maybe branch prediction... but most probably your new command line does not do the same thing as the previous one).

Apart from being correct, the solution with 'join' is probably faster. Out of curiosity: what is its run time? It is possible that reading/writing the (long) file names dominates that run time. If so, the files can be identified with numbers. To do so, the tiny AWK program can become 'FNR == 1 { ++f } { print $2, $1, f }'.

amenex
Hors ligne
A rejoint: 01/03/2015

Taking Magic Banana's cue, I applied the join command in round-robin fashion:

> HNs.Ed.tropic.ssec.wisc.edu.txt <== Start with this one; then join it in turn with each of the others in the current directory

> join -1 2 -2 2 <(sort -k 2 HNs.Ed.tropic.ssec.wisc.edu.txt) <(sort -k 2 HNs.genetherapynet.com.txt) > HNusage/HNs.Ed.tropic.ssec.wisc.edu/genetherapynet.com.txt
> join -1 2 -2 2 <(sort -k 2 HNs.Ed.tropic.ssec.wisc.edu.txt) <(sort -k 2 HNs.www.cincynature.org.txt) > HNusage/HNs.Ed.tropic.ssec.wisc.edu/www.cincynature.org.txt
> join -1 2 -2 2 <(sort -k 2 HNs.Ed.tropic.ssec.wisc.edu.txt) <(sort -k 2 HNs.www.ashevilletheatre.org.txt) > HNusage/HNs.Ed.tropic.ssec.wisc.edu/www.ashevilletheatre.org.txt
...
> join -1 2 -2 2 <(sort -k 2 HNs.Ed.tropic.ssec.wisc.edu.txt) <(sort -k 2 HNs.Ed.radio.at-agri.com.txt) > HNusage/HNs.Ed.tropic.ssec.wisc.edu/Ed.radio.at-agri.com.txt
> join -1 2 -2 2 <(sort -k 2 HNs.Ed.tropic.ssec.wisc.edu.txt) <(sort -k 2 HNs.www.pyrok.com.hk.txt) > HNusage/HNs.Ed.tropic.ssec.wisc.edu/www.pyrok.com.hk.txt

Now a script is needed to concatenate all 44 files [along with each filename]:

> time cat `ls /home/george/Desktop/June2019/DataSets/HNs-TwoColumn/HNusage/HNs.Ed.tropic.ssec.wisc.edu` > /home/george/Desktop/June2019/DataSets/HNs-TwoColumn/HNusage/HNs.Ed.tropic.ssec.wisc.edu.visitors.txt

Produces 1.4 MB output, but without any filenames; this can be remedied by adding the filename to each file in turn
with a print statement or (shudder) in Leafpad, one file at a time, 44 times over ... but there's another way:

See this link: https://unix.stackexchange.com/questions/117568/adding-a-column-of-values-in-a-tab-delimited-file

>> awk '{print $0, FILENAME}' file1 file2 file3 ...

> awk '{print FILENAME,"\t",$0}' 01.txt 02.txt 03.txt ... which works because I renamed the files to fit this format.

I kept the roster list ...

Here's the whole awk script, which concatenates all 44 files and inserts the file name associated with the data as desired:
> time awk '{print FILENAME,"\t",$0}' 01.txt 02.txt 03.txt 04.txt 05.txt 06.txt 07.txt 08.txt 09.txt 10.txt 11.txt 12.txt 13.txt 14.txt 15.txt 16.txt 17.txt 18.txt 19.txt 20.txt 21.txt 22.txt 23.txt 24.txt 25.txt 26.txt 27.txt 28.txt 29.txt 30.txt 31.txt 32.txt 33.txt 34.txt 35.txt 36.txt 37.txt 38.txt 39.txt 40.txt 41.txt 42.txt 43.txt 44.txt > Backup/ProcessedVisitorLists/FILENAME.txt

It's 1.8 MB and in a pretty format, not yet processed as mentioned elsewhere ... a luncheon date beckons.

Timing ? The join commands took about an hour to set up, ca. 1 or 2 seconds real time for each one (after copy
and paste into the console, about 15 seconds for each of the 44 join commands ==> 11-1/2 minutes, and this last
monstrocity took 0.05 second real time, not to mention all morning struggling with a prettier method of reading
what's been in the current directory all along. Repeating it for the other 43 combinations should now be
a breeze, as I can switch the file names around with Leafpad.

All because I haven't yet spent those ten hours that've been mentioned every so often ...

Here is the extra processing that I promised:

Step (1) > sort -k2 HNs.Ed.tropic.ssec.wisc.edu.txt > HNs.Ed.tropic.ssec.wisc.edu.Sorted.txt

Step (2) > time awk 'NR >= 2 { print $1, $2, $3, $4 }' 'HNs.Ed.tropic.ssec.wisc.edu.Sorted.txt' | uniq --skip-fields=1 --all-repeated=none | awk '{ print $1 "\t" $2 "\t" $3 "\t" $4 }' > HNs.Ed.tropic.ssec.wisc.edu.Duplicates.txt

The duplicates are now groups, but without intervening blank lines. What remains is to count the number of
duplicates in each group, and then place the groups with the most numerously duplicated hostnames at the top
of the list. There are four columns, starting with the name of the associated hostname file, then the
duplicated hostnames from the other 43 joined hostname files, then the number of instances of the duplicitous
hostname in the hostname file common to all the joins, and then the number of instances of that hostname in
each associated hostname file.

This method catches much-duplicated hostnames that are used infrequently in the course of a month's traffic in
one domain, but which are frequently applied to other domains.

Attached is a snippet (about a quarter) of the duplicates file, which needs processing to place the most numerous
hostname duplicate groups at the top of the list.

Pièce jointeTaille
OutputFileSnippet.txt 344.51 Ko
Magic Banana

I am a member!

I am a translator!

Hors ligne
A rejoint: 07/24/2010

Taking Magic Banana's cue, I applied the join command in round-robin fashion

Here is what I actually proposed (where filenames are appended *before* the join): https://trisquel.info/forum/grep-consumes-all-my-ram-and-swap-ding-big-job#comment-142474

> time awk '{print FILENAME,"\t",$0}' 01.txt 02.txt 03.txt 04.txt 05.txt 06.txt 07.txt 08.txt 09.txt 10.txt 11.txt 12.txt 13.txt 14.txt 15.txt 16.txt 17.txt 18.txt 19.txt 20.txt 21.txt 22.txt 23.txt 24.txt 25.txt 26.txt 27.txt 28.txt 29.txt 30.txt 31.txt 32.txt 33.txt 34.txt 35.txt 36.txt 37.txt 38.txt 39.txt 40.txt 41.txt 42.txt 43.txt 44.txt > Backup/ProcessedVisitorLists/FILENAME.txt

Can't you write *.txt?!

Repeating it for the other 43 combinations should now be a breeze, as I can switch the file names around with Leafpad.

I am not sure I understand what you want to do (join every file with the union of all other files?) but Leafpad is certainly not the best solution.

amenex
Hors ligne
A rejoint: 01/03/2015

Magic Banana asked:

>> Here is what I actually proposed (where filenames are appended *before* the join):
>> https://trisquel.info/forum/grep-consumes-all-my-ram-and-swap-ding-big-job#comment-142474

When I collapsed the script into a one-liner, it would start ... but nothing ensued for about ten minutes.
At the point of desperation, it dawned on me what you proposed, so I put it in more readable terms:

> time awk '{print FILENAME,"\t",$0}' 01.txt 02.txt 03.txt 04.txt 05.txt 06.txt 07.txt 08.txt 09.txt 10.txt 11.txt 12.txt 13.txt 14.txt 15.txt 16.txt 17.txt 18.txt 19.txt 20.txt 21.txt 22.txt 23.txt 24.txt 25.txt 26.txt 27.txt 28.txt 29.txt 30.txt 31.txt 32.txt 33.txt 34.txt 35.txt 36.txt 37.txt 38.txt 39.txt 40.txt 41.txt 42.txt 43.txt 44.txt > Backup/ProcessedVisitorLists/FILENAME.txt

This script is expandable for the foreseeable future, because rather few webmasters publish their Webalizer data online.
I wrote it for the joined files, but it's as easily applied to the pre-joined files as well. I've since changed my naming
protocol to make the status of the files more clear.

>> Can't you write *.txt?!
FILENAME has specific meaning in awk, so I was sure that I would be getting that for which I was asking. Besides: it works.

Magic Banana

I am a member!

I am a translator!

Hors ligne
A rejoint: 07/24/2010

When I collapsed the script into a one-liner

Why would you do that? Copy-paste the script in a file with a meaningful name. Then turn that file executable (e.g., using your file manager or with 'chmod +x').

it would start ... but nothing ensued for about ten minutes.

It works in less than 0.1 s on the three files you gave in the original post.

awk '{print FILENAME,"\t",$0}'

That prints the file name followed by a space, a tab, another space and the original line. I doubt it is what you want. Also, the tiny AWK program I proposed changes the order of the fields so that there the join field is first and no option needs to alter the default behavior of 'join'.

>> Can't you write *.txt?!
FILENAME has specific meaning in awk, so I was sure that I would be getting that for which I was asking.

I am talking about the arguments given to awk: 01.txt 02.txt 03.txt 04.txt 05.txt 06.txt 07.txt 08.txt 09.txt 10.txt 11.txt 12.txt 13.txt 14.txt 15.txt 16.txt 17.txt 18.txt 19.txt 20.txt 21.txt 22.txt 23.txt 24.txt 25.txt 26.txt 27.txt 28.txt 29.txt 30.txt 31.txt 32.txt 33.txt 34.txt 35.txt 36.txt 37.txt 38.txt 39.txt 40.txt 41.txt 42.txt 43.txt 44.txt. Typing that is a waste of time and is prone to error.

Also, if, in my previous post, I have understood what you wanted to do with Leafpad and 43 manual executions (i.e., "join every file with the union of all other files"), here is a slightly modified script that does everything in one single execution:

#!/bin/sh
if [ -z "$2" ]
then
printf "Usage: $0 file1 file2 [file3 ...]
"
exit
fi
TMP=$(mktemp -u)
trap "rm $TMP 2>/dev/null" 0
mkfifo $TMP
for searched in "$@"
do
shift
cut -f 2 "$searched" | sort > $TMP &
awk '{ print $2, $1, FILENAME }' "$@" | sort -k 1,1 | join - $TMP > "$searched-joins"
set "$@" "$searched"
done

Again: if you want to reorder the fields and change the spaces into tabs, you can pipe the output of the join, in the for loop, to something like:
awk '{ print $2 "\t" $3 "\t" $1 }'

Notice the absence of comma (which would insert OFS, a space by default).

amenex
Hors ligne
A rejoint: 01/03/2015

Magic Banana may be re-stating my objective differently than I have been stating it:

>> Also, if, in my previous post, I have understood what you wanted to do with Leafpad and 43 manual executions
>> (i.e., "join every file with the union of all other files"), here is a slightly modified script that does
>> everything in one single execution ...

That would distill things down until very few multiple hostnames were left, because there aren't many hostnames
found in even most of the domain data that I've been able to collect. Instead, there are groups of 40 or so
identical hostnames, then groups of smaller numbers, indicating sharper focus, and so on. I hesitate to expose
the results until the more frequently used hostames can be resolved to singular or multiple IPv4 addresses.

No; all my joins are pair-wise, as in 01.txt joined with 02.txt (but not 02.txt joined with 01.txt, which gives
the identical result), nor 01.txt joined with 01.txt, which simply duplicates the (often very long) initial list
of hostnames accessing domain 01. My output files are very numerous ~ [(45*44/2 -45) = 945]; once created they
are not joined but merely concatenated and then grouped to yield bunches of identical hostnames, each one traceable
to one of the original hostnames in the Recent Visitor lists.

Note that at each level of the groups of listed scripts, an additional script of the top scripts is commented out
so as to eliminate duplications, until at the last script list there is just one script left not commented out.

This as-yet unworkable script is meant to concatenate the first set of those 945 files:
>> > time awk --file Joins/Script-Joins-sorted-07272019.txt
... and it has to be repeated 44 more times, as reflected in the ever-shortening list of script lists in 142549.

After that comes the task of looking up with nMap, the CIDR report, and Google as many other identical hostnames
as can reasonably be found.

Magic Banana

I am a member!

I am a translator!

Hors ligne
A rejoint: 07/24/2010

You write a lot but the problem is still unclear to me. My last script is a solution to the following problem:

"Given text files where each line is a number followed by a tabulation and a hostname (unique in a given file), take the files one by one and list, for each of its hostnames, how many other files have it, what are these other files and the numbers they relate the hostname to. The hostnames are reordered in decreasing order of the number of other files listing it. Any hostname in one single file (no other file has it) is unlisted."

Indeed, if you copy-paste the script of https://trisquel.info/forum/grep-consumes-all-my-ram-and-swap-ding-big-job#comment-142554 in "/usr/local/bin/multi-join" (for instance) and make that file executable ('sudo chmod +x /usr/local/bin/multi-join'), then you can execute that script on all the files ('multi-join *.txt', if the files are all those with txt suffix in the working directory) and will get as many new files bearing the same names suffixed with "-joins". For instance, giving as input the three files in the original post, you get, within 0.1s, three new files, "HNs.bst_.lt_.txt-joins", "HNs.www_.barcodeus.com_.txt-joins" and "HNs.www_.outwardbound.net_.txt-joins". "HNs.bst_.lt_.txt-joins", for instance, contains:

2 webislab40.medien.uni-weimar.de 1 HNs.www_.barcodeus.com_.txt 2 HNs.www_.outwardbound.net_.txt
2 vmi214246.contaboserver.net 2 HNs.www_.barcodeus.com_.txt 2 HNs.www_.outwardbound.net_.txt
(...)
1 107-173-204-16-host.colocrossing.com 15 HNs.www_.outwardbound.net_.txt
1 104-117-158-51.rev.cloud.scaleway.com 1 HNs.www_.barcodeus.com_.txt

The first line means:

2 files, besides "HNs.bst_.lt_.txt", have "webislab40.medien.uni-weimar.de": "HNs.www_.barcodeus.com_.txt", which relates this hostname with the number 1, and "HNs.www_.outwardbound.net_.txt", which relates this hostname with the number 2.

Since, here, there are three files, 2 is the maximal number of other files (all files have the hostname) and 1 is the minimum (because any hostname in one single file is unlisted): in this example, all *-join files list, first, lines starting with 2, second, lines starting with 1.

If that is not the problem you face, please express the actual problem clearly. As I did above. Not with ten paragraphs. Not with digressions.

amenex
Hors ligne
A rejoint: 01/03/2015

Magic Banana politely requested:

>> please express the actual problem clearly. [paraphrasing] in less than ten paragraphs

The folowing scheme worked OK with a list of about a million visitors' hostnames.

(1) Collect Recent Visitor data with a Google search on {stats "view all sites"}. Pick a recent month (such as 201906); then copy and save the usually very long list of hostnames (last column) and the number of occurrences (first column). Discard the middle columns and retain the files 01.txt ... NN.txt

(2) Concatenate all the resulting two-column files into one multi-megabyte text file and add the filenames of the component files in the first column:
>> awk '{print FILENAME "\t" $0}' 01.txt 02.txt ... NN.txt > Joins/FILENAME.txt

(3) Sort the resulting multi-megabyte text file on its third column:
>> time sort -k3 FILENAME.txt > Sorted.txt

(4) Collect all the many duplicate hostnames in the sorted multi-megabyte text file. A workable script is:
>> awk 'NR >= 2 { print $1, $2, $3 }' 'Sorted.txt' | uniq --skip-fields=2 --all-repeated=none | awk '{ print $1 "\t" $2 "\t" $3}' > Duplicates.txt

This is a good file to keep for research purposes, as the domains subjected to the repeated visits are still associated with the hostnames of the visitors.

(5) Strip all but the third column from Duplicates.txt with
>> awk '{ print $3 }' 'Duplicates02.txt' > CountsOrder.txt

(6) To produce a file which can be counted with this script:
>> uniq -c CountsOrder.txt > OrderCounts.txt

(7) Sort this file to place the hostnames with the most numerous counts at the top:
>> sort -rg OrderCounts.txt > SortedByFrequency.txt

(8) Truncate the megabyte-plus SortedByFrequency.txt file to include only duplicates of three or more: SortedByFrequencyGT2.txt:
>> https://trisquel.info/files/SortedByFrequencyGT2.txt
... the scholar will add a suitable script for this task when NN increases to an unmanageably large number.

(9) Find all the IPv4 addresses which resolve to the most numerously repeated visitor hostnames. That's another <= ten pararaphs' worth.

Magic Banana

I am a member!

I am a translator!

Hors ligne
A rejoint: 07/24/2010

That is not a problem. That is an algorithm whose first step is not even clear: a Google search on {stats "view all sites"} returns a "normal response", a list of 224,000 pages from different websites.

Still assuming that the three files you posted in the original post are a sample of your input, I actually wonder if all you want is not simply:
$ awk '{ print FILENAME, $0 }' *.txt | sort -k 3 | awk 'p != $3 { if (p != "") print c, p r; p = $3; c = 0; r = "" } { ++c; r = r " " $1 " " $2 }' | sort -nrk 1,1 > out

If *.txt catches the three files, "out" is (with the same semantics as explained in my previous post, except that the file name is now before the number):
3 xhsjs.preferdrive.net HNs.bst_.lt_.txt 1 HNs.www_.barcodeus.com_.txt 2 HNs.www_.outwardbound.net_.txt 6
3 webislab40.medien.uni-weimar.de HNs.bst_.lt_.txt 1 HNs.www_.barcodeus.com_.txt 1 HNs.www_.outwardbound.net_.txt 2
(...)
1 027a74fd.bb.sky.com HNs.www_.outwardbound.net_.txt 188
1 014199116180.ctinets.com HNs.www_.outwardbound.net_.txt 3

The input files can then be removed: all the information is in "out". You can query it with 'grep' and 'awk'. For instance:

  • To only get the lines with hostnames in "HNs.bst_.lt_.txt" (hence a selection with as many lines as "HNs.bst_.lt_.txt"):
    $ grep -F ' HNs.bst_.lt_.txt ' out
  • To additionally impose that the selected hostnames are in at least another file (as in the problem I stated):
    $ grep -F ' HNs.bst_.lt_.txt ' out | awk '$1 > 1'
  • To only keep the hostnames of the previous output:
    $ grep -F ' HNs.bst_.lt_.txt ' out | awk '$1 > 1 { print $2 }'
amenex
Hors ligne
A rejoint: 01/03/2015

The scholar, focusing on the mathematics, admonishes the "pragmatic idealist":

>> That is not a problem. That is an algorithm whose first step is not even clear:
>> a Google search on {stats "view all sites"} returns a "normal response", a list of 224,000 pages from different websites.

OK: ["usage statistics" "view all sites"] returns nearly 100% viable results. Open them one at a time; change
the URL from ".../stats/usage_201904.html" to ".../stats/site_201906.html" from which the first and last columns
should be selected and saved as a two-column table.

We've developed this to a stage at which one could probably repeat the exercise as soon as each month's data
becomes available ... makes a good homework problem. Getting all the IPv4 addresses makes it a dissertation.

Here's another stab at stating the problem:

Too many visitors to our Internet domains are up to no good, as evidenced by the many unresolvable hostnames
appearing in the lists of Recent Visitors that we find in our own domains' logs. Internet traffic is identified
by a numerical protocol, either IPv4 (currently) or IPv6 (future, to gain more address space). Our ISP's
gratuitously allow these numerical adddreses to be translated into visually more easily recalled hostnames.
Unfortunately, operators of Internet servers are allowed to assign arbitrary names to each numerical address in
the server's address space, known as "pointers" a.k.a. PTR's. Each server address can host any number of
subdomains, whose names must be registered; these are known as A records. Sadly, during the process of
transmitting information on the Internet, expedient nameservers are used to cache repeatedly accessed requests
for the IP addresses of hostnames and thereby accumulate both A records and PTR records which are treated
equivalently. If a PTR is named the same as more than one other PTR, the A records at either one's IP address
become unresolvable unless one knows the IP address of its official nameserver. Therefore, we can protect our own
Internet domains by blocking all the offending IP addresses that we can identify from the repetitive status of
the hostnames that appear to be most indiscriminately accessing similar domains to our own.

amenex
Hors ligne
A rejoint: 01/03/2015

Magic Banana requested clarification of my future plans:

In reply to what I said:
> Repeating it for the other 43 combinations should now be a breeze, as I can switch the file names around with Leafpad.

>> I am not sure I understand what you want to do (join every file with the union of all other files?) ...

Yes: Join each file in turn with all the others, so all the repeatedly used hostnames are represented. Once the 45
results are in hand, I'll concatenate them _without_ using one final join, thereby keeping the babies from getting
thrown out with the bathwater. I have a script that does a nice job of grouping the duplicated hostnames, but it
won't separate them with blank lines ... (yet). See 142510.

>> ... but Leafpad is certainly not the best solution.

Understood, but it's visual, and I can use "undo" nearly without end. Also, I want to guard against double-counting,
as with 01j01.txt or 01j02.txt vs 02j01.txt, and that requires some heavy-duty concentration. My non-geek work will
not take more than an hour or so, ... actually, it took an hour and twenty minutes, and avoiding the double-counting
is visually quite clear in Leafpad ... and the next stage will take just 44 blinks of an eye. I've retained all the
stages of the processing script: > time awk --file Joins/Script-Joins-sorted-07272019.txt (see 142538 and the attached
examples) so they won't have to be re-created when (or if) more Webalizer data comes to light, when they can just be
appended.

At the end, there's still the task of identifying the often-proliferating IPv4 address(es) that go with each hostname.

I'v separated all (?) the untranslated IPv4 addresses beforehand, but the nMap scans are taking too long. I'll need
to perform our "join" magic on those 45 data sets also to reduce the sheer quantity of addresses to be nMap'ed.

Pièce jointeTaille
Script-Joins-01.txt 3.21 Ko
Script-Joins-02.txt 3.21 Ko
Script-Joins-03.txt 3.21 Ko
Script-Joins-43.txt 3.29 Ko
Script-Joins-44.txt 3.29 Ko
Magic Banana

I am a member!

I am a translator!

Hors ligne
A rejoint: 07/24/2010

I want to guard against double-counting, as with 01j01.txt or 01j02.txt vs 02j01.txt, and that requires some heavy-duty concentration.

"My" solution (since my first post in this thread) joins one file with all the other files. Not pairwise. There is nothing to concatenate at the end.

I have a script that does a nice job of grouping the duplicated hostnames, but it won't separate them with blank lines ... (yet).

"My" solution (since my first post in this thread) outputs the hostnames in order. They are already grouped. To prepend them with blank lines, the output of every join can be piped to:
awk '$1 != p { p = $1; print "" } { print }'

However, I believe I have finally understood the whole task and I do not see much point in having the repetitions on several lines (uselessly repeating the hostname). AWK can count the number of other files where the hostname is found, print that count, the hostname (once) and the rest (the number and the file name). 'sort' can then sort in decreasing order of count. The whole solution is:

#!/bin/sh
if [ -z "$2" ]
then
printf "Usage: $0 file1 file2 [file3 ...]
"
exit
fi
TMP=$(mktemp -u)
trap "rm $TMP 2>/dev/null" 0
mkfifo $TMP
for searched in "$@"
do
shift
cut -f 2 "$searched" | sort > $TMP &
awk '{ print $2, $1, FILENAME }' "$@" | sort -k 1,1 | join - $TMP | awk 'p != $1 { if (p != "") print c, p r; p = $1; c = 0; r = "" } { ++c; r = r " " $2 " " $3 }' | sort -nrk 1,1 > "$searched-joins"
set "$@" "$searched"
done

If, instead of the number of other files you want the sum of their numbers, substitute "++c" with "c += $2".

amenex
Hors ligne
A rejoint: 01/03/2015

Hmmm. We seem both to be writing at once ...

Magic Banana is saying:

Quoting amenex:
> I want to guard against double-counting, as with 01j01.txt or 01j02.txt vs 02j01.txt, and that requires
> some heavy-duty concentration.

>> "My" solution (since my first post in this thread) joins one file with all the other files. Not pairwise.
>> There is nothing to concatenate at the end.

amenex again:
> I have a script that does a nice job of grouping the duplicated hostnames, but it won't separate them with
> blank lines ... (yet).

>> "My" solution (since my first post in this thread) outputs the hostnames in order. They are already grouped.
>> To prepend them with blank lines, the output of every join can be piped to:
>>> awk '$1 != p { p = $1; print "" } { print }'

>> However, I believe I have finally understood the whole task and I do not see much point in having the
>> repetitions on several lines (uselessly repeating the hostname). AWK can count the number of other files
>> where the hostname is found, print that count, the hostname (once) and the rest (the number and the file
>> name). 'sort' can then sort in decreasing order of count. The whole solution is:

amenex:
I'll try that later ... right now I'm worried that the problem may be analyzed another way, simply by
concatenating all the Recent Visitors into one [huge] file while retaining each hostname's association with the
domains' Webalizer data, then grouping the Recent Visitor hostnames according to the quantities of their
occurrences, and therefter discarding the smallest numbers of duplicate hostnames. The data total 39 MB.

Making the current directory that in which the numerically coded two-column hostname lists reside:
time awk '{print FILENAME"\t"$0}' 01.txt 02.txt 03.txt 04.txt 05.txt 06.txt 07.txt 08.txt 09.txt 10.txt 11.txt 12.txt 13.txt 14.txt 15.txt 16.txt 17.txt 18.txt 19.txt 20.txt 21.txt 22.txt 23.txt 24.txt 25.txt 26.txt 27.txt 28.txt 29.txt 30.txt 31.txt 32.txt 33.txt 34.txt 35.txt 36.txt 37.txt 38.txt 39.txt 40.txt 41.txt 42.txt 43.txt 44.txt 45.txt > Joins/ProcessedVisitorLists/FILENAME.txt ... 46.2 MB; 1,038,048 rows (0.067 sec.)

> time sort -k3 FILENAME.txt > Sorted.txt (0.112 sec.)
> time awk 'NR >= 2 { print $1, $2, $3 }' 'Sorted.txt' | uniq --skip-fields=2 --all-repeated=none | awk '{ print $1 "\t" $2 "\t" $3}' > Duplicates.txt ... 7.0 MB; 168,976 rows (0.093 sec.)

Forgive me for my use of the unsophisticated script ... the groups are in the appropriate bunches, but the bunches are in
alphabetical order, but all the original domain data are still present. Sure beats grep, though ...

Print only the hostname column: > time awk '{ print $3 }' 'Duplicates02.txt' > CountsOrder.txt (now 5.5 MB; 0.016 sec.)

Now do the counting step: > time uniq -c CountsOrder.txt > OrderCounts.txt ... back up to 1.1 MB; 0.009 sec

Finally, sort them according to count frequency: > time sort -rg OrderCounts.txt > SortedByFrequency.txt ... still 1.1 MB; 0.003 sec.

Truncate to include only counts greater than 2: > SortedByFrequencyGT2.txt 714 KB (attached)

There are a lot of high-count repetitions.

Pièce jointeTaille
SortedByFrequencyGT2.txt 697.2 Ko
amenex
Hors ligne
A rejoint: 01/03/2015

Using awk to execute the series of join commands described above produces only syntax errors.

Here are those commands:

> join -1 2 -2 2 <(sort -k 2 HNs.Ed.tropic.ssec.wisc.edu.txt) <(sort -k 2 HNs.genetherapynet.com.txt) > HNusage/HNs.Ed.tropic.ssec.wisc.edu/genetherapynet.com.txt
> join -1 2 -2 2 <(sort -k 2 HNs.Ed.tropic.ssec.wisc.edu.txt) <(sort -k 2 HNs.www.cincynature.org.txt) > HNusage/HNs.Ed.tropic.ssec.wisc.edu/www.cincynature.org.txt
> join -1 2 -2 2 <(sort -k 2 HNs.Ed.tropic.ssec.wisc.edu.txt) <(sort -k 2 HNs.www.ashevilletheatre.org.txt) > HNusage/HNs.Ed.tropic.ssec.wisc.edu/www.ashevilletheatre.org.txt
...
> join -1 2 -2 2 <(sort -k 2 HNs.Ed.tropic.ssec.wisc.edu.txt) <(sort -k 2 HNs.Ed.radio.at-agri.com.txt) > HNusage/HNs.Ed.tropic.ssec.wisc.edu/Ed.radio.at-agri.com.txt
> join -1 2 -2 2 <(sort -k 2 HNs.Ed.tropic.ssec.wisc.edu.txt) <(sort -k 2 HNs.www.pyrok.com.hk.txt) > HNusage/HNs.Ed.tropic.ssec.wisc.edu/www.pyrok.com.hk.txt

I attempted to simplify these commands by substituting sequential numeric names (44 executions in all):

> # join -1 2 -2 2 <(sort -k 2 01.txt) <(sort -k 2 01.txt) > Joins/01j01.txt
> join -1 2 -2 2 <(sort -k 2 01.txt) <(sort -k 2 02.txt) > Joins/02j01.txt
> join -1 2 -2 2 <(sort -k 2 01.txt) <(sort -k 2 03.txt) > Joins/03j01.txt
> join -1 2 -2 2 <(sort -k 2 01.txt) <(sort -k 2 04.txt) > Joins/04j01.txt
...
> join -1 2 -2 2 <(sort -k 2 01.txt) <(sort -k 2 43.txt) > Joins/43j01.txt
> join -1 2 -2 2 <(sort -k 2 01.txt) <(sort -k 2 44.txt) > Joins/44j01.txt
> join -1 2 -2 2 <(sort -k 2 01.txt) <(sort -k 2 45.txt) > Joins/45j01.txt

Joins that copy a file onto itself are commented out; later, other joins that are identical, such as 01j02.txt
and 02j01.txt, will have to have the second pair also skipped in the concatenation step to avoid causing further
duplications. I will eventually be making every possible pairwise join that isn't a duplication.

My awk command is meant to execute the above commands listed in the file, Joins/Script-Joins-sorted-07272019.txt:

> time awk --file Joins/Script-Joins-sorted-07272019.txt

Running those scripts one at a time is successful, but tedious. The awk command generates numerous syntax errors,
mainly at the dots in the hostnames contained in the files 01.txt and 45.txt

I tried enclosing the filenames (01.txt through 45.txt) in single quotes, but that just made matters worse ...

Once the awk command works, these join commands' results can be put together lickety-split as described in my
earlier posting above.