Need to compare numbers - Help with script

10 risposte [Ultimo contenuto]
GNUser
Offline
Iscritto: 07/17/2013

So, I have a file with dozens of lines like this

12 13 59 2
102 12 32 2
99 13 102 19

I need to compare if any of the numbers in first line are present in the second line. Then I need to compare if any of the numbers of the second line are in the third. And so on and so on. I don't want to count if the same number appears in both first and third line (13 for example is not to be counted in my example above).
Position is not important (number 12 is in first position in the first line, but second position in the second line, I still want to count it).

I have been looking at ways to do loops but not sure how to build an array for this (not even sure if an array IS the best option).

Any help is greatly appreciated, I usually do well with bash but this time around it's proving more difficult. Thanks!

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

You can use that AWK program:
{ for (i = 1; i <= NF; ++i) if ($i in a) print FILENAME ":" FNR - 1 "-" FNR ": " $i " repeated"; delete a; for (i = 1; i <= NF; ++i) a[$i] }
Of course, you can turn the print into whatever you need to do with those repetitions. I can help if you do not know how.

GNUser
Offline
Iscritto: 07/17/2013

First, thanks for the reply.

Second, I never used awk much, if at all, so I have little experience with it (unlike grep which I have used in a few occasions), which is why I didn't think of using it.

Third, I will need your help to, first, understand the code you wrote. It seems to me, it is comparing each number to the entire document, am I correct? I ask this because the intention was to only compare each number to the next line.
Basically find common numbers between first and second lines, then second and third, then third and fourth, etc etc.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

I never used awk much, if at all, so I have little experience with it (unlike grep which I have used in a few occasions), which is why I didn't think of using it.

AWK is a very simple programming language. Very simple to learn too, especially if you know the basics of the languages that inspired it: C (Boolean operators, conditional statements, loops, numerical and string-processing functions, etc.), sed (mostly for the program structure, conditions followed by actions, but sed-like substitutions are possible too), and the shell (pipes, redirections, etc.). If you often need to process structured text files, it is really worth learning AWK's basics: it will take you a day of work, and you will save much time in the long run.

I will need your help to, first, understand the code you wrote.

First of all, I wrote the program on one single line. For clarity, you should break the line. Any good text editor can auto-indent the program and highlight its syntax (as long as the language is recognized: the first line should be "#!/usr/bin/awk -f").

The program I wrote consists of one single action, between braces (which more generally delimit "blocks" of instructions, as in C). That action is unconditional, because no condition precedes "{". That means the action is executed on every record. By default every line is a record (the record separator, RS, can be redefined). It is the case here.

The program uses two "for" loops, "for (i = 1; i <= NF; ++i)". Those loops follow the classical C syntax. However no variable needs to be declared or even defined (with "=", as in C) in AWK: the variable is created the first time it is used, as "" for a string, as 0 for a number, as an empty array for an array.

Those loops enumerate the field numbers. Indeed, NF is the number of fields the record contains. It may vary from record to record. Whenever a record is read, AWK automatically defines many variables, beside NF: $0 is the record, NR its number, FNR too but is reinitialized whenever a new file is processed (the program can take several files in arguments), ..., and, most importantly, $1, $2, ..., $NF contain the different fields of the record. By default, the field separator is any number of consecutive spaces and tabs (the field separator, FS, can be redefined). It is the case here.

The i-th iteration of the first loop tests, with the keyword "in", whether the i-th field, $i, is a key of the array a. The test follows the classical C syntax. If it passes, the next line is executed (a block could be used for several instructions, as in C): it prints a concatenation (spaces between constants/variables concatenates them) the name of the file that is currently processed (FILENAME, which is automatically defined), the previous record number in that file (FNR - 1), an hyphen ("-"), the current record number in that file (FNR), the string ": ", the i-th field ($i) and the string " repeated". Because the print instruction is used (printf exists as well), the output record separator is printed at the end. By default, it is the newline character (ORS can be redefined). It is the case here. Also, although that "print" uses no comma, it is worth noticing a comma would be replaced with the output field separator, which is a space by default (OFS can be redefined).

Once out of the first loop, the array a is deleted, using the keyword "delete". In this way, in the previous loop, the array a contains, as keys, the fields of the previous record only (finally answering your question!). The second loop (re)defines that array. It simply accesses the values at the key $1, $2, ..., $NF. Unless there are repetitions in the same record, none of those keys exists: they are created when they are accessed and associated with "". You certainly wonder why I would use an associative array if there is no values to associate with the keys. The answer is: because AWK only has that structure.

You may start learning more about AWK with the slides I use to teach (the archive with the data for the exercises is still online; it contains answers too): https://dcc.ufmg.br/~lcerf/slides/mda6.pdf

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Rereading the original post, I notice the task deals with "counting".

Assuming no repeated field in a same line, counting the total number of repetitions (3, for the three lines you gave: 2, 12 and 102 are all repeated once) shows the extremely useful END condition, satisfied once the whole input has been read:
{ for (i = 1; i <= NF; ++i) if ($i in a) ++count; delete a; for (i = 1; i <= NF; ++i) a[$i] }
END { print count }

Counting how many pairs of subsequent lines have at least one repetition (the result is 2 for the three lines you gave: 1-2 and 2-3) additionally illustrates the use of the break instruction, which works as in C:
{ for (i = 1; i <= NF; ++i) if ($i in a) { ++count; break }; delete a; for (i = 1; i <= NF; ++i) a[$i] }
END { print count }

Counting the numbers of repetitions per value (still assuming no repetition on a same line) really uses the associative nature of AWK's array:
{ for (i = 1; i <= NF; ++i) if ($i in a) ++count[$i]; delete a; for (i = 1; i <= NF; ++i) a[$i] }
END { for (i in count) print i, count[i] }

Notice that the keyword "in" in a for loop allows to enumerate the keys in the array.

GNUser
Offline
Iscritto: 07/17/2013

First of all, thank you so much for all the detailed explanation about AWK. I cannot say I understood everything (I am used at writing in bash, and from some scripts I have shared here in the Forum you will notice I am not as good at making things simple and straightforward as you are), but your instructions were certainly clear and well written out. I hope others will benefit from reading it too.

Second, the first example is indeed what I needed (the result being 3). I run that piece of code and was able to get the results I needed. The script is also fast, which I was worried about given my files were very large. Works great!

AWK seems very useful and I hope I will have the time to put into learning it, right now it wasn't possible so I am deeply grateful for your help. Thank you once again!

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The script is also fast, which I was worried about given my files were very large.

Notice that the shell is slow. It is a reason to substitute loops in shell scripts (like "while read line; do ... ; done < file") with calls to commands processing the whole input line by line. Essentially all text-processing commands do that, AWK programs included. If possible, those commands had better communicate through pipes, so that they run in parallel.

Notice also that Perl, which is often used to do text-processing, is not only harder to learn than AWK but slower too.

GNUser
Offline
Iscritto: 07/17/2013

Btw, I was convinced AWK was like grep and sed, a tool to process text files. But you talk and show it as being a language on its own. Are the two the same thing?

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Well, a (formal) language is defined as a set of words. Using that definition all valid regular expressions (as given to grep) are a language, all valid AWK programs are a language, etc.

AWK is a simple programming language. Contrary to grep (but like sed, what is surprising), it is Turing-complete: it can compute anything that can theoretically be computed. But it is specialized in text-processing: it processes an input record by record (line by line, by default), automatically splits the record into fields, etc. Also, it shares with grep and sed the heavy use of regular expressions (the variables RS and FS I wrote about are regular expressions, conditions can be regular expressions, etc.).

loldier
Offline
Iscritto: 02/17/2016

Here's a book on Sed & Awk.

https://doc.lagout.org/operating%20system%20/linux/Sed%20%26%20Awk.pdf

https://stackoverflow.com/questions/7727640/what-are-the-differences-among-grep-awk-sed

"Now awk and sed are completely different than grep. awk and sed are text processors. Not only do they have the ability to find what you are looking for in text, they have the ability to remove, add and modify the text as well (and much more).

awk is mostly used for data extraction and reporting. sed is a stream editor."

loldier
Offline
Iscritto: 02/17/2016

Brian Kernighan explains where 'grep' came from. (Computerphile, July 6, 2018).

https://yewtu.be/watch?v=NTfOnGZUZDk