Separating hostnames having multiple IPv4 addresses from a long two-column list
- Inicie sesión ou rexístrese para enviar comentarios
Back in February I was rescued by Magic Banana from a predicament wherein I was failing to remove duplicate lines from a two-column list of hostnames and their IPv4 addresses:
https://trisquel.info/en/forum/sort-and-uniq-fail-remove-all-duplicates-list-hostnames-and-their-ipv4-addresses
Now I'm faced with a similar, longer (~80,000 rows) list of all the hostnames in a collection of address blocks, from which I'm trying to _remove_ all the lines whose hostnames are _not_ repeated. That should yield a list of hostnames that are repeated within the same address block or even across different address blocks.
This would be easy if I could flag repeated strings in the first column and then sort them (along with their associated IPv4 addresses in the second column, which contains no duplicates) so that I can make a two-column list of just the hostnames that have multiple IPv4 addresses.
OpenOffice Calc seems to have a way of flagging duplicate values in adjacent cells in a column, but mine freezes so often that the exerimentation would take too long.
Thanks,
George Langford
That shouldn't be the case - Calc should easily be able to handle 80,000+ rows at a time. Which version of Calc are you using? The latest 6.2.2 is very stable. Also, how are you installing it? I prefer to download from the Libreoffice website and install from there, as many distributions only supply outdated or poorly configured versions in their repositories.
Also, what do you mean by freezing? I've dealt with 650,000 rows of data in Calc at a time, and if I apply a formula to the entire spreadsheet Calc will appear to "freeze" for a few minutes, but it is just taking the time it needs to calculate the changes. Do you mean "freezing" in terms of it completely locking up? Or is it simply acting unresponsive for awhile while it calculates changes?
Let us know, I'll bet we can help find a solution.
My OpenOffice Calc is actually the LibreOffice Calc that comes with Trisquel (flidas). The link that I'm attempting to follow is here:
https://stackoverflow.com/questions/35822359/highlight-duplicate-value-in-openoffice-calc
At the bottom of that page is the suggested solution:
"If you want to format any value in a cell that is a duplicate of the one immediately above it, select the entire column and use the following formula in Conditional Formatting:
"(INDIRECT(CELL("address"))=OFFSET(INDIRECT(CELL("address"));-1;0))"
But when I select Conditional Formatting, the drop-down menu is uncooperative, and I've no idea how to proceed ... anything I do just freezes the program.
I'm presuming that after I've color-coded the duplicates, that I can thereafter separate the color-coded ones from the rest.
What's your Calc version number?
If you attach a small excerpt of the text input(s) (save the speardsheet in CSV, if that is all you have), covering all "cases", and the expected output, I may better understand what you want to achieve and write you a command line. 'uniq' has an option --unique (-u) to only display unique lines. But it looks like you want 'comm', which compares two (sorted) files and show the lines appearing in both.
My libreoffice calc is version 5.
Attached is a selection from the file from which I'd like to select the duplicate hostnames and their IPv4 addresses.
The first two columns are the product of nMap searches on about 350 address blocks. I've added a column which contains the results of nslookup on the addresses in the second column. "92.242.140.21" resolves to barefruit errorhandling, a catchall; one occurrence of 92.242.140.21 comes from a single hostname which is probably duplicated elsewhere on the Internet. When both duplicated hostnames are on the same /24 address block, it appears that nslookup returns the first one's address only, masking the second one. The nslookup result for alolika.su surprised me, as it is the correct IPv4 address for one instance of that hostname, leaving the second instance completely hidden.
I've already removed all truly duplicated lines from the spreadsheet. The second column therefore should contain no duplicated IPv4 addresses. Any search for unique hostnames must therefore be applied only to the first column.
Anexo | Tamaño |
---|---|
selection.ods | 16.73 KB |
Can you download the 6.2.2 version from the libreoffice website? Unpack, and then go into the DEBS folder, and sudo dpkg -i *deb. That should hopefully install a copy beside your version 5, or at least that's how it's worked for me. The latest versions have been much more stable and functional.
Here's the text version of the selection (.CSV and HTML aren't accepted here...)
Anexo | Tamaño |
---|---|
selection.txt | 540 bytes |
I am not entirely sure I understood the task (showing the expected output would help). I basically understood "only output the lines if the value in its first column is different in the line before or after". Is that it? If so, here is a horrible solution (as usual with sed):
$ tr -s '\t' ' ' < selection.txt | tail -n +3 | sed 's/\([^ ]*\) \([^ ]* \)\(.*\)/\2\3 \1/' | uniq -uf 2
188.120.236.138 188.120.236.138 alneo.ru
91.240.86.50 NXDOMAIN alnitcorp.fvds.ru
82.146.42.228 92.242.140.21 alod.fvds.ru
62.109.12.147 62.109.12.147 aloe-slings.com
82.146.60.203 82.146.60.203 alohomora.ru
62.109.17.21 185.9.147.4 aloshop.ru
82.146.34.229 92.242.140.21 alosvlad.fvds.ru
212.109.193.28 212.109.193.28 alovera.ru
I assume that:
- there is no space in the input file (tr -s '\t' ' ' changes the tabulations to spaces because tabulations cannot be reliably used in 'sed'; tell me what character cannot be used if the space can);
- the input file is sorted w.r.t. the first column (if not, add " | sort" after "tail -n +3", which only aims to remove the first two lines);
- you do not care having the first becoming the last column in the output (if you do, tell me: a similar sed program as the one already in the command line can move it back);
- you do not care having the space as the delimiter in the output (if you want a tabulation, pipe, at the end of the command to "tr ' ' '\t'").
Making the same assumptions, here is a far more readable command line, thanks to awk:
$ awk 'NR > 2 { print $2, $3, $1 }' selection.txt | uniq -uf 2
By curiosity, I would like to know the times each of the two solutions take on the large input: just prefix the commands with "time ". I would bet on awk. If it is indeed faster and you want some of the modifications listed above, tell me.
Magic Banana is on the right track, but we're not getting the hoped-for output.
Magic Banana's understanding of my goal is:
> I am not entirely sure I understood the task (showing the expected output would help). I basically
> understood "only output the lines if the value in its first column is different in the line before or after"
Should have been "only output the lines if the value in its first column is [the same] in the line before or after"
Here's my version of what that understanding should be:
Only output the rows' data if the hostnames in the first column are the same.
The minimum output is two rows of data; none of a group of rows should be dropped.
Here's the original data, without the third column (which was only presented to indicate of the magnitude of the issue, which is that duplicated hostnames are used to obfuscate their IPv4 addresses):
hostname IPv4 address
,,
alneo.ru 188.120.236.138
alnikino.ru 212.109.192.55
alnikino.ru 212.109.197.49
alnitcorp.fvds.ru 91.240.86.50
alod.fvds.ru 82.146.42.228
aloe-slings.com 62.109.12.147
alohomora.ru 82.146.60.203
alolika.su 82.202.160.112
alolika.su 92.63.110.215
aloshop.ru 62.109.17.21
alosvlad.fvds.ru 82.146.34.229
alovera.ru 212.109.193.28
The hoped-for output should be:
duplicated hostnames IPv4 address
alnikino.ru 212.109.192.55
alnikino.ru 212.109.197.49
alolika.su 82.202.160.112
alolika.su 92.63.110.215
Yes, the first column was already sorted; and the second column is free from duplicate IPv4 addresses (which originally got in there because I accidentally scanned some of the same IPv4 addresses more than once by covering the same CIDR ranges with /24 that had already been covered with larger CIDR spans).
Attached is a longer selection which includes hostnames that begin with numerals, followed by the "redacted" version, containing only duplicated hostnames and their IPv4 addresses (not yet rid of their enclosing parentheses, artifacts of the nMap output files. My "redaction" is manual, of course.
There's no need to retain the headers, of course; we are unlikely to get the two columns confused; on the other hand, it is essential that the sorting process correctly mantain the pairs from each row of the table.
I started to process the ~80,000 row ODS file and got through marking about 11,000 rows with 0's and 1's in about half a day's efforts. The race between geek & grunt has begun !
Anexo | Tamaño |
---|---|
selection-longer.txt | 16.52 KB |
selection-longer-dups.txt | 2.85 KB |
OK. So that is basically the opposite of what I understood in the first place: the -D option of 'uniq' must be used instead of its -u option. That gives:
$ awk 'NR > 2 { print $2, $1 }' selection.txt | uniq -Df 1 | awk '{ print $2 "\t" $1 }'
alnikino.ru 212.109.192.55
alnikino.ru 212.109.197.49
alolika.su 82.202.160.112
alolika.su 92.63.110.215
Using the same command on selection-longer.txt and redirecting the output to a file named out, I can then compare with your manual work (after removing the headers and "normalizing" the number of tabs to one). I find three differences:
$ tail -n +3 selection-longer-dups.txt | tr -s '\t' | diff - out
11a12
> a214947.fvds.ru (149.154.69.180)
76a78,79
> a920222.fvds.ru (188.120.251.198)
> a920222.fvds.ru (188.120.251.50)
So, well, either I still have not understood the task or you forgot those three lines. In any case, you should never mechanically process by hand long files.
Geek wins !
I applied Magic Banana's script to the ~2.5MB master file (sans parentheses) and obtained a ~600KB file of duplicates in ~.2 second. Thank you !
These hostnames came from the combined address ranges of the autonomous systems 34300,29182,59729,15626,56630,44493, and 48666. I skipped a few CIDR blocks because those were addressed in previous studies that I did on the derivatives of example.com. The resulting spreadsheet has ~19,000 rows. That's as many as all the permutations of example.com and are just as difficult to resolve with nslookup or dig. It's roughly a fifth of the number of rows in the master file.
Bear in mind that this is a moving target and would have to be repeated on almost a daily basis to maintain an accurate list of unresolvable hostnames.
There is apparently no oversight of this issue. Registrars in the Western World will not let us register a domain name exactly like another one, and they even deprecate typographic lookalikes. Think of the issues of universal inoculations, seat belts, inspection of pressure vessels, and driver licenses ...
I am indebted to Magic Banana for his contribution to this effort. He's right about manual processing; I had even proofread the longer list.
George Langford
Geek wins !
That is precisely what my fifth slide on https://dcc.ufmg.br/~lcerf/slides/mda3.pdf says. :-)
The half day spent mechanically processing the large file was more than enough to read those slides, practice with the related exercises (the data are still online, together with the solutions) and, in the end (on slide 20), know about 'uniq', among many other simple text-processing commands. The subsequent slides on https://dcc.ufmg.br/~lcerf/pt/mda.html#slides up to "A few words about efficiency", are about 'grep' (and regular expressions), 'sed' (almost only the substitution command, actually), 'awk' (far more in-depth: awk is great!) and, well, "A few words about efficiency".
I assume the student knows how the fundamentals of the interactive use of the Shell. If it is not your case, then read, before any of the above slides, http://en.flossmanuals.net/command-line/ :
- The whole "Basics" section";
- The subsections "Basic commands", "Cut down on typing" and "Redirection" of the "Commands" section;
- The subsection "Piping" of the" "Advanced-ish" section.
Magic Banana (the teacher !) is right to admonish me for my ad hoc approach to script writing ...
He makes a compelling rationale for supporting higher education as well as continuing education.
The substitutes can be exhausting and time consuming. The tutorial is superb, but no lasting knowledge
is gained without doing one's homework. I learned that in my High School math classes.
At this wrting I'm halfway through the task of converting the ~20,000 row spreadsheet of duplicate
hostnames and their IPv4 addresses into an ~80-cell HTML table for the Internet, bearing in mind
that my HTML-editor doesn't like more than a thousand rows in its tables. That has to be readable
by folks tracking down those unresolvable hostnames appearing in their raw access files.
George Langford
HTML is plain text... that awk can write for you (no need for an editor):
$ awk 'NR > 2 { print $2, $1 }' selection.txt | uniq -Df 1 | awk 'BEGIN { print "<table>\n<thead><tr><th>hostname</th><th>IPv4 address</th></tr></thead>\n<tbody>" } { print "<tr><td>" $2 "</td><td>" $1 "</td></tr>" } END { print "</tbody>\n</table>" }'
<table>
<thead><tr><th>hostname</th><th>IPv4 address</th></tr></thead>
<tbody>
<tr><td>alnikino.ru</td><td>212.109.192.55</td></tr>
<tr><td>alnikino.ru</td><td>212.109.197.49</td></tr>
<tr><td>alolika.su</td><td>82.202.160.112</td></tr>
<tr><td>alolika.su</td><td>92.63.110.215</td></tr>
</tbody>
</table>
- Inicie sesión ou rexístrese para enviar comentarios