How can awk as a substitute for paste

12 respostas [Última entrada]
amenex
Desconectado
Joined: 01/03/2015

Some time ago I noticed that an awk script piped into a long-running nmap script caused the overall
script to make relatively frequent saves of the output file, providing some assurance that the script
was making progress and providing a record of its accomplishment. That occasionally allowed me to restart
the script (as in the case of a loss of mains power) to pick up where the script stopped by truncating
its source file and thereby avoiding having to run the entire source file from the beginning.

Here's such a script:
awk '{print $1}' 'B-May2020-L.768-secondary.txt' | sudo nmap -6 -Pn -sn -T4 --max-retries 16 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' | uniq -u > May2020-L.768-secondary.oGnMap.txt

There is no other need for
awk '{print $1}' 'B-May2020-L.768-secondary.txt'
because it's not actually changing the file, but the result of its use is to provide the extremely
helpful & protective effect of causing the overall script to make frequent saves to HDD.

Now I am using a paste script immediately before the nmap script, but this similar use of the pipe
process is not causing any peridic saves to take place. With fifteen such scripts running at once,
a lot of data can disappear if the mains power is lost in a windstorm.

The source file in each example is simply a list of IPv6 or IPv4 addresses made up in parts of real
and randomly generated blocks with nothing that requires rearrangement or sorting.

A crude way of accomplishing this could be to insert a similar redundant awk script between the
paste and nmap sections of the main script, but is there a more geek-like way of forcing the
script to make periodic saves ? That method need not be especially efficient, as my network's
sending/receiving speeds appear to be the rate-limiting factors, presently about 30 kB/second.

George Langford

Magic Banana

I am a member!

I am a translator!

Desconectado
Joined: 07/24/2010

'awk' does not aim to synchronize cached writes to persistent storage. 'sync' does. To call it every 5 seconds:
while sleep 5
do
sync
done

But you do not want a hard-coded value. A default is good though. Also, for greater performances, you should be able to give the list of the files to sync. That gives a script such as:
#!/bin/sh
period=5
if [ -n "$1" ]
then
period=$1
shift
fi
if [ -n "$1" ]
then
option=-d
fi
while sleep $period
do
sync $option "$@"
done

Type Ctrl+C to terminate.

EDIT: arguably simpler script.

amenex
Desconectado
Joined: 01/03/2015

While all this coding was going on, I discovered a flaw in my logic, whereby my method of
separating not-looked-up IPv4 addresses from the Recent Visitor data was extracting some
lookalike IPv4 data from the hostnames, including impossible addresses. I used comm to
select only those addresses that were in both the calculated list and the Recent Visitor
data, reducing the list of extracted two-octet prefixes from over 50,000 to around 27,000.

Naturally, now that Magic Banana tells us that awk won't do what I expect, sure enough, it
stopped causing those user-friendly intermediate writes.

The man sync and info sync pages are not helping me; as yet I haven't the slightest clue
where sync stores the data in persistent storage; if I knew, I could watch it develop.

There are now 38 files to be created with long-running nmap scripts; if I could write one
sync script before I start and not stop it until the last of those 38 files finishes, and
still be able to watch the occasional disk writes so I could evaluate progress of the scripts,
that would be ideal. If I could list the 38 output files in advance, that would save a lot
of scrambling while the nmap searches are onging about ten at a time.

The good news is that when nmap started complaining about 942.103.*.* addresses, I used
grep to search all the Recent visitor files and found (in a few seconds) three instances of:
as45942.103.28.157.43.lucknow.sikkanet.com

Dig -x reveals that the actual IP address is 103.28.157.43, not 942.103.28.157, neither of
which ought to contribute to the two-octet prefix list (but 103.28 got in there anyway !)

If I create a subfolder called Sync under the May2020/IPv4 folder and store a list of the
expected output filenames to which sync is to be applied in a text file there, can sync be
made to activate and start to save cached data to the appropriate filename as soon as the
nmap script is started ?

George Langford

Magic Banana

I am a member!

I am a translator!

Desconectado
Joined: 07/24/2010

I haven't the slightest clue where sync stores the data in persistent storage; if I knew, I could watch it develop.

'sync' just tells the kernel to write the buffered data to disk. To the files where they would have eventually been written: watch your output files.

amenex
Desconectado
Joined: 01/03/2015

Here's the result from my first thoughtful guess:
while sleep 300; do sync -f /home/george/Desktop/May2020/nMapScans/ScoreCards-IPv4/SyncIPv4/; done | time sudo nmap -Pn -sn -T4 --max-retries 8 -iL Addresses.IPv4.May2020.37.txt -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' | uniq -c | awk '{print $3"\t"$2}' '-' | sort -k 1 > SyncIPv4/Random.IPv4.May2020.37.nMap.primary.oG.txt

Nothing appeared until the script finished, whereupon time added the following:
195.00user 99.48system 4:15:16elapsed 1%CPU
followed by sync's responses:
(0avgtext+0avgdata 13032maxresident)k0inputs+0outputs (0major+2486minor)pagefaults 0swaps
without any actual conclusion of the script; the prompt hasn't reappeared yet, even though
there is no network activity. I used Ctrl+C to regain the prompt.

Where was sync saving the cached output during the 4-1/4 hours the script was running ?

There were many singular PTR records, a modest number of No_DNS responses, and few, if any, multi-address PTR's.
1,684,804 IPv4 addresses were examined, with 1,684,780 responses for 649 CIDR/16 blocks.
There are 77.7MB of data received for 22.9MB of input addresses.

Magic Banana

I am a member!

I am a translator!

Desconectado
Joined: 07/24/2010

Save the script I wrote in a file, turn it executable and execute it in a terminal. Without any argument, it asks the kernel to write the buffered data to disk every five second (you can change this default modifying the number 5 at the beginning of the script). With one argument, you can specify a different period. It must be a valid argument to 'sleep': 2 or 2s for two seconds, 5m for five minutes, etc. Any number of files can be given as additional arguments: if so, only those files will be sync'ed.

Your "thoughtful guess" makes little sense: no data flows in the first pipe, all your '-' are useless (most commands process the standard input by default; also, the quotes are here useless), the sed substitution writes parentheses that tr removed immediately after, the counts that uniq -c outputs are removed immediately after and, as I have already told you many many times (the last time was one week ago: https://trisquel.info/forum/another-uniq-u-feature-emerges#comment-150310), sort -k 1 is the same as sort alone (you probably want sort -k 1,1).

followed by sync's responses

sync does not write anything (but errors). time outputs what you show.

without any actual conclusion of the script; the prompt hasn't reappeared yet, even though there is no network activity. I used Ctrl+C to regain the prompt.

The while loop is infinite. That is why I told you "Type Ctrl+C to terminate" in https://trisquel.info/forum/how-can-awk-substitute-paste#comment-150532

Where was sync saving the cached output during the 4-1/4 hours the script was running ?

Please read https://trisquel.info/forum/how-can-awk-substitute-paste#comment-150543 again. I suspect you do not understand the concept of a data buffer. Without such a buffer, whenever something is to be written in a file, it would be directly written onto the disk. Nevertheless, writing many times small amounts of data onto the disk is slow. It is more efficient to accumulate in RAM the data and to make one big write when a lot was accumulated or when necessary. sync tells the kernel (which manages the buffer) "it is necessary".

amenex
Desconectado
Joined: 01/03/2015

Regarding the non-scholarly:
| sed 's/()/(No_DNS)/g' | tr -d '()'
Nmap writes parentheses with nothing between them when it receives no PTR data for an IP address;
it also writes parentheses around an IP address whose PTR has been received.
My expression puts the IP addresses all in one column and the PTR responses (including No_DNS) in
a different column.

Yes, it is redundant; I really should have written:
| sed 's/()/No_DNS/g' | tr -d '()'
There are still two expressions, and only two fewer characters among them.

Those sort -k 1's are indeed redundant, but writing the extra characters reminds me to check where
the pertinent column actually is, and I don't perceive any harm in the result.

Data buffers: I actually do understand, because I watch the System Monitor while the nmap scans are
ongoing, and there's a dip in the network traffic about every 100 seconds, presumably while those
buffers are being updated. The RAM usage climbs steadily as a number of simultaneously running
scripts gather their multi-megabyte data. Also, losing network connectivity does not cause errors
to appear in the accumulated data.

"the counts that uniq -c outputs are removed immediately after"
I have had trouble with sort -u and uniq -u, whereas uniq -c works every time; and discarding
the counts doesn't tax the HDD because that write isn't otherwise saved.

"Save the script I wrote in a file ... "
Alas, the script puzzles me, partly because I can find no man page for "fi" but mainly because it is
composed with shorthand expressions which I don't comprehend. As I rarely mix nmap script executions
with other tasks, mainly because everything else is dramatically slowed down when there are a dozen
scripts running, can I be assured that this sync script is a generalized one that I don't have to
modify at all ? There won't be very frequent updates (think five minutes (5m) because that would make
filename updates every 25 seconds while a dozen scripts are running; greater frequency would have the
list of files changing too frequently.

After changing Magic Banana's sync program to increase the sleep interval to five minutes (period=5m)
I saved it as filename.bin, made it executable with chmod +x filename.bin, and tried to start it with
sudo .filename.bin ... alas, Terminal's response was "command not found." I couldn't find that file
either ... ".filename.bin" is lost. I actually called it SyncIPv4.bin.

George Langford

Magic Banana

I am a member!

I am a translator!

Desconectado
Joined: 07/24/2010

I have had trouble with sort -u and uniq -u, whereas uniq -c works every time

All those commands work all the time.

Running 'uniq -c' and deleting the counts immediately after is the same as using 'uniq' without option -c.

I can find no man page for "fi"

It is "end if" in shell language. 'man sh' says:

The syntax of the if command is
if list
then list
[ elif list
then list ] ...
[ else list ]
fi

it is composed with shorthand expressions which I don't comprehend.

It is quite basic shell scripting: [ -n tests whether the subsequent string is nonempty; "$1" is the first argument given to the script; period= defines the variable named period; $period is the value of the variable period; shift shifts the arguments of the script (it sets the value of $1 to the value of $2, the value of $2 to the value of $3, etc.); "$@" is all these arguments.

People usually tend to better comprehend short scripts with short instructions than long command lines using useless options, with calls that do things that are reverted later on the command line, etc.

can I be assured that this sync script is a generalized one that I don't have to modify at all ?

It does what I described in details in the first paragraph of my previous post. Is it what you want?

that would make filename updates every 25 seconds

The file contents are sync'ed, not the file names.

greater frequency would have the list of files changing too frequently.

The smaller the period, the less data you may lose in case of crash, but the more writes to the disk and the worse the performance.

sudo .filename.bin ... alas, Terminal's response was "command not found."

There is no executable named .filename.bin in any directory listed in $PATH. If there is an executable named filename.bin in the working directory and if it accepts 5m as a single argument, you may execute it in this way:
$ ./filename.bin 5m

There is here no need for administrator privileges, as granted by sudo.

I actually called it SyncIPv4.bin

You can call the script whatever you want. Notice anyway that it does not deal with IPv4 and that it is not a binary!

amenex
Desconectado
Joined: 01/03/2015

Magic Banana wrote:

The file contents are sync'ed, not the file names.

Should I expect the time stamp to change ?

amenex wrote:
sudo .filename.bin ... alas, Terminal's response was "command not found."

To which Magic Banana responded:
There is no executable named .filename.bin in any directory listed in $PATH.
Reflects the truth: .filename.bin doesn't actually reside anywhere; it's invoked from the
command line in the working directory (where the files to be sync'ed are located) but it
resides in live memory (RAM).

And then Magic Banana continued:
If there is an executable named filename.bin in the working directory and if it accepts 5m
as a single argument,

Magic Banana also said that [one] may execute it in this way:
$ ./filename.bin 5m
Adding, there is here no need for administrator privileges, as granted by sudo.

Amenex exclaimed:
Progress ! Terminal responded as it would when a live script is in progress ...

The output of my long & tortuous nmap script is meant to go into the directory from which
filename.bin was executed as recommended by Magic Banana, appears with a time stamp corresponding
to its starting time, has had zero bytes from the get-go (as in my usual experience) but several
five minute sleep periods have gone by, and there's no sign of any changes in the output file.

George Langford

Magic Banana

I am a member!

I am a translator!

Desconectado
Joined: 07/24/2010

I have just realized that, contrary to the command line in the original post, the newer one ends with 'sort'. As a consequence, everything must run to completion before the output is written. Indeed the first line in the output can possibly be the last processed line. You probably want to modify the end of the newer command line to save what sort receives:
(...) | tee SyncIPv4/Random.IPv4.May2020.37.nMap.primary.oG_unsorted.txt | sort > SyncIPv4/Random.IPv4.May2020.37.nMap.primary.oG.txt

amenex
Desconectado
Joined: 01/03/2015

Magic Banana astutely proposed:
(...) | tee SyncIPv4/Random.IPv4.May2020.37.nMap.primary.oG_unsorted.txt | sort > SyncIPv4/Random.IPv4.May2020.37.nMap.primary.oG.txt

Which works right away and without modification; and in concert with the preceding awk statement that I had not yet removed:
awk '{print $0}' 'Addresses.IPv4.May2020.35.txt' |

The ... unsorted.txt file and the ....nMap.primary.oG.txt files both appeared, first ....nMap.primary.oG.txt with
zero bytes, followed (in a few seconds, not five minutes as in the "./SyncIPv4.bin") by the ... unsorted.txt file,
which began filling up frequently (as the main file used to do with that awk script for reasons unbeknownst to either of us).

Thank you !

George Langford

Magic Banana

I am a member!

I am a translator!

Desconectado
Joined: 07/24/2010

the main file used to do with that awk script for reasons unbeknownst to either of us

There is no way the output of sort can be written before its entire input was generated. As I wrote in my last post: "the first line in the output can possibly be the last processed line". Excerpt from 'info sort':
‘-o OUTPUT-FILE’
‘--output=OUTPUT-FILE’
Write output to OUTPUT-FILE instead of standard output. Normally,
‘sort’ reads all input before opening OUTPUT-FILE, so you can
safely sort a file in place by using commands like ‘sort -o F F’
and ‘cat F | sort -o F’. However, ‘sort’ with ‘--merge’ (‘-m’) can
open the output file before reading all input, so a command like
‘cat F | sort -m -o F - G’ is not safe as ‘sort’ might start
writing ‘F’ before ‘cat’ is done reading it.

amenex
Desconectado
Joined: 01/03/2015

Here's one of the old scripts which was causing the output file to be updated:
awk '{print $1}' 'B-May2020-L.768-secondary.txt' | sudo nmap -6 -Pn -sn -T4 --max-retries 16 -iL '-' -oG - | grep "Host:" '-' | awk '{print $2,$3}' '-' | sed 's/()/(No_DNS)/g' | tr -d '()' | uniq -u > May2020-L.768-secondary.oGnMap.txt

There's no sort there, just uniq, which I was expecting only to remove adjacent duplicates. So we aren't at odds.

There are sixteen instances of the script running right now, which keeps the network I/O nice & steady.
Some dips happen when there's other traffic, but not because the responses that nmap receives are sporadic;
those dips are ironed out while so many other nmap scans are ongoing.

George Langford