Can a path statement be too long or the file too big?

7 replies [Last post]
amenex
Offline
Joined: 01/03/2015

Faced with the task of using grep to find related strings in a target file located on a USB
flash drive, with the strings located in a file in the local directory, I had success with
the first two scripts, generalized to protect the innocent:
grep -f Pattern-File-A.txt /home/george/FlashDrive/...4 steps.../Target-File-A.txt > Output-File-A.txt
grep -f Pattern-File-B.txt /home/george/FlashDrive/...4 steps.../Target-File-B.txt > Output-File-B.txt

These worked fine, with a satisfying number of matches.

Continuing, I used a similar script with the same path length from a third pattern file to
the second target file, but the output file turned out to be identical to the input target
file, with zero matches.

My non-geek workaround was simply to move the target file into the working directory,
whereupon I met with success:
grep -f Pattern-File-C.txt Target-File-B.txt > Output-File-C.txt
The output was clean, with nothing irrelevant.

Target file sizes were 32.2MB, 63.5MB, and 63.5MB (the last two being the same file).
The Pattern files contained 385, 155, and 232 strings, respectively.
The successful output files included about 5500, 4000, and 600 matches, respectively.

There were no other scripts running during this exercise. The scripts each took a small
fraction of a second CPU time, even the unsuccessful one.
8GB RAM, 18GB swap, T420 ThinkPad

George Langford

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

PATH_MAX is 4096 on Trisquel 8, according to getconf. The maximal file size depends on the filesystem. Trisquel 8's installer chooses XFS for /home, which supports files up to 8 exbibytes.

Did you really think that a few tens of characters is beyond limit, for a path? And a few tens of MB for a file?!

amenex
Offline
Joined: 01/03/2015

After thinking about my observation last night, it dawned on me that the script has to reach down
through the directory structure of the host computer before negotiating the directory structure of
the flash drive. Let's articulate that structure in the scripts as though they were being run from
arbitrary working directories:
grep -f /home/george/Desktop/May2020/nMapScans/Pattern-File-A.txt /home/george/FlashDrive/...4 steps.../Target-File-A.txt > Output-File-A.txt
versus the unsuccessful script:
grep -f /home/george/Desktop/January2020/DataSets/MB/Multi-addressed-HNs/Counts/Pattern-File-C.txt /home/george/FlashDrive/...4 steps.../Target-File-B.txt > Output-File-B.txt
and after moving the target file:
cd /home/george/Desktop/January2020/DataSets/MB/Multi-addressed-HNs/Counts ;
grep -f Pattern-File-C.txt Target-File-B.txt > Output-File-C.txt

Analogously to moving a multi-gigabyte file:
mv hugefileA.txt /home/george/someplaceelse/folderB/folderC/hugefileA.txt
which takes just the blink of an eye so long as the move takes place within the file structure of
the storage medium, all that really matters is the number of characters in the path statement.

The multi-megabyte target file moved in just a few blinks of an eye and can be moved back to its
flash drive just as quickly ... I actually copied the file, so I can just delete it and then
clear the trash file.

In the present case it was painfully obvious that the unsuccessful script wasn't inaccurate or
noisy but plainly wholly unsuccessful instead, as the output file was the exact same size as the
target file. This is not a bug; it's a feature that is necessary for an historic or practical
reason, whatever that may be.

It would appear that the default in this instance is for grep to spit the target file back out
instead of causing chaos when its internal limit is exceeded. Better just to state that result
with a simple "don't tickle me there" text response instead of appearing to go through the
motions ?

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

grep has no "internal limit": as long as the currently processed line (not file) fits into main memory, it works.

amenex
Offline
Joined: 01/03/2015

Pressing on to delve into another part of the project ...

Again another grep script in the same vein worked very hard for about twenty minutes and then
disgorged the original target file in its output, using an increment of 4.7GB of RAM. Even with
the target file moved into the working directory ... so that workaround failed.

Faced with no alternative, this lazy semi-geek condensed the target file to a simple one-column
list (which he ought to have done at the outset) and re-ran the grep script on the condensed target,
with the happy response of a slowly rising output file (rising in kB increments rather than multi-MB),
the same RAM usage (the additional 4.7GB), and usage of 520kB of SWAP, which it had not done before.

Grep again took about twenty minutes to accomplish all this.

Grep was clearly going through the same motions each time, but somehow its actual output has gotten
overwritten in the first instance, but was protected in the second run. The target file was 63.5MB
in the first run and still 26.5MB the second time, but now the output file has real grepped data
without superfluous material, about 3MB and 71,000 rows of non-duplicated matching PTR's.

One difference between yesterday and today is that the pattern file now has nearly 4,000 hostnames.

What should I be seeing in the log files ?

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Even with the target file moved into the working directory ... so that workaround failed.

That is not a workaround. Instead of assuming you are facing limitations or bugs and trying random things, you had better read the documentation and understand what you do wrong.

Without option -F, grep interprets every line in the pattern file (in argument of -f) as a regular expression. For instance, '.' means "any single character". Is it what you want?

As usual, it is very hard to understand your problem and I got tired of deciphering your vague text with no example of the input and of the related desired output. Anyway, I doubt grep is the proper command to use here. It looks like a task for awk or, if the order does not matter, for join (or even comm).

amenex
Offline
Joined: 01/03/2015

Grep worries me because it selects a lot of names that aren't in the pattern file, even while the operation
remains orderly and manages to select just a fraction of the target file's entries. I had thought that grep has
the advantage of allowing me to identify long PTR records based on permutations of their IPv6 addresses, but
such comparisons did not occur in the present set of patterns based on IPv4 addresses, where there weren't any
examples of permutated IPv4 addresses in the target file's PTR's.

Join selected just 141 matches, which were easy to recognize because those matches alone included the data in the
pattern file's second column. Comm also selects those 141 matches; and I used join to restore their counts column.

The join, sort, and comm-based scripts all were executed orders of magnitude faster than the grep script.

The original pattern file and a randomized (sort -R) as well as reduced-length (one million+ to 300,000 rows)
target file are attached, for which join as well as comm find 35 matches in short order.

AttachmentSize
SS.HN-GLU-MB-January2020-PTRs-Rndm.txt.gz 1.55 MB
SS.IPv4-NLU-Joined-HN-GLU-January2020-slash24.PTRs_.Tally_.txt 93.66 KB
Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Grep worries me because it selects a lot of names that aren't in the pattern file

No, it does not. You do not understand what it does and/or do not use it properly. Beside every dot in the pattern file that should be \. (to not match any single character, as I explained in my previous post), a caret and a tab should respectively start and end every pattern.

Consider line 297123 of SS.HN-GLU-MB-January2020-PTRs-Rndm.txt for instance: "union". Because of it, 'grep -f SS.HN-GLU-MB-January2020-PTRs-Rndm.txt' outputs every line including "union". In SS.IPv4-NLU-Joined-HN-GLU-January2020-slash24.PTRs_.Tally_.txt:
$ grep union SS.IPv4-NLU-Joined-HN-GLU-January2020-slash24.PTRs_.Tally_.txt
unallocated.unioncom.net.ua 249
r-r-resale-dba-once-upon-a-child-9148-union.static.fuse.net 4
chaco-credit-union-10m-fuse.static.fuse.net 4
mail.unionbankph.com 3
gw.interunion.ru 2

That is not what you want. But grep cannot guess it: it does what you ask it to do.

Including line 297123, 128 lines in SS.HN-GLU-MB-January2020-PTRs-Rndm.txt contain "union":
$ zgrep -c union SS.HN-GLU-MB-January2020-PTRs-Rndm.txt_0.gz
128

The presence of the 127 other lines makes no difference whatsoever in the output of 'grep -f SS.HN-GLU-MB-January2020-PTRs-Rndm.txt'.

The join, sort, and comm-based scripts all were executed orders of magnitude faster than the grep script.

The overall run times is dominated by sort's, which is linearithmic in the number of lines in the largest of the two files. Because grep must output the lines in the order of the (potentially infinite) file, its run time grows with the product of the number of lines in that file and the number of patterns (for each processed line, all patterns are tested): that is much worse if the smaller file is large.

Also, without -F, grep interprets the patterns as regular expressions: it is obviously more expensive to match a regular expression than a fixed string. Finally, grep searches the pattern in the whole line and not only in one specific field, as join does.