finding particular pages within PDFs

41 risposte [Ultimo contenuto]
muhammed
Offline
Iscritto: 04/13/2013

Is there a way to search PDF files for keywords, and then create new PDFs that contain only the pages that contain those keywords.

I'd like to search the PDFs for words with various combinations of "and" and "or". Is this possible?

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

You could first split the PDFs into individual pages ('pdfjam' can do that) that you could put in a "pages" directory, enumerate those pages with a Shell 'for file in pages/*' loop, test 'if pdftotex "$file" - | grep -f regexps' (where "regexps" is a file with one regexp per line) and, if the test passes, append the file to a Shell variable so that, outside the loop, you 'pdfjoin' them all. That is for a "or" query. For "and" query you need to pipe several 'grep's. Words that are hyphenated would not be found.

But do you really want PDF pages at the output? Don't you want to work on the sole text in the documents? Inside the directory with the PDFs, you can get the text with:
for file in *.pdf
do
pdftotext "$file" -
done

If the PDFs are structures in columns (tables for instance), try adding the -layout option to 'pdftotext'.

muhammed
Offline
Iscritto: 04/13/2013

MB, having the text would be way more useful than the PDF pages! Thanks for recommending pdftotext and the -layout option.

I have some questions -- could you help me break this process down into smaller steps?

I looked up pdfjam's split command online -- I think that it may be a little time consuming (my PDFs are a few thousand pages long):

http://0x2a.at/blog/2011/02/pdf_manipulation_on_the_cli/

http://tex.stackexchange.com/questions/79623/quickly-extracting-individual-pages-from-a-document

I looked at PDF Shuffler (the GUI one) and that can only split files one-by-one. Are there other options?

Once I split the files into single pages, I'll need the Shell command 'for file in pages/*" loop. I don't understand what this step will do. Could you please explain this step too?

About this step: 'if pdftotex "$file" - | grep -i regexps' -- does this copy all the PDF text to one text file? And then search (grep) the text file? Does this command take text from many single PDfs? Or only after the "hit" pages are joined up into one document?

What does it mean to "append the file to a Shell variable" ? What is the goal in this step? Could you please explain how I can do this step too?

Edit: will pdftotext's output be in html? I plan to use the -layout option.

Legimet
Offline
Iscritto: 12/10/2013

"for file in pages/*" is a for loop. That means that it will execute the body of the loop for each file in the directory pages/*, setting the variable file to the filename each time.

'if pdftotext "$file" - | grep -i regexps': the 'pdftotext "$file" -' part outputs the text of the pdf to standard output. However, this is piped to grep. When you pipe it, the standard output of the first command becomes standard input of the second command. So the file will be searched for your regexp, and the "if" will check if there were any matches.

You can append the filename to a variable using something like '$foo="$foo $file"'

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

See the attachment. You owe me a beer. ;-)

An example of use:
$ pdf-page-grep *.pdf
regexp: GNU
OR regexp (empty to stop): [fF]ree
OR regexp (empty to stop):

matching pages in "Agglomerating Local Patterns Hierarchically with ALPHA.pdf":
matching pages in "A Parameter-Free Associative Classification Method.pdf": 1 3 5 7 9 11 12
matching pages in "Artificial Regulatory Networks Evolution.pdf": 2
matching pages in "Closed and Noise-Tolerant Patterns in n-ary Relations.pdf": 8 9 11 20
matching pages in "Closed Patterns Meet n-ary Relations.pdf": 16 21 23 27
matching pages in "Complete Discovery of High-Quality Patterns in Large Numerical Tensors.pdf": 7
matching pages in "Constraint-Based Search of Different Kinds of Discriminative Patterns.pdf": 4 6
matching pages in "Constraint-Based Search of Straddling Biclusters and Discriminative Patterns.pdf": 6 10
matching pages in "Data-Peeler: Constraint-Based Closed Pattern Mining in n-ary Relations.pdf": 7 9
matching pages in "Descoberta de n-Conjuntos Fechados Eficiente e Restrita a Grupos de Interesse.pdf": 8
matching pages in "Discovering Descriptive Rules in Relational Dynamic Graphs.pdf": 12 19
matching pages in "Discovering Inter-Dimensional Rules in Dynamic Graphs.pdf": 8 12
matching pages in "Discovering Relevant Cross-Graph Cliques in Dynamic Networks.pdf": 7
matching pages in "Distributed Skycube Computation with Anthill.pdf": 3 5
matching pages in "Exploiting Temporal Locality to Determine User Bias in Microblogging Platforms.pdf":
matching pages in "Extraction de motifs fermés dans des relations n-aires bruitées.pdf":
matching pages in "Mining Constrained Cross-Graph Cliques in Dynamic Networks.pdf": 8 20
matching pages in "Multidimensional Association Rules in Boolean Tensors.pdf": 7 12
matching pages in "Parameter-free classification in multi-class imbalanced data sets.pdf": 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
matching pages in "Reachability Queries in Very Large Graphs: A Fast Refined Online Search Approach.pdf": 4
matching pages in "Sémantiques et Calculs de Règles Descriptives dans une Relation n-aire.pdf": 2 14 15
matching pages in "Tackling Closed Pattern Relevancy In n-ary Relations.pdf": 5
matching pages in "Un nouveau cadre de travail pour la classification associative dans les données aux classes disproportionnées.pdf": 4
matching pages in "Watch me Playing, I am a Professional: A First Study on Video Game Live Streaming.pdf":

Output written to "Un nouveau cadre de travail pour la classification associative dans les données aux classes disproportionnées-matches.pdf"

It actually is pretty fast: a little bit more than 10s for the example above that processes 24 documents (341 pages in total) and generates a 62-page PDF.

As you can see, I decided to name the output with the "basename" of the last matching PDF followed by "-matches.pdf". I initially wanted to concatenate the names of all matching PDFs but you may then reach the size limit for file names!

I do not know if you really wanted regular expressions (instead of simple strings). Maybe you wanted whole-word matches and/or to ignore the case. Those things are simple options to add to the 'grep' command.

You were talking about combining AND and OR queries. The current command takes care of the OR part... but, to get your AND, you can then process the output PDF with the same command. You therefore need to rewrite your query in the conjunctive normal form (there is a good chance it already is in this form): https://en.wikipedia.org/wiki/Conjunctive_normal_form

If you want to take the regexps from a file (in particular, if you want to non-interactively use 'pdf-page-grep' inside another script), you just need to redirect the standard input. The regexps must be on adjacent lines and the file must end with *two* empty lines. For example, I create here the file with 'cat':
$ cat > regexps
GNU
[fF]ree

^C
$ pdf-page-grep *.pdf < regexps
# same output as above

AllegatoDimensione
pdf-page-grep.gz 502 byte
muhammed
Offline
Iscritto: 04/13/2013

It took me a little while to figure out what I was looking at ... thank you so much! I'm running the script now, and it's finding pages! This is so cool. I'm going to PM you about that beer.

Also, thanks a lot Legimet. You guys are the best.

muhammed
Offline
Iscritto: 04/13/2013

In case someone with a similar situation finds this page -- here's how to run a script:

1. Open the terminal, and type:

cd [directory where your script is]

Example:

cd /home/username/Desktop/research/

Put the PDF files in the same directory

2. Then type the following, to give yourself permission to run the script (is that what it does?)

chmod +x [script name]

3. Run the script

./[script name]

4. For Magic Bananna's script -- the terminal will tell you to use the script this way:

./[script name] [pdf-name].pdf [pdf-name].pdf

OR you can do all the PDFs in that folder in one go with an asterix, like this:

./[script name] *.pdf

Edit: I had to look it up so I thought I'd share

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

I made a typo in the last test (the script crashes if there is no match). The corrected script is attached.

It is true that you have to turn the script executable. You can do that with 'chmod +x' or from a graphical file browser (in Nautilus: right click, "Properties", "Permissions" tab, a box to check).

If you plan to frequently use the script, you had better move it to a directory listed in your PATH variable. To /usr/local/bin/ for example:
$ sudo mv pdf-page-grep /usr/local/bin/

In this way, you can run it anywhere by just typing its name.

The files in arguments can be specified in any "Shell way". The script sees a list of file names.

Again: if you want slightly different ways of selecting text, it is only about adding options to the 'grep' command. I am thinking of the following options:
`-i'
`-y'
`--ignore-case'
Ignore case distinctions in both the pattern and the input files.

`-F'
`--fixed-strings'
Interpret the pattern as a list of fixed strings, separated by
newlines, any of which is to be matched.

`-w'
`--word-regexp'
Select only those lines containing matches that form whole words.
The test is that the matching substring must either be at the
beginning of the line, or preceded by a non-word constituent
character. Similarly, it must be either at the end of the line or
followed by a non-word constituent character. Word-constituent
characters are letters, digits, and the underscore.

`-r'
`-R'
`--recursive'
For each directory mentioned on the command line, read and process
all files in that directory, recursively.

AllegatoDimensione
pdf-page-grep.gz 494 byte
Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The script now considers that the arguments that start with "-" (e.g., "-F" or "--ignore-case") are options for 'grep'. I put the script on my website: http://dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

There actually was a problem with the input PDFs: if they were not in the working directory, the script was crashing. Also, the output pages were not in the correct order (the order in which the user gave the PDFs). Finally, the script was retuning 0 even if no page matched the patterns (the exit status is useful when using the command inside another script).

I made all those fixes and heavily commented the script (for those who want to learn Shell scripting): http://dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep

muhammed
Offline
Iscritto: 04/13/2013

I want to learn Shell scripting now and will read those comments carefully -- thanks

ssdclickofdeath
Offline
Iscritto: 05/18/2013

Where's the license for the script?

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Oops! I added those lines:
# Distributed under the terms of the GNU General Public License v3
# AUTHOR: Magic Banana
# e-mail: name at domain

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

I simplified the script: http://dcc.ufmg.br/~lcerf/utilities/pdf-page-grep

It now is closer to my original proposal since it extracts the individual pages with matches and, in the end, join them all.

Besides basic POSIX commands (such as 'grep' and 'awk'), the script now only relies on commands that the "poppler-utils" package provides. This package is installed by default in the GNOME and the Mini editions of Trisquel 6 and 7. 'pdfjam' is not required anymore.

On the downside, the script actually is slower when many pages per PDF match the patterns. However, I guess the script mainly is useful in "needle in the haystack" contexts (otherwise, just use Ctrl+F in the PDF viewer!). In such contexts, the difference of performance is insignificant.

muhammed
Offline
Iscritto: 04/13/2013

I tried running the script again today, but am having trouble.

~/Desktop/research$ pdf-page-grep
bash: pdf-page-grep: command not found

Does anyone know why this might be happening?

lembas
Offline
Iscritto: 05/13/2010

That means there's no such command in your $PATH. If the script is in the research folder and executable you need to run it with ./scriptname

muhammed
Offline
Iscritto: 04/13/2013

Right! Thank you!

I moved the script to the directory in my PATH variable like MB suggested. But I'm not sure that it worked properly. I'll do a little research about that and try again more carefully.

muhammed
Offline
Iscritto: 04/13/2013

Is it possible to search for pages that contain words -- at least one word from each of two groups? For example:

First group of "ORs": car, truck, bus, bicycle, or motorcycle

"and"

Second group of "ORs": blue, red, green, purple, or beige

So a good hit could have the word "green" and "truck" on the same page (but not necessarily near eachother on the same page).

A good hit may look like "He had a green truck." A good hit could also look like "He didn't drive the truck near the green trees." So long as 1+ word from each group is on the same page.

Could I implement such a feature if I learn Shell script?

muhammed
Offline
Iscritto: 04/13/2013

I just thought of something. I could use pdf-page-grep to do a first pass with my first group of ORs.

Then I could split the "matches" file into single-page PDFs.

And then use a new set of ORs on those single-page PDFs.

This would be like having an "And" in the search. Is there an automatic way to split the matches-PDF into single page PDFs?

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

I do not understand your need to split the output PDF into single pages. As far as I understand, you can:

  1. execute 'pdf-page-grep' with the original PDFs in argument and specify the elements in the first group as "patterns"; the output is a PDF whose name is written at the end of the execution;
  2. execute 'pdf-page-grep' with that single PDF in argument and specify the elements in the second group as "patterns"; the output is what you want.

As I have already tried to explain, your query needs to be written as a conjunction of disjunctions (and this is what you did!):

You were talking about combining AND and OR queries. The current command takes care of the OR part... but, to get your AND, you can then process the output PDF with the same command. You therefore need to rewrite your query in the conjunctive normal form (there is a good chance it already is in this form): https://en.wikipedia.org/wiki/Conjunctive_normal_form

Anyway, and even if it looks unnecessary in your case, here the answer to your question: 'pdfseparate' is the command to split a PDF into single pages. 'pdf-page-grep' actually uses it (with the options -f and -l to specify an interval of pages to extract).

Since you seem to be interested in the search of whole-words, you may want to use the option "-w". With that option, when searching for "car", a page that contains the word "careful" (but not "car") would *not* match. But think the list of words through: you may want to find not only "car" but also "cars"!

Another option you probably want to use is the option "-F" so that what the characters you type in the pattern always are the characters that are searched. Without the option "-F", some characters (such as ".", "*" or "$") have a special meaning (to define regular expressions rather than simple strings of characters).

Finally, when searching for "car", you may actually want as well "Car" or "CAR". The option "-i" tells the script (actually, tells 'grep' that the script calls) that the case is to be ignored.

You can group the one-letter options (such as the three options I presented above). For instance, to process one PDF named "mypdf.pdf" using all three options, you can fire:
$ pdf-page-grep -wFi mypdf.pdf

muhammed
Offline
Iscritto: 04/13/2013

Oh, I understand now -- When I read "pipe grep" earlier, I thought that "pipe" referred to script instruction or terminal command that I didn't know yet.

You're right; no need to separate out the pages. I just have to pipe grep with a second set of words to achieve "and".

muhammed
Offline
Iscritto: 04/13/2013

The script writes the "basename-matches.pdf" file to the same folder where the script and PDFs are, right?

I can't find that matches file. Is it a problem if my PDFs have spaces in the name? (Particularly the last PDF, that the script uses to create the "matches.pdf" file name)

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The script writes the output PDF in the "working directory" (i.e., the directory where you "are"). Notice that you can have, in argument, PDFs that are in different directories. For instance to process all PDFs in the sub-directories "dir1" and "dir2" of the working directory:
$ pdf-page-grep dir1/*.pdf dir2/*.pdf

The names of the PDFs can contain spaces. Just know how to write such files in the terminal (use auto-completion, with [tab], to get it right).

muhammed
Offline
Iscritto: 04/13/2013

Cool -- thanks

I still don't get an basename-matches.pdf output, but I have an idea to try tomorrow. I'll use the script on ~50 mb of PDFs at a time and then merge the resulting matches files.

The way I'm doing it now --for 500+mb of PDFs at a time--I get an error message at the end of the run, and there doesn't seem to be a matches file in my working folder.

I/O Error: Couldn't open file '/tmp/pdf-page-grep.PHgWDa-1022': Too many open files.
Syntax Error: Could not merge damaged documents ('/tmp/pdf-page-grep.PHgWDa-1022')
29008 matching pages written to "basename-matches.pdf"

Maybe 500+ megs of PDFs is too much. I did a few test runs with a few 10-15 page PDFs, and that worked well.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

The problem was not the size of the PDFs but the number of matching pages (29008 in you case). The script was actually creating a 1-page PDF for each of them and then pass them all to 'pdfunite'. It appears the kernel limits the number of files a process can open and 'pdfunite' cannot do its job.

Given your usage, I was wrong when I wrote that the script mainly is useful in "needle in the haystack" contexts. That is why I went back to using 'pdfjam'. You should notice a performance gain. Using 'pdfjam' solves the problem of the number of files to "unite" (as long as there are less than 1022 PDFs with matching pages).

The new script is there: http://dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep

muhammed
Offline
Iscritto: 04/13/2013

Cool -- 1021 pages at a time will work great. Thanks for all the updates and help MB.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

With the newest version, there should be restriction on the number of matching pages. Only on the number of matching PDFs. Are you processing thousands of PDFs?

muhammed
Offline
Iscritto: 04/13/2013

My largest set of PDFs is 80 files. In that set, some PDFs are as big as ~20 mb, some are only ~500 kb.

In the newest version of pdf-page-grep, the number of matching pages is restricted to 1021 right? I can search my PDFs in smaller groups if this is the case.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Again: no, the newest version has no restriction on the number of matching pages. I switched back to the faster solution with 'pdfjam' (the eponymous package must be installed) and that solves as well the problem you were facing: http://dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep

Before that latest version, the problem was that, at the end of the script, every matching page was in a single 1-page document and 'pdfunite' was joining them all. To do so 'pdfunite' was receiving as many files in arguments. However the kernel limits the number of files a process (here 'pdfunite') can open.

Now, in the latest version, the matching pages in a PDF document are extracted altogether into one single PDF document, using 'pdfjam'. As a consequence, at the end of the script, 'pdfunite' works on as many files as there were matching *PDFs* (and *not* "matching *pages*" because every PDF can have many matching pages).

muhammed
Offline
Iscritto: 04/13/2013

I think that I understand -- pdfjam lets the computer group the matches without first creating an individual PDF for each page-match.

I will read the new script to spot the differences and, to try to understand how you did it.

I'll try a run on the whole set -- with the "narrowest searches first" in mind.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Something else: if you are actually having an AND in your query then you had better start with the most selective part (the one with the least number of expected matches). That would be faster (less pages to extract and then less pages to process).

muhammed
Offline
Iscritto: 04/13/2013

Thanks for the tip; I hadn't thought about how to structure the search in this way. I will review how I'm doing my searches with this in mind.

muhammed
Offline
Iscritto: 04/13/2013

I imagine that there's a lot that people can do with tools like this. I opened a documentation page for it here:

https://trisquel.info/en/wiki/information-processing

I don't know whether "Information Processing" is the most appropriate name. If there's a better name, please open a new page, and I can ask SirGrant to delete the Information Processing page.

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

I complemented the documentation page a little.

muhammed
Offline
Iscritto: 04/13/2013

Haha, "a little"; my only contribution was linking to your website.

muhammed
Offline
Iscritto: 04/13/2013

My first run of the script was successful, and I now have a "basename-matches" PDF. It's almost 30000 pages long. I renamed it from basename-matches.pdf to word-word-#ofpagehits.pdf

I ran the script a second time, on word-word-#ofpagehits.pdf, to achieve the "and" functionality. Unfortunatley, on this second run, I don't don't seem to get a new match file at the end.

I tried twice. On the first "And" run, I got thousands of hits, and no match file at ehe end. The second time, I searched for something more rare, got 180 hits, and again, no match file in the end.

Any ideas on why this could be happening?

This is the message I see at the end:

pdfunite version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdfunite [options] ..
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
1 matching PDFs
Output written to "basename-matches.pdf"

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Here is what was happening: 'pdfunite' expects *several* PDFs to unite. Not one.

I corrected the problem (when one single PDF matches, the script now uses 'mv'): http://dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep

Thank you for your feedback.

muhammed
Offline
Iscritto: 04/13/2013

Thanks again, again

muhammed
Offline
Iscritto: 04/13/2013

Trisquel comes with programs like mv and pdfunite right? Do most gnu users (like me until very recently) have them on the computer and not use them? Or do popular GUI programs depend on these kinds of programs to do things (like "export to pdf" in LibreOffice)?

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

'mv' really is a basic command to "move" files (notice that moving to the same directory but under a different name actually is "renaming"). It has been present in every UNIX or UNIX-inspired system since the seventies.

The commands provided by the Poppler project (the "poppler-utils" package in Trisquel) are present in Trisquel by default. The reason is that CUPS, the printing system, depends on those commands. I therefore guess it uses them (probably to convert PDF to PostScript for printers that do not support direct PDF printing).

Magic Banana

I am a member!

I am a translator!

Offline
Iscritto: 07/24/2010

Lucas Westermann made me the honor to write a pedagogical article about 'pdf-page-grep'. This article was published in the issue 89 (pages 10–11) of the Full Circle Magazine: http://dl.fullcirclemagazine.org/issue89_en.pdf

muhammed
Offline
Iscritto: 04/13/2013

Oh man, that is so awesome!