Information Processing

(this page is a work-in-progress)

Text-processing commands

The Unix operating system came with several text-processing commands that are still very useful today: head, tail, cat, tr, wc, cut, paste, comm, join, sort, uniq, grep, etc.

Specific, these commands are very efficient. The GNU project has improved them a great deal (e.g., additional options). The original commands are part of some POSIX standards.

sed and awk are not as specific as the commands listed above but they are extremely powerful when it comes to process text.

Besides their 'info' manuals, introductory material on all those commands can be found all over the Web. For instance, the sets of slides numbered 3 to 7 on http://dcc.ufmg.br/~lcerf/en/mda.html#slides present those commands (including exercises) and allow to learn their basics within a few hours.

Commands to process PDFs

The packages "poppler-utils" and "pdfjam" provide several commands to process PDFs (e.g., to concatenate several PDFs into one single document, to extract some specific pages, to see the meta-data, to get the content as plain text, etc.).

Those commands can be used inside scripts (like any command). Following this thread of the forum, a script was written to extract from PDF documents, the pages matching some regular expressions (simple strings for example): http://dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep

Lucas Westermann (Full Circle Magazine) wrote a pedagogical article about Professor Loic Cerf's 'pdf-page-grep'. This article appeared in issue 89 (pages 10–11) of the the magazine: http://dl.fullcirclemagazine.org/issue89_en.pdf

Revisions

09/04/2014 - 04:09
muhammed
09/07/2014 - 20:08
Magic Banana