Revision of Information Processing from Sat, 10/04/2014 - 03:40
The revisions let you track differences between multiple versions of a post.
(this page is a work-in-progress)
Text-processing commands
The Unix operating system came with several text-processing commands that are still very useful today: head
, tail
, cat
, tr
, wc
, cut
, paste
, comm
, join
, sort
, uniq
, grep
, etc.
Specific, these commands are very efficient. The GNU project has improved them a great deal (e.g., additional options). The original commands are part of some POSIX standards.
sed
and awk
are not as specific as the commands listed above but they are extremely powerful when it comes to process text.
Besides their 'info' manuals, introductory material on all those commands can be found all over the Web. For instance, the sets of slides numbered 3 to 7 on http://dcc.ufmg.br/~lcerf/en/mda.html#slides present those commands (including exercises) and allow to learn their basics within a few hours.
Commands to process PDFs
The packages "poppler-utils" and "pdfjam" provide several commands to process PDFs (e.g., to concatenate several PDFs into one single document, to extract some specific pages, to see the meta-data, to get the content as plain text, etc.).
Those commands can be used inside scripts (like any command). Following this thread of the forum, a script was written to extract from PDF documents, the pages matching some regular expressions (simple strings for example): http://dcc.ufmg.br/~lcerf/en/utilities.html#pdf-page-grep
Lucas Westermann (Full Circle Magazine) wrote a pedagogical article about Professor Loic Cerf's 'pdf-page-grep'. This article appeared in issue 89 (pages 10–11) of the the magazine: http://dl.fullcirclemagazine.org/issue89_en.pdf