Languages

User login
Create new account

Navigation

Recent donations

Monika05PL
donated PLN 7.80

Jorge Garcia Go...
donated € 10.00

Jesus Tobajas ...
donated € 10.00

arielenter
donated MXN 500.00

Gary Rookard
donated $ 10.00

Donate now!

bc1q3t3vxjhd3dmvg3cfn24k4l7n4mf750utpp75hn

Submitted by amrprznu on Wed, 12/10/2014 - 00:14

Revision of Searchable PDF from DjVu OCR from non-searchable PDF from Wed, 12/10/2014 - 20:20

The revisions let you track differences between multiple versions of a post.

Why a DjVu? Because it's so much easier to do, and it's completely libre.

Here is a script which will help you to convert a PDF into a searchable DjVu file. The only thing that I miss in DjVu is the annotation capabilities within a GUI (you can also try with http://elpa.gnu.org/packages/djvu.html http://sourceforge.net/p/djvu/discussion/103285/thread/ad176492/).

If you really want to get a PDF, you can do two things: look at the bottom of this page or get ocrfeeder to work for you.

Here is how you get the searchable DjVu from an image-based PDF: copy the code below into a new file or download the attachment (you can call it djvuocr.sh). Then, make it executable (open your file manager, do a right-click on the file, go to properties, go to permissions and change from read only to read and execute). You will have to run the program in a terminal (you can open one by doing CTRL + t or by going to the Accessories section of your Programs Menu).

If the name of your PDF is MyFile.pdf and it's located in a folder called Docs which lies on your personal folder (HOME), the full name of your file is ~/Docs/MyFile.pdf. Let us assume that you saved your program to another folder (~/Progs/djvuocr.sh, for example). In that case, you have to run the program by typing this on your terminal:

 ~/djvuocr.sh ~/Docs/MyFile.pdf

This will take a while. It converts the PDF into TIFF which you will have to cleanse with scantailor (optionally, for better results); gets the OCR out of the TIFF, and puts both the OCR and the TIFF into a DjVu file.

 
    !/bin/bash
    
        djvuocr.sh is free software. You can redistribute it and/or modify
        it under the terms of the GNU General Public License as published by
        the Free Software Foundation, either version 3 of the License, or
        (at your option) any later version.
    
        djvuocr.sh is distributed in the hope that it will be useful,
        but WITHOUT ANY WARRANTY; without even the implied warranty of
        MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
        GNU General Public License for more details.
    
        You should have received a copy of the GNU General Public License
        along with djvuocr.sh.  If not, see <http://www.gnu.org/licenses/>.
    
    
     This is script is meant to automate the process of converting
     image-based PDF files into searchable DJVU files by means of OCR.
    
     It was inspired by:
     http://superuser.com/questions/641899/how-to-automatically-find-non-searchable-pdfs
     http://jwilk.net/software/ocrodjvu
    
     It is required to have the following installed:
       ocrodjvu (tested with 0.7.18 from Debian packages.debian.org)
       pdftoppm, pdfinfo and pdffonts (from poppler-utils 0.24.5)
       scantailor (tested with 0.9.11.1)
       djvm and cjb2 (from djvulibre; djvulibre-bin 3.5.25.4-3)
    
     Save this script to a file and make it executable:
       chmod +x djvuocr.sh
    
     Then run it on a terminal with the name of the PDF that you want to
     OCR:
       djvuocr.sh /directory/MyPdfFile.pdf
    
     Good luck!




if  ! "$; then
   printf "Usage: $0 /path/to/PDF\n";
   exit 1;
fi;

 -z $(which "scantailor") && inst="$inst scantailor";  -z $(which "pdf2djvu") && inst="$inst pdf2djvu";  -z $(which "ocrodjvu") && inst="$inst ocrodjvu"; | -z $(which "pdfinfo") || -z $(which "pdffonts")  && inst="$inst poppler-utils"; | -z $(which "cjb2")  && inst="$inst djvulibre-bin"; if  -n $inst; then
   printf "The following utilities need to be installed:\n$inst\n\n"
   printf "Do you want to continue? [Y/n]"
   read bol_inst;
   case $bol_inst in
   [yY] | [yY][eE][sS] | "")
       sudo aptitude install "$inst";
       ;;
   [nN] | [Nn][Oo])
       printf "Aborting. Have a nice day."
       exit 1;
       ;;
   esac;
fi;

arch_pdf="$1"; num_pag=$(pdfinfo "$arch_pdf" | grep '^Pages' | awk '{print $2}');

printf "\nChecking if $(basename "$arch_pdf") is a readable PDF\n"; for ((i=1; i<=$num_pag; i++)); do
   font=$(pdffonts -f "$i" -l "$i" "$arch_pdf" | tail -n +3 | wc -l);
   if [[ "$font" == "0" ]]; then
   pags="$pags $i";
   fi;
done; printf "Pages $pags are not searchable\n"

if  tr -d ' ')" ; then
   [[ -z $nomTmp ]] && nomTmp="$(uuidgen)";
   OCRdir="/tmp/OCR/$nomTmp"
   mkdir -p "$OCRdir"
   printf "\nCreated working directory $OCRdir"
   printf "\nConverting pages to TIFF: ";
   for i in $pags; do
   printf "$i ";
   pdftoppm -tiffcompression none -r 300 -tiff -f "$i" -l "$i" "$arch_pdf" "$OCRdir/$nomTmp";
   done;
   printf "\n";
   printf "Do you want to use scantailor to improve the OCR? [y/N] "
   read bol_tailor;
   case $bol_tailor in
   [yY] | [yY][eE][sS])
       printf "When scantailor is opened,
1) click \"New Project\"; 2) choose $OCRdir as \"Input Directory\"; 3) don't change the \"Output Directory\"; 4) fullfil the tasks to the left; 5) in the last task select black and white (not mixed nor grayscale); 6) when done, click on the play symbol (an arrow within a circle), and 7) close scantailor. You don't need to save the project if you don't want to.

You'll be brought back to this terminal when finished. Hit ENTER now."
       read;
       scantailor;
       out_dir="out";
       ;;
   [nN] | [Nn][Oo] | "")
       printf "Skipping scantailor";
   esac;
   out_dir="$OCRdir/$out_dir";
   printf "\nConverting pages to DJVU: \n";
   djvu_dir="$OCRdir/djvu";
   mkdir -p "$djvu_dir"
   # Check if tesseract is working
   if [[ "$(ocrodjvu --list-engines)" =~ "tesseract" ]]; then
   printf "Using tesseract OCR\n";
   ocr="tesseract";
   else
   printf "Not using tesseract. The results will be sub-optimal\n";
   fi;
   for i in "$out_dir"/*.tif; do
   if [[ -f "$i" ]]; then
       # http://jwilk.net/software/ocrodjvu
       djvu_arch="$djvu_dir/$(basename -s .tif "$i")".djvu;
       cjb2 "$i" "$djvu_arch";
       if [[ -n "$ocr" ]]; then
       ocrodjvu --in-place -e "$ocr" "$djvu_arch";
       else
       ocrodjvu --in-place "$djvu_arch";
       fi;
   fi;
   done;
   printf "\nExtracting remaining pages from PDF: \n"
   for (( i=1; i<=$num_pag; i++ )); do
   if [[ ! -f "$djvu_dir/$nomTmp-$i.djvu" ]]; then
       # pdftk "$arch_pdf" cat "$i" output "$out_dir/$nomTmp-$i".pdf;
       pdf2djvu -j0 --lines -p "$i" -o "$djvu_dir/$nomTmp-$i".djvu "$arch_pdf";
   fi;
#    pdftk "$arch_pdf" burst output "$(basename -s .pdf "$nomTmp")"-"%d.pdf";   done;
   printf "\nJoining to DJVU: \n";
   djvm -c "$OCRdir/$(basename -s .pdf "$arch_pdf")"-OCR.djvu "$djvu_dir/$nomTmp-"*.djvu;
   printf "\n***** Success! *****\n";
   printf "Your new file is located at $OCRdir";
   printf "\nDo you want to clean the temporary files? [Y/n] ";
   read fin;
   case "$fin" in
   [yY] | [yY][eE][sS] | "")
       rm -fr "$djvu_dir" "$out_dir" "$OCRdir/$nomTmp*.tif";
       ;;
   [nN] | [nN][oO])
       printf "\n"
       du -h "$OCRdir";
       ;;
   esac;
fi;

Then, if you really want to have a PDF out of this process, you will have to check the location of your new DjVu (/tmp/OCR/MyFile-OCR.djvu, for instance) and run the following code in the terminal. It may take a seriously long time, and you should check the difference in size between the DjVu file and the PDF. Then, you can decide if you really want a PDF or DjVu!

 djvups /tmp/OCR/MyFile-OCR.djvu /tmp/OCR/temp.ps

 ps2pdf /tmp/OCR/temp.ps /tmp/OCR/MyFile-OCR.pdf

The file multiocr.txt is a file with a bash script which scans a folder recursively and does the whole PDF → DjVu process with all PDF files (with a PDF extension).

Attachment	Size
djvuocr.txt	5.41 KB
multiocr.txt	6.67 KB

Revisions

12/10/2014 - 00:14

amrprznu

09/21/2024 - 14:09

knife