2

My OS is Ubuntu.

I found there are some applications can OCR a pdf or djvu file, generating another text file.

But I was wondering how to add the OCRed text onto the original pdf or djvu files, to make it text-selectable in original pdf or djvu files, as Adobe Acrobat does on Windows?

2 Answers 2

3

For PDF there is pdfsandwich

pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.

It's a 2 steps process :

  1. Add OCR text to a new PDF with (here I use tesseract OCR engine with french language) :

    pdfsandwich -sloppy_text -tesseract /path/to/tesseractbin -tesso -l fra ./original.pdf -o ./ocr.pdf

  2. Then convert the PDF/OCR to DjVu with :

    pdf2djvu -o ./ocr.djvu ./ocr.pdf

3

I started a Bash project on github to help convert from PDF to PDF+OCR and DjvU+OCR. It's based on the reply by @meda-beda and some edit I added.

It is a wrapper of pdfSandwich and pdf2djvu.

It was developed and tested under Ubuntu-12.10, I reckon there is still work to do on the option to tweak the resulting file (sometimes bigger than original).

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .