OCR that adds generated text to the original pdf and djvu files?

Question

My OS is Ubuntu.

I found there are some applications can OCR a pdf or djvu file, generating another text file.

But I was wondering how to add the OCRed text onto the original pdf or djvu files, to make it text-selectable in original pdf or djvu files, as Adobe Acrobat does on Windows?

Community · Accepted Answer · 2020-06-12 13:48:39Z

For PDF there is pdfsandwich

pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.

It's a 2 steps process :

Add OCR text to a new PDF with (here I use tesseract OCR engine with french language) :

pdfsandwich -sloppy_text -tesseract /path/to/tesseractbin -tesso -l fra ./original.pdf -o ./ocr.pdf
Then convert the PDF/OCR to DjVu with :

pdf2djvu -o ./ocr.djvu ./ocr.pdf

Édouard Lopez · Accepted Answer · 2013-02-12 10:18:14Z

3

I started a Bash project on github to help convert from PDF to PDF+OCR and DjvU+OCR. It's based on the reply by @meda-beda and some edit I added.

It is a wrapper of pdfSandwich and pdf2djvu.

It was developed and tested under Ubuntu-12.10, I reckon there is still work to do on the option to tweak the resulting file (sometimes bigger than original).

answered Feb 12, 2013 at 10:18

Édouard Lopez

2722 silver badges15 bronze badges

Add a comment |

Stack Exchange Network

OCR that adds generated text to the original pdf and djvu files?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
ubuntu
pdf
ocr
djvu
.

Hot Network Questions

OCR that adds generated text to the original pdf and djvu files?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged ubuntupdfocrdjvu.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
ubuntu
pdf
ocr
djvu
.