What exactly is "tesseract"?

Question

Like so many software companies that provide a free/open source version and also sell a "commercial" version, they make it as cryptic and unfriendly as possible to actually download and use the free one. Here is a typical example: https://mupdf.com/downloads/

There's two different files for download for Windows:

mupdf-1.18.0-windows.zip
mupdf-1.18.0-windows-tesseract.zip

What is "-tesseract"? No idea. I've looked all over that page, other pages, searched online, etc. No clue. Not as much as a single word explaining what the difference is or what "tesseract" means. Wikipedia's disambiguation page also lends no hint as to what it may refer to.

What is "tesseract"? And more importantly: what does it have to do with PDF viewing and why is it a separate file?

Welcome to Superuser. You may want to revisit this post and focus on the actual question while removing the whine/rant. ( I nearly voted to close the question due to the first paragraph and also second last one) — davidgo, Commented Oct 19, 2020 at 19:03

Mokubai · Accepted Answer · 2020-10-19 18:02:37Z

tesseract is an open source OCR program which is able to be freely integrated into other programs.

Searching the muPDF site gives some indication of what the package is:

api: Optional use of Tesseract to use OCR to extract text.

So as it is an ebook reader and presumably some of those ebooks may be either image-based PDFs of just plain images an OCR reader is needed to extract text. In that case it uses tesseract.

Without tesseract then presumably text extraction will not work on image-based books and you will be limited to grabbing text from proper text-only ebooks.

If you know you are never going to need to extract text from images then you can save time downloading and reduce the program footprint by not downloading the -tesseract version. If you need OCR then you want to download the -tesseract version.

End Antisemitic Hate · Accepted Answer · 2021-01-31 01:33:10Z

According to Wikipedia MuPDF is a software framework with a rudimentary viewer. It is no surprise that the documentation is rudimentary as well.
I have dug out some information about the new OCR feature of MuPDF.

extract TESSERACT.txt from the download mupdf-1.18.0-windows-tesseract.zip for setup instructions.
display command line help

mutool draw
-F - output format (default inferred from output file name)
...
ocr'd text: ocr.txt, ocr.html, ocr.xhtml, ocr.stext

example invocation

mutool draw -F ocr.txt -o x.txt x.pdf

OCR result was poor, consider using OCRmyPDF, which is also Open Source and based on Tesseract.

Stack Exchange Network

What exactly is "tesseract"?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
windows
pdf
open-source
tesseract-ocr
.

Hot Network Questions

What exactly is "tesseract"?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged windowspdfopen-sourcetesseract-ocr.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
windows
pdf
open-source
tesseract-ocr
.