5

Like so many software companies that provide a free/open source version and also sell a "commercial" version, they make it as cryptic and unfriendly as possible to actually download and use the free one. Here is a typical example: https://mupdf.com/downloads/

There's two different files for download for Windows:

mupdf-1.18.0-windows.zip
mupdf-1.18.0-windows-tesseract.zip

What is "-tesseract"? No idea. I've looked all over that page, other pages, searched online, etc. No clue. Not as much as a single word explaining what the difference is or what "tesseract" means. Wikipedia's disambiguation page also lends no hint as to what it may refer to.

What is "tesseract"? And more importantly: what does it have to do with PDF viewing and why is it a separate file?

1
  • 1
    Welcome to Superuser. You may want to revisit this post and focus on the actual question while removing the whine/rant. ( I nearly voted to close the question due to the first paragraph and also second last one)
    – davidgo
    Commented Oct 19, 2020 at 19:03

2 Answers 2

5

tesseract is an open source OCR program which is able to be freely integrated into other programs.

Searching the muPDF site gives some indication of what the package is:

api: Optional use of Tesseract to use OCR to extract text.

So as it is an ebook reader and presumably some of those ebooks may be either image-based PDFs of just plain images an OCR reader is needed to extract text. In that case it uses tesseract.

Without tesseract then presumably text extraction will not work on image-based books and you will be limited to grabbing text from proper text-only ebooks.

If you know you are never going to need to extract text from images then you can save time downloading and reduce the program footprint by not downloading the -tesseract version. If you need OCR then you want to download the -tesseract version.

2

According to Wikipedia MuPDF is a software framework with a rudimentary viewer. It is no surprise that the documentation is rudimentary as well.
I have dug out some information about the new OCR feature of MuPDF.

  • extract TESSERACT.txt from the download mupdf-1.18.0-windows-tesseract.zip for setup instructions.

  • display command line help

    mutool draw
    -F - output format (default inferred from output file name)
    ...
    ocr'd text: ocr.txt, ocr.html, ocr.xhtml, ocr.stext

example invocation

mutool draw -F ocr.txt -o x.txt x.pdf

OCR result was poor, consider using OCRmyPDF, which is also Open Source and based on Tesseract.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .