3

I am using Tesseract as a means to convert printed text documents captured by my cell phone camera into text. The results are not great. The quality of the image is very good, far clearer than a fax, but it seems to have a very difficult time identifying characters.

I've also tried mimicking one of these documents in a text editor, taking a screenshot of the window, and running that through Tesseract and the results are only marginally better.

This leads me to believe there's probably an optimal font for Tesseract. I Googled a bit and came across OCR-A, but it apparently requires a license. I then stumbled upon am free OCR-A alternative on SourceFourge, but it doesn't appear to fare much better than Arial or Courier New.

Is there a font that works best with Tesseract or do I need to do something else to increase the accuracy of the character recognition?

4
  • You do have the correct dictionary loaded, right?
    – Daniel B
    Commented Jul 3, 2016 at 16:20
  • @DanielB Good point. I am actually using this as a means to convert relatively small data files to base64 and then printing them on paper for backup. It's sort of the same idea behind Paperback. Any idea how to create my own custom dictionary? I could try creating a dictionary of every possible base64 string and see if that helps with the accuracy.
    – user613051
    Commented Jul 3, 2016 at 17:58
  • Why not print also qr codes next to the text?? Commented Jul 3, 2016 at 18:32
  • @MátéJuhász I've considered generating QR codes because of the amount of data they can hold, but haven't gotten around to looking for QR code reader apps that don't require every permission known to humankind
    – user613051
    Commented Jul 3, 2016 at 18:57

3 Answers 3

4

I've done an experiment to answer this question.

  • Generate a document with random 6000 characters from the base 64 character sets (basically all letters upper and lower case + digits).
  • For each font on my system (a Linux box), generate an image with the same content
  • Give it to Tesseract
  • Measure the error rate / accuracy

Here are the results for Tesseract v4.1.1, I give the top performing fonts:

  • mitra
  • TeX_Gyre_Bonum
  • DejaVu_Serif
  • Roboto
  • Cantarell

See also this wrap-up: https://www.monperrus.net/martin/perfect-ocr-digital-data

0
1

I use tesseract-ocr a lot, and in my experience only 2 things improve its performance, the source image being in tiff format, and the physical size of the text in the image. Consequently I run it against the image, and against the image resized 200%, 400% and 800%. For each of the texts produced I count the number of words flagged as misspelled and choose accordingly.

Certainly the font affects tesseract's performance, but I don't see it's relevance to your situation, aren't you stuck with whatever font was used to produce the text document you photograph?

0

Your best choice is to train it for whatever font you are using.

I don't want to pretend this is an easy process, it isn't but it should work better. Also most OCR programs favor 300dpi or 600dpi, so upscaling maybe necessary.

The Tesseract Github Wiki has some good resources on Training Tesseract.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .