Optimal font for Tesseract? (specifically the .NET wrapper)

Question

I am using Tesseract as a means to convert printed text documents captured by my cell phone camera into text. The results are not great. The quality of the image is very good, far clearer than a fax, but it seems to have a very difficult time identifying characters.

I've also tried mimicking one of these documents in a text editor, taking a screenshot of the window, and running that through Tesseract and the results are only marginally better.

This leads me to believe there's probably an optimal font for Tesseract. I Googled a bit and came across OCR-A, but it apparently requires a license. I then stumbled upon am free OCR-A alternative on SourceFourge, but it doesn't appear to fare much better than Arial or Courier New.

Is there a font that works best with Tesseract or do I need to do something else to increase the accuracy of the character recognition?

@DanielB Good point. I am actually using this as a means to convert relatively small data files to base64 and then printing them on paper for backup. It's sort of the same idea behind Paperback. Any idea how to create my own custom dictionary? I could try creating a dictionary of every possible base64 string and see if that helps with the accuracy. — user613051, Commented Jul 3, 2016 at 17:58
@MátéJuhász I've considered generating QR codes because of the amount of data they can hold, but haven't gotten around to looking for QR code reader apps that don't require every permission known to humankind — user613051, Commented Jul 3, 2016 at 18:57

Martin Monperrus · Accepted Answer · 2020-11-16 16:41:22Z

4

I've done an experiment to answer this question.

Generate a document with random 6000 characters from the base 64 character sets (basically all letters upper and lower case + digits).
For each font on my system (a Linux box), generate an image with the same content
Give it to Tesseract
Measure the error rate / accuracy

Here are the results for Tesseract v4.1.1, I give the top performing fonts:

mitra
TeX_Gyre_Bonum
DejaVu_Serif
Roboto
Cantarell

See also this wrap-up: https://www.monperrus.net/martin/perfect-ocr-digital-data

edited Nov 16, 2020 at 16:41

answered Apr 19, 2020 at 8:07

Martin Monperrus

3,0633 gold badges19 silver badges21 bronze badges

Add a comment |

user985675 · Accepted Answer · 2021-09-28 19:49:57Z

I use tesseract-ocr a lot, and in my experience only 2 things improve its performance, the source image being in tiff format, and the physical size of the text in the image. Consequently I run it against the image, and against the image resized 200%, 400% and 800%. For each of the texts produced I count the number of words flagged as misspelled and choose accordingly.

Certainly the font affects tesseract's performance, but I don't see it's relevance to your situation, aren't you stuck with whatever font was used to produce the text document you photograph?

Martin Monperrus · Accepted Answer · 2020-04-19 11:31:54Z

0

Your best choice is to train it for whatever font you are using.

I don't want to pretend this is an easy process, it isn't but it should work better. Also most OCR programs favor 300dpi or 600dpi, so upscaling maybe necessary.

The Tesseract Github Wiki has some good resources on Training Tesseract.

edited Apr 19, 2020 at 11:31

Martin Monperrus

3,0633 gold badges19 silver badges21 bronze badges

answered Jul 3, 2016 at 18:02

cybernard

14.2k3 gold badges30 silver badges35 bronze badges

Add a comment |

Stack Exchange Network

Optimal font for Tesseract? (specifically the .NET wrapper)

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
tesseract-ocr
.

Hot Network Questions

Optimal font for Tesseract? (specifically the .NET wrapper)

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged tesseract-ocr.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
tesseract-ocr
.