Training Tesseract-OCR for english language fonts

Question

I have about 3000 small images of single words that I am trying to convert to text. I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell.

 tesseract.exe imagename.png imagename

produces a text file with the converted text.

The results I got were terrible with only about 40% of characters successfully converted. I would like to improve the results.

Does anyone know what the optional configurations that can be given in this command? The required arguments are:

tesseract imagename outputbase [- lang] [configfile [+|-]varfile]...]

Also could someone describe the training procedure, I am finding it hard to understand the documentation. I know that my text is in times new roman. Do I need to train it for TNR or is that already built in and/or is it possible to download files that allows tesseract to recognize it?

I found some docs for training code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 — andrew, Commented Jan 19, 2011 at 20:44
After reading the instructions that @andrew (you) found, what part are you not understanding? How far have you gotten in that process? — Everett, Commented Aug 19, 2012 at 9:20

Pranaysharma · Accepted Answer · 2013-12-08 19:08:34Z

0

One way to remove the results is to preprocess them like remove any skew and thresholding them. You can use open CV. Later you can train the text

answered Dec 8, 2013 at 19:08

Pranaysharma

1564 bronze badges

Add a comment |

Stack Exchange Network

Training Tesseract-OCR for english language fonts

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
ocr
tesseract-ocr
.

Hot Network Questions

Training Tesseract-OCR for english language fonts

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged ocrtesseract-ocr.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
ocr
tesseract-ocr
.