tesseract ocr: why a png image containing computer digits returns me garbage when I ocr it to a text file?

Question

I've installed tesseract ocr 5.3.0 (on Debian 12)

I want to scan and ocr this png file:

When I execute a:

tesseract cp1.png cp1

the output cp1.txt contains unexpected garbage:

y seeseeggegegegenagesseagegs

feésidaedsdcsdasaredadacd

sgsessesesssesagess

B isgsddsadsdecansas

geverdcdessaguce sses

SERRRERRRRSRSRSERRRERSEsesR
an

Why?

Picture's too small & fuzzy. Readiris can't read it either… though it does a slightly better job, some of it [maybe half] comes out as actual numbers, though not necessarily the right ones. — Tetsujin, Commented Jan 31 at 8:05
@Tetsujin Thanks! I tried another one of better quality and it worked better. — Marc Le Bihan, Commented Jan 31 at 8:46
@DrMoishePippik I had no clue, it was my very first ocr. I didn't figured the image itself could be too bad — Marc Le Bihan, Commented Jan 31 at 18:06
@MarcLeBihan, sorry, I wasn't criticizing your effort! The image with which you're working was causing the issue, and I wrote an answer on how an image can be improved for OCR, if one has no control of the source. — DrMoishe Pippik, Commented Jan 31 at 18:29
I’m voting to close this question because the image needs to be better. There is no other answer required. — Rohit Gupta, Commented Feb 15 at 7:48

DrMoishe Pippik · Accepted Answer · 2024-01-31 18:38:11Z

OCR depends on clear images. If details are a bit unclear to a human reader, OCR would have even more difficulty discerning characters.

Ideally, when scanning or photographing text, the image should be optimized so that there is clear contrast between text and background. Wrinkles and folds should be minimized, e.g., by perpendicular lighting in photography, or moderate pressure in scanning. If there are colored stains, images can be adjusted to remove blotches of that color.

Images can be also improved, afterwards, for use in OCR. About three minutes were spent using free IrfanView to produce the image below from that in the question. It was processed "by inspection" to decrease gamma, increase contrast, and increase sharpness, but this processing could be improved by testing with the OCR tool to optimize accuracy.

In addition, if one is using Tesseract extensively on similar data, it is possible to train the tool to recognize specific fonts and specific characters. If one is dealing just with numerical data, for example, Tesseract can be trained to recognize only digits, punctuation and spaces, increasing accuracy. That training takes some effort, and might be worthwhile only for a long-term project with voluminous data (e.g., digitizing many back-issues of a newspaper that used only a few fonts).

Stack Exchange Network

tesseract ocr: why a png image containing computer digits returns me garbage when I ocr it to a text file?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
png
ocr
tesseract-ocr
.

Hot Network Questions

tesseract ocr: why a png image containing computer digits returns me garbage when I ocr it to a text file?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged pngocrtesseract-ocr.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
png
ocr
tesseract-ocr
.