0

I've installed tesseract ocr 5.3.0 (on Debian 12)

I want to scan and ocr this png file:

enter image description here

When I execute a:

tesseract cp1.png cp1

the output cp1.txt contains unexpected garbage:

y seeseeggegegegenagesseagegs

feésidaedsdcsdasaredadacd

sgsessesesssesagess

B isgsddsadsdecansas

geverdcdessaguce sses

SERRRERRRRSRSRSERRRERSEsesR
an

Why?

6
  • 3
    Picture's too small & fuzzy. Readiris can't read it either… though it does a slightly better job, some of it [maybe half] comes out as actual numbers, though not necessarily the right ones.
    – Tetsujin
    Commented Jan 31 at 8:05
  • @Tetsujin Thanks! I tried another one of better quality and it worked better. Commented Jan 31 at 8:46
  • @DrMoishePippik I had no clue, it was my very first ocr. I didn't figured the image itself could be too bad Commented Jan 31 at 18:06
  • @MarcLeBihan, sorry, I wasn't criticizing your effort! The image with which you're working was causing the issue, and I wrote an answer on how an image can be improved for OCR, if one has no control of the source. Commented Jan 31 at 18:29
  • I’m voting to close this question because the image needs to be better. There is no other answer required. Commented Feb 15 at 7:48

1 Answer 1

1

OCR depends on clear images. If details are a bit unclear to a human reader, OCR would have even more difficulty discerning characters.

Ideally, when scanning or photographing text, the image should be optimized so that there is clear contrast between text and background. Wrinkles and folds should be minimized, e.g., by perpendicular lighting in photography, or moderate pressure in scanning. If there are colored stains, images can be adjusted to remove blotches of that color.

Images can be also improved, afterwards, for use in OCR. About three minutes were spent using free IrfanView to produce the image below from that in the question. It was processed "by inspection" to decrease gamma, increase contrast, and increase sharpness, but this processing could be improved by testing with the OCR tool to optimize accuracy.

Processed image

In addition, if one is using Tesseract extensively on similar data, it is possible to train the tool to recognize specific fonts and specific characters. If one is dealing just with numerical data, for example, Tesseract can be trained to recognize only digits, punctuation and spaces, increasing accuracy. That training takes some effort, and might be worthwhile only for a long-term project with voluminous data (e.g., digitizing many back-issues of a newspaper that used only a few fonts).

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .