0

I need to extract text from images like the one below:

example image

As you can see, the text is typed not handwritten. Moreover, the background is colorful.

I've tried Tesseract OCR, and while it works some of the times, it fails miserably on some inputs. For the above example, it produces "Due CoN aicomrBi em Cela RTL".

Which command-line OCR software would you recommend? If Tesseract is my best bet, can I transform these images to make it easier for Tesseract to recognize the characters?

EDIT: Based on @MarcusMüller's suggestion, I used convert -threshold 55% to better separate the foreground text from the background. The resulting images are much better!

binarized image

Alas, Tesseract still is useless. On this new image, it produces: "Bim KM ioes Bm Meme e Cera".

As such, the question remains open.

6
  • I'm sure there's some foreground-selecting preprocessing you could do. Commented Nov 15, 2022 at 19:41
  • @MarcusMüller Thanks for responding. I'm completely new to OCR. What does foreground-selecting mean in this context? How can I do it? I have tried binarizing the image, but that makes it worse due to the colorful background.
    – user549392
    Commented Nov 15, 2022 at 19:43
  • 1
    exactly what you do with binarizing, but more smartly: an algorithm that detects which of the pixels constitute background and which are foreground, to make foreground black and all the background white (or vice versa). Commented Nov 15, 2022 at 19:48
  • @MarcusMüller That was extremely helpful. Unfortunately, Tesseract still doesn't produce the right result. I've updated the question with the new results after applying a better algorithm.
    – user549392
    Commented Nov 15, 2022 at 20:32
  • 1
    Try inverting the image. I mean, a threshold is not the most clever foreground extraction method. There's a few things in OpenCV, as well. I bet someone also has nice pretrained models that were meant to make such image segmentation. In your example, you might even be able to train one yourself… but that would maybe really be overkill (and by far exceed any reasonable scope here) Commented Nov 15, 2022 at 20:38

1 Answer 1

0

Bad OCR performance on uneven background can probably be helped if you preprocess the image to extract the foreground.

There's many techniques for image segmentation / foreground extraction. It seems you have had good results with a threshold! Maybe play around with that, or use more advanced extractors (e.g., from openCV), or train a neural network to do the segmentation for you.

Also note that OCR might work better with dark text on bright ground, so inversion might be necessary.

You must log in to answer this question.