4

I am working on a project where I have to read the document from an image. In initial stage I will read the machine printed documents and then eventually move to handwritten document's image. However I am doing this for learning purpose, so I don't intend to use apis like Tesseract etc. I intend to do in steps:

  1. Preprocessing(Blurring, Thresholding, Erosion&Dilation)

  2. Character Segmentation

  3. OCR (or ICR in later stages)

So I am doing the character segmentation right now, I recently did it through the Horizontal and Vertical Histogram. I was not able to get very good results for some of the fonts, like the image as shown I was not able to get good results.

Document's Image

Is there any other method or algorithm to do the same? Any help will be appreciated!

Edit 1:

The result I got after detecting blobs using cv2.SimpleBlobDetector. Results

The result I got after using cv2.findContours. enter image description here

1
  • 2
    Check out THIS and THIS for character segmentation.
    – Jeru Luke
    Commented Jan 16, 2017 at 9:06

1 Answer 1

7

A first option is by deskewing, i.e. measuring the skew angle. You can achieve this for instance by Gaussian filtering or erosion in the horizontal direction, so that the characters widen and come into contact. Then binarize and thin or find the lower edges of the blobs (or directly the directions of the blobs). You will get slightly oblique line segments which give you the skew direction.

enter image description here

When you know the skew direction, you can counter-rotate to perform de-sekwing. The vertical histogram will then reliably separate the lines, and you can use an horizontal histogram in each of them.

A second option, IMO much better, is to binarize the characters and perform blob detection. Then proximity analysis of the bounding boxes will allow you to determine chains of characters. They will tell you the lines, and where spacing is larger, delimit the words.

enter image description here

5
  • Thank you so much for the answer. Even I think the second way is better. But if I plan to make a transition to handwriting recognition. The first way seems better, right? Commented Jan 17, 2017 at 5:16
  • @nishant.neo: not necessarily. Handwriting recognition is much much harder. And if the lines are close together, it turns to a nightmare.
    – user1196549
    Commented Jan 17, 2017 at 6:59
  • I used the SimpleBlobDetector, but didn't get same results as yours, I binarized the image. Better results were obtained when I used findContours. But even they weren't as good as yours. Commented Jan 17, 2017 at 18:45
  • @YvesDaoust Hi, can you please post the code of (second) your example ? Thanks
    – BlueTrack
    Commented Dec 21, 2017 at 8:32
  • @BlueTrack: this was made with a proprietary software.
    – user1196549
    Commented Dec 21, 2017 at 8:33

Not the answer you're looking for? Browse other questions tagged or ask your own question.