Segmentation of lines, words and characters from a document's image

Question

I am working on a project where I have to read the document from an image. In initial stage I will read the machine printed documents and then eventually move to handwritten document's image. However I am doing this for learning purpose, so I don't intend to use apis like Tesseract etc. I intend to do in steps:

Preprocessing(Blurring, Thresholding, Erosion&Dilation)
Character Segmentation
OCR (or ICR in later stages)

So I am doing the character segmentation right now, I recently did it through the Horizontal and Vertical Histogram. I was not able to get very good results for some of the fonts, like the image as shown I was not able to get good results.

Is there any other method or algorithm to do the same? Any help will be appreciated!

Edit 1:

The result I got after detecting blobs using cv2.SimpleBlobDetector.

The result I got after using cv2.findContours.

Check out THIS and THIS for character segmentation.
– Jeru Luke
Commented Jan 16, 2017 at 9:06 — Jeru Luke, Commented Jan 16, 2017 at 9:06

score 7 · Accepted Answer · 2017-01-16 10:06:24Z

7

A first option is by deskewing, i.e. measuring the skew angle. You can achieve this for instance by Gaussian filtering or erosion in the horizontal direction, so that the characters widen and come into contact. Then binarize and thin or find the lower edges of the blobs (or directly the directions of the blobs). You will get slightly oblique line segments which give you the skew direction.

When you know the skew direction, you can counter-rotate to perform de-sekwing. The vertical histogram will then reliably separate the lines, and you can use an horizontal histogram in each of them.

A second option, IMO much better, is to binarize the characters and perform blob detection. Then proximity analysis of the bounding boxes will allow you to determine chains of characters. They will tell you the lines, and where spacing is larger, delimit the words.

edited Jan 16, 2017 at 10:06

answered Jan 16, 2017 at 9:07

user1196549

Thank you so much for the answer. Even I think the second way is better. But if I plan to make a transition to handwriting recognition. The first way seems better, right?
– amor.fati95
Commented Jan 17, 2017 at 5:16
@nishant.neo: not necessarily. Handwriting recognition is much much harder. And if the lines are close together, it turns to a nightmare.
– user1196549
Commented Jan 17, 2017 at 6:59
I used the SimpleBlobDetector, but didn't get same results as yours, I binarized the image. Better results were obtained when I used findContours. But even they weren't as good as yours.
– amor.fati95
Commented Jan 17, 2017 at 18:45
@YvesDaoust Hi, can you please post the code of (second) your example ? Thanks
– BlueTrack
Commented Dec 21, 2017 at 8:32
@BlueTrack: this was made with a proprietary software.
– user1196549
Commented Dec 21, 2017 at 8:33

Add a comment |

Collectives™ on Stack Overflow

Segmentation of lines, words and characters from a document's image

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
opencv
image-processing
ocr
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonopencvimage-processingocr or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
opencv
image-processing
ocr
or ask your own question.