5

I am interested in using OCR to recognize text from a document that doesn't contain words. Rather, it is a document with a long string of "random" printed characters. I have been trying to use tesseract to scan the text, but it seems to be looking for words. Is there a way to tell tesseract to just do plain character recognition?

3
  • I have updated the question to fix the complaint.
    – Daniel
    Commented Aug 28, 2013 at 15:33
  • The old Presto! PageManager that came with the scanner, did not do spellchecking by default (windows), it has spell checker but post OCR. I wonder if you can dissapear the dictionary on any software doing auto correction, it could not do it then. The OCR is not by default looking at whole words, except mabey for alignment.
    – Psycogeek
    Commented Aug 28, 2013 at 17:04
  • 1
    @Daniel - Now its a question that can actually be answered.
    – Ramhound
    Commented Aug 28, 2013 at 17:08

2 Answers 2

4

Yes, you can disable the dictionaries by defining a configuration file containing:

load_system_dawg F
load_freq_dawg F

and specify it with the command.

1
  • This does appear to do what I wanted. Sadly, the results aren't much better for the text that I was working with, but it does answer the question. Thanks!
    – Daniel
    Commented Oct 8, 2013 at 17:46
1

Tesseract does not work well because it expects words and natural language.

For your use case, I've had success with gocr.

I can decode 15k of random characters with 100% accuracy, see https://www.monperrus.net/martin/store-data-paper

1
  • Your post assumed you've printed the text and you can influence the alphabet used. There isn't such assumption in the question. Commented Apr 25, 2020 at 11:32

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .