Using tesseract for character recongniton, result is not as expected (much worse). How to get better?

Ask Question

Asked 2 years, 6 months ago

Modified 2 years, 6 months ago

Viewed 395 times

I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR for a long time). Links found via Web search "praise" tesseract. https://www.linuxlinks.com/ocrtools/ second best on chart. https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution

Tesseract is probably the most accurate open source OCR engine available.

I've installed it from distro via apt-get and run. Result with out-of-the-box is IMO awful. Why? Maybe it can be ealily fixed? Or advice another package that does the job. The page I've tried to recognize lacks pictures, as I see it it is rather easy task. See below the result:

Edit: in fact result when that small part is processed were much better, but when whole is processed than results are not ok. I understand making lines more horizontal and not skewed might help a lot, still I was hoping software got good at recognizing non-perfectly aligned text.

oon usb 1-@: |
“3792661 usb 1-8: New USB device found, idVendor=1343, idProduct:

7.983163] usb 1-8: New USB dev bs P luct=5662, bedDevice=16.6?

re eh peeled haibbetaia a

: new high-speed USB device number 5 PhS |
i

Per Samm SCR Can)
t pela ee rcpt PP cay
: 2.998668) usb 1-8: er
t
Ct

When only small part is processed:

2.837811) usb 1-8: new high-speed USB device number 5 using xhei_hed

2.979266] usb 1-8: New USB device ECU CREME Cnt ttc cain Tt teen Td
7.983163] usb 1-8: New USB device strings: Mfr=1, Product=2, SerialNumbers@

?.9869291 usb 1-8: Product: Integrated Camera

Added 1:

Tried again smaller and less skewed picture, I guess software considers time stamps as separate column, I have not seen on man page options to tweak that:

f a eg
| 7.849264]
Device= 6.44
f 7 .6492961
| 7.849355]
f 7.849415]
[ 7.849492]
| Van eos
fl 7.861846]
if Va ACB
| 7.864776]
if eel Be
Ha Bs) bs 4
if be A be ge
C ie BD LB
ce B)
te] Bs]
rage
lb eae
8.962076)
ie Ke Lb
9.600567)
9.696957)
9 .6970371

YS SF SS Se

usb 1-8: new high-speed USB device number 4 using xhci_hcd
usb 1-8: New USB device found, idVendor=04f2, idProduct=b449, bed

usb 1-8: New USB device strings: Mfr=3, Product=1, SerialNumber=2
usb 1-8: Product: Integrated Camera

usb 1-8: Manufacturer: Chicony Electronics Co.,Ltd.
usb 1-8: SerialNumber: 6x0001

usb-storage 1-1:1.6: USB Mass Storage device detected

scsi host3:

usb-storage 1-1:1.6

usbcore: registered new interface driver usb-storage
usbcore: registered new interface driver uas

scsi 3:0:6:@: Direct-fAccess General UDisk eg
sd 3:0:0:0: Attached scsi generic sgi type @

eM Pee PM eA PA ed) te) ae
Py Me ee dd

Py ee ee eee dm

sd 3:0:0:0: [sdb] Assuming drive cache: write through

sdb: sdbi sdb2 sdb3

sd 3:0:0:0: [sdb] Attached SCSI removable disk

squashfs: version 4.6 (2609/01/31) Phillip Lougher

Copying live image to RAM...
Ca ewe te Mae

edited Jan 10, 2022 at 7:13

asked Jan 10, 2022 at 6:35

Martian2020

1,2191 gold badge10 silver badges28 bronze badges

2

I’m voting to close this question because it is about fine-tuning settings for an analysis software. This is impossible to answer without the sample dataset and knowledge of the specific criteria. It would be better placed in a discussion forum on tessaract or other general optical recognition tools.
– AdminBee
Commented Jan 10, 2022 at 8:09
1

Tesseract isn't great with fuzzy or skewed images (which is why ocrmypdf has a -d, --deskew option to help with PDFs made from poor scans). Fortunately, boot failure images are one of the few kinds of images in U&L questions that aren't likely to trigger complaints.
– cas
Commented Jan 10, 2022 at 8:41
1

If you want to try deskewing the images yourself, you could try converting your photo to a PDF and then using ocrmypdf, or save the image as a tiff and use tiff_findskew from pageutils. Not guaranteed to work, but I've had some great results with ocrmypdf's -d option on some PDFs made from abysmally bad scanned images. BTW ocrmypdf and pageutils are available as packages for Debian and derivatives, and probably for other distros too.
– cas
Commented Jan 10, 2022 at 8:42
@cas, thank you. I've tried ocrmypdf. It produces text overlayed on picture, when text is copied it is pasted as separate short objects / many short lines, I do not get long lines from log as single lines. Summary: not ok.
– Martian2020
Commented Jan 10, 2022 at 12:14
Yeah, well, garbage in = garbage out. There's only so much that can be done to fix poor quality images. Try pdftotext -layout on the PDF generated by ocrmypdf - the -layout option often produces better results, especially on PDFs with multiple columns of text. Also BTW, if you have less configured to use lesspipe, it will automatically run pdftotext -layout if you use less to "view" a pdf. NOTE: pdftotext does not do OCR, it just extracts the text layer (if one exists) from a PDF.
– cas
Commented Jan 10, 2022 at 12:32

| Show 3 more comments

Stack Exchange Network

Using tesseract for character recongniton, result is not as expected (much worse). How to get better?

0

You must log in to answer this question.

Browse other questions tagged
ocr
tesseract
.

Hot Network Questions

Using tesseract for character recongniton, result is not as expected (much worse). How to get better?

0

You must log in to answer this question.

Browse other questions tagged ocrtesseract.

Related

Hot Network Questions

Browse other questions tagged
ocr
tesseract
.