I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR for a long time). Links found via Web search "praise" tesseract
. https://www.linuxlinks.com/ocrtools/ second best on chart. https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution
Tesseract is probably the most accurate open source OCR engine available.
I've installed it from distro via apt-get and run. Result with out-of-the-box is IMO awful. Why? Maybe it can be ealily fixed? Or advice another package that does the job. The page I've tried to recognize lacks pictures, as I see it it is rather easy task. See below the result:
Edit: in fact result when that small part is processed were much better, but when whole is processed than results are not ok. I understand making lines more horizontal and not skewed might help a lot, still I was hoping software got good at recognizing non-perfectly aligned text.
oon usb 1-@: |
“3792661 usb 1-8: New USB device found, idVendor=1343, idProduct:
7.983163] usb 1-8: New USB dev bs P luct=5662, bedDevice=16.6?
re eh peeled haibbetaia a
: new high-speed USB device number 5 PhS |
i
Per Samm SCR Can)
t pela ee rcpt PP cay
: 2.998668) usb 1-8: er
t
Ct
When only small part is processed:
2.837811) usb 1-8: new high-speed USB device number 5 using xhei_hed
2.979266] usb 1-8: New USB device ECU CREME Cnt ttc cain Tt teen Td
7.983163] usb 1-8: New USB device strings: Mfr=1, Product=2, SerialNumbers@
?.9869291 usb 1-8: Product: Integrated Camera
Added 1:
Tried again smaller and less skewed picture, I guess software considers time stamps as separate column, I have not seen on man page options to tweak that:
f a eg
| 7.849264]
Device= 6.44
f 7 .6492961
| 7.849355]
f 7.849415]
[ 7.849492]
| Van eos
fl 7.861846]
if Va ACB
| 7.864776]
if eel Be
Ha Bs) bs 4
if be A be ge
C ie BD LB
ce B)
te] Bs]
rage
lb eae
8.962076)
ie Ke Lb
9.600567)
9.696957)
9 .6970371
YS SF SS Se
usb 1-8: new high-speed USB device number 4 using xhci_hcd
usb 1-8: New USB device found, idVendor=04f2, idProduct=b449, bed
usb 1-8: New USB device strings: Mfr=3, Product=1, SerialNumber=2
usb 1-8: Product: Integrated Camera
usb 1-8: Manufacturer: Chicony Electronics Co.,Ltd.
usb 1-8: SerialNumber: 6x0001
usb-storage 1-1:1.6: USB Mass Storage device detected
scsi host3:
usb-storage 1-1:1.6
usbcore: registered new interface driver usb-storage
usbcore: registered new interface driver uas
scsi 3:0:6:@: Direct-fAccess General UDisk eg
sd 3:0:0:0: Attached scsi generic sgi type @
eM Pee PM eA PA ed) te) ae
Py Me ee dd
Py ee ee eee dm
sd 3:0:0:0: [sdb] Assuming drive cache: write through
sdb: sdbi sdb2 sdb3
sd 3:0:0:0: [sdb] Attached SCSI removable disk
squashfs: version 4.6 (2609/01/31) Phillip Lougher
Copying live image to RAM...
Ca ewe te Mae
-d
,--deskew
option to help with PDFs made from poor scans). Fortunately, boot failure images are one of the few kinds of images in U&L questions that aren't likely to trigger complaints.tiff_findskew
from pageutils. Not guaranteed to work, but I've had some great results with ocrmypdf's-d
option on some PDFs made from abysmally bad scanned images. BTWocrmypdf
andpageutils
are available as packages for Debian and derivatives, and probably for other distros too.ocrmypdf
. It produces text overlayed on picture, when text is copied it is pasted as separate short objects / many short lines, I do not get long lines from log as single lines. Summary: not ok.pdftotext -layout
on the PDF generated by ocrmypdf - the -layout option often produces better results, especially on PDFs with multiple columns of text. Also BTW, if you have less configured to uselesspipe
, it will automatically runpdftotext -layout
if you use less to "view" a pdf. NOTE: pdftotext does not do OCR, it just extracts the text layer (if one exists) from a PDF.