Questions tagged [ocr]
OCR (Optical character recognition) is the conversion of an image of characters into a machine-readable encoded text. Use this tag to indicate questions involving this type of conversion or software that performs OCR. When possible indicate the software you use, source and target of the conversion.
39
questions
0
votes
0
answers
52
views
What happened to Tesseract's "Math / equation detection module"?
I was able to get Tesseract to run via a Python script on my Windows machine to turn non-searchable PDFs into searchable ones. When downloading Tesseract onto windows, it asked me which languages I ...
0
votes
0
answers
92
views
Making badly scanned public domain books legible with OCR
I've obtained soft copies of some very old public domain books.
The illustrations are clear enough, but the text is somewhat blurry.
I've experimented with Tesseract OCR and it can recognize a ...
2
votes
0
answers
41
views
OCR high res images & combine OCR data later, after image compression?
I have a large number of .tif's coming out of ScanTailor. Is there a way that I might OCR those .tif's with tesseract, holding the OCR data separate from the images; then compress the images, and ...
1
vote
2
answers
378
views
How to scan with ocr bash script
To streamline the scan process I intend to create a script that scans and applies OCR in one step. However my bash skills are rather poor, so I would be very thankful for a bit of help. Here my ...
2
votes
0
answers
222
views
MacOS-like OCR for Linux?
How can one setup the same ubiquitous OCR capabilities on Linux, in a manner similar to how one can copy text from any image in any software on MacOS and iOS?
I am using EndevourOS with Gnome DE.
0
votes
1
answer
145
views
Make (`ocrmypdf`) command run in terminal AND include input name in that of the output
I have this line inside a Dolphin service-menu file that contains many other commands for PDF processing:
Exec=bash -c 'f="%u"; ocrmypdf "$f" "${f%.pdf}_ocr.pdf";'
It ...
0
votes
1
answer
408
views
Best command-line OCR software for recognizing typed text over colorful background
I need to extract text from images like the one below:
As you can see, the text is typed not handwritten. Moreover, the background is colorful.
I've tried Tesseract OCR, and while it works some of ...
0
votes
0
answers
47
views
How do I format texts that were processed by OCR?
Let's say that I want to connect all the paragraphs that are broken by the citations that start with (1), (2), (3), (4), (5). How would I express/automate this in bash? Keep in mind there are at most ...
1
vote
0
answers
34
views
Can I transform colors of scanned pdf files and reduce the scan resolution to save memory keeping an existing text layer from OCR?
I have a pile of pdf files which have been scanned long ago and which are already searchable (i.e. they went through OCR).
However the light level and contrast settings were not optimal.
Is it ...
1
vote
0
answers
395
views
Using tesseract for character recongniton, result is not as expected (much worse). How to get better?
I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR ...
0
votes
0
answers
91
views
NormCap OCR via Awesome Window Manager
One of the coolest programs I've come across recently, is an Optical Character Recognition (OCR) program called NormCap. I have it tied to a hot key, and anytime I want to copy un-highlightable text ...
2
votes
0
answers
96
views
Is there software to manually OCR / teach OCR for handwriting (non-english) texts?
I had a problem that can't solve Tesseract/Abbyy Finereader etc - they can't recognize handwriting Russian as example.
So I search
OCR software for such things
or a way to manually OCR my pdfs (...
0
votes
0
answers
592
views
How to specify multiple input files for Tesseract when using the output PDF option (only works with 'parallel' on the command line)
I am trying to tesseract all files in a directory to a pdf:
This command works fine:
ls * | parallel -j 4 tesseract {} {.} pdf
And produces a pdf for each input file.
However, I am unable to get it ...
5
votes
1
answer
274
views
Find PDFs that don't have text
I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do ...
1
vote
0
answers
155
views
Where is ocrmypdf executable after Cygwin installation?
I followed this page to install OCRmyPDF on Cygwin. I did so from a non-administrator account, so the process ended up creating ~/.local/ for the required files. The following commands, however, do ...
0
votes
1
answer
1k
views
methods of PDF compression
The Problem
I have a lot of old books that I want to scan and digitize. For this, I use some flatbed scanner, xsane and GImageReader, which works great.
Back a few years ago, when I was still using ...
0
votes
1
answer
146
views
How to find a word in picture and put another word in desired position?
I am an IT specialist but i am doing financial clerk job a lot! I have to put cost centers in invoices (of the IT department) - by hand!
Maybe is there in Linux a technology or solution to automate ...
2
votes
0
answers
264
views
Convert scanned pdf to pdf with text and images
Is it possible to convert scanned pdf to an normal pdf (i.e. same pdf as if it was created from a document (with formatted text and images)) ?
I tried many OCR solutions online/offline
but they tend ...
10
votes
2
answers
13k
views
Tesseract: High CPU Usage and slow speed, only when running multiple processes in parallel
Problem
pytesseract.image_to_string() takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and ...
4
votes
0
answers
191
views
Replace Scanned Text with OCRed Text in PDF
I have a scanned book as a PDF.
When viewed in Evince, the book appears as it did when scanned, with old fashioned fonts that appear as they were scanned.
However, Evince recognises the letters as ...
1
vote
2
answers
285
views
How do I update this recursive directory file search for input and name outputs to handle the below case
I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.
In its simple version, it works.
ocrmypdf -l vie --deskew --clean --force-ocr --sidecar ...
2
votes
0
answers
718
views
Extract hardcoded subtitles
I wanted to know if there is a way to extract hardcoded subtitles via OCR, should I do some image processing after extracting the frames in order to use tesseract afterwards?
I have tried to extract ...
50
votes
4
answers
44k
views
How to use OCR from the command line in Linux?
I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations.
I need to ...
4
votes
1
answer
2k
views
Delete OCR from PDF
I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the ...
0
votes
3
answers
1k
views
OCR software for handwritten equations to get LaTeX file
First of all, I apologize if this is not the right place to ask this, but I couldn't think of anywhere else (maybe Stack Overflow?).
Anyway, I'm looking for a Optical Character Recognition software (...
5
votes
1
answer
2k
views
tesseract: is it possible to change font output in OCRed pdf?
Following up on how to OCR a pdf file and get the text stored within pdf? I have successfully produced OCRed pdf pages.
In Evince, however, the letters are not shown; by this I mean that I cannot see ...
94
votes
4
answers
71k
views
How to OCR a PDF file and get the text stored within the PDF?
First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.
I am interested in a solution for Fedora to OCR a multipage non-...
8
votes
4
answers
5k
views
How can I rasterize all of the text in a PDF?
You know when you have a pdf, which is a scan of a document and it's a really huge file, because it just stores the picture of the scanned document?
And there are OCR tools which can help you to ...
2
votes
1
answer
699
views
Where I can get Tesseract binaries for Debian 6 64bit?
I used apt-get to install Tesseract but it's not really working. Maybe I could just download binaries somewhere, put in a dir and use this way?
What's wrong with my Tesseract now:
tesseract --help
...
2
votes
0
answers
78
views
OCR that outputs probability data
I would like to convert printed books I own into audio by scanning them with OCR and then running the text through a TTS engine. These titles are not available as ebooks.
Since OCR can make small ...