Questions tagged [ocr]
OCR (Optical character recognition) is the conversion of an image of characters into a machine-readable encoded text. Use this tag to indicate questions involving this type of conversion or software that performs OCR. When possible indicate the software you use, source and target of the conversion.
39
questions
94
votes
4
answers
71k
views
How to OCR a PDF file and get the text stored within the PDF?
First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.
I am interested in a solution for Fedora to OCR a multipage non-...
50
votes
4
answers
44k
views
How to use OCR from the command line in Linux?
I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations.
I need to ...
49
votes
6
answers
35k
views
Is there some sort of PDF-to-text converter?
I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?
Perhaps related post, OCR with Ubuntu here.
15
votes
5
answers
7k
views
OCR on Linux systems [closed]
I have always found OCR technology to be behind on open source systems. I've also watched the Ocropus project since its infancy. I've tried what I've heard is the best OCR engine available for Linux,...
10
votes
2
answers
13k
views
Tesseract: High CPU Usage and slow speed, only when running multiple processes in parallel
Problem
pytesseract.image_to_string() takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and ...
8
votes
4
answers
5k
views
How can I rasterize all of the text in a PDF?
You know when you have a pdf, which is a scan of a document and it's a really huge file, because it just stores the picture of the scanned document?
And there are OCR tools which can help you to ...
7
votes
2
answers
3k
views
How to find all images containing any text?
I got a lot of images, and I need to find, which of them contain any text in English (to delete them). Is it possible to do it automatically?
5
votes
1
answer
274
views
Find PDFs that don't have text
I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do ...
5
votes
1
answer
2k
views
tesseract: is it possible to change font output in OCRed pdf?
Following up on how to OCR a pdf file and get the text stored within pdf? I have successfully produced OCRed pdf pages.
In Evince, however, the letters are not shown; by this I mean that I cannot see ...
4
votes
3
answers
340
views
sed one-liner to replace word-medial capitals
I used OCR to turn some scans into plaintext, but unfortunately the letters 'fi' which are commonly joined in some fonts, got read in as capital W's. Now I need to replace all the W's with 'fi', and ...
4
votes
1
answer
2k
views
Delete OCR from PDF
I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the ...
4
votes
1
answer
194
views
De-obfuscate a picture with statistical information?
I need to get this kind of information into numbers, how?
Perhaps related
https://dsp.stackexchange.com/questions/1054/how-do-i-recover-the-signal-from-an-ecg-image
https://dsp.stackexchange.com/...
4
votes
0
answers
191
views
Replace Scanned Text with OCRed Text in PDF
I have a scanned book as a PDF.
When viewed in Evince, the book appears as it did when scanned, with old fashioned fonts that appear as they were scanned.
However, Evince recognises the letters as ...
3
votes
1
answer
1k
views
Linux equivalent of GraphClick?
Is there a piece of Linux software that does what GraphClick does in Mac OS X?
That is, is there a Linux software that "is a graph digitizer software which allows to automatically retrieve the ...
2
votes
2
answers
1k
views
Create custom wordlist
I want to create a custom list of (scientific) words for purposes like spell checking and OCR based on my collection of scientific papers in pdf format. Using pdftotext I can easily create a text file ...