Skip to main content

Questions tagged [ocr]

OCR (Optical character recognition) is the conversion of an image of characters into a machine-readable encoded text. Use this tag to indicate questions involving this type of conversion or software that performs OCR. When possible indicate the software you use, source and target of the conversion.

94 votes
4 answers
71k views

How to OCR a PDF file and get the text stored within the PDF?

First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support. I am interested in a solution for Fedora to OCR a multipage non-...
ingli's user avatar
  • 1,889
50 votes
4 answers
44k views

How to use OCR from the command line in Linux?

I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations. I need to ...
Village's user avatar
  • 3,717
49 votes
6 answers
35k views

Is there some sort of PDF-to-text converter?

I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro? Perhaps related post, OCR with Ubuntu here.
otto's user avatar
  • 591
15 votes
5 answers
7k views

OCR on Linux systems [closed]

I have always found OCR technology to be behind on open source systems. I've also watched the Ocropus project since its infancy. I've tried what I've heard is the best OCR engine available for Linux,...
jjclarkson's user avatar
  • 2,147
10 votes
2 answers
13k views

Tesseract: High CPU Usage and slow speed, only when running multiple processes in parallel

Problem pytesseract.image_to_string() takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and ...
Ashish's user avatar
  • 270
8 votes
4 answers
5k views

How can I rasterize all of the text in a PDF?

You know when you have a pdf, which is a scan of a document and it's a really huge file, because it just stores the picture of the scanned document? And there are OCR tools which can help you to ...
Dimitri Schachmann's user avatar
7 votes
2 answers
3k views

How to find all images containing any text?

I got a lot of images, and I need to find, which of them contain any text in English (to delete them). Is it possible to do it automatically?
Andrey Chetverikov's user avatar
5 votes
1 answer
274 views

Find PDFs that don't have text

I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do ...
fich's user avatar
  • 330
5 votes
1 answer
2k views

tesseract: is it possible to change font output in OCRed pdf?

Following up on how to OCR a pdf file and get the text stored within pdf? I have successfully produced OCRed pdf pages. In Evince, however, the letters are not shown; by this I mean that I cannot see ...
ingli's user avatar
  • 1,889
4 votes
3 answers
340 views

sed one-liner to replace word-medial capitals

I used OCR to turn some scans into plaintext, but unfortunately the letters 'fi' which are commonly joined in some fonts, got read in as capital W's. Now I need to replace all the W's with 'fi', and ...
ixtmixilix's user avatar
  • 13.3k
4 votes
1 answer
2k views

Delete OCR from PDF

I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the ...
Seninha's user avatar
  • 1,045
4 votes
1 answer
194 views

De-obfuscate a picture with statistical information?

I need to get this kind of information into numbers, how? Perhaps related https://dsp.stackexchange.com/questions/1054/how-do-i-recover-the-signal-from-an-ecg-image https://dsp.stackexchange.com/...
user avatar
4 votes
0 answers
191 views

Replace Scanned Text with OCRed Text in PDF

I have a scanned book as a PDF. When viewed in Evince, the book appears as it did when scanned, with old fashioned fonts that appear as they were scanned. However, Evince recognises the letters as ...
zhanmusi's user avatar
  • 141
3 votes
1 answer
1k views

Linux equivalent of GraphClick?

Is there a piece of Linux software that does what GraphClick does in Mac OS X? That is, is there a Linux software that "is a graph digitizer software which allows to automatically retrieve the ...
hpy's user avatar
  • 4,587
2 votes
2 answers
1k views

Create custom wordlist

I want to create a custom list of (scientific) words for purposes like spell checking and OCR based on my collection of scientific papers in pdf format. Using pdftotext I can easily create a text file ...
highsciguy's user avatar
  • 2,574

15 30 50 per page