Questions tagged [ocr]

Ask Question

OCR (Optical character recognition) is the conversion of an image of characters into a machine-readable encoded text. Use this tag to indicate questions involving this type of conversion or software that performs OCR. When possible indicate the software you use, source and target of the conversion.

39 questions

94 votes

4 answers

71k views

How to OCR a PDF file and get the text stored within the PDF?

First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support. I am interested in a solution for Fedora to OCR a multipage non-...

ingli

1,889

asked Aug 4, 2016 at 15:39

50 votes

4 answers

44k views

How to use OCR from the command line in Linux?

I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations. I need to ...

Village

3,717

asked Jul 9, 2017 at 21:22

49 votes

6 answers

35k views

Is there some sort of PDF-to-text converter?

I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro? Perhaps related post, OCR with Ubuntu here.

otto

asked Dec 11, 2010 at 14:46

15 votes

5 answers

7k views

OCR on Linux systems [closed]

I have always found OCR technology to be behind on open source systems. I've also watched the Ocropus project since its infancy. I've tried what I've heard is the best OCR engine available for Linux,...

jjclarkson

2,147

asked Aug 16, 2010 at 22:27

10 votes

2 answers

13k views

Tesseract: High CPU Usage and slow speed, only when running multiple processes in parallel

Problem pytesseract.image_to_string() takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and ...

Ashish

asked Jul 18, 2019 at 8:29

8 votes

4 answers

5k views

How can I rasterize all of the text in a PDF?

You know when you have a pdf, which is a scan of a document and it's a really huge file, because it just stores the picture of the scanned document? And there are OCR tools which can help you to ...

Dimitri Schachmann

asked Apr 26, 2015 at 14:09

7 votes

2 answers

3k views

How to find all images containing any text?

I got a lot of images, and I need to find, which of them contain any text in English (to delete them). Is it possible to do it automatically?

Andrey Chetverikov

asked Oct 17, 2012 at 9:59

5 votes

1 answer

274 views

Find PDFs that don't have text

I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do ...

fich

asked Jan 15, 2021 at 4:11

5 votes

1 answer

2k views

tesseract: is it possible to change font output in OCRed pdf?

Following up on how to OCR a pdf file and get the text stored within pdf? I have successfully produced OCRed pdf pages. In Evince, however, the letters are not shown; by this I mean that I cannot see ...

ingli

1,889

asked Aug 27, 2016 at 8:14

4 votes

3 answers

340 views

sed one-liner to replace word-medial capitals

I used OCR to turn some scans into plaintext, but unfortunately the letters 'fi' which are commonly joined in some fonts, got read in as capital W's. Now I need to replace all the W's with 'fi', and ...

ixtmixilix

13.3k

asked May 26, 2011 at 23:47

4 votes

1 answer

2k views

Delete OCR from PDF

I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the ...

Seninha

1,045

asked Jun 11, 2017 at 22:46

4 votes

1 answer

194 views

De-obfuscate a picture with statistical information?

I need to get this kind of information into numbers, how? Perhaps related https://dsp.stackexchange.com/questions/1054/how-do-i-recover-the-signal-from-an-ecg-image https://dsp.stackexchange.com/...

user2362

asked Feb 4, 2012 at 18:01

4 votes

0 answers

191 views

Replace Scanned Text with OCRed Text in PDF

I have a scanned book as a PDF. When viewed in Evince, the book appears as it did when scanned, with old fashioned fonts that appear as they were scanned. However, Evince recognises the letters as ...

zhanmusi

asked Feb 24, 2019 at 0:39

3 votes

1 answer

1k views

Linux equivalent of GraphClick?

Is there a piece of Linux software that does what GraphClick does in Mac OS X? That is, is there a Linux software that "is a graph digitizer software which allows to automatically retrieve the ...

hpy

4,587

asked Apr 29, 2011 at 15:30

2 votes

2 answers

1k views

Create custom wordlist

I want to create a custom list of (scientific) words for purposes like spell checking and OCR based on my collection of scientific papers in pdf format. Using pdftotext I can easily create a text file ...

highsciguy

2,574

asked May 18, 2013 at 20:25

15 30 50 per page

2 3 Next

Stack Exchange Network

Questions tagged [ocr]

How to OCR a PDF file and get the text stored within the PDF?

How to use OCR from the command line in Linux?

Is there some sort of PDF-to-text converter?

OCR on Linux systems [closed]

Tesseract: High CPU Usage and slow speed, only when running multiple processes in parallel

How can I rasterize all of the text in a PDF?

How to find all images containing any text?

Find PDFs that don't have text

tesseract: is it possible to change font output in OCRed pdf?

sed one-liner to replace word-medial capitals

Delete OCR from PDF

De-obfuscate a picture with statistical information?

Replace Scanned Text with OCRed Text in PDF

Linux equivalent of GraphClick?

Create custom wordlist

Hot Network Questions

Questions tagged [ocr]

Related Tags