Newest 'ocr' Questions - Unix & Linux Stack Exchange

0 votes

0 answers

52 views

What happened to Tesseract's "Math / equation detection module"?

I was able to get Tesseract to run via a Python script on my Windows machine to turn non-searchable PDFs into searchable ones. When downloading Tesseract onto windows, it asked me which languages I ...

Curious Layman

101

asked May 16 at 16:17

0 votes

0 answers

92 views

Making badly scanned public domain books legible with OCR

I've obtained soft copies of some very old public domain books. The illustrations are clear enough, but the text is somewhat blurry. I've experimented with Tesseract OCR and it can recognize a ...

YQ002lc2

45

asked Jul 24, 2023 at 6:11

2 votes

0 answers

41 views

OCR high res images & combine OCR data later, after image compression?

I have a large number of .tif's coming out of ScanTailor. Is there a way that I might OCR those .tif's with tesseract, holding the OCR data separate from the images; then compress the images, and ...

Diagon

680

asked Jul 7, 2023 at 22:50

1 vote

2 answers

378 views

How to scan with ocr bash script

To streamline the scan process I intend to create a script that scans and applies OCR in one step. However my bash skills are rather poor, so I would be very thankful for a bit of help. Here my ...

alex

993

asked Apr 26, 2023 at 2:16

2 votes

0 answers

222 views

MacOS-like OCR for Linux?

How can one setup the same ubiquitous OCR capabilities on Linux, in a manner similar to how one can copy text from any image in any software on MacOS and iOS? I am using EndevourOS with Gnome DE.

Pushp Vashisht

121

asked Apr 13, 2023 at 18:04

0 votes

1 answer

145 views

Make (`ocrmypdf`) command run in terminal AND include input name in that of the output

I have this line inside a Dolphin service-menu file that contains many other commands for PDF processing: Exec=bash -c 'f="%u"; ocrmypdf "$f" "${f%.pdf}_ocr.pdf";' It ...

cipricus

1,629

asked Nov 30, 2022 at 14:43

0 votes

1 answer

408 views

Best command-line OCR software for recognizing typed text over colorful background

I need to extract text from images like the one below: As you can see, the text is typed not handwritten. Moreover, the background is colorful. I've tried Tesseract OCR, and while it works some of ...

user549392

asked Nov 15, 2022 at 19:35

0 votes

0 answers

47 views

How do I format texts that were processed by OCR?

Let's say that I want to connect all the paragraphs that are broken by the citations that start with (1), (2), (3), (4), (5). How would I express/automate this in bash? Keep in mind there are at most ...

Jean

1

asked Oct 1, 2022 at 12:48

1 vote

0 answers

34 views

Can I transform colors of scanned pdf files and reduce the scan resolution to save memory keeping an existing text layer from OCR?

I have a pile of pdf files which have been scanned long ago and which are already searchable (i.e. they went through OCR). However the light level and contrast settings were not optimal. Is it ...

Adalbert Hanßen

253

asked Sep 14, 2022 at 19:19

1 vote

0 answers

395 views

Using tesseract for character recongniton, result is not as expected (much worse). How to get better?

I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR ...

Martian2020

1,219

asked Jan 10, 2022 at 6:35

0 votes

0 answers

91 views

NormCap OCR via Awesome Window Manager

One of the coolest programs I've come across recently, is an Optical Character Recognition (OCR) program called NormCap. I have it tied to a hot key, and anytime I want to copy un-highlightable text ...

Lonnie Best

5,185

asked Dec 25, 2021 at 23:46

2 votes

0 answers

96 views

Is there software to manually OCR / teach OCR for handwriting (non-english) texts?

I had a problem that can't solve Tesseract/Abbyy Finereader etc - they can't recognize handwriting Russian as example. So I search OCR software for such things or a way to manually OCR my pdfs (...

PDD

21

asked Oct 15, 2021 at 4:19

0 votes

0 answers

592 views

How to specify multiple input files for Tesseract when using the output PDF option (only works with 'parallel' on the command line)

I am trying to tesseract all files in a directory to a pdf: This command works fine: ls * | parallel -j 4 tesseract {} {.} pdf And produces a pdf for each input file. However, I am unable to get it ...

Michael

asked May 4, 2021 at 13:29

5 votes

1 answer

274 views

Find PDFs that don't have text

I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do ...

fich

330

asked Jan 15, 2021 at 4:11

1 vote

0 answers

155 views

Where is ocrmypdf executable after Cygwin installation?

I followed this page to install OCRmyPDF on Cygwin. I did so from a non-administrator account, so the process ended up creating ~/.local/ for the required files. The following commands, however, do ...

user36800

111

asked Jan 10, 2021 at 20:01

0 votes

1 answer

1k views

methods of PDF compression

The Problem I have a lot of old books that I want to scan and digitize. For this, I use some flatbed scanner, xsane and GImageReader, which works great. Back a few years ago, when I was still using ...

carsten

355

asked Jan 3, 2021 at 10:05

0 votes

1 answer

146 views

How to find a word in picture and put another word in desired position?

I am an IT specialist but i am doing financial clerk job a lot! I have to put cost centers in invoices (of the IT department) - by hand! Maybe is there in Linux a technology or solution to automate ...

Юля

1

asked May 29, 2020 at 11:15

2 votes

0 answers

264 views

Convert scanned pdf to pdf with text and images

Is it possible to convert scanned pdf to an normal pdf (i.e. same pdf as if it was created from a document (with formatted text and images)) ? I tried many OCR solutions online/offline but they tend ...

Jean Molinier

63

asked Dec 3, 2019 at 8:31

10 votes

2 answers

13k views

Tesseract: High CPU Usage and slow speed, only when running multiple processes in parallel

Problem pytesseract.image_to_string() takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and ...

Ashish

270

asked Jul 18, 2019 at 8:29

4 votes

0 answers

191 views

Replace Scanned Text with OCRed Text in PDF

I have a scanned book as a PDF. When viewed in Evince, the book appears as it did when scanned, with old fashioned fonts that appear as they were scanned. However, Evince recognises the letters as ...

zhanmusi

141

asked Feb 24, 2019 at 0:39

1 vote

2 answers

285 views

How do I update this recursive directory file search for input and name outputs to handle the below case

I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf. In its simple version, it works. ocrmypdf -l vie --deskew --clean --force-ocr --sidecar ...

pleasemarkdarkly

11

asked Sep 26, 2018 at 2:45

2 votes

0 answers

718 views

Extract hardcoded subtitles

I wanted to know if there is a way to extract hardcoded subtitles via OCR, should I do some image processing after extracting the frames in order to use tesseract afterwards? I have tried to extract ...

SkyBeast MC

21

asked Jul 17, 2018 at 23:38

50 votes

4 answers

44k views

How to use OCR from the command line in Linux?

I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations. I need to ...

Village

3,717

asked Jul 9, 2017 at 21:22

4 votes

1 answer

2k views

Delete OCR from PDF

I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the ...

Seninha

1,045

asked Jun 11, 2017 at 22:46

0 votes

3 answers

1k views

OCR software for handwritten equations to get LaTeX file

First of all, I apologize if this is not the right place to ask this, but I couldn't think of anywhere else (maybe Stack Overflow?). Anyway, I'm looking for a Optical Character Recognition software (...

TomCho

529

asked Dec 18, 2016 at 17:59

5 votes

1 answer

2k views

tesseract: is it possible to change font output in OCRed pdf?

Following up on how to OCR a pdf file and get the text stored within pdf? I have successfully produced OCRed pdf pages. In Evince, however, the letters are not shown; by this I mean that I cannot see ...

ingli

1,889

asked Aug 27, 2016 at 8:14

94 votes

4 answers

71k views

How to OCR a PDF file and get the text stored within the PDF?

First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support. I am interested in a solution for Fedora to OCR a multipage non-...

ingli

1,889

asked Aug 4, 2016 at 15:39

8 votes

4 answers

5k views

How can I rasterize all of the text in a PDF?

You know when you have a pdf, which is a scan of a document and it's a really huge file, because it just stores the picture of the scanned document? And there are OCR tools which can help you to ...

Dimitri Schachmann

183

asked Apr 26, 2015 at 14:09

2 votes

1 answer

699 views

Where I can get Tesseract binaries for Debian 6 64bit?

I used apt-get to install Tesseract but it's not really working. Maybe I could just download binaries somewhere, put in a dir and use this way? What's wrong with my Tesseract now: tesseract --help ...

buikoto

21

asked Jan 23, 2015 at 22:05

2 votes

0 answers

78 views

OCR that outputs probability data

I would like to convert printed books I own into audio by scanning them with OCR and then running the text through a TTS engine. These titles are not available as ebooks. Since OCR can make small ...

themirror

7,038

asked Sep 27, 2013 at 16:17

2 votes

2 answers

1k views

Create custom wordlist

I want to create a custom list of (scientific) words for purposes like spell checking and OCR based on my collection of scientific papers in pdf format. Using pdftotext I can easily create a text file ...

highsciguy

2,574

asked May 18, 2013 at 20:25

7 votes

2 answers

3k views

How to find all images containing any text?

I got a lot of images, and I need to find, which of them contain any text in English (to delete them). Is it possible to do it automatically?

Andrey Chetverikov

173

asked Oct 17, 2012 at 9:59

4 votes

1 answer

194 views

De-obfuscate a picture with statistical information?

I need to get this kind of information into numbers, how? Perhaps related https://dsp.stackexchange.com/questions/1054/how-do-i-recover-the-signal-from-an-ecg-image https://dsp.stackexchange.com/...

user2362

asked Feb 4, 2012 at 18:01

4 votes

3 answers

340 views

sed one-liner to replace word-medial capitals

I used OCR to turn some scans into plaintext, but unfortunately the letters 'fi' which are commonly joined in some fonts, got read in as capital W's. Now I need to replace all the W's with 'fi', and ...

ixtmixilix

13.3k

asked May 26, 2011 at 23:47

0 votes

1 answer

363 views

Image (having text-and-numbers) to text-file matching [:alnum:] nicely with some Unix -tool?

Suppose a photograph with text and numbers. I want to manage it in my editor with tools such as grep, standard text-processing things such as Vim's block-highlighting and also more advanced things ...

user2362

asked May 25, 2011 at 23:57

3 votes

1 answer

1k views

Linux equivalent of GraphClick?

Is there a piece of Linux software that does what GraphClick does in Mac OS X? That is, is there a Linux software that "is a graph digitizer software which allows to automatically retrieve the ...

hpy

4,587

asked Apr 29, 2011 at 15:30

0 votes

1 answer

67 views

Writing to picture which is scanned document

I have a scanned contract and I need to change only a few names and dates in the contract. It's easy to scan the document but impossible to ocr the document and open in *.doc format. Is there an ...

xralf

15.2k

asked Apr 19, 2011 at 9:29

49 votes

6 answers

35k views

Is there some sort of PDF-to-text converter?

I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro? Perhaps related post, OCR with Ubuntu here.

otto

591

asked Dec 11, 2010 at 14:46

15 votes

5 answers

7k views

OCR on Linux systems [closed]

I have always found OCR technology to be behind on open source systems. I've also watched the Ocropus project since its infancy. I've tried what I've heard is the best OCR engine available for Linux,...

jjclarkson

2,147

asked Aug 16, 2010 at 22:27

Questions tagged [ocr]

Related Tags