1

I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.

In its simple version, it works.

ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese\ Website.jpg Vietnamese\ Website.pdf --verbose 1

I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:

find . \( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' \)

The example batch and parallel processing example is below:

find .  -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'

My question is in two parts:

'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.

find . \( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' \) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1

Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?

2 Answers 2

1

I think there are two points.

  • Alias expansion works only on the first word, not on an option.
  • You need some modification to the names provided by find.

While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:

#!/bin/bash

languages='eng+rus+vie+...'
base="${1%.*}
ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1

Then you can run it with

find . \( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' \) | parallel --tag -j 2 ocrmypdf.sh '{}'
5
  • I was also told that -printf '%P\n' was to be added on the find command and that didn't work - or even was supported at all on macosx. :( Commented Sep 26, 2018 at 7:10
  • find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but? Commented Sep 26, 2018 at 7:13
  • Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose? Commented Sep 26, 2018 at 7:18
  • Thanks for your direction. I was able to get it working and will publish the details tomorrow! Commented Sep 26, 2018 at 8:17
  • Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful. Commented Sep 26, 2018 at 21:11
0

So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.

Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.

function do_ocr () {
    #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
    find_all_formats | parallel --tag -j 2 \
    ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa \
    --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75 \
    -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1

}

Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata

Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).

My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .