I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.
In its simple version, it works.
ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese\ Website.jpg Vietnamese\ Website.pdf --verbose 1
I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:
find . \( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' \)
The example batch and parallel processing example is below:
find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'
My question is in two parts:
'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.
find . \( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' \) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1
Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?