Find PDFs that don't have text

Question

I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do the job, but I'm lost.

How can I find PDFs that do not have text?

Stéphane Chazelas · Accepted Answer · 2021-01-15 09:45:41Z

7

Yes, using pdfgrep sounds like a good idea. Something like:

find . -name '*.[Pp][Dd][Ff]' -type f \
  ! -exec pdfgrep -q '\w' {} ';' -print

Would report the list of pdf files where pdfgrep can't find any word character (alnums or underscore).

(with some find implementations, you can use -iname '*.pdf' instead of -name '*.[Pp][Dd][Ff]' above. Beware it assumes file names are valid text in the current locale)

To look for files with fewer than 1000 word characters:

find . -name '*.[Pp][Dd][Ff]' -type f -exec sh -c '
  for file do
    [ "$(pdfgrep -c "\w" "$file")" -lt 1000 ] &&
      printf "%s\n" "$file"
  done' sh {} +

edited Jan 15, 2021 at 9:45

answered Jan 15, 2021 at 7:16

Stéphane Chazelas

554k92 gold badges1.1k silver badges1.6k bronze badges

That will get a lot of false positives, since many scanned pdfs include notices / watermarks as text.
– user313992
Commented Jan 15, 2021 at 9:33
@user414777, ITYM false negatives as it would fail to report those files. I've added a variant that count the number of word characters (and which could have false positives in addition to false negatives).
– Stéphane Chazelas
Commented Jan 15, 2021 at 9:49

Add a comment |

Stack Exchange Network

Find PDFs that don't have text

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
find
pdf
ocr
.

Hot Network Questions

Find PDFs that don't have text

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged findpdfocr.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
find
pdf
ocr
.