5

I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do the job, but I'm lost.

How can I find PDFs that do not have text?

1 Answer 1

7

Yes, using pdfgrep sounds like a good idea. Something like:

find . -name '*.[Pp][Dd][Ff]' -type f \
  ! -exec pdfgrep -q '\w' {} ';' -print

Would report the list of pdf files where pdfgrep can't find any word character (alnums or underscore).

(with some find implementations, you can use -iname '*.pdf' instead of -name '*.[Pp][Dd][Ff]' above. Beware it assumes file names are valid text in the current locale)

To look for files with fewer than 1000 word characters:

find . -name '*.[Pp][Dd][Ff]' -type f -exec sh -c '
  for file do
    [ "$(pdfgrep -c "\w" "$file")" -lt 1000 ] &&
      printf "%s\n" "$file"
  done' sh {} +
2
  • That will get a lot of false positives, since many scanned pdfs include notices / watermarks as text.
    – user313992
    Commented Jan 15, 2021 at 9:33
  • @user414777, ITYM false negatives as it would fail to report those files. I've added a variant that count the number of word characters (and which could have false positives in addition to false negatives). Commented Jan 15, 2021 at 9:49

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .