Yes, using pdfgrep
sounds like a good idea. Something like:
find . -name '*.[Pp][Dd][Ff]' -type f \
! -exec pdfgrep -q '\w' {} ';' -print
Would report the list of pdf files where pdfgrep
can't find any word character (alnums or underscore).
(with some find
implementations, you can use -iname '*.pdf'
instead of -name '*.[Pp][Dd][Ff]'
above. Beware it assumes file names are valid text in the current locale)
To look for files with fewer than 1000 word characters:
find . -name '*.[Pp][Dd][Ff]' -type f -exec sh -c '
for file do
[ "$(pdfgrep -c "\w" "$file")" -lt 1000 ] &&
printf "%s\n" "$file"
done' sh {} +