I'm using Opensuse 10.3 and like to know command line tools to search phrases in large number of pdf files inside a directory. In Windows XP the Explorer search allows this but is too slow. Is there grep tips here?


5 Answers 5

# extracting text from pdf
pdftotext "file.pdf" "file.txt"

# connecting with grep
pdftotext "file.pdf" /dev/stdout |grep -H --label="file.pdf" -- "$SEARCH_STRING"

# if you want grep to show only file list of matching pdf file, add --files-with-matches
pdftotext "file.pdf" /dev/stdout |grep -H --label="file.pdf" --files-with-matches -- "$SEARCH_STRING"

# find possible list of pdf to search from
find "$SEARCH_DIR" -type f -name '*.pdf' > list-of-pdf.txt
# everything joined by awk as duct tape, sent to bash for processing
# double quote is escaped as x22 inside awk.
find "$SEARCH_DIR" -type f -name '*.pdf' |awk -v SEARCH_STRING="$SEARCH_STRING" '{
print "pdftotext \x22"$0"\x22 /dev/stdout | grep -H  --label=\x22"$0"\x22 -- \x22"SEARCH_STRING"\x22"
}' |bash

# With out bash. Further process to match your need
find "$SEARCH_DIR" -type f -name '*.pdf' |awk -v SEARCH_STRING="$SEARCH_STRING" '
EXEC="pdftotext \x22"$0"\x22 /dev/stdout | grep -H  --label=\x22"$0"\x22 -- \x22"SEARCH_STRING"\x22";
while(EXEC|getline ret){
 print "For file ["$0"] we have match ["ret"]";
 # do whatever you like. 
  • I guess you did not notice the part of the question that mentioned “Windows XP” or the windows-search tag. I know the question (confusedly) started with “openSUSE”, but there are more Windows references than Linux references; especially when you count his subsequent comment as well.
    – Synetech
    Commented Jul 8, 2012 at 2:42
  • @Synetech: He rejected an answer with "Wingrep is only under Windows", which suggests he wants a Linux solution. Commented Jul 8, 2012 at 4:16
  • @Mechanicalsnail, he rejected it because it is a GUI tool where has asked for a command-line tool.
    – Synetech
    Commented Jul 8, 2012 at 4:21

Under both Linux and Windows, you can use Acrobat Reader, which has a command to search multiple files.

Under Linux, there is Recoll, which will build an index of your pdf files (and more) the first time you run it. After the index is built, word searches should be very fast; phrase searches should be reasonable. Make sure the pdftotext command is installed before you start Recoll; under Debian and Ubuntu, it's in the poppler-utils package, I don't know about Suse.

Or you could directly convert the files to text and use grep on the text files with the commands below.

find -name '*.pdf' -exec pdftotext {} \;
grep -r --include '*.txt' -l -F "exact phrase to search"
grep -r --include '*.txt' -l -E "regular expression to search"
  • adobe would not allow to search under a whole directory, it would do so just inside a file. I want to know command line tools first and if there are GUI tools, then its will be nice too
    – iceman
    Commented Jul 13, 2010 at 19:33
  • 1
    Adobe Reader 9 under Linux has an "Edit | Search" menu entry which does allow you to search in all the PDF files in a directory. On the command line, all the methods I'm aware of involve a step of pdftotext (which tools such as Recoll will do automatically). Commented Jul 13, 2010 at 20:03
  • 2
    +1 for Recoll. Indexing the files will save time if you have a lot and you search them frequently. Commented Jul 8, 2012 at 8:32

Adobe Reader X does the job and it does allow searching under a whole directory and subdirectories, not only inside a file, but it is not a command line program.

  • is that in the latest version of Acrobat X? which release?
    – iceman
    Commented Jun 13, 2012 at 2:28
  • I tried the Acrobat indexing tool and calling it primitive is a compliment. recoll installed on debian handily, now trying to make it usable for my windows-based employees.
    – Krista K
    Commented Aug 26, 2014 at 1:11

To recursively list all files in your home directory that have the PDF file extension and that contain a line that matches the regex ‘[iI]n Haskell’ for example, you can issue:

find ~/ -regextype posix-extended -regex '.*\.pdf' -execdir sh -c 'pdftotext "$0" - | grep -El --label="$PWD${0#?}" "$1"' {} '[iI]n Haskell' \;


  • Though it's not particularly necessary for this example, I've constructed this avoiding the use of -exec or xargs because, for security reasons, I think that it's good practice to get into the habit of doing so. Changing ‘-execdir’ to ‘-exec’ and ‘$PWD${0#?}’ to ‘$0’ should achieve the same result in this instance.
  • Instead of using globs for pattern-matching the filenames, it can be useful to use the greater expressive power of regular expressions and to pattern-match over the whole path. I included the practice here to show how it can be done. Note that the path that is pattern-matched against is the path that would normally be printed. Whether it is relative or absolute depends on the given path argument(s), which if emitted default to the current working directory (‘./’). In this example, the paths matched against are all absolute (i.e. begin with ‘/’) because ‘~/’ is expanded to the absolute path of the current user's home directory, and it is the only path argument.
  • The ‘$0’ and ‘$1’ are positional parameters used in such a way as to correctly quote the arguments. If this is not done properly, the command is vulnerable to arbitrary filenames.
  • ${0#?}’ strips the first character of $0, i.e. the ‘.’.

To print each matching line proceeded by the filename:

find ~/ -regextype posix-extended -regex '.*\.pdf' -execdir bash -c 'pdftotext "$0" - | grep -EH --label="${0:2}" "$1"' {} '[iI]n Haskell' \;

This variant uses ‘-H’ instead of ‘-l’, and labels with filename rather than filepath. ‘${0:2}’ strips the first two characters of $0, i.e. the ‘./’, but it apparently isn't recognised by sh.

Of course, tweak to your needs.


Simplest way I found

Joining all the above answers, this is the simplest way I found:

find . -iname "*.pdf" -exec pdftotext {} - \; | grep -i "what you search"


  1. pdftotext does not seem to accept multiple in-files. We cover that with find and making one pdftotext per file.
  2. Setting the output filename to - in pdftotext it sends the output to the stdout, so a) we can get all the PDF texts concatenated one after the other, b) we can pipe into grep.

If you don't have pdftotext then do this to install it in debian/ubuntu-like distros:

apt-get install xpdf

Not the answer you're looking for? Browse other questions tagged .