8

I am looking for a tool that will be faster than grep, maybe a multi-threaded grep, or something similar... I have been looking at a bunch of indexers, but I am not sold that I need an index...

I have about 100 million text files, that I need to grep for exact string matches, upon finding a string match, I need the filename where the match was found.

ie: grep -r 'exact match' > filepaths.log

Its about 4TB of data, and I started my first search 6 days ago, and grep is still running. I have another dozen searches to go and I can't wait 2 months to retrieve all these filenames =]

I've reviewed the following, however, I don't think I need all the bells and whistles these indexers come with, I just need the filename where the match occurred...

  • dtSearch
  • Terrier
  • Lucene
  • Xapian
  • Recoil
  • Sphinx

and after spending hours reading about all those engines, my head is spinning, and I wish I just had a multi-threaded grep lol, any ideas, and/or suggestions are greatly appreciated!

PS: I am running CentOS 6.5

EDIT: Searching for multi-threaded grep returns several items, My question is, is a multi-threaded grep the best option for what I am doing?

EDIT2: After some tweaking, this is what I have come up with, and it is going much faster than the regular grep, I still wish it was faster though... I am watching my disk io wait, and its not building up yet, I may do some more tweaking, and def still interested in any suggestions =]

find . -type f -print0 | xargs -0 -n10 -P4 grep -m 1 -H -l 'search string'
4
  • So you actually plan to search for more than one string, right? Multithreading won’t help because you’re limited by disk throughput and (more importantly) seek performance.
    – Daniel B
    Commented Dec 20, 2014 at 10:41
  • ya, The disks are def the bottleneck here Commented Dec 20, 2014 at 11:28
  • "100 million text files"... really? and the approach to take really all depends on whether this is a one time thing or whether the data really needs to be indexed for future use.
    – Tyson
    Commented Dec 20, 2014 at 14:54
  • ya... really. lol =] It's more or less a one time thing for about 2 dozen searches in total Commented Dec 20, 2014 at 16:14

2 Answers 2

11

grep is I/O bound, meaning its speed is dominated by how fast it can read the files it is searching. Multiple searches in parallel can compete with each other for disk I/O, so you may not see much speedup.

If you just need matching filenames, and not the actual matches found in the files, then you should run grep with the -l flag. This flag causes grep to just print filenames that match, and not print the matching lines. The value here is that it permits grep to stop searching a file once it has found a match, so it could reduce the amount of work that grep has to do.

If you're searching for fixed strings rather than regular expressions, then you could try using fgrep rather than grep. Fgrep is a variant of grep that searches for fixed strings, and searching for fixed strings is faster than running a regular expression search. You may or may not see any improvement from this, because modern versions of grep are probably smart enough to optimize fixed-string searches anyway.

If you want to try running multiple searches in parallel, you could do it using shell utilities. One way would be to build a list of filenames, split it into parts, and run grep separately for each list:

find /path/to/files -type f -print | split -l 10000000 list.
for file in list.*; do
    grep -f ${file} -l 'some text' > ${file}.out &
done
wait
cat $*.out > filepaths.log
rm list.*

This uses find to find the files, splits the list of filenames into groups of ten million, and runs grep in parallel for each group. The output of the greps are all joined together at the end. This ought to work for files with typical names, but it'd fail for files that had newlines in their names for example.

Another approach uses xargs. First, you'd have to write a simple shell script that runs grep in the background:

#!/bin/bash
grep -l 'search text' "$@" >> grep.$$.out &

This will run grep on the list of files specified as arguments to the script, writing the result to a file named after the process's PID. The grep process runs in the background.

Then you'd run the script like this:

find /path/to/files -type f -print0 | xargs -0 -r /my/grep/script
[ wait for those to finish ]
cat grep.*.out > filepaths.log
rm grep.*.out

In this case, xargs will bundle the filenames into groups and run the script once for each group. The script will run an instance of grep once for each group. Once all of the grep instances have finished, you can combine their outputs. Unfortunately, I couldn't think of a clever way to automatically wait for the grep instances to finish here, so you might have to do that manually.

3
  • If you don't need regex, one benefit of fgrep is you don't have to worry about escaping reserved characters, e.g. fgrep '..' instead of grep '\.\.'.
    – thdoan
    Commented Jul 5, 2016 at 5:16
  • 1
    Grep isn't always I/O bound. I'm currently running a CPU-bound grep.
    – iAdjunct
    Commented Apr 20, 2019 at 2:22
  • @Kenster in the area of Gb/s ssd, not being able to process a 30Tib single file using multithreads is really annoying. Commented Mar 11, 2021 at 19:27
0

Sounds like you need a script or small program that will run multiple instances (i.e. 8 x grep could be run in parallel on a modern i7 with 4 cores/8threads) of grep and concatenate or merge the output, more than you need a faster grep.

How to make such a script is a whole other question, but that's the way I would attack your problem.

1
  • 1
    If files are on several drives maybe, but this is IO bound, not CPU bound. Commented Mar 29, 2018 at 11:02

Not the answer you're looking for? Browse other questions tagged .