1

I am trying to figure out what's the fastest way to return all the file names whose content matches with any of multiple strings. I am using xargs for doing iterations.


$ cat ../Identifiers.list | xargs -i grep -l "{}" .

This took around 8 minutes to print all the file names. Is there a faster way?


Identifiers.list - File content below

287434
383460
633491
717255
827734
253735
635373
553888
910366

No of files in Directory - 36000

$ ls -l *.xml | wc -l
36000
1
  • As Benjamin suggests, grep -f is what you're looking for. Commented Feb 1, 2019 at 20:33

2 Answers 2

5

I'd do it the other way around:

printf '%s\0' *.xml | xargs -0 grep -lFf ../Identifiers.list

This will check each file just once and stop as soon as a match is found. -F uses fixed string matching instead of regular expressions, which should speed things up further.

I think your approach implicitly uses -L 1 (because of -i), so for each line of Identifier.list, it goes through all files.

Potentially even faster with parallelization, for example with four parallel processes:

printf '%s\0' *.xml | xargs -0 -P 4 grep -lFf ../Identifiers.list

For even more speedup, if your files are ASCII, you could use LC_ALL=C:

printf '%s\0' *.xml | LC_ALL=C xargs -0 -P 4 grep -lFf ../Identifiers.list

Using xargs is a good idea, though, even without parallelization: using grep directly, as in

grep -lFf ../Identifiers.list *.xml

might throw an error because *.xml expands to a command line that is too long.

6
  • 1
    Would using find and its -exec variant with + be an improvement? It would pass as many arguments as possible to each invocation of grep, reducing the number of invocations.
    – cody
    Commented Feb 1, 2019 at 20:24
  • Thanks for the suggestion. will try that and post results here.
    – bharath
    Commented Feb 1, 2019 at 20:33
  • @Benjamin - Thanks to your wisdom. Using 4 parallel processes, it returned the results 3.3x faster.
    – bharath
    Commented Feb 1, 2019 at 21:18
  • I'm just curious - what is the final time? And what is the total data size (like cat *.xml | wc -c)?
    – liborm
    Commented Feb 1, 2019 at 21:23
  • @cody find -exec {} + should get the same behvaviour as | xargs in terms of minimizing the number of calls to grep. I usually prefer find | -exec {} + over xargs, but the one thing xargs can do that find can't is parallel execution. There's also GNU parallel, but I don't know it well. Commented Feb 1, 2019 at 22:17
0

put the strings into one regular expression:

(?:287434|383460|633491|717255|827734|253735|635373|553888|910366)

and then grep:

grep -P '(?:287434|383460|633491|717255|827734|253735|635373|553888|910366)' *

Not the answer you're looking for? Browse other questions tagged or ask your own question.