0

I used to have a script like the following

for i in $(cat list.txt)
do
  grep $i sales.txt
done

Where cat list.txt

tomatoes
peppers
onions

And cat sales.txt

Price Products
$8.88 bread
$6.75 tomatoes
$3.34 fish
$5.57 peppers
$0.95 beans
$4.56 onions

I am a beginner in BASH/SHELL and after reading posts like Why is using a shell loop to process text considered bad practice? I changed the previous script to the following:

grep -f list.txt sales.txt

Is this last way of doing it really better than using a for loop? At first I thought it was, but then I realized it is probably the same since grep has to read the query file each time it greps a different line in the target file. Does anyone know if its actually better and why? If its better somehow I'm probably missing something about how grep processes this task but I can't figure it out.

1
  • It's easier to read, the loop logic is written into the program itself so it's almost definitely faster, only a single program has to be called... I can't think of a scenario where grep -f list.txt sales.txt wouldn't be considered the "better" option. I suppose if you needed some intermediate processing between switching through your patterns in list.txt and grepping, then maybe a loop depending on what that was... maybe...
    – JNevill
    Commented Nov 23, 2018 at 20:55

2 Answers 2

1

Expanding on my comment...

You can download the source for grep via git with:

 git clone https://git.savannah.gnu.org/git/grep.git

You can see at line 96 of src/grep.c a comment:

/* A list of lineno,filename pairs corresponding to -f FILENAME
   arguments. Since we store the concatenation of all patterns in
   a single array, KEYS, be they from the command line via "-e PAT"
   or read from one or more -f-specified FILENAMES.  Given this
   invocation, grep -f <(seq 5) -f <(seq 2) -f <(seq 3) FILE, there
   will be three entries in LF_PAIR: {1, x} {6, y} {8, z}, where
   x, y and z are just place-holders for shell-generated names.  */

Which is about all the clue we need to see that the patterns being searched whether they come in through -e or through -f with a file are dumped into an array. That array is then the source of the search. moving through that array in C is going to be faster than your shell looping through a file. So this alone will win the speed race.

Also, as I mentioned in my comment, the grep -f list.txt sales.txt is easier to read, easier to maintain, and only a single program (grep) has to be invoked.

2
  • The time saving us more likely to come from doing a single execution with a single file pass, not due to C iterating a small array faster than bash Commented Nov 23, 2018 at 21:24
  • This is pretty much the explanation I was looking for. I didn't consider that grep works in C and that this would give it an edge over a pure bash search through a file. This makes sense, thank you.
    – nsa
    Commented Nov 23, 2018 at 23:14
1

Your second version is better because:

  1. It only requires a single pass over the file (it does not need multiple passes like you think)
  2. It has no globbing and spacing bugs (your first attempt behaves poorly for green beans or /*/*/*/*)

It's totally fine to read files purely in shell code when 1. you do it correctly and 2. the overhead is negligible, but neither really applies to your first example (except for the fact that the files are currently small).

Not the answer you're looking for? Browse other questions tagged or ask your own question.