2

I am writing a script and for readability questions I am thinking about replacing ';' in my sed expression by a pipe.

For example

sed 's/.*@@//;s/[[:space:]].*//;s/\(.*\\\).*/\1LATEST/'

Would become

sed 's/.*@@//' | sed 's/[[:space:]].*//' | sed 's/\(.*\\\).*/\1LATEST/'

I know a pipe have a cost but I guess the ';' in a sed has also a cost.

Could it be equivalent? If not, how bad could it be in a loop of thousands of iteration ?

2
  • 3
    Test it and find out. That's how we'd find out. Once you do, you can answer your own question here to let everyone know which is faster with your particular input data. Commented May 8, 2018 at 12:41
  • If you're writing it in a skript, can't you just leave the quote open and continue on the next line if it's only for readability? Or how about using -e to add the commands individually.
    – daniu
    Commented May 9, 2018 at 5:38

1 Answer 1

2

This is actually an interesting question. Because using extra pipelines uses more CPU processing time, but also works faster for large inputs on multicore CPUs due to parallelization.

Case #1: large inputs

I used the following command to construct input and time your commands:

time echo N | awk '{ for(i=0;i<$0;i++) print i"@@\n "i"\n"i"\\" }' | COMMAND > /dev/null

where N is an integer and tells AWK how long the test input should be, and COMMAND is the command (or a pipeline) you want to time.

I run the tests for N = 10,000,000 on a 2-core machine:

Single sed version:

time echo 10000000 | awk '{ for(i=0;i<$0;i++) print i"@@\n "i"\n"i"\\" }' | sed 's/.*@@//;s/[[:space:]].*//;s/\(.*\\\).*/\1LATEST/' > /dev/null

Result:

real    1m26.714s
user    1m35.196s
sys     0m1.212s

Pipelined sed version:

time echo 10000000 | awk '{ for(i=0;i<$0;i++) print i"@@\n "i"\n"i"\\" }' | sed 's/.*@@//' | sed 's/[[:space:]].*//' | sed 's/\(.*\\\).*/\1LATEST/' > /dev/null

Result:

real    0m56.280s
user    1m46.404s
sys     0m0.972s

As you can see, even though extra pipelines add about 11 seconds of extra processing time (user+sys), the command actually takes about 30 seconds less real time to finish, because output from each of the three sed commands is being processed by the next one while it is still working. On my machine it causes the real processing time to be almost precisely half the CPU time, which indicates efficient use of both CPU cores.

For single-core machines, however, extra pipelining will only add unnecessary overhead, slowing down processing.


Case #2: line-by-line processing

On the other side, if you are writing a bash script and using your sed commands to process individual lines, which you should not do, the output is probably too small to observe the above parallelization effect. And the single sed version would be much more efficient.

Here are the timings for just 10,000 lines processed one-by-one:

time for ((i=1;i<=10000;i++)); do printf "$i@@\n $i\n$i\\ \n" | sed 's/.*@@//;s/[[:space:]].*//;s/\(.*\\\).*/\1LATEST/'; done > /dev/null

Result:

real    0m27.430s
user    0m2.772s
sys     0m4.224s

Pipelined sed:

time for ((i=1;i<=10000;i++)); do printf "$i@@\n $i\n$i\\ \n" | sed 's/.*@@//' | sed 's/[[:space:]].*//' | sed 's/\(.*\\\).*/\1LATEST/'; done > /dev/null

Result:

real    0m57.274s
user    0m3.704s
sys     0m7.776s

As you can see, pipelined sed works more than twice slower than single sed command.

Note that using a single sed pipeline on a large input (as in Case #1) works at least 1000 times faster than processing similar input line-by-line (as in Case #2).

1
  • Thank you very much for your response, I learned a lot today ! You nailed it with your shell loop recommandation by the way.. Commented May 15, 2018 at 7:35

Not the answer you're looking for? Browse other questions tagged or ask your own question.