tee is faster than no tee

Question

I have this simple bash script:

#!/bin/bash

for i in {1..1000000}
do
    echo "hello ${i}"
done

That prints a message a million times.

I try to compare the performance of dumping all the output into a single file,  vs. using tee to split the output into two files.

$ time ./run.sh > out1.txt

real    0m9.535s
user    0m6.678s
sys     0m2.803s
$ 
$ time ./run.sh | tee out2.txt > out1.txt

real    0m6.705s
user    0m6.895s
sys     0m5.214s

It turns out that writing to two files simultaneously is faster than only writing to one (these results were consistent across multiple runs).

Can anyone explain how this is possible? Also, how should I interpret the user and sys outputs of time? Why is sys time higher when using tee?

Depending on what run.sh does, it might be an effect of caching. You should run both variants more than once alternatingly. Writing two files could use more system time. — Bodo, Commented Aug 11, 2022 at 16:21
This is very relevant: How can I time a pipe? But still doesn't explain why tee is faster, it just clarifies that you're not really measuring the time taken right. — terdon, Commented Aug 11, 2022 at 16:45
I'm guessing this is due to buffering. The tee variant may be able to write larger chunks (possibly whatever size the pipe buffer is), whereas the first variant may write line by line. Benchmarking with hyperfine on my FreeBSD system shows that the tee variant is about twice as fast. This is not an answer because I'm just waving my arms in the air. — Kusalananda, Commented Aug 11, 2022 at 16:59
See also bash run.sh | cat >file. It would be as fast as the tee variant. — Kusalananda, Commented Aug 11, 2022 at 17:18
@t0w0i7ne I don't get that. Why would shell flush every line when stdout is redirected to a block device, yet buffer when stdout is redirected to a pipe? I can see the difference between a 4K file buffer and a 64K pipe buffer (although the file would be cached with write-behind), but surely a block device would not be line-buffered? — Paul_Pedant, Commented Aug 12, 2022 at 6:30

egmont · Accepted Answer · 2023-01-20 13:10:52Z

First, let's clear a couple of things.

Flushing

The shell flushes its output after every built-in command, such as echo in this case.

It has to. It has to mimic the behavior as if it was an external command (wherever it's possible, like here it is, it's a built-in shell command rather than the external /bin/echo purely for performance reasons). External commands obviously have to flush (or disregard for good) the output when they quit. The built-in echo has to behave the same way.

It's the expected behavior that if an echo is followed by a long wait or computation in your script, the echo-ed data is already available on the output.

This behavior is independent of where the standard output is connected to. It could be the terminal, a pipe, a file, doesn't matter. Each echo is flushed independently. (Not to be confused with the default behavior of libc, where the behavior depends on whether the standard output goes to a terminal or not, as you can see with most of the standard utilities, like cat, grep, head, and so on and so forth, and also tee of course. The shell does flush explicitly after every built-in command, rather than relying on libc's default buffering.)

Use strace ./run.sh > out1.txt to see a million write() calls performed by your shell.

Multi core

I assume that you have multiple CPU cores in your system, and no significant load due to other processes. In this setup the kernel assigns bash run.sh to one of the cores, and tee to another. This way no heavy-weight process switching takes place, and if they both can actually run then they both do run simultaneously.

Presumably if you confine the two processes into a single core (I believe you can do this with the taskset command, I'd let you experiment with it), then you'd get vastly different results, tee significantly slowing down the process. It's not just an extra process that needs to run serially, interleaved against run.sh, but the kernel would also need to switch between the two processes many times and this switching itself is also quite costly.

The time command

time measures the entire pipeline, that is, run.sh and tee combined. If you're interested in measuring one of the commands only, invoke time in a subshell, like:

$ ( time ./run.sh ) | tee out2.txt > out1.txt

$ ./run.sh | ( time tee out2.txt ) > out1.txt

For the real time, the wall clock time elapsed is printed. That is, as if you literally printed the timestamp before and after the pipeline and computed the difference, or used an external stopwatch. If two processes, each spinning one CPU core for 10 seconds, are running in a pipeline while both of them can run all the time, fully in parallel, the real time will be 10 seconds.

user and sys time, however, add up across CPU cores. If two parallel processes, each on their own CPU core, spin the CPU to its maximum for 10 seconds (the real time being 10 seconds as we've just seen), the user time will be 20 seconds.

Now, let's put these all together:

There's only one question left to answer: Why is it faster to write a tiny chunk of data to a pipe than to a file?

I don't have a direct answer for this, and I'm just drawing the conclusion backwards, i.e. from the timing results that you measured, that it must be faster to write to a pipe than to a file. The following are sort of speculations, but hopefully reasonable ones.

Pipes have a fixed size (I believe 64kB). It could easily be that the entire size is allocated when the pipe is created, therefore no more dynamic allocation in the kernel takes place. (If the size is reached, the writer side blocks until some space is freed up by the reader.) For files, however, there's no such limit. Whatever is passed from user space to the kernel has to be copied there (it's not feasible to block the writer until the data is actually written to disk). Therefore I find it likely that some sort of dynamic memory allocation might take place when writing to a file, making this part of the story more expensive.

In case of pipes, the only additional thing the kernel might need to do is to wake up the processes that have just become able to run, i.e. were waiting for data to appear in the pipe. In case of files, the kernel needs to update the file's in-memory metadata (size, modification time), and start a timer (or update an existing one) to schedule eventually writing out this data to the disk.

There's no strict rule that writing to a file would have to be more expensive than writing to a pipe, it just apparently so happens that it is, as it's demonstrated by the numbers you measured.

By adding tee, you happen to reduce the work required by run.sh, since its million write()s now happen to be somewhat cheaper. This makes the entire run.sh be able to run faster, and thus result in a smaller wall clock time.

You add a second process which runs mostly in parallel to it, and presumably does less work. It uses buffered write() for both of its output files, i.e. only a few syscalls compared to the unbuffered case. For its input, well, it might perform a million tiny read()s, but I'd guess that due to randomness in timing and what not, many of bash's write()s will probably be combined and will arrive in a single read(), thereby probably requiring noticeably fewer than a million read()s. (It would be really cool to see how many read()s it performs. strace'ing is not an option since then the measuring itself would significantly alter the timings. I'd go with patching tee to increment a counter on each read() and dump that number at the end. I'd leave it to the dear readers as an exercise.)

So, tee is faster than run.sh and as such doesn't delay the completion of the pipeline. However, it adds its own share to the user and sys times, making them larger than they were before.

Update:

I was curious so I patched tee to see how many times it read()s.

With just one terminal on the desktop, it's around 660 000 - 670 000. With a browser being opened in the background with a page or two, it's around 500 000 - 600 000. With the browser just starting up (more heavy work), it's around 400 000. Makes sense: the more other things to do, the more likely that tee doesn't read its data immediately and some of bash's write()'s can accumulate. You get the idea, and now the rough numbers as well, which of course can vary a lot across computers.

Stack Exchange Network

tee is faster than no tee

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
bash
pipe
time
tee
benchmark
.

Linked

Hot Network Questions

tee is faster than no tee

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged bashpipetimeteebenchmark.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
bash
pipe
time
tee
benchmark
.