[Edit: Read accepted answer first. The long investigation below stems from a subtle blunder in the timing measurement.]
I often need to process extremely large (100GB+) text/CSV-like files containing highly redundant data that cannot practically be stored on disk uncompressed. I rely heavily on external compressors like lz4 and zstd, which produce stdout streams approaching 1GB/s.
As such, I care a lot about the performance of Unix shell pipelines. But large shell scripts are difficult to maintain, so I tend to construct pipelines in Python, stitching commands together with careful use of shlex.quote()
.
This process is tedious and error-prone, so I'd like a "Pythonic" way to achieve the same end, managing the stdin/stdout file descriptors in Python without offloading to /bin/sh
. However, I've never found a method of doing this without greatly sacrificing performance.
Python 3's documentation recommends replacing shell pipelines with the communicate()
method on subprocess.Popen
. I've adapted this example to create the following test script, which pipes 3GB of /dev/zero
into a useless grep
, which outputs nothing:
#!/usr/bin/env python3
from shlex import quote
from subprocess import Popen, PIPE
from time import perf_counter
BYTE_COUNT = 3_000_000_000
UNQUOTED_HEAD_CMD = ["head", "-c", str(BYTE_COUNT), "/dev/zero"]
UNQUOTED_GREP_CMD = ["grep", "Arbitrary string which will not be found."]
QUOTED_SHELL_PIPELINE = " | ".join(
" ".join(quote(s) for s in cmd)
for cmd in [UNQUOTED_HEAD_CMD, UNQUOTED_GREP_CMD]
)
perf_counter()
proc = Popen(QUOTED_SHELL_PIPELINE, shell=True)
proc.wait()
print(f"Time to run using shell pipeline: {perf_counter()} seconds")
perf_counter()
p1 = Popen(UNQUOTED_HEAD_CMD, stdout=PIPE)
p2 = Popen(UNQUOTED_GREP_CMD, stdin=p1.stdout, stdout=PIPE)
p1.stdout.close()
p2.communicate()
print(f"Time to run using subprocess.PIPE: {perf_counter()} seconds")
Output:
Time to run using shell pipeline: 2.412427189 seconds
Time to run using subprocess.PIPE: 4.862174164 seconds
The subprocess.PIPE
approach is more than twice as slow as /bin/sh
. If we raise the input size to 90GB (BYTE_COUNT = 90_000_000_000
), we confirm this is not a constant-time overhead:
Time to run using shell pipeline: 88.796322932 seconds
Time to run using subprocess.PIPE: 183.734968687 seconds
My assumption up to now was that subprocess.PIPE
is simply a high-level abstraction for connecting file descriptors, and that data is never copied into the Python process itself. As expected, when running the above test head
uses 100% CPU but subproc_test.py
uses near-zero CPU and RAM.
Given that, why is my pipeline so slow? Is this an intrinsic limitation of Python's subprocess
? If so, what does /bin/sh
do differently under the hood that makes it twice as fast?
More generally, are there better methods for building large, high-performance subprocess pipelines in Python?
shell=True
here is... unfortunate. If yoursubstring_which_will_never_be_found
contained$(rm -rf ~)
in it, or -- worse --$(rm -rf ~)'$(rm -rf ~)'
, you'd have a very bad day. (Relying onshlex.split()
isn't good form either -- if you have a name with a space, you want to keep it as one name; populate an array or tuple by hand, and you don't need to worry about your content being munged).subprocess.PIPE
is a high-level abstraction for connecting file descriptors; no, the data isn't copied into the Python process's namespace. Why you're seeing a difference here is a good question -- I'd need to dig in; wouldn't be surprised if it were related to buffering settings on the file descriptors.