Why is readline() so slow for PIPE files?

Question

I am trying to read a huge gzipped csv file and process each line.

I tried 2 different implementations:

It happens that the usually recommended implementation is 100x slower than the alternative. Am I wrong or is the implementation of Popen().stdout really bad? (it seems to read the file character by character).

from time import time
from subprocess import Popen, PIPE

# We generate a csv file with 1M lines of 3D coordinates
from random import random
import os

N = 1000000
PATH = 'test'
GZIP_PATH = 'test.gz'

with open(PATH, 'w') as datafile:
    for i in xrange(N):
        datafile.write('{0}, {1}, {2}\n'.format(random(), random(), random()))

try:
    os.remove(GZIP_PATH)
except:
    pass

Popen(['gzip', PATH]).wait()

# We want to process the file line by line

# We start with a textbook implementation

def simple_generator(file):
    line = file.readline()
    while line:
        yield line[:-1]
        line = file.readline()

with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE).stdout as datafile:
    t = time()
    i = 0
    for line in simple_generator(datafile):
        i+=1 # process the line
    print time()-t
    print i

# We start a lower level implementation

BLOCK_SIZE = 1<<16
def fast_generator(file):
    rem = ''
    block = file.read(BLOCK_SIZE)
    while block:
        lines = block.split('\n')
        lines[0] = rem+lines[0]
        for i in xrange(0,len(lines)-1):
            yield lines[i]
        rem = lines[-1]
        block = file.read(BLOCK_SIZE)

with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE).stdout as datafile:
    t = time()
    i = 0
    for line in fast_generator(datafile):
        i+=1 # process the line
    print time()-t
    print i

# Output:
#
# 34.0195429325
# 1000000
# 0.232397794724
# 1000000
#
# The second implementation is 100x faster!

Textbook implementation? I've never seen the use of file.readline() recommended for a case like this. The usual idiom is to directly use the file as an iterator, for line in file, which performs at basically the same speed as your low level implementation. — Lukas Graf, Commented Jul 11, 2014 at 20:21
You should probably use the python gzip module instead of Popen: docs.python.org/3/library/gzip.html — Christian Thieme, Commented Jul 11, 2014 at 20:22
@ngrislain Yes it reads byte by byte because default behaviour is unbuffered as documented for Popen. Set a buffer size and it should be faster. — BlackJack, Commented Jul 11, 2014 at 20:39
@ChristianThieme: No gzip module does not allow to deal with very large files as it loads the whole file in memory. — ngrislain, Commented Jul 11, 2014 at 22:30
@LukasGraf: for line in file does not work for PIPE files in python 2.5 bugs.python.org/issue3907 — ngrislain, Commented Jul 11, 2014 at 22:42

ngrislain · Accepted Answer · 2014-07-11 22:49:39Z

2

The proper implementation should be to call Popen with bufsize=-1

with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE, bufsize=-1).stdout as datafile:
    t = time()
    i = 0
    for line in simple_generator(datafile):
        i+=1 # process the line
    print time()-t
    print i

I am a but surprised that the default behaviour is bufsize=0 though.

answered Jul 11, 2014 at 22:49

ngrislain

1,04213 silver badges19 bronze badges

This definitely fixed the performance gap but, interestingly, fast_generator is still 30-40% faster than simple_generator when I run these tests.
– beetea
Commented Jul 11, 2014 at 23:35

Add a comment |

Collectives™ on Stack Overflow

Why is readline() so slow for PIPE files?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
performance
streaming
bigdata
readline
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonperformancestreamingbigdatareadline or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
performance
streaming
bigdata
readline
or ask your own question.