2

I am trying to read a huge gzipped csv file and process each line.

I tried 2 different implementations:

It happens that the usually recommended implementation is 100x slower than the alternative. Am I wrong or is the implementation of Popen().stdout really bad? (it seems to read the file character by character).

from time import time
from subprocess import Popen, PIPE

# We generate a csv file with 1M lines of 3D coordinates
from random import random
import os

N = 1000000
PATH = 'test'
GZIP_PATH = 'test.gz'

with open(PATH, 'w') as datafile:
    for i in xrange(N):
        datafile.write('{0}, {1}, {2}\n'.format(random(), random(), random()))

try:
    os.remove(GZIP_PATH)
except:
    pass

Popen(['gzip', PATH]).wait()

# We want to process the file line by line

# We start with a textbook implementation

def simple_generator(file):
    line = file.readline()
    while line:
        yield line[:-1]
        line = file.readline()

with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE).stdout as datafile:
    t = time()
    i = 0
    for line in simple_generator(datafile):
        i+=1 # process the line
    print time()-t
    print i

# We start a lower level implementation

BLOCK_SIZE = 1<<16
def fast_generator(file):
    rem = ''
    block = file.read(BLOCK_SIZE)
    while block:
        lines = block.split('\n')
        lines[0] = rem+lines[0]
        for i in xrange(0,len(lines)-1):
            yield lines[i]
        rem = lines[-1]
        block = file.read(BLOCK_SIZE)

with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE).stdout as datafile:
    t = time()
    i = 0
    for line in fast_generator(datafile):
        i+=1 # process the line
    print time()-t
    print i

# Output:
#
# 34.0195429325
# 1000000
# 0.232397794724
# 1000000
#
# The second implementation is 100x faster!
9
  • Textbook implementation? I've never seen the use of file.readline() recommended for a case like this. The usual idiom is to directly use the file as an iterator, for line in file, which performs at basically the same speed as your low level implementation.
    – Lukas Graf
    Commented Jul 11, 2014 at 20:21
  • 1
    You should probably use the python gzip module instead of Popen: docs.python.org/3/library/gzip.html Commented Jul 11, 2014 at 20:22
  • @ngrislain Yes it reads byte by byte because default behaviour is unbuffered as documented for Popen. Set a buffer size and it should be faster.
    – BlackJack
    Commented Jul 11, 2014 at 20:39
  • @ChristianThieme: No gzip module does not allow to deal with very large files as it loads the whole file in memory.
    – ngrislain
    Commented Jul 11, 2014 at 22:30
  • @LukasGraf: for line in file does not work for PIPE files in python 2.5 bugs.python.org/issue3907
    – ngrislain
    Commented Jul 11, 2014 at 22:42

1 Answer 1

2

The proper implementation should be to call Popen with bufsize=-1

with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE, bufsize=-1).stdout as datafile:
    t = time()
    i = 0
    for line in simple_generator(datafile):
        i+=1 # process the line
    print time()-t
    print i

I am a but surprised that the default behaviour is bufsize=0 though.

1
  • This definitely fixed the performance gap but, interestingly, fast_generator is still 30-40% faster than simple_generator when I run these tests.
    – beetea
    Commented Jul 11, 2014 at 23:35

Not the answer you're looking for? Browse other questions tagged or ask your own question.