I am trying to read a huge gzipped csv file and process each line.
I tried 2 different implementations:
It happens that the usually recommended implementation is 100x slower than the alternative. Am I wrong or is the implementation of Popen().stdout
really bad? (it seems to read the file character by character).
from time import time
from subprocess import Popen, PIPE
# We generate a csv file with 1M lines of 3D coordinates
from random import random
import os
N = 1000000
PATH = 'test'
GZIP_PATH = 'test.gz'
with open(PATH, 'w') as datafile:
for i in xrange(N):
datafile.write('{0}, {1}, {2}\n'.format(random(), random(), random()))
try:
os.remove(GZIP_PATH)
except:
pass
Popen(['gzip', PATH]).wait()
# We want to process the file line by line
# We start with a textbook implementation
def simple_generator(file):
line = file.readline()
while line:
yield line[:-1]
line = file.readline()
with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE).stdout as datafile:
t = time()
i = 0
for line in simple_generator(datafile):
i+=1 # process the line
print time()-t
print i
# We start a lower level implementation
BLOCK_SIZE = 1<<16
def fast_generator(file):
rem = ''
block = file.read(BLOCK_SIZE)
while block:
lines = block.split('\n')
lines[0] = rem+lines[0]
for i in xrange(0,len(lines)-1):
yield lines[i]
rem = lines[-1]
block = file.read(BLOCK_SIZE)
with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE).stdout as datafile:
t = time()
i = 0
for line in fast_generator(datafile):
i+=1 # process the line
print time()-t
print i
# Output:
#
# 34.0195429325
# 1000000
# 0.232397794724
# 1000000
#
# The second implementation is 100x faster!
file.readline()
recommended for a case like this. The usual idiom is to directly use the file as an iterator,for line in file
, which performs at basically the same speed as your low level implementation.Popen
. Set a buffer size and it should be faster.gzip
module does not allow to deal with very large files as it loads the whole file in memory.for line in file
does not work for PIPE files in python 2.5 bugs.python.org/issue3907