Python readlines faster than read

Question

This is related to the In Python, is read() , or readlines() faster? but not exactly the same. I have a small file to read many many times. I found out that reading it with readlines() and joining is faster than reading with read(). I could not find a good explanation for that but it puzzles me.

In [34]: cat test.txt
ATOM      1  N   MET A   1      -1.112 -18.674 -30.756  1.00 16.53           N  
ATOM      2  CA  MET A   1       0.327 -18.325 -30.772  1.00 16.53           C  
ATOM      3  C   MET A   1       0.513 -16.897 -31.160  1.00 16.53           C  
ATOM      4  O   MET A   1      -0.063 -15.998 -30.552  1.00 16.53           O  
ATOM      5  CB  MET A   1       1.083 -19.211 -31.777  1.00 16.53           C  
ATOM      6  CG  MET A   1       1.101 -20.691 -31.391  1.00 16.53           C  
ATOM      7  SD  MET A   1       1.989 -21.764 -32.559  1.00 16.53           S  
ATOM      8  CE  MET A   1       3.635 -21.109 -32.159  1.00 16.53           C  
ATOM      9  N   LYS A   2       1.333 -16.657 -32.199  1.00146.35           N  
ATOM     10  CA  LYS A   2       1.595 -15.313 -32.613  1.00146.35           C  

In [35]: timeit open("test.txt").read()
10000 loops, best of 3: 58.7 µs per loop

In [36]: timeit "\n".join(open("test.txt").readlines())
10000 loops, best of 3: 56.4 µs per loop

The result is pretty consistent.

Why don't you read this small file only once and keep it in memory? — Jongware, Commented Jul 11, 2018 at 8:43
It is a status file (here in the example it is not). It has to be read from the disk every time because it could be modified by other processes. — guma44, Commented Jul 11, 2018 at 9:09
@LutzHorn It might be not relevant if you do it once but if you do it milions of times that will count. For me it is just counterintitive. We wanted to change the code to just read() but we thought let's measure it :D. — guma44, Commented Jul 11, 2018 at 9:12
@guma44 You already do it 10,000 times using timeit. Do you plan to read such a file millions of times? — user9455968, Commented Jul 11, 2018 at 9:13

AKX · Accepted Answer · 2018-07-11 10:31:26Z

For a file that small, it doesn't make a difference.

For a larger file...

import timeit

data = '''
ATOM      1  N   MET A   1      -1.112 -18.674 -30.756  1.00 16.53           N  
ATOM      2  CA  MET A   1       0.327 -18.325 -30.772  1.00 16.53           C  
ATOM      3  C   MET A   1       0.513 -16.897 -31.160  1.00 16.53           C  
ATOM      4  O   MET A   1      -0.063 -15.998 -30.552  1.00 16.53           O  
ATOM      5  CB  MET A   1       1.083 -19.211 -31.777  1.00 16.53           C  
ATOM      6  CG  MET A   1       1.101 -20.691 -31.391  1.00 16.53           C  
ATOM      7  SD  MET A   1       1.989 -21.764 -32.559  1.00 16.53           S  
ATOM      8  CE  MET A   1       3.635 -21.109 -32.159  1.00 16.53           C  
ATOM      9  N   LYS A   2       1.333 -16.657 -32.199  1.00146.35           N  
ATOM     10  CA  LYS A   2       1.595 -15.313 -32.613  1.00146.35           C  
'''.lstrip()

names_and_sizes = []

for x in range(1, 10):
    reps = 1 + 2 ** (x + 2)
    with open('test_{}.txt'.format(x), 'w') as outf:
        for x in range(reps):
            outf.write(data)
        names_and_sizes.append((outf.name, outf.tell()))

for filename, size in names_and_sizes:
    a = timeit.timeit(lambda: open(filename).read(), number=1000)
    b = timeit.timeit(lambda: "\n".join(open(filename).readlines()), number=1000)
    print(filename, size, a, b)

the output is

test_1.txt 7290 0.07285173307172954 0.09389211190864444
test_2.txt 13770 0.08125667599961162 0.1290126950480044
test_3.txt 26730 0.08221574104391038 0.17529957089573145
test_4.txt 52650 0.0865904720267281 0.2977212209952995
test_5.txt 104490 0.1046126070432365 0.5687746809562668
test_6.txt 208170 0.1773586180061102 1.1868972890079021
test_7.txt 415530 0.26339677802752703 2.0290830068988726
test_8.txt 830250 0.31897587003186345 4.381448873900808
test_9.txt 1659690 0.6923789769643918 9.483053435920738

or more intuitively

(and with both axes being logarithmic)

Thanks, that's intuitive but it does not explain why on a small scale it is slower. — guma44, Commented Jul 11, 2018 at 9:10
If you look at the numbers in my example, split-and-join is actually always slower. — AKX, Commented Jul 11, 2018 at 9:50
(Only on Python 2 and with the original 810 byte file is split-and-join 3% faster.) — AKX, Commented Jul 11, 2018 at 9:53
You can compare the implementations of file.read() and file.readlines() over at Github: github.com/python/cpython/blob/2.7/Objects/… and github.com/python/cpython/blob/2.7/Objects/… respectively. I imagine it might have to do with the fact that .read() has to dynamically (re)allocate the buffer it is reading into, as it does not know the size of what is to be read beforehand. — AKX, Commented Jul 11, 2018 at 10:02

Collectives™ on Stack Overflow

Python readlines faster than read

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
readlines
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonreadlines or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
readlines
or ask your own question.