0

This is related to the In Python, is read() , or readlines() faster? but not exactly the same. I have a small file to read many many times. I found out that reading it with readlines() and joining is faster than reading with read(). I could not find a good explanation for that but it puzzles me.

In [34]: cat test.txt
ATOM      1  N   MET A   1      -1.112 -18.674 -30.756  1.00 16.53           N  
ATOM      2  CA  MET A   1       0.327 -18.325 -30.772  1.00 16.53           C  
ATOM      3  C   MET A   1       0.513 -16.897 -31.160  1.00 16.53           C  
ATOM      4  O   MET A   1      -0.063 -15.998 -30.552  1.00 16.53           O  
ATOM      5  CB  MET A   1       1.083 -19.211 -31.777  1.00 16.53           C  
ATOM      6  CG  MET A   1       1.101 -20.691 -31.391  1.00 16.53           C  
ATOM      7  SD  MET A   1       1.989 -21.764 -32.559  1.00 16.53           S  
ATOM      8  CE  MET A   1       3.635 -21.109 -32.159  1.00 16.53           C  
ATOM      9  N   LYS A   2       1.333 -16.657 -32.199  1.00146.35           N  
ATOM     10  CA  LYS A   2       1.595 -15.313 -32.613  1.00146.35           C  

In [35]: timeit open("test.txt").read()
10000 loops, best of 3: 58.7 µs per loop

In [36]: timeit "\n".join(open("test.txt").readlines())
10000 loops, best of 3: 56.4 µs per loop

The result is pretty consistent.

8
  • 2
    The difference of 2.3 µs is not relevant.
    – user9455968
    Commented Jul 11, 2018 at 8:41
  • Why don't you read this small file only once and keep it in memory?
    – Jongware
    Commented Jul 11, 2018 at 8:43
  • It is a status file (here in the example it is not). It has to be read from the disk every time because it could be modified by other processes.
    – guma44
    Commented Jul 11, 2018 at 9:09
  • @LutzHorn It might be not relevant if you do it once but if you do it milions of times that will count. For me it is just counterintitive. We wanted to change the code to just read() but we thought let's measure it :D.
    – guma44
    Commented Jul 11, 2018 at 9:12
  • @guma44 You already do it 10,000 times using timeit. Do you plan to read such a file millions of times?
    – user9455968
    Commented Jul 11, 2018 at 9:13

1 Answer 1

3

For a file that small, it doesn't make a difference.

For a larger file...

import timeit

data = '''
ATOM      1  N   MET A   1      -1.112 -18.674 -30.756  1.00 16.53           N  
ATOM      2  CA  MET A   1       0.327 -18.325 -30.772  1.00 16.53           C  
ATOM      3  C   MET A   1       0.513 -16.897 -31.160  1.00 16.53           C  
ATOM      4  O   MET A   1      -0.063 -15.998 -30.552  1.00 16.53           O  
ATOM      5  CB  MET A   1       1.083 -19.211 -31.777  1.00 16.53           C  
ATOM      6  CG  MET A   1       1.101 -20.691 -31.391  1.00 16.53           C  
ATOM      7  SD  MET A   1       1.989 -21.764 -32.559  1.00 16.53           S  
ATOM      8  CE  MET A   1       3.635 -21.109 -32.159  1.00 16.53           C  
ATOM      9  N   LYS A   2       1.333 -16.657 -32.199  1.00146.35           N  
ATOM     10  CA  LYS A   2       1.595 -15.313 -32.613  1.00146.35           C  
'''.lstrip()

names_and_sizes = []

for x in range(1, 10):
    reps = 1 + 2 ** (x + 2)
    with open('test_{}.txt'.format(x), 'w') as outf:
        for x in range(reps):
            outf.write(data)
        names_and_sizes.append((outf.name, outf.tell()))

for filename, size in names_and_sizes:
    a = timeit.timeit(lambda: open(filename).read(), number=1000)
    b = timeit.timeit(lambda: "\n".join(open(filename).readlines()), number=1000)
    print(filename, size, a, b)

the output is

test_1.txt 7290 0.07285173307172954 0.09389211190864444
test_2.txt 13770 0.08125667599961162 0.1290126950480044
test_3.txt 26730 0.08221574104391038 0.17529957089573145
test_4.txt 52650 0.0865904720267281 0.2977212209952995
test_5.txt 104490 0.1046126070432365 0.5687746809562668
test_6.txt 208170 0.1773586180061102 1.1868972890079021
test_7.txt 415530 0.26339677802752703 2.0290830068988726
test_8.txt 830250 0.31897587003186345 4.381448873900808
test_9.txt 1659690 0.6923789769643918 9.483053435920738

or more intuitively

chart of time spent

(and with both axes being logarithmic)

enter image description here

7
  • Thanks, that's intuitive but it does not explain why on a small scale it is slower.
    – guma44
    Commented Jul 11, 2018 at 9:10
  • If you look at the numbers in my example, split-and-join is actually always slower.
    – AKX
    Commented Jul 11, 2018 at 9:50
  • (Only on Python 2 and with the original 810 byte file is split-and-join 3% faster.)
    – AKX
    Commented Jul 11, 2018 at 9:53
  • And that's my setup.
    – guma44
    Commented Jul 11, 2018 at 9:58
  • You can compare the implementations of file.read() and file.readlines() over at Github: github.com/python/cpython/blob/2.7/Objects/… and github.com/python/cpython/blob/2.7/Objects/… respectively. I imagine it might have to do with the fact that .read() has to dynamically (re)allocate the buffer it is reading into, as it does not know the size of what is to be read beforehand.
    – AKX
    Commented Jul 11, 2018 at 10:02

Not the answer you're looking for? Browse other questions tagged or ask your own question.