14

If RAM isn't a concern (I have close to 200GB on the server), is reading line by line faster or reading everything into RAM and access it? Each line will be a string of around 200-500 unicode characters. There are close to 2 million lines for each file.

Line-by-line

import codecs
for i in codecs.open('unicodefile','r','utf8'):
  print i

Reading into RAM

import codecs
for i in codecs.open('unicodefile','r','utf8').readlines():
  print i
2
  • 9
    import timeit; timeit.timeit('''for i in codecs.open('unicodefile','r','utf8'): print i''', 'import codecs') then do the same for the second case.
    – kojiro
    Commented Feb 23, 2013 at 10:22
  • If RAM isn't a concern (you know that you can fit the contents into RAM), then put all the content in RAM. RAM is an order of magnitude faster to read than your spinning disk. Memory hierarchies are a basic principle of system architecture. Take advantage of them. Commented Apr 5, 2013 at 3:10

4 Answers 4

15
+50

I used cProfile on a ~1MB dictionary words file. I read the same file 3 times. The first reads tho whole file in just to even the playing field in terms of it being stored in cache. Here is the simple code:

def first_read():
    codecs.open(file, 'r', 'utf8').readlines()

def line_by_line():
    for i in codecs.open(file, 'r', 'utf8'):
        pass

def at_once():
    for i in codecs.open(file, 'r', 'utf8').readlines():
        pass

first_read()
cProfile.run('line_by_line()')
cProfile.run('at_once()')

And here are the results:

Line by line:

         366959 function calls in 1.762 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.762    1.762 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 codecs.py:322(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:395(__init__)
    14093    0.087    0.000    0.131    0.000 codecs.py:424(read)
    57448    0.285    0.000    0.566    0.000 codecs.py:503(readline)
    57448    0.444    0.000    1.010    0.000 codecs.py:612(next)
        1    0.000    0.000    0.000    0.000 codecs.py:651(__init__)
    57448    0.381    0.000    1.390    0.000 codecs.py:681(next)
        1    0.000    0.000    0.000    0.000 codecs.py:686(__iter__)
        1    0.000    0.000    0.000    0.000 codecs.py:841(open)
        1    0.372    0.372    1.762    1.762 test.py:9(line_by_line)
    13316    0.011    0.000    0.023    0.000 utf_8.py:15(decode)
        1    0.000    0.000    0.000    0.000 {_codecs.lookup}
    27385    0.027    0.000    0.027    0.000 {_codecs.utf_8_decode}
    98895    0.011    0.000    0.011    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    13316    0.099    0.000    0.122    0.000 {method 'endswith' of 'unicode' objects}
       27    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
    14069    0.027    0.000    0.027    0.000 {method 'read' of 'file' objects}
    13504    0.020    0.000    0.020    0.000 {method 'splitlines' of 'unicode' objects}
        1    0.000    0.000    0.000    0.000 {open}

All at once:

         15 function calls in 0.023 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.023    0.023 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 codecs.py:322(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:395(__init__)
        1    0.000    0.000    0.003    0.003 codecs.py:424(read)
        1    0.000    0.000    0.014    0.014 codecs.py:576(readlines)
        1    0.000    0.000    0.000    0.000 codecs.py:651(__init__)
        1    0.000    0.000    0.014    0.014 codecs.py:677(readlines)
        1    0.000    0.000    0.000    0.000 codecs.py:841(open)
        1    0.009    0.009    0.023    0.023 test.py:13(at_once)
        1    0.000    0.000    0.000    0.000 {_codecs.lookup}
        1    0.003    0.003    0.003    0.003 {_codecs.utf_8_decode}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.001    0.001    0.001    0.001 {method 'read' of 'file' objects}
        1    0.010    0.010    0.010    0.010 {method 'splitlines' of 'unicode' objects}
        1    0.000    0.000    0.000    0.000 {open}

As you can see from the results, reading the whole file in at once is much faster, but you run the risk of a MemoryError being thrown in the file is too large.

1
  • Do some reading about mmap. Usually a good idea. Even if memory WAS a constraint.
    – Michal M
    Commented Apr 5, 2013 at 0:29
8

Nothing stops you from testing this on your machine. I created a file with 1M lines each and the results, timed as

time python something.py > /dev/null

were:

Line-by-Line:

real    0m4.878s
user    0m4.860s
sys     0m0.008s

Reading into RAM:

real    0m0.981s
user    0m0.828s
sys     0m0.148s

I got MemoryError when trying with 2M lines, 300 characters each, but the above suggests that reading into RAM would be faster.

6

Looking at the example code the OP posted, I think there is a misunderstanding of what Python is doing.

Ie:

"Reading in line by line"

import codecs
for i in codecs.open('unicodefile','r','utf8'):
  print i

The above looks like it is reading in line by line. However, Python interprets this as "read as much of the file into memory, and then process each as a line". So in effect, the above for loop reads everything into memory.

"Reading into RAM"

import codecs
for i in codecs.open('unicodefile','r','utf8').readlines():
  print i

I believe that the above is practically the same as the "line by line" example above. Ie, Python is reading it all into memory.

If you had wanted to test the line-by-line performance, you would need the "readline()" and not "readlines()" or the unspecified for loop, which may imply "readlines()". This is noted elsewhere in the StackOverflow site.

Another aspect to consider is filesystem buffering. If you are running the same bit of code against the same file, then you run the risk of filesystem buffering polluting the results. As you say, you have 200GB of ram, that is more than enough to buffer enough of the file to impact run results.

You would need to do the following to ensure clean test results:

1) copy the large file from a known source to a new filename. (Filesystem needs to be not a COW filesystem.) 2) flush the filesystem cache 3) run the first test against the file. 4) delete the file 5) re-copy the file from source to another new filename. 6) flush the filesystem cache 7) run the second test against the new file.

That will give you a more accurate test of file load times.

If you want to load the whole of the file into memory all at once, wouldn't the filehandle.read(bytes to read) potentially provide a faster means of block reading in the file contents?

In either case, for reference:

http://docs.python.org/2/tutorial/inputoutput.html

0

it's better to build your program using streaming processing (line by line), in this case you could process big volumes of data. In general it's better to implement reading which reads 100 lines for example, then you processes them, then it loads another 100 lines. On low-level you are just using big buffer and read original file by big chunks. If you load everything in memory - you could get memory error like @oseiskar wrote

Not the answer you're looking for? Browse other questions tagged or ask your own question.