0

(This isn't my program, but I'll try to provide all the relevant information to the best of my knowledge.)

There is a program which reads binary files that are roughly 300MB in size, processes them and outputs some information. The program uses ifstream for file input and streams are correctly initialized and closed for each read.

The program has to read each file multiple times. Reading a file for the first time takes about 3 seconds, and each consecutive read takes about 0.1 seconds. If several files are processed, going back to the first file will still yield fast read speeds, but after some time re-reading a file becomes slow.

Additionally, if a file is copied to another location, the speed of the first read of the new file is roughly 0.1 seconds.

If you do the math, the speed of consecutive reads is roughly the advertised read speed of the hard drive.

All this looks like file locations are cached by either the OS or the hard drive, so that on consecutive reads you don't have to seek out file locations.

Does anyone know what exactly is causing the slowdown on the initial read, and if it can be prevented? Three seconds may not seem like a lot, but they add about 5 hours to the total time needed to correctly process every file.

Also, the program runs on Fedora 14 and Scientific Linux, with both OS's having their default file systems.

Any ideas would be appreciated.

1
  • 3 seconds to read a 300MB file is about right for hitting the disk - that's 100MB/s, which is at the high end of the speed you can expect from a modern, fast hard disk. 0.1 seconds to read a 300MB file is not coming off a disk - that's coming out of a cache.
    – caf
    Commented Nov 20, 2011 at 10:42

4 Answers 4

2

Linux will try and copy the file into RAM to make the next read faster - I am guessing this is what is happening. The initial read is actual off disk - subsequent reads are out of the file cache because the entire file has been copied to RAM

1
  • I monitored the RAM while the program was iterating through the files. Considering the original amount of free space, the size of the files, and the number of files it iterated through before the old ones were "forgotten", this seems to be the correct answer.
    – Morglor
    Commented Nov 22, 2011 at 5:21
1

The OS (Linux) has a disk cache. After you read the file once, it's in the cache.

0

My guess would be that maybe the first time it reads the file it takes longer because it loads some information into the cache?

After the first time, it just uses some of the information in the cache.

0

Yes, the data becomes cached. You might force that caching with the readahead syscall (or simply by having another process read it). If using mmap you could also use madvise

Not the answer you're looking for? Browse other questions tagged or ask your own question.