Are files opened by processes loaded into RAM?

Question

Commands, for instance sed, are programs and programs are codified logic inside a file and these files are somewhere on the hard disk. However when commands are being run, a copy of their files from the hard disk is put into the RAM, where they come to life and can do stuff and are called processes.

Processes can make use of other files, read or write into them, and if they do those files are called open files. There is a command to list all open files by all running processes: lsof.

OK, so what I wonder about is if the double life of a command, one on the hard disk, the other in the RAM is also true for other kind of files, for instance those who have no logic programmed, but are simply containers for data.

My assumption is, that files opened by processes are also loaded into the RAM. I do not know if it is true, it is just an intuition.

Please, could someone make sense of it?

Related: What happens if you edit a script during execution? — user147505, Commented May 31, 2017 at 22:39

Kusalananda · Accepted Answer · 2019-08-26 21:38:45Z

No, a file is not automatically read into memory by opening it. That would be awfully inefficient. sed, for example, reads its input line by line, as do many other Unix tools. It seldom has to keep more than the current line in memory.

With awk it's the same. It reads a record at a time, which by default is a line. If you store parts of the input data in variables, that will be extra, of course¹.

Some people have a habit of doing things like

for line in $(cat file); do ...; done

Since the shell will have to expand the $(cat file) command substitution completely before running even the first iteration of the for loop, this will read the whole of file into memory (into the memory used by the shell executing the for loop). This is a bit silly and also inelegant. Instead, one should do

while IFS= read -r line; do ...; done <file

This will process file line by line (but do read Understanding "IFS= read -r line").

Processing files line by line in the shell is only seldom needed though, as most utilities are line-oriented anyway (see Why is using a shell loop to process text considered bad practice?).

I'm working in bioinformatics, and when processing huge amounts of genomic data I would not be able to do much unless I only kept the bits of the data that were absolutely necessary in memory. For example, when I need to strip the bits of data that could be used to identify individuals from a 1 terabyte dataset containing DNA variants in a VCF file (because that type of data can't be made public), I do line by line processing with a simple awk program (this is possible since the VCF format is line-oriented). I do not read the file into memory, process it there, and write it back out again! If the file was compressed, I would feed it through zcat or gzip -d -c, which, since gzip does stream processing of data, would also not read the whole file into memory.

Even with file formats that are not line oriented, like JSON or XML, there are stream parsers that makes it possible to process huge files without storing it all in RAM.

With executables, it's slightly more complicated since shared libraries may be loaded on demand, and/or be shared between processes (see Loading of shared libraries and RAM usage, for example).

Caching is something I haven't mentioned here. This is the action of using RAM to hold frequently accessed pieces of data. Smaller files (for example executables) may be cached by the OS in the hope that the user will make many references to them. Apart from the first reading of the file, subsequent accesses will be made to RAM rather than to disk. Caching, like buffering of input and output is usually largely transparent to the user and amount of memory used to cache things may dynamically change depending on the amount of RAM allocated by applications etc.

¹ _{Technically, most programs probably read a chunk of the input data at a time, either using explicit buffering, or implicitly through the buffering that the standard I/O libraries do, and then present that chunk line by line to the user's code. It's much more efficient to read a multiple of the disk's block size than e.g. a character at a time. This chunk size will seldom be larger than a handful of kilobytes though.}

you said, it is possible to load shared libraries into RAM, is it also possible to load a regular file, which contains only data into RAM, even if it would not make sense? — sharkant, Commented May 29, 2017 at 23:06
@sharkant Of course. That's only a matter of adding data to a variable (or array, or hash, or whatever data structure the language in question supplies) until all of the file has been stored. With awk, { a[i++] = $0 } would add all lines of the input file to the array a. You may also want to look up the C function mmap(), but its use may be a bit off-topic here. — Kusalananda, Commented May 29, 2017 at 23:12
sed, awk, and other line-oriented programs do not read a line at a time into memory, because plain text files don't contain a line index, and filesystem APIs and low-level storage hardware reads one or more "sectors" (typically 512 or 1024 bytes) at a time. I'd be surprised if less than 8KB was read into memory by the OS before the first line was processed. — Russell Borogove, Commented May 30, 2017 at 0:03
Although a utility like sed will only read one line at a time into memory, it is worth mentioning that the operating system will use free ram to cache files so they can be accessed quickly. If you are running sed on a smaller file it is feasible that the OS will cache the whole file in memory and the operation will be done entirely in RAM. See: en.wikipedia.org/wiki/Page_cache — Sean Dawson, Commented May 30, 2017 at 4:31
@sharkant There is use in having a file entirely accessible in memory (see the other answer, mmap is the keyword system call here). For example, a database system would usually want to have, for ease and speed of access, the whole database or at least some of the indices mapped into memory. This does not necessarily mean that the whole thing is actually in memory. The OS is free to "pretend" that the file is in memory. It tells the application "here, in this range of memory is your file", and only once a read is done (just like when the process has been swapped out), the data is actually read. — Jonas Schäfer, Commented May 30, 2017 at 5:36

Basile Starynkevitch · Accepted Answer · 2017-05-30 16:10:27Z

However when commands are being run, a copy of their files from the hard disk is put into the RAM,

This is wrong (in general). When a program is executed (thru execve(2)...) the process (running that program) is changing its virtual address space and the kernel is reconfiguring the MMU for that purpose. Read also about virtual memory. Notice that application programs can change their virtual address space using mmap(2) & munmap & mprotect(2), also used by the dynamic linker (see ld-linux(8)). See also madvise(2) & posix_fadvise(2) & mlock(2).

Future page faults will be processed by the kernel to load (lazily) pages from the executable file. Read also about thrashing.

The kernel maintains a large page cache. Read also about copy-on-write. See also readahead(2).

OK, so what I wonder about is if the double life of a command, one on the hard disk, the other in the RAM is also true for other kind of files, for instance those who have no logic programmed, but are simply containers for data.

For system calls like read(2) & write(2) the page cache is also used. If the data to be read is sitting in it, no disk IO will be done. If disk IO is needed, the read data would be very likely put in the page cache. So, in practice, if you run the same command twice, it could happen that no physical I/O is done to the disk on the second time (if you have an old rotating hard disk - not an SSD - you might hear that; or observe carefully your hard disk LED).

I recommend reading a book like Operating Systems : Three Easy Pieces (freely downloadable, one PDF file per chapter) which explains all this.

See also Linux Ate My RAM and run commands like xosview, top, htop or cat /proc/self/maps or cat /proc/$$/maps (see proc(5)).

PS. I am focusing on Linux, but other OSes also have virtual memory and page cache.

Roger L. · Accepted Answer · 2017-05-30 15:51:04Z

No. While having gigs of RAM these days is fantastic, there was a time when RAM was a very limited resource (I learned programming on a VAX 11/750 with 2MB of RAM) and the only thing in RAM was active executable and data pages of active processes, and file data that was in the buffer cache.
The buffer cache was flushed, and data pages were swapped out. And frequently at times. The read-only executable pages were over written and page tables marked so if the program touched those pages again they were paged in from the filesystem. Data was paged in from swap. As noted above, the STDIO library pulled in data in blocks and were obtained by the program as needed: fgetc, fgets, fread, etc. With mmap, a file could be mapped into the address space of a process, such as is done with shared library objects or even regular files. Yes, you may have some degree of control if its in RAM or not (mlock), but it only goes so far (see the error code section of mlock).

The statement "your RAM is going to be too small for your files" is true now as it was in the olden days of VAX. — Federico Poloni, Commented May 31, 2017 at 11:38
@Federico_Poloni Not quite so true today. At my last employer we had a workstation-class PC with 1Tb of RAM and just 0.5Tb of hard disk. (Problem class: small inputs, medium outputs, large randomly-accessed arrays during computation). — nigel222, Commented May 31, 2017 at 13:33

Stack Exchange Network

Are files opened by processes loaded into RAM?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
files
memory
lsof
.

Linked

Hot Network Questions

Are files opened by processes loaded into RAM?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged filesmemorylsof.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
files
memory
lsof
.