No, a file is not automatically read into memory by opening it. That would be awfully inefficient. sed
, for example, reads its input line by line, as do many other Unix tools. It seldom has to keep more than the current line in memory.
With awk
it's the same. It reads a record at a time, which by default is a line. If you store parts of the input data in variables, that will be extra, of course1.
Some people have a habit of doing things like
for line in $(cat file); do ...; done
Since the shell will have to expand the $(cat file)
command substitution completely before running even the first iteration of the for
loop, this will read the whole of file
into memory (into the memory used by the shell executing the for
loop). This is a bit silly and also inelegant. Instead, one should do
while IFS= read -r line; do ...; done <file
This will process file
line by line (but do read Understanding "IFS= read -r line").
Processing files line by line in the shell is only seldom needed though, as most utilities are line-oriented anyway (see Why is using a shell loop to process text considered bad practice?).
I'm working in bioinformatics, and when processing huge amounts of genomic data I would not be able to do much unless I only kept the bits of the data that were absolutely necessary in memory. For example, when I need to strip the bits of data that could be used to identify individuals from a 1 terabyte dataset containing DNA variants in a VCF file (because that type of data can't be made public), I do line by line processing with a simple awk
program (this is possible since the VCF format is line-oriented). I do not read the file into memory, process it there, and write it back out again! If the file was compressed, I would feed it through zcat
or gzip -d -c
, which, since gzip
does stream processing of data, would also not read the whole file into memory.
Even with file formats that are not line oriented, like JSON or XML, there are stream parsers that makes it possible to process huge files without storing it all in RAM.
With executables, it's slightly more complicated since shared libraries may be loaded on demand, and/or be shared between processes (see Loading of shared libraries and RAM usage, for example).
Caching is something I haven't mentioned here. This is the action of using RAM to hold frequently accessed pieces of data. Smaller files (for example executables) may be cached by the OS in the hope that the user will make many references to them. Apart from the first reading of the file, subsequent accesses will be made to RAM rather than to disk. Caching, like buffering of input and output is usually largely transparent to the user and amount of memory used to cache things may dynamically change depending on the amount of RAM allocated by applications etc.
1 Technically, most programs probably read a chunk of the input data at a time, either using explicit buffering, or implicitly through the buffering that the standard I/O libraries do, and then present that chunk line by line to the user's code. It's much more efficient to read a multiple of the disk's block size than e.g. a character at a time. This chunk size will seldom be larger than a handful of kilobytes though.