There are multiple reasons for OOM events to occur before the available swap space is used entirely and OOM events may trigger the OOM-killer thread or worse… nasty signals :
A/ Generalities regarding memory allocation and OOM events
Because kernel devs are aware that a lot of programs malloc() huge amounts of memory “just-in-case” and don’t use much of it and, at the very least, might statically expect that all the processes running on the system won't simultaneously be in need of the memory they requested, the kernel does not actually reserve the memory at malloc (or friends) point.
It will instead wait for the first write access into the memory (which will necessarily lead to a page fault) to make the real mapping.
If, at this point, there is no memory immediately available, the kernel will wait for better days (1) and if these better days don't come quick enough will fire an OOM event.
OOM event that, depending on some sysctl setting (panic_on_oom), will either trigger the OOM-killer or generate a kernel panic.
B/ Why OOM events might occur irrespective of the amount of free space in swap (2)
- B.1/ Because the swapping process is not fast enough in freeing space :
As seen in §A the kernel will not wait long for some memory to become available. So if no running process happens to release some memory and the filesystem cache is already reduced to its strict minimum, making so swapping out is the only way to free memory pages… this just won't fit into the grace period. OOM-event will get fired even if Gigs of memory could have been swapped.
Random accesses to disk are slow, access to the swap area is even slower since the swap space is likely to stand on the same disk than the filesystems used by the running processes.
There is nevertheless a way to try to avoid the system falling into that situation. Remember Achilles and the tortoise : Start swapping out earlier. Start moving pages at a time when the system is not in need of physical memory.
This is what you actually indirectly (3) managed to obtain when increasing swappiness. But, because this is just a side effect of your setting, the "best" setting suffers from a high stdev and is highly workload dependent. Benchmarks needed. (4)(5)
- B.2 : Because the system already swapped everything that could be swapped
Processes using the mlock() system call can obtain pages that are guaranteed per design not swappable. Worse ? mlockall()
(6)
Which can indeed result in a fair amount of MB unswappable.
HugeTLB pages also cannot be swapped out under memory pressure, cat /proc/meminfo
will report the amount of memory reserved for serving their purpose.
C/ Why threads can terminate when memory pressure is high without the OOM-killer logging anything. (7)
- C.1 : Per application design
The decision to over allocate is taken by the kernel at the time malloc
is issued. And despite the kernel defaults on an "optimistic strategy", it can always happen that the kernel refuses the reservation request, returning a NULL pointer to the malloc()
calling thread.
In which case, depending on how the calling process handles this exception, it will either wait for better times for renewing its request or simply gracefully abort or even… just ignore and segfault, this terminating or causing the premature death of parents in cascade, this in turn releasing a fair amount of memory without needing the OOM-killer to intervene. (and once again irrespective of the space remaining in swap)
- C.2/ Because some thread caught some nasty signal
Because the system can also tolerate over-allocation of huge pages, if no huge page exists at page fault time, the task is sent a SIGBUS and often dies an unhappy death.
1 : Hmmm better milliseconds actually since it will check up to six times maximum with a couple of nanoseconds wait in between. Note that these figures belong to my memory of now old kernels, they might have changed since.
2 : Please note that, strictly speaking, Linux does not swap since swapping refers to the transfer of an entire process address space to disk. Linux actually implements paging since it, in fact, transfer individual pages. However docs & discussions using swapping… so be it.
3 : "indirectly" because starting swapping earlier is only a side effect of that setting which is primarily intended to tell your preference filesystem cache vs process' pages.
Because filesystem's IO is costly, linux will use as much physical memory as it can for caching.
The higher the value of the swappiness, the more aggressive the system will be swapping process' pages as soon as process' launch time, this incidentally increasing the amount of cache pages quickly reclaimable under memory pressure.
4 : This BTW also explains the contrapositive of your question : why is the system swapping whereas it has a lot of free RAM available ?
5 : While we can read major institutions (RHEL, ORACLE…) advising setting the swappiness to the strict minimum… (and buy more RAM…) Morton (A leading kernel dev) strongly advises a value of 100.
With the availability of technologies such as zswap, possibly making cost of swapping cheaper than filesystem IO, values of swappiness greater than 100 would even not be absurd.
6 :
mlockall() locks all pages mapped into the address space of the
calling process. This includes the pages of the code, data, and
stack segment, as well as shared libraries, user space kernel
data, shared memory, and memory-mapped files. All mapped pages
are guaranteed to be resident in RAM when the call returns
successfully; the pages are guaranteed to stay in RAM until later
unlocked.
7 : Keep in mind that even if launched, the OOM-killer is rather… lazy so, preferring the nasty tasks to terminate by themselves. So if signals are pending for the culprit… the OOM-killer will wait for their action to be taken… just in case…
vm.swappiness
affects whether or not the OOM Killer is triggered. I look forward to your answer!vm.swappiness
was a red herring.