9

I am using Linux 5.15 with Ubuntu 22.04.

I have a process that uses a lot of memory. It requires more memory than I have RAM in my machine. The first time that I ran it, it was killed by the OOM Killer. I understand this: the system ran out of memory, the OOM Killer was triggered, my process was killed. This makes sense. I am also certain that this is what happened: I took a look at dmesg and it's all there.

So I added some swap space. I don't mind if this process takes a long time to run: I won't run it often.

I ran the process again. This time it ran for longer than the first time. The whole system became very laggy, in that way that systems do when they are swapping a lot. It seemed to be working... and then it died. Not only did the process die, but the shell process that was its parent died too, and the Tmux process that was its parent, and the shell process that was the Tmux process' parent, and even the GNOME terminal process that was its parent! But then the process murder stopped: no more parents died.

At first, I thought the OOM Killer had been triggered again - even though there was plenty of swap space still available - and that it had chosen to kill the GNOME terminal process. But I checked dmesg and journalctl -k and there was nothing new there. There was no sign that the OOM Killer had been triggered.

So, first question: is there any circumstance in which the OOM Killer can be triggered without it logging anything to the kernel ring buffer?

It puzzled me that the Linux kernel seemed to have started swapping but somehow it hadn't swapped enough... or it hadn't swapped fast enough... or something.

So I increased vm.swappiness. This really shouldn't affect system stability: it's just a knob to turn for performance optimization. Even with vm.swappiness set to 0 the kernel should still start swapping when the free memory in a zone drops below a critical threshold.

But it kind of seemed like it had started swapping but hadn't swapped enough... so I increased vm.swappiness to 100 to encourage it to swap a bit more.

Then I ran the process again. The whole system became very laggy, in that way that systems do when they are swapping a lot... until the process ran successfully to completion.

So, second question: why did the kernel not use the available swap space, even when free memory had dropped below the critical threshold and there was certainly plenty of swap space available? Why did changing vm.swappiness make a difference?

Update:

Further testing revealed that setting vm.swappiness is not a reliable solution. I've had some failures even with vm.swappiness set to 100. It might improve the chances of the process completing successfully but I'm not sure.

7
  • "I have a process that uses a lot of memory" How ? (by repeatedly using malloc or friends ? / other ?)
    – MC68020
    Commented Dec 5, 2022 at 8:34
  • 1
    I don't know, I didn't write it and I haven't debugged it. Is it relevant to either of my questions? If so, that would be a great thing to explain in an answer! I wasn't aware that the OOM Killer behaved differently depending on how a process allocates memory.
    – c--
    Commented Dec 5, 2022 at 19:05
  • Yes there are significant differences between some process which image, at launch time, would require a given amount of memory / another process which would at some point of its execution request for the same amount of memory / another processes which would repeatedly and quickly request small chunks which sums would, at some point equate the identical amount of memory. This difference is definitely relevant to your question as worded in the title. Not to the "first question" as you labeled it. I'll try to provide some answer…
    – MC68020
    Commented Dec 5, 2022 at 22:10
  • 1
    Awesome, thank you. Presumably it doesn’t require all the memory at startup since it runs for a while before crashing when swap is added. But I don’t know whether it is lots of small allocations continuously or occasional big ones. Either way, I am surprised that vm.swappiness affects whether or not the OOM Killer is triggered. I look forward to your answer!
    – c--
    Commented Dec 6, 2022 at 8:09
  • I have added some new observations to my question. It seems that perhaps vm.swappiness was a red herring.
    – c--
    Commented Dec 7, 2022 at 21:34

3 Answers 3

7
+100

There are multiple reasons for OOM events to occur before the available swap space is used entirely and OOM events may trigger the OOM-killer thread or worse… nasty signals :

A/ Generalities regarding memory allocation and OOM events
Because kernel devs are aware that a lot of programs malloc() huge amounts of memory “just-in-case” and don’t use much of it and, at the very least, might statically expect that all the processes running on the system won't simultaneously be in need of the memory they requested, the kernel does not actually reserve the memory at malloc (or friends) point.
It will instead wait for the first write access into the memory (which will necessarily lead to a page fault) to make the real mapping.
If, at this point, there is no memory immediately available, the kernel will wait for better days (1) and if these better days don't come quick enough will fire an OOM event. OOM event that, depending on some sysctl setting (panic_on_oom), will either trigger the OOM-killer or generate a kernel panic.

B/ Why OOM events might occur irrespective of the amount of free space in swap (2)

  • B.1/ Because the swapping process is not fast enough in freeing space :
    As seen in §A the kernel will not wait long for some memory to become available. So if no running process happens to release some memory and the filesystem cache is already reduced to its strict minimum, making so swapping out is the only way to free memory pages… this just won't fit into the grace period. OOM-event will get fired even if Gigs of memory could have been swapped.
    Random accesses to disk are slow, access to the swap area is even slower since the swap space is likely to stand on the same disk than the filesystems used by the running processes.
    There is nevertheless a way to try to avoid the system falling into that situation. Remember Achilles and the tortoise : Start swapping out earlier. Start moving pages at a time when the system is not in need of physical memory.
    This is what you actually indirectly (3) managed to obtain when increasing swappiness. But, because this is just a side effect of your setting, the "best" setting suffers from a high stdev and is highly workload dependent. Benchmarks needed. (4)(5)
  • B.2 : Because the system already swapped everything that could be swapped
    Processes using the mlock() system call can obtain pages that are guaranteed per design not swappable. Worse ? mlockall() (6)
    Which can indeed result in a fair amount of MB unswappable.
    HugeTLB pages also cannot be swapped out under memory pressure, cat /proc/meminfo will report the amount of memory reserved for serving their purpose.

C/ Why threads can terminate when memory pressure is high without the OOM-killer logging anything. (7)

  • C.1 : Per application design
    The decision to over allocate is taken by the kernel at the time malloc is issued. And despite the kernel defaults on an "optimistic strategy", it can always happen that the kernel refuses the reservation request, returning a NULL pointer to the malloc() calling thread.
    In which case, depending on how the calling process handles this exception, it will either wait for better times for renewing its request or simply gracefully abort or even… just ignore and segfault, this terminating or causing the premature death of parents in cascade, this in turn releasing a fair amount of memory without needing the OOM-killer to intervene. (and once again irrespective of the space remaining in swap)
  • C.2/ Because some thread caught some nasty signal Because the system can also tolerate over-allocation of huge pages, if no huge page exists at page fault time, the task is sent a SIGBUS and often dies an unhappy death.

1 : Hmmm better milliseconds actually since it will check up to six times maximum with a couple of nanoseconds wait in between. Note that these figures belong to my memory of now old kernels, they might have changed since.

2 : Please note that, strictly speaking, Linux does not swap since swapping refers to the transfer of an entire process address space to disk. Linux actually implements paging since it, in fact, transfer individual pages. However docs & discussions using swapping… so be it.

3 : "indirectly" because starting swapping earlier is only a side effect of that setting which is primarily intended to tell your preference filesystem cache vs process' pages.
Because filesystem's IO is costly, linux will use as much physical memory as it can for caching.
The higher the value of the swappiness, the more aggressive the system will be swapping process' pages as soon as process' launch time, this incidentally increasing the amount of cache pages quickly reclaimable under memory pressure.

4 : This BTW also explains the contrapositive of your question : why is the system swapping whereas it has a lot of free RAM available ?

5 : While we can read major institutions (RHEL, ORACLE…) advising setting the swappiness to the strict minimum… (and buy more RAM…) Morton (A leading kernel dev) strongly advises a value of 100.
With the availability of technologies such as zswap, possibly making cost of swapping cheaper than filesystem IO, values of swappiness greater than 100 would even not be absurd.

6 :

  mlockall() locks all pages mapped into the address space of the
   calling process.  This includes the pages of the code, data, and
   stack segment, as well as shared libraries, user space kernel
   data, shared memory, and memory-mapped files.  All mapped pages
   are guaranteed to be resident in RAM when the call returns
   successfully; the pages are guaranteed to stay in RAM until later
   unlocked.

7 : Keep in mind that even if launched, the OOM-killer is rather… lazy so, preferring the nasty tasks to terminate by themselves. So if signals are pending for the culprit… the OOM-killer will wait for their action to be taken… just in case…

11
  • I've never seen any mention of the OOM killer being triggered by slow swap, and certainly not witnessed it myself due to thrashing where waits for swap appear to exceed 1000ms. Do you have any citations for this behaviour? Commented Dec 9, 2022 at 13:09
  • @PhilipCouling : From linux-mm.org itself : Inside What causes these OOM events? / The kernel is not using its swap space properly / second paragraph. ("it is also possible for the system to find itself in a sort of deadlock…") linux-mm.org/OOM
    – MC68020
    Commented Dec 9, 2022 at 13:19
  • Thank you, this is a great explanation. I think your B.1 answers my second question. I'm amazed that Linux works this way - it seems like a truly stupid design to trigger an OOM event when there's still plenty of memory - but if it really does work by just waiting a fixed period for swapping to free enough memory then that would explain the behaviour I saw. I'd have to do more tests to conclusively rule out B.2 but it seems very unlikely, given the successful runs.
    – c--
    Commented Dec 9, 2022 at 13:33
  • I'm not convinced by B.1 as worded. The reference given in comment discusses an issue caused by deadlock not timing. The only way this involves timing is that you can attempt to avoid deadlock by trying to avoid exhausting available physical memory. To hit this situation the storage driver would need to request memory to complete the IO. In any case, it's the deadlock that causes it not the timing. Commented Dec 9, 2022 at 13:36
  • I don't think either your C.1 or C.2 answers my first question though. I can't see why either malloc returning NULL or the process receiving SIGBUS could cause other processes to die. Also, I think unhandled signals are usually logged by the kernel too these days, aren't they?
    – c--
    Commented Dec 9, 2022 at 13:39
5

First of all, I'd like to thank MC68020 for taking the time to look into this for me. As it happens, their answer didn't include what was really happening in this situation - but they got the bounty anyway as it's a great answer and a helpful reference for the future.

I'd also like to thank Philip Couling for his answer, which also wasn't quite right but pointed me in the right direction.

The problem turned out to be systemd-oomd.

The problem and its solution are described here: How do I disable the systemd OOM process killer in Ubuntu 22.04?

In short:

systemctl disable --now systemd-oomd
systemctl mask systemd-oomd

And now I can reliably run my process to completion every time, without some systemd service killing the entire process tree with no warning.

2
  • Very happy you found the clue. Especially something I could not write about since my systems are running openrc. So +1++. I stand anyhow committed to provide insides from kernel's code.
    – MC68020
    Commented Dec 12, 2022 at 20:43
  • I'm happy to find that Linux isn't as crazy as it was beginning to seem! If it had really been triggering the OOM Killer because it couldn't swap fast enough to a local SSD that would have been... an interesting design decision. Thanks for your help :)
    – c--
    Commented Dec 12, 2022 at 23:13
3

I'm not aware of any causes that are likely to result in the OOM killer killing processes but not logging the fact. There's an edge case where OOM Killer might take down the process responsible for writing kernel logs to disk. This seems unlikely from your description.

I would take two details from your description as important and related:

  • The lack of an OOM-Killer log
  • The fact that the whole process tree including the GUI window disappeared.

It's a bit of a guess, but this smells like the GUI itself is killing it.

It's quite possible that the thrashing was making it look like it was crashed. I've seen examples where, for example, browsers have crashed out due to intensive thrashing. Crash detectors can see no activity and assume the program itself has gone wrong, not understanding the program was simply waiting for the Kernel to respond.

I would try switching console and running it from a command line without the GUI. This would at least rule out any interference from GNOME itself.

1
  • Thanks, it didn't occur to me that GNOME itself might kill a process. I will try running it from a text console.
    – c--
    Commented Dec 9, 2022 at 18:28

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .