6

I am getting unexpected behaviour when my machine runs out of memory.

I have a Intel i7-6700 with 32GB of RAM and I'm running Arch Linux with vanilla 4.14.8 kernel. I have a 32GB swap on a encrypted LVM volume on SSD disk.

During normal operation I run a couple of QEMU/KVM guests, along with other stuff (XFCE, Firefox etc.). The normal memory usage is about 20-30%, with almost no swap.

But when I run something memory-intensive (e.g. 7za a -md=29 to compress a large file), the system hangs/freezes when memory usage gets to 100%. The keyboard and mouse stop responding completely, display freezes, disk activity stops, and any TCP connections to machine hang in SYN phase. The only way to recover from this situation is to power-cycle the machine.

In the moment just before hang, one can see that virtually no swap space is being used. Of course, swap is enabled, and I am not using any particular sysctl settings related to memory (in particular, my vm.swappiness has the default value of 60).

What I don't understand is this:

  • Why doesn't the kernel use the swap space?
  • Why does the oom-killer not kick in when memory is exhausted?

I am not a kernel expert, but as I understand it, the system is not supposed to freeze/hang when running out of memory. What I would expect to see is this:

  • When there is swap space available, no process should be killed until both memory and swap are consumed (in my case, 64G).
  • Even with no swap, oom-killer is supposed to kill 7za when memory runs out
  • Even without both of the above, any process trying to allocate more memory than is available should get an error and fail gracefully.

So there are in fact 3 independent mechanisms to prevent running out of memory, but all of them appear to fail. I realize there might be some subtle issues I don't know about (i.e. memory ballooning in guest VMs, locked memory etc.), but I can really not think of anything that would explain the behaviour I am seeing.

Can somebody explain what is going on here and why? Am I just missing something? Can I do something to deterministically prevent hanging?

EDIT:

I ran some differential tests and I've found that:

  • Encrypted swap on LVM volume => machine freezes.
  • Encrypted swap on partition => everything OK (swap gets used as expected, machine does not freeze).

It would appear that the problem is somehow related to LVM. I have used the same physical partition in both cases, so it's not disk-related either. During tests, i've left vm.swappiness to 60 (default).

Just as a side note - during one particular test, I've noticed that in htop, one "notch" appeared in swap bar just before machine froze. So the kernel actually started to use swap, but it only lasted for about 3 seconds.

The problem should be easily reproducible.

UPDATE:

For anybody following up on this, I determined that the problem is specific to using swap space on top of LVM (encrypted or not). This was tested on 4.x kernels, and I was not able to avoid this hangs by tweaking sysctl parameters. I have no info about 5.x at the moment. It seems like a kernel bug to me.

5
  • How do you know disk activity stops? Do you have an HDD light which goes off? It almost seems to me like the only thing that is happening is swap is being written,/read - and the entire system appears frozen because I is extremely slow as a result - possibly to the point where commands showing io are licking up?
    – davidgo
    Commented Dec 31, 2017 at 3:59
  • 32 gigs of swap is an aweful lot for a HDD based system.
    – davidgo
    Commented Dec 31, 2017 at 4:00
  • I did some additional tests, see "EDIT" section above.
    – jurez
    Commented Dec 31, 2017 at 17:05
  • Anything of note in dmesg or syslogs?
    – Adam
    Commented May 13, 2018 at 3:54
  • Try github.com/rfjakob/earlyoom or github.com/hakavlad/nohang
    – Ben Creasy
    Commented Mar 9, 2020 at 6:57

1 Answer 1

1

I've seen a similar result happen - but the problem isn't lack of memory; it's a process that eats up space in the root partition/volume.

E.g. Commonly this could be excessive writing to /tmp, or other file system in /. The kernel will swap out anything it can (which isn't much) in an effort to store the unwritten memory in RAM buffers. Fairly quickly this will fail and everything grinds to a halt.

Normally you would get warning messages issued - but you may not see them for an especially storage-greedy process.

1
  • Especially you can't see any logs if you are using an desktop environment and UI applications.
    – BAZTED
    Commented Mar 27, 2020 at 8:10

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .