4

I am collecting numbers for monitoring HPC servers and am debating the policy for handing out memory (overcommit or not). I wanted to show users a number on how much virtual memory their processes (the whole machine) requested vs. how much was actually used.

I thought I'd get the interesting values from /proc/meminfo using the fields MemTotal, MemAvailable, and Committed_AS. The latter is supposed to show how much memory has been committed to by the kernel, a worst-case number of how much memory would really be needed to fulfill the running tasks.

But Committed_AS is obviously too small. It is smaller than the currently used memory! Observe two example systems. One admin server:

# cat /proc/meminfo 
MemTotal:       16322624 kB
MemFree:          536520 kB
MemAvailable:   13853216 kB
Buffers:             156 kB
Cached:          9824132 kB
SwapCached:            0 kB
Active:          4854772 kB
Inactive:        5386896 kB
Active(anon):      33468 kB
Inactive(anon):   412616 kB
Active(file):    4821304 kB
Inactive(file):  4974280 kB
Unevictable:       10948 kB
Mlocked:           10948 kB
SwapTotal:      16777212 kB
SwapFree:       16777212 kB
Dirty:               884 kB
Writeback:             0 kB
AnonPages:        428460 kB
Mapped:            53236 kB
Shmem:             26336 kB
Slab:            4144888 kB
SReclaimable:    3863416 kB
SUnreclaim:       281472 kB
KernelStack:       12208 kB
PageTables:        38068 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    24938524 kB
Committed_AS:    1488188 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      317176 kB
VmallocChunk:   34358947836 kB
HardwareCorrupted:     0 kB
AnonHugePages:     90112 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      144924 kB
DirectMap2M:     4988928 kB
DirectMap1G:    13631488 kB

This is roughly 1.5G committed vs. 2.5G being in use without caches. A compute node:

ssh node390 cat /proc/meminfo
MemTotal:       264044768 kB
MemFree:        208603740 kB
MemAvailable:   215043512 kB
Buffers:           15500 kB
Cached:           756664 kB
SwapCached:            0 kB
Active:         44890644 kB
Inactive:         734820 kB
Active(anon):   44853608 kB
Inactive(anon):   645100 kB
Active(file):      37036 kB
Inactive(file):    89720 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      134216700 kB
SwapFree:       134216700 kB
Dirty:                 0 kB
Writeback:           140 kB
AnonPages:      44918876 kB
Mapped:            52664 kB
Shmem:            645408 kB
Slab:            7837028 kB
SReclaimable:    7147872 kB
SUnreclaim:       689156 kB
KernelStack:        8192 kB
PageTables:        91528 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    345452512 kB
Committed_AS:   46393904 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      797140 kB
VmallocChunk:   34224733184 kB
HardwareCorrupted:     0 kB
AnonHugePages:  41498624 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      312640 kB
DirectMap2M:     7966720 kB
DirectMap1G:    262144000 kB

This is around 47G used vs. 44G committed. The system at question is a CentOS 7 cluster:

uname-a
Linux adm1 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

On my Linux desktop using a vanilla kernel, I see more 'reasonable' numbers with 32G being committed compared to 15.5G being in use. On a Debian server I see 0.4G in use vs. 1.5G committed.

Can someone explain this to me? How do I get a correct number for the committed memory? Is this a bug in the CentOS/RHEL kernel that should be reported?

Update with more data and a comparison between systems

A listing of used/committed memory for various systems I could access, with a note about the kind of load:

  • SLES 11.4 (kernel 3.0.101-108.71-default)
    • 17.6G/17.4G, interactive multiuser HPC (e.g. MATLAB, GIS)
  • CentOS 7.4/7.5 (kernel 3.10.0-862.11.6.el7 or 3.10.0-862.14.4.el7)
    • 1.7G/1.3G, admin server, cluster mgmt, DHCP, TFTP, rsyslog, …
    • 8.6G/1.7G, SLURM batch system, 7.2G RSS for slurmdbd alone
    • 5.1G/0.6G, NFS server (400 clients)
    • 26.8G/32.6G, 16-core HPC node loaded with 328 (need to talk to the user) GNU R processes
    • 6.5G/8.1G, 16-core HPC node with 16 MPI processes
  • Ubuntu 16.04 (kernel 4.15.0-33-generic)
    • 1.3G/2.2G, 6-core HPC node, 6-threaded scientific application (1.1G RSS)
    • 19.9G/20.3G, 6-core HPC node, 6-threaded scientific application (19G RSS)
    • 1.0G/4.4G, 6-core login node with BeeGFS metadata/mgmt server
  • Ubuntu 14.04 (kernel 3.13.0-161-generic)
    • 0.7G/0.3G, HTTP server VM
  • Custom build (vanilla kernel 4.4.163)
    • 0.7G/0.04G, mostly idle Subversion server
  • Custom build (vanilla kernel 4.14.30)
    • 14.2G/31.4G, long-running desktop
  • Alpine (kernel 4.4.68-0-grsec)
    • 36.8M/16.4M, some (web) server
  • Ubuntu 12.04 (kernel 3.2.0-89-generic)
    • 1.0G/7.1G, some server
  • Ubuntu 16.04 (kernel 4.4.0-112-generic)
    • 0.9G/1.9G, some server
  • Debian 4.0 (kernel 2.6.18-6-686, 32 bit x86, obviously)
    • 1.0G/0.8G, some reliable server
  • Debian 9.5 (kernel 4.9.0-6)
    • 0.4G/1.5G, various web services, light load, obviously
  • Debian 9.6 (kernel 4.9.0-8-amd64)
    • 10.9G/17.7G, a desktop
  • Ubuntu 13.10 (kernel 3.11.0-26-generic)
    • 3.2G/5.4G, an old desktop
  • Ubuntu 18.04 (kernel 4.15.0-38-generic)
    • 6.4G/18.3G, a desktop

SUnreclaim for SLES and CentOS rather large … 0.5G to 1G not uncommon, more if not flushing caches from time to time. But not enough to explain the missing memory in Committed_AS. The Ubuntu machines typically have below 100M SUnreclaim. Except the 14.04 one, that one has small Committed_AS and 0.4G SUnreclaim. Bringing kernels in order is tricky, as the 3.10 kernel from CentOS has many features of 4.x kernels backported. But there seems to be a line between 4.4 and 4.9 that affected the strangely low values of Committed_AS. The added servers from some of my peers suggest that Committed_AS also delivers strange numbers for older kernels. Was this broken and fixed multiple times?

Can people confirm this? Is this just buggy/very inaccurate kernel behaviour in determining the values in /proc/meminfo, or is there a bug(fix) history?

Some of the entries in the list are really strange. Having one slurmdbd process with a RSS of four times Committed_AS cannot be right. I am tempted to test a vanilla kernel on these systems with the same workload, but I cannot take the most interesting machines out of production for such games.

I guess the answer to my question is a pointer to the fix in the kernel commit history that enabled good estimates in Committed_AS again. Otherwise, please enlighten me;-)

Update about a two processes having more RSS than Committed_AS

The batch server that runs an instance of the Slurm database daemon slurmdbd, along with slurmctld is an illuminating example. It is very long-running and shows a stable picture, with those two processes dominating resource use.

# free -k; for p in $(pgrep slurmctld) $(pgrep slurmdbd) ; do cat /proc/$p/smaps|grep Rss| awk '{ print $2}'; done | (sum=0; while read n; do sum=$((sum+n)); done; echo $sum ); cat /proc/meminfo
              total        used        free      shared  buff/cache   available
Mem:       16321148     5873792      380624      304180    10066732     9958140
Swap:      16777212        1024    16776188
4703676
MemTotal:       16321148 kB
MemFree:          379708 kB
MemAvailable:    9957224 kB
Buffers:               0 kB
Cached:          8865800 kB
SwapCached:          184 kB
Active:          7725080 kB
Inactive:        6475796 kB
Active(anon):    4634460 kB
Inactive(anon):  1007132 kB
Active(file):    3090620 kB
Inactive(file):  5468664 kB
Unevictable:       10952 kB
Mlocked:           10952 kB
SwapTotal:      16777212 kB
SwapFree:       16776188 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:       5345868 kB
Mapped:            79092 kB
Shmem:            304180 kB
Slab:            1287396 kB
SReclaimable:    1200932 kB
SUnreclaim:        86464 kB
KernelStack:        5252 kB
PageTables:        19852 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    24937784 kB
Committed_AS:    1964548 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     1814044 kB
DirectMap2M:    14854144 kB
DirectMap1G:     2097152 kB

Here you see the Rss of the two processes amounting to 4.5G (just slurmdbd is 3.2G). The Rss kindof matches the active anon pages, but Committed_AS is less than 2G. Counting the Rss of all processes via /proc comes quite close to AnonPages+shmem (note: Pss is only about 150M smaller). I don't get how Committed_AS can be smaller than Rss (summed Pss) of active processes. Or, just in the context of meminfo:

How can Committed_AS (1964548 kB) be smaller than AnonPages (5345868 kB)? This is a faily stable workload. These are extremely long-lived two processes that are about the only thing that happens on this machine, with rather constant churn (batch jobs on other nodes being managed).

1
  • I recently edited the Question with some more data (in two steps). The picture leaves me somewhat confirmed in my belief that there is something wrong with the Committed_AS estimate in, well, some ranges of kernel versions.
    – drhpc
    Commented Nov 30, 2018 at 13:41

3 Answers 3

2

Those boxes are not under significant memory pressure. Neither is paging (SwapFree). Second box is ~47 GB committed of 250 GB total. 200 GB is a lot to play with.

In practice, keep increasing the size of the workload until one of these happens:

  • User (application) response time degrades
  • Page out rate is higher than you are comfortable with
  • OOM killer murders some processes

Relationships between the memory counters is unintuitive, varies greatly between workloads, and probably is only really understood by kernel developers. Don't worry about it too much, focus on measuring obvious memory pressure.


Other descriptions of Comitted_AS, on the linux-mm list a while ago, emphasize it is an estimate:

Committed_AS: An estimate of how much RAM you would need to make a
              99.99% guarantee that there never is OOM (out of memory)
              for this workload. Normally the kernel will overcommit
              memory. That means, say you do a 1GB malloc, nothing
              happens, really. Only when you start USING that malloc
              memory you will get real memory on demand, and just as
              much as you use. So you sort of take a mortgage and hope
              the bank doesn't go bust. Other cases might include when
              you mmap a file that's shared only when you write to it
              and you get a private copy of that data. While it normally
              is shared between processes. The Committed_AS is a
              guesstimate of how much RAM/swap you would need
              worst-case.
8
  • This is not about the machines being in trouble. This is for monitoring resource usage of scientific computing jobs occupying the whole node (hence not a single process to work with). The usage pattern differs a bit from typical server. Since the memory is actually meant to be used by the user jobs, it can be normal that a job requests 300G or more and then starts filling that with data. I want to be able to tell the users that they allocated too much before OOM killing starts, or tell them that they might hit such a limit when scaling problem size.
    – drhpc
    Commented Nov 29, 2018 at 18:39
  • About Committed_AS being an estimate: I do not expect it to be exact. It can be off by quite some margin. But that it can be about half of the used non-reclaimable memory, so obviously wrong, led me to assume that there is something that I am missing. Any idea apart from reading the kernel source code and trying to get an answer on LKML?
    – drhpc
    Commented Nov 29, 2018 at 18:41
  • Workload dependent, please add specifics to your question about if the dataset is a file, shared memory (many DBMS systems use shared memory buffers), private pages, if huge pages are in use... You can read or memory map a TB sized file on such a box, and being file backed isn't likely to use much memory. But if you slurp 1 TB into private pages it is going to OOM. Commented Nov 29, 2018 at 18:57
  • That's my point: I want to know if e.g. mappings of big files are counted in Committed_AS (I think they should). See my update of the question regarding more examples of servers and some workload indication.
    – drhpc
    Commented Nov 30, 2018 at 12:00
  • 2
    I fear, we're talking past each other, obviously from differing backgrounds. Disregarding any operational considerations for server farms or HPC clusters, I would like to focus on my main point: Committed_AS is supposed to be an estimated upper bound of memory use given current allocations, however unrealistic it is that it will be reached. I observe values that are actually lower than current memory use. This simply looks wrong and I would like to know if this is to be expected, if there is a certain kernel config option that influences this, etc.
    – drhpc
    Commented Dec 4, 2018 at 12:53
1

Here's another answer purely about Committed_AS being lower than "expected":

The interesting lines from your /proc/meminfo are as follows:

Active:          4854772 kB
Inactive:        5386896 kB
Active(anon):      33468 kB
Inactive(anon):   412616 kB
Active(file):    4821304 kB
Inactive(file):  4974280 kB
Mlocked:           10948 kB
AnonPages:        428460 kB
Shmem:             26336 kB
Committed_AS:    1488188 kB

(The Active and Inactive are just sums of the (anon) vs (file) details later, and AnonPages is just sum of lines with identifier (anon) – I only included those lines to make this easier to understand.)

As Active(file) is file backed that doesn't cause any raise to Committed_AS so practically the only things that actually raises your Committed_AS value are AnonPages + Shmem + Mlocked + spikes in memory usage. The Committed_AS is the amount of memory (RAM+swap combined) that system must be able to provide to currently running processes even if all caches and buffers are flushed to disk.

If a process does malloc() (which is usually implemented as sbrk() or brk() behind the scenes) the kernel will increase Committed_AS but it will not show in other numbers because kernel doesn't actually reserve any real RAM until the memory is actually used by the process. (Technically the kernel has specified virtual memory address space range to use for the process but the virtual memory mapping for the CPU is pointing to zero filled page with a flag that if the process tries to write anything, actual memory must be allocated on the fly - this allows process to read zeros from the virtaul address space without faulting the CPU but writing data to virtually allocated memory area is the action that actually allocates the memory for real.) It's very common that programs allocate more (virtual) memory than they actually use so this is a good feature to have but it obviously makes memory statistics harder to understand. It seems that your system is mostly running processes that do not acquire a lot of memory that's not actually used because your Committed_AS is pretty low compared to other values.

For example, my current system is currently running like this:

MemTotal:       32570748 kB
Active:         12571828 kB
AnonPages:       7689584 kB
Mlocked:           19788 kB
Shmem:           4481940 kB
Committed_AS:   44949856 kB

Note the huge amount of Committed_AS (~45 GB) in my system even though the total number of anonoymous pages, locked memory plus Shmem total to about 12 GB. As I'm running desktop environment on this system, I would assume that I have lots of processes that have executed fork() after acquiring/using lots of RAM. In this case the forked process can in theory modify all that memory without doing any explicit memory allocations and all this forked memory is counted upwards the Committed_AS value. As a result, the Committed_AS may not reflect your real system memory usage at all.

TL;DR: Committed_AS is estimated allocated virtual memory that is not backed up by any filesystem or max amount of memory that must be backed by real storage (RAM+swap) in theory to keep currently running processes still running if nothing allocates more memory in the whole system.

However, If the system is communicating with outside world, even incoming IP packets could cause more memory to be used so you cannot make any guarantees about future system behavior based on this number. Also note that stack memory is always allocated on the fly so even if none of your processes fork() or make explicit memory allocations, your memory usage (Committed_AS) may still increase when processes use more stack space.

In my experience Committed_AS is only really meaningful to compare to previous runs with similar workloads. However, if Committed_AS is less than your MemTotal you can be pretty sure that the system has very light memory pressure compared to your available hardware.

15
  • I'll have a closer look at the file-backed memory use. This may be it. About your explanantions about malloc() vs. actual use: I am very much aware of that. My point, even as we had overcommit disabled, was to be able to tell what amount of memory is actually available for a HPC job that will be the only non-system thing running on the machine (again: not about usual server workloads). I want to diagnose and tell the user: You actually used 28% of memory explicitly, but your program(s) allocated 80%. Cgroup statistics may be the solution, but not simple system-wide meminfo.
    – drhpc
    Commented Sep 24, 2021 at 10:48
  • So, in my first example. You are saying that the actual memory use is Mlocked + AnonPages + Shmem = 0.5G, Committed_AS is 1.5G, right? So that is a normal picture that I wouldn't question. 1.5G allocated, but 33% of that actually used right now. So is the answer actuallly that Committed_AS is good, but MemTotal-MemAvailable is just wildly wrong as an estimate of what's being used? MemAvailable seems to err on the side of caution, not to promise more than what is possibly there (said spikes perhaps)?
    – drhpc
    Commented Sep 24, 2021 at 11:11
  • I already noted in my monitoring scripts that MemAvailable does not count things like SReclaimable as being available. I'm revisiting this topic now that maybe will update the question with things I learned in between. I do not really get the point of your answer, though. Is it that you say that counting the file-backed pages causes me to think Committed_AS is too low? Most of your answer seems to explain why Committed_AS should be higher than expected from actual use, not lower.
    – drhpc
    Commented Sep 24, 2021 at 11:28
  • 1
    As I wrote earlier, you cannot sum Rss fields in smaps of multiple processes because Rss pages can be shared via copy-on-write (COW) mechanism. The Pss field should be used instead if you're computing sum over multiple processes. Commented Sep 27, 2021 at 9:28
  • 1
    (Btw.: With Linux 4.14.246, I just wrote 1G of zeros to /dev/shm and that got counted in Committed_AS right away.)
    – drhpc
    Commented Sep 28, 2021 at 10:29
0

In my experience, Committed_AS has been more accurate than MemAvailable. Especially with highly spikey workload the MemAvailable seems to be more like some kind of average instead of true value over short time periods.

That said, I don't remember using data from Committed_AS with kernels older than version 4.15 so I don't know if historical behavior was different.

Both Committed_AS and MemAvailable are officially kernel level heuristics so neither should be trusted as true fact.

For the workloads that I use to run, I start to experience performance problems typically when Committed_AS exceeds about 150% the real amount of RAM. However, that obviously highly depends on your workload. If you have lots of leaky processes and enough swap, your Committed_AS may keep climbing up without performance issues as processes leak RAM and kernel swaps the leaked RAM areas to swap. Note that for such cases Committed_AS could end up being much higher than total RAM + swap without any problems.

I wouldn't disable memory overcommit unless you're running hard realtime system. And such a thing probably shouldn't use any swap either. I'm personally always running with /proc/sys/vm/overcommit_memory set to 1.

If you can provide enough swap, it usually makes sense to increase /proc/sys/vm/watermark_scale_factor and /proc/sys/vm/watermark_boost_factor to avoid latency because of swapping. However, it's important to understand that Committed_AS is the currently committed memory (memory requested by user mode processes but usually not fully used) and having RAM+swap cover that should handle all cases where no process allocates any new memory. Unless you're running some very exotic system, multiple processes are constantly allocating new memory so you shouldn't make too strict estimates about future behavior of the system. And in case your workload is highly spikey, the current numbers tell very little about the future behavior of the system.

With modern systems, I'd focus on statistics that can include all the highest short term peaks in memory management. I'd guess that a well made statistics program would monitor kernel events via /sys/fs/cgroup/memory/cgroup.event_control and collect statistics at the moment of the highest memory pressure. I don't know any statistics application that actually supports that, though. Any statistics app that collects data on wall clock defined sample period only is going to miss majority of the short term spikes in RAM usage. For mathematically correct sample averages the wall clock sample period is a requirement but understanding the spikes is more important than having accurate averages because it's those spikes that kill your processes/performance, not the averages.

9
  • 1
    Thanks for your time … and, well, time has passed and things with newer kernels could be different. We run the jobs inside cgroups via the Slurm scheduler now, and will sometime beat that one into giving us proper access to the cgroups for reporting memory usage (it's deleting the cgroups before our custom reporting can access them, probably will have to either patch Slurm or add a plugin that stores the cgroups stats). But apart from that, the original question still stands about how bad Committed_AS should be as an estimate. I mean, a lot smaller than the actual current usage?
    – drhpc
    Commented Sep 13, 2021 at 14:10
  • I think that Committed_AS can be used as heuristic but it seems to update with some delay and short memory usage spikes may be missed even if you constantly poll the value. When memory load is high long enough, it has appeared pretty accurate on my experience. Commented Sep 14, 2021 at 19:06
  • Sadly, not in my experience … see the examples I posted. Nothing spiky about the memory usage. I consistently saw clearly too low Committed_AS on differing systems also with fairly constant load. I don't see much interest (on the respective kernel list, even) in clearing up that picture. Hoping for proper stats from cgroups.
    – drhpc
    Commented Sep 22, 2021 at 15:29
  • How do you know you don't have memory usage spikes? Even if you sample /proc/meminfo every second, the spikes may last less under second and any such spike could even cause OOM. Commented Sep 23, 2021 at 9:26
  • Of course I don't really know that I don't have usage spikes. I just know that I checked various systems that either are mostly idle or have a rather constant load in HPC jobs. My main point is, though: How can the spikes in actual usage be higher than those in allocated/promised memory? The occupied memory should be the slower variable, as it really takes time touching 10G of RAM. Even if it's heuristics, it is odd to produce them in an obviously inconsistent manner. I got a bucket of 10 litres and fill it with 16 litres of water. Rough estimate …
    – drhpc
    Commented Sep 24, 2021 at 10:39

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .