1

We build a system that's intended to be on all the time - it collects and displays graphs of data. If we leave it without changing anything for long enough, we end up with an oom-killer event. That kills our main process (it's got the high oom-score) and our software gets restarted.

Basics: The system is CentOS 6, kernel is 2.6.32.26. The system has 2G of ram and 4G of swap. The application is written in C++ w/Qt 3.

I've set a cron job to grab the contents of /proc/meminfo and /proc/slabinfo every minute. Here's the traces I find most interesting from the meminfo data (the most recent oom-killer is on the right side of the graph): meminfo

Note SUnreclaim grows until the oom-killer hits. The change in slope on SUnreclaim is where I switched displays.

Here's some interesting traces from the slabinfo data: slabinfo sizes

What this looks like to me is that something's leaking or fragmenting. Whatever it is does seem to get cleaned up when my processes die, but I honestly have no idea what's going on here.

How do I figure out what's leaking?

Updated: Early on in this process, I started with ps output (not shown here). All of our processes RSS values ramp up quickly to their 'normal' level and then stay put. If this was a process running away with normal memory, I wouldn't need assistance. Instead, there's something we're doing that's causing unswappable memory to be allocated.

As to the upgrade suggestion: The codebase has a lot of dependencies on old libraries, and I can't make a transition to even a 3 series kernel right now.

1 Answer 1

0

You've asked two questions.

1) If the OOM Killer runs + you have no swapping, likely this relates to your vm.swappiness setting. Try setting this to 1. On your antiquated + highly hackable kernel (shudder), setting to 0 (as I recall), disables swapping completely, which likely isn't what you're after.

2) Determining your leaking program might be as easy as running ps auxww repeatedly looking for constantly increasing RSS values or some other metric.

All this said...

Your Kernel is very old. PHP is capped at 5.3 (highly hackable). OpenSSL is buggy. Many related libraries are old + may be the source of memory leaks.

Likely best to upgrade to a recent Distro. A simple upgrade may install more recent code with addresses your memory leakage.

6
  • The kernel IS old, but there's no web stuff installed or in play. The only traffic in and out is SSH and samba, and the SSL libs have been updated. (CentOS 6 is still getting updates). I will look at the latest 2.6 kernel from CentOS, but I can't transition to even a 3 series kernel, due to dependencies and re-test requirements. Commented Oct 26, 2017 at 13:27
  • If you continually use an old Kernel, you may be getting hit by a hacker exploiting a zero day too. I'd suggest you setup iptables rules to log + drop all outgoing SSH + SMTP + UDP 443, then allow SSH via exception rules. If you see questionable iptables logging, then you've been hacked. If you have been hacked, then leaving all the iptables rules in play, may cause the BotNet to drop using your machine. Old Kernels being hacked can produce some very odd behavior. Commented Nov 1, 2017 at 13:05
  • The systems affected live only our internal networks, and the failure is reproducible even if the system is disconnected from the network. Also, 'hackers did it' is up there in my list of things to never assume, along with 'the compiler has a bug'. Commented Nov 1, 2017 at 15:18
  • Switching to the latest CentOS 6 kernel DID in fact make the problem go away. So I have a fix, even if I don't yet understand the problem completely. Further tests on our code (to try to isolate what is triggering the fail) will hopefully give me some clues. Commented Nov 1, 2017 at 15:20
  • Running very old Kernels means you're running bugs which have been fixed for years. For example, running CentOS 6, means you've automatically started with a very old Kernel + very old software packages, so you're running code which, in many cases (like Apache + PHP), has had years of fixes, that you're missing, by installing an old OS. Better to start with latest code + nights when you have insomnia, you can scrape through years of Changelogs trying to figure out which change... to which package... might have fixed your problem. Commented Nov 7, 2017 at 14:35

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .