Timeline for What can cause ALL services on a server to go down, yet still responding to ping? and how to figure out

Current License: CC BY-SA 3.0

13 events

when toggle format	what		by	license	comment
Oct 22, 2012 at 15:42	comment	added	matteo		my server is at OVH, it's a dedicated server and they have a thing they call Manager on their site which remotely monitors your server, that's where I see the memory usage graph I'm talking about (which stopped collecting data when the server stalled, of course) and from where I (hard)rebooted it
Oct 22, 2012 at 15:36	comment	added	matteo		@ewwhite if by "OOM issue" you mean "many processes like ssh, httpd and the like having been killed by OOM-killer", no, it is most surely not an OOM issue, but if by "OOM issue" you mean "my server ran out of memory", then I do have the absolute certainty that it was an OOM issue, because I saw the graph where there was a spike in RAM usage from normal levels to 100% and in swap from almost 0 to about 80% within a matter of minutes (an "instant" in the precision of the graph)
Oct 22, 2012 at 12:22	comment	added	ewwhite		@matteo I see no indication that this is an OOM issue. Typically, the OOM-killer will pick specific or processes that meet certain criteria, but it wouldn't always kill a daemon like ssh. This is definitely on the I/O side. You didn't explain your hardware situation/specs as I requested in my answer.
Oct 22, 2012 at 8:33	comment	added	matteo		So, I guess OOM-killer never kicked in at all. On one side I'm curious about why (why didn't it kill any process when there was a sudden peak of ram usage to 100% and swap to 80%), but that's just curiosity. On the other hand, is there something I could do so that, next time this happens, I will be able (after the crash and reboot) to figure out WHAT process ate up so much memory?
Oct 22, 2012 at 6:25	comment	added	matteo		@JonathanCallen: thank you, but "grep -i kill /var/messages*" returned nothing.
Oct 21, 2012 at 21:56	comment	added	Coops		@matteo - more details of finding the log entry here: stackoverflow.com/questions/624857/…
Oct 21, 2012 at 18:16	comment	added	Jonathan Callen		@matteo The log message would appear as `Out of Memory: Killed process [PID] [process name].`, so greping for `oom` or `killer` wouldn't find it.
Oct 21, 2012 at 16:34	comment	added	matteo		no sign of oom-killer in either dmesg nor messages, btw - at least I grepped (case insensitive) for both oom and killer
Oct 21, 2012 at 14:13	comment	added	DerfK		@matteo Linux has what it calls "overcommit": just because you `malloc()` 1GB of ram doesn't actually mean you're going to use it, so the memory manager keeps track of how much memory your program thinks it has and how much memory the program has actually used, and it actually works well, most of the time. At least, until more than one program actually wants to use all of the 1GB it thinks it has.
Oct 21, 2012 at 13:53	comment	added	matteo		refuse to allocate memory to programs asking for it when there's not enough ram for the system to keep working correctly... I mean a buggy or even malicious program should never be able to destroy the whole system...
Oct 21, 2012 at 13:53	comment	added	matteo		Thanks a lot, I'm almost sure this is the problem, as both the RAM and the swap were full prior to the server failure. (I can see on ovh's Manager's stats). And it's probably some of my crazy php scripts using a lot of memory. It does puzzle me however for a couple of reasons. (1) looks like the memory eaten up by php is not freed afterwards, but that wouldn't make sense; (2) in any case, I wouldn't expect a proper operating system to die completely just because of one (or even a few) processes using too much memory... I would expect it to
Oct 21, 2012 at 13:45	vote	accept	matteo
Oct 21, 2012 at 13:06	history	answered	Coops	CC BY-SA 3.0