0

After a few hours of correctling working, our proliant Server stop calculus with the System Healt LED 12 blinking, which according to the documentation ( http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c01706108-8.pdf ) is the sign of a "Critical system failure detected (processor, memory, regulator, thermal event, fan, NMI)" (page 96).

SSH is then lost. We can reboot and re-get ssh ( I am not onsite) , but I don't know what to check then ? is there any logfile where to find some info ?

I found this guide : http://denis.herve.free.fr/trsfrt/HProliant.pdf but seems overwelming to me.

My colleague suggest it could be a RAM + Swap overload which make the whole server crash. I don't really agree with him as as far I am concerned, a memory issue wouldn't lead to a critical system failure. Any idea on this point ?

I am wondering if there could be any relationship with my previous post : Linux server swapping before memory is completely full.

we are on ubuntu 14.04.

PS : the server is on a basement, there may be a bit of water condensation on the morning...

EDIT Folowing @Hennes remark, we moved the server back to the living room. But after a night of calculus, it was again bliking with the red light :-(

Now I am trying to get my head around the log files. We rebooted the server this morning around 09:44 Here are the files recently changed : enter image description here

What to search, where, to get some Error infos ?

I tried :

romain@pl:/var/log$ cat syslog | grep error
Dec 27 12:00:23 pl kernel: [    1.053210] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 27 12:00:23 pl kernel: [    6.740763] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [    6.741967] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [    7.082169] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [    7.112776] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [    9.905224] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
Dec 27 11:52:18 pl kernel: [    1.053048] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 27 11:52:18 pl kernel: [    6.364768] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [    6.365903] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [    6.684685] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [    6.686080] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [   11.211120] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
Dec 28 09:46:55 pl kernel: [    1.051638] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 28 09:46:55 pl kernel: [    6.348693] ata3.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [    6.349786] ata3.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [    6.699099] ata4.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [    6.731027] ata4.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [    8.959211] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro

and :

romain@pl:/var/log$ cat dmesg | grep error
[    1.051638] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
[    6.348693] ata3.00: failed to enable AA (error_mask=0x1)
[    6.349786] ata3.00: failed to enable AA (error_mask=0x1)
[    6.699099] ata4.00: failed to enable AA (error_mask=0x1)
[    6.731027] ata4.00: failed to enable AA (error_mask=0x1)
[    8.959211] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro

-> Here I don't really get what are the values in the first column like [ 6.731027] : is it the number of seconds since boot ?

I checked

romain@pl:/var/log$ cat syslog | grep memory
Dec 27 12:00:23 pl kernel: [    0.000000] Scanning 1 areas for low memory corruption
Dec 27 12:00:23 pl kernel: [    0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
[...]
Dec 27 12:00:23 pl kernel: [    0.000000] init_memory_mapping: [mem 0x100000000-0x61fffffff]
Dec 27 12:00:23 pl kernel: [    0.000000] Early memory node ranges
Dec 27 12:00:23 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 27 12:00:23 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 27 12:00:23 pl kernel: [    0.019764] Initializing cgroup subsys memory
Dec 27 12:00:23 pl kernel: [    0.019992] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 27 12:00:23 pl kernel: [    0.971501] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 27 12:00:23 pl kernel: [    0.972518] Scanning for low memory corruption every 60 seconds
Dec 27 12:00:23 pl kernel: [    6.154807] memory memory67: hash matches
Dec 27 12:00:23 pl kernel: [    6.205519] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 27 12:00:23 pl kernel: [    6.234958] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 27 12:00:23 pl kernel: [    6.254602] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 27 12:00:23 pl kernel: [    9.739558] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 27 12:00:32 pl kernel: [   20.152332] cgroup: docker-runc (2183) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 27 12:00:32 pl kernel: [   20.152335] cgroup: "memory" requires setting use_hierarchy to 1 on the root
Dec 27 11:52:18 pl kernel: [    0.000000] Scanning 1 areas for low memory corruption
Dec 27 11:52:18 pl kernel: [    0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
Dec 27 11:52:18 pl kernel: [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[...]
Dec 27 11:52:18 pl kernel: [    0.000000] init_memory_mapping: [mem 0x100000000-0x61fffffff]
Dec 27 11:52:18 pl kernel: [    0.000000] Early memory node ranges
Dec 27 11:52:18 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 27 11:52:18 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 27 11:52:18 pl kernel: [    0.019779] Initializing cgroup subsys memory
Dec 27 11:52:18 pl kernel: [    0.020005] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 27 11:52:18 pl kernel: [    0.970708] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 27 11:52:18 pl kernel: [    0.971734] Scanning for low memory corruption every 60 seconds
Dec 27 11:52:18 pl kernel: [    5.854654] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 27 11:52:18 pl kernel: [    5.883624] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 27 11:52:18 pl kernel: [    5.902731] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 27 11:52:18 pl kernel: [   10.983190] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 27 11:52:25 pl kernel: [   19.933483] cgroup: docker-runc (2140) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 27 11:52:25 pl kernel: [   19.933486] cgroup: "memory" requires setting use_hierarchy to 1 on the root
Dec 28 09:46:55 pl kernel: [    0.000000] Scanning 1 areas for low memory corruption
Dec 28 09:46:55 pl kernel: [    0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
Dec 28 09:46:55 pl kernel: [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[...]
Dec 28 09:46:55 pl kernel: [    0.000000] init_memory_mapping: [mem 0x100000000-0x51fffffff]
Dec 28 09:46:55 pl kernel: [    0.000000] Early memory node ranges
Dec 28 09:46:55 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 28 09:46:55 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 28 09:46:55 pl kernel: [    0.020007] Initializing cgroup subsys memory
Dec 28 09:46:55 pl kernel: [    0.020233] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 28 09:46:55 pl kernel: [    0.970821] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 28 09:46:55 pl kernel: [    0.971834] Scanning for low memory corruption every 60 seconds
Dec 28 09:46:55 pl kernel: [    5.824432] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 28 09:46:55 pl kernel: [    5.853109] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 28 09:46:55 pl kernel: [    5.871990] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 28 09:46:55 pl kernel: [    8.826997] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 28 09:47:04 pl kernel: [   19.154325] cgroup: docker-runc (2171) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 28 09:47:04 pl kernel: [   19.154328] cgroup: "memory" requires setting use_hierarchy to 1 on the root

I also checked for 'fan', 'nmi', 'critical' in the syslog file, without any output.

I remembered some stackoverflow questions where people where copy/pasting wohle files in an external logfile website - I can't remember the name - I am ready to put files online if someone is interested.

Any hint on where to search for what keyword is welcome.

We use the server with docker and r-studio server on top for ML calculus. I really doubt that the kind of usage may be the source for this issue, but in IT we never know, so I precise it ;)

Thanks for any idea.

8
  • Heh, you had me. I did not realise this was a joke post until I read the last sentence.
    – Hennes
    Commented Dec 27, 2016 at 10:27
  • :-/ unfortunately it is not a joke... Or maybe it is indeed ... Commented Dec 27, 2016 at 10:40
  • Ubuntu keeps some log files in the /var/log folder. The dmesg log file has kernel output. You might find something in there about failing hardware. Also, you can ssh into the server and use the htop program to watch ram and swap to make sure it is not getting high.
    – Hydraxan14
    Commented Dec 27, 2016 at 19:16
  • I edited the question : the server being in a safer room, it crashes again :( and I didn't find yet the right keyword to search for in log files :-/ Commented Dec 28, 2016 at 9:49
  • Check the available disk space. Is the calculus job disk-intensive ?
    – harrymc
    Commented Dec 29, 2016 at 19:52

1 Answer 1

1
+50

Assuming your system is the ML150 G6 that the documentation you linked mentions, let me strongly urge you to set up and use the Lights Out-100 management functionality on the system.

A basic how-to can be found here. Once you've gotten access into the Lights Out-100 management (I'd recommend using the web interface until you're more familiar with what LO100 offers and how you use it), then see especially pages 28-32 of that same doc; it shows how you can see in real-time sensors and event information for your system. Often, if a hardware issue is causing resets, it will be listed in the System Event Log, and finding it there will give you some insight into what's going on with your machine. The System Event Log should be capturing its data whether you've ever touched LO100 or not, so once you can get in there it should have something interesting to tell you.

Much of the same information can be fetched through your running OS, either via /var/log/messages (which you've already tried without much success) or through HP's Insight tools, which are available to install for some Linux flavors (see http://downloads.linux.hp.com/SDR/project/mcp/ for one good starting point to getting some of these tools). Unfortunately not all events are visible in the system logs, as they're hardware specific and the HP agents, not the kernel proper, are what instrument them.

Having said that, you may also see if you have mcelog installed and running; it can catch some hardware events, and usually logs something in the message log when it catches an event. It also usually either logs the event information to a separate log, or keeps it in memory so you can query it with the mcelog command. It's worth looking for mcelog in your messages log, or to look if you have a recently updated /var/log/mcelog file.

2
  • Thanks for this hints, It is indeed a ML 150 G6 server, I will have a look to all that. Commented Jan 2, 2017 at 11:47
  • you are more than welcome :-) even if I have to say I didn't really check the SW - the server is running fine now - I have no clue what happened. But we did put it back on a dry bedroom... maybe it just had to dry up :-/ Commented Jan 6, 2017 at 22:08

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .