After a few hours of correctling working, our proliant Server stop calculus with the System Healt LED 12 blinking, which according to the documentation ( http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c01706108-8.pdf ) is the sign of a "Critical system failure detected (processor, memory, regulator, thermal event, fan, NMI)" (page 96).
SSH is then lost. We can reboot and re-get ssh ( I am not onsite) , but I don't know what to check then ? is there any logfile where to find some info ?
I found this guide : http://denis.herve.free.fr/trsfrt/HProliant.pdf but seems overwelming to me.
My colleague suggest it could be a RAM + Swap overload which make the whole server crash. I don't really agree with him as as far I am concerned, a memory issue wouldn't lead to a critical system failure. Any idea on this point ?
I am wondering if there could be any relationship with my previous post : Linux server swapping before memory is completely full.
we are on ubuntu 14.04.
PS : the server is on a basement, there may be a bit of water condensation on the morning...
EDIT Folowing @Hennes remark, we moved the server back to the living room. But after a night of calculus, it was again bliking with the red light :-(
Now I am trying to get my head around the log files.
We rebooted the server this morning around 09:44
Here are the files recently changed :
What to search, where, to get some Error infos ?
I tried :
romain@pl:/var/log$ cat syslog | grep error
Dec 27 12:00:23 pl kernel: [ 1.053210] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 27 12:00:23 pl kernel: [ 6.740763] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [ 6.741967] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [ 7.082169] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [ 7.112776] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [ 9.905224] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
Dec 27 11:52:18 pl kernel: [ 1.053048] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 27 11:52:18 pl kernel: [ 6.364768] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [ 6.365903] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [ 6.684685] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [ 6.686080] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [ 11.211120] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
Dec 28 09:46:55 pl kernel: [ 1.051638] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 28 09:46:55 pl kernel: [ 6.348693] ata3.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [ 6.349786] ata3.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [ 6.699099] ata4.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [ 6.731027] ata4.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [ 8.959211] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
and :
romain@pl:/var/log$ cat dmesg | grep error
[ 1.051638] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
[ 6.348693] ata3.00: failed to enable AA (error_mask=0x1)
[ 6.349786] ata3.00: failed to enable AA (error_mask=0x1)
[ 6.699099] ata4.00: failed to enable AA (error_mask=0x1)
[ 6.731027] ata4.00: failed to enable AA (error_mask=0x1)
[ 8.959211] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
-> Here I don't really get what are the values in the first column like [ 6.731027] : is it the number of seconds since boot ?
I checked
romain@pl:/var/log$ cat syslog | grep memory
Dec 27 12:00:23 pl kernel: [ 0.000000] Scanning 1 areas for low memory corruption
Dec 27 12:00:23 pl kernel: [ 0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
[...]
Dec 27 12:00:23 pl kernel: [ 0.000000] init_memory_mapping: [mem 0x100000000-0x61fffffff]
Dec 27 12:00:23 pl kernel: [ 0.000000] Early memory node ranges
Dec 27 12:00:23 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 27 12:00:23 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 27 12:00:23 pl kernel: [ 0.019764] Initializing cgroup subsys memory
Dec 27 12:00:23 pl kernel: [ 0.019992] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 27 12:00:23 pl kernel: [ 0.971501] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 27 12:00:23 pl kernel: [ 0.972518] Scanning for low memory corruption every 60 seconds
Dec 27 12:00:23 pl kernel: [ 6.154807] memory memory67: hash matches
Dec 27 12:00:23 pl kernel: [ 6.205519] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 27 12:00:23 pl kernel: [ 6.234958] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 27 12:00:23 pl kernel: [ 6.254602] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 27 12:00:23 pl kernel: [ 9.739558] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 27 12:00:32 pl kernel: [ 20.152332] cgroup: docker-runc (2183) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 27 12:00:32 pl kernel: [ 20.152335] cgroup: "memory" requires setting use_hierarchy to 1 on the root
Dec 27 11:52:18 pl kernel: [ 0.000000] Scanning 1 areas for low memory corruption
Dec 27 11:52:18 pl kernel: [ 0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
Dec 27 11:52:18 pl kernel: [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[...]
Dec 27 11:52:18 pl kernel: [ 0.000000] init_memory_mapping: [mem 0x100000000-0x61fffffff]
Dec 27 11:52:18 pl kernel: [ 0.000000] Early memory node ranges
Dec 27 11:52:18 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 27 11:52:18 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 27 11:52:18 pl kernel: [ 0.019779] Initializing cgroup subsys memory
Dec 27 11:52:18 pl kernel: [ 0.020005] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 27 11:52:18 pl kernel: [ 0.970708] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 27 11:52:18 pl kernel: [ 0.971734] Scanning for low memory corruption every 60 seconds
Dec 27 11:52:18 pl kernel: [ 5.854654] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 27 11:52:18 pl kernel: [ 5.883624] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 27 11:52:18 pl kernel: [ 5.902731] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 27 11:52:18 pl kernel: [ 10.983190] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 27 11:52:25 pl kernel: [ 19.933483] cgroup: docker-runc (2140) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 27 11:52:25 pl kernel: [ 19.933486] cgroup: "memory" requires setting use_hierarchy to 1 on the root
Dec 28 09:46:55 pl kernel: [ 0.000000] Scanning 1 areas for low memory corruption
Dec 28 09:46:55 pl kernel: [ 0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
Dec 28 09:46:55 pl kernel: [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[...]
Dec 28 09:46:55 pl kernel: [ 0.000000] init_memory_mapping: [mem 0x100000000-0x51fffffff]
Dec 28 09:46:55 pl kernel: [ 0.000000] Early memory node ranges
Dec 28 09:46:55 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 28 09:46:55 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 28 09:46:55 pl kernel: [ 0.020007] Initializing cgroup subsys memory
Dec 28 09:46:55 pl kernel: [ 0.020233] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 28 09:46:55 pl kernel: [ 0.970821] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 28 09:46:55 pl kernel: [ 0.971834] Scanning for low memory corruption every 60 seconds
Dec 28 09:46:55 pl kernel: [ 5.824432] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 28 09:46:55 pl kernel: [ 5.853109] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 28 09:46:55 pl kernel: [ 5.871990] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 28 09:46:55 pl kernel: [ 8.826997] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 28 09:47:04 pl kernel: [ 19.154325] cgroup: docker-runc (2171) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 28 09:47:04 pl kernel: [ 19.154328] cgroup: "memory" requires setting use_hierarchy to 1 on the root
I also checked for 'fan', 'nmi', 'critical' in the syslog file, without any output.
I remembered some stackoverflow questions where people where copy/pasting wohle files in an external logfile website - I can't remember the name - I am ready to put files online if someone is interested.
Any hint on where to search for what keyword is welcome.
We use the server with docker and r-studio server on top for ML calculus. I really doubt that the kind of usage may be the source for this issue, but in IT we never know, so I precise it ;)
Thanks for any idea.