1

I'd like to debug an issue I'm having with a linux (debian stable) server, but I'm running out of ideas of how to confirm any diagnosis.

Some background: The servers are running DL160 class with hardware raid between two disks. They're running a lot of services, mostly utilising network interface and CPU. There are 8 cpus and 7 "main" most cpu-hungry processes are bound to one core each via cpu affinity. Other random background scripts are not forced anywhere. The filesystem is writing ~1.5k blocks/s the whole time (goes up above 2k/s in peak times). Normal CPU usage for those servers is ~60% on 7 cores and some minimal usage on the last (whatever's running on shells usually).

What actually happens is that the "main" services start using 100% CPU at some point, mainly stuck in kernel time. After a couple of seconds, LA goes over 400 and we lose any way to connect to the box (KVM is on it's way, but not there yet). Sometimes we see a kernel reporting hung task (but not always):

[118951.272884] INFO: task zsh:15911 blocked for more than 120 seconds.
[118951.272955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[118951.273037] zsh           D 0000000000000000     0 15911      1
[118951.273093]  ffff8101898c3c48 0000000000000046 0000000000000000 ffffffffa0155e0a
[118951.273183]  ffff8101a753a080 ffff81021f1c5570 ffff8101a753a308 000000051f0fd740
[118951.273274]  0000000000000246 0000000000000000 00000000ffffffbd 0000000000000001
[118951.273335] Call Trace:
[118951.273424]  [<ffffffffa0155e0a>] :ext3:__ext3_journal_dirty_metadata+0x1e/0x46
[118951.273510]  [<ffffffff804294f6>] schedule_timeout+0x1e/0xad
[118951.273563]  [<ffffffff8027577c>] __pagevec_free+0x21/0x2e
[118951.273613]  [<ffffffff80428b0b>] wait_for_common+0xcf/0x13a
[118951.273692]  [<ffffffff8022c168>] default_wake_function+0x0/0xe
....

This would point at raid / disk failure, however sometimes the tasks are hung on kernel's gettsc which would indicate some general weird hardware behaviour.

It's also running mysql (almost read-only, 99% cache hit), which seems to spawn a lot more threads during the system problems. During the day it does ~200kq/s (selects) and ~10q/s (writes).

The host is never running out of memory or swapping, no oom reports are spotted.

We've got many boxes with similar/same hardware and they all seem to behave that way, but I'm not sure which part fails, so it's probably not a good idea to just grab something more powerful and hope the problem goes away.

Applications themselves don't really report anything wrong when they're running. I can run anything safely on the same hardware in an isolated environment. What can I do to narrow down the problem? Where else should I look for explanation?

2
  • 1
    Does it recover from this on its own, or does it require a reboot? Either way, if you have sysfs mounted and can read /sys/block/[drivedevice]/driver/ioerr_cnt it will give you the # of errors talking to that drive since bootup (note: my driver throws a bunch of commands at the drive during boot to see what works, so I start with 0x8). If it's low, it's not a failing drive. If it's high it could be failing drive or failing somewhere else that's making the drive inaccessible.
    – DerfK
    Commented Jan 5, 2011 at 18:34
  • The server dies as far as I know and has to be restarted manually. ioerr_cnt prints out 0 now, but since the hosts died today, I'll keep checking it.
    – viraptor
    Commented Jan 5, 2011 at 19:51

1 Answer 1

0

DL160? Do you have iLO on the machine? From there, you can remote control the box and do restarts, power up, or power down. Might need the advanced license, though. iLO runs on separate hardware from the main system board, so it should always be available as long as the server has a power cord plugged in. iLO also gives you access to triggering NMI Resets of the host, as well as capturing the last fatal crash, allowing for limited study.

Have you also tried "burning" the server with a MemTest86+ run for about 8 hours (assuming you can afford that long of downtime)? Memory errors on Linux sometimes manifest themselves in some really funny ways. That Oops report you have references a memory function (__pagevec_free()), which might suggest a bad memory cell that's accessed very infrequently, hence the wait period between crashes.

Have you also checked that your BIOS is fully updated from HP?

Beyond that, compile your own kernel and enable all the debugging symbols, and look up a few of the HOWTOs on using KGDB to debug a kernel crash. There's some tricks you might be able to do to trap the kernel when it crashes, and then use KGDB to look at the backstrace and maybe hunt down an offending userland program or further identify your hardware faults.

2
  • Got the iLO connected but during the hanging, I get the same kernel errors as mentioned in the question. I've tried stress and iozone so far (at the same time too) without triggering the problem. Bios should be updated, but I'll double-check. Memtest might be a good idea... I'll start it up at some point.
    – viraptor
    Commented Jan 13, 2011 at 9:48
  • MemTest will typically turn up anything funny. It's just waiting around for it to do so, though. Being a DL160, you're on an Intel chipset, so I don't think you need the memory inserted in pairs....though I might be wrong. Rule of thumb is, if you added any memory not stamped with an HP P/N, test that by itself first, then test the HP sticks one-by-one (or pair-by-pair) and let MemTest do at least one pass through all of its tests. Also, use MemTest86+, not MemTest86...they're two separate programs. I've had better luck w/ the plus version.
    – Kumba
    Commented Jan 13, 2011 at 22:53

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .