12

I have a debian linux box (Debian Squeeze) that deadlocks every few hours if I run a python script that sniffs an interface...

The stack trace is attached to the bottom of this question. Essentially, I have a Broadcom ethernet interface (bnx2 driver) that seems to die when I start a sniffing session and then it tries to transmit a frame out the same interface.

From what I can tell, a kernel watchdog timer is tripping...

NETDEV WATCHDOG: eth3 (bnx2): transmit queue 0 timed out

I think there is a way to control watchdog timers with ioctl (ref: EmbeddedFreak: How to use linux watchdog).

Questions (Original):

How can I find which watchdog timer(s) is controlling eth3? Bonus points if you can tell me how to change the timer or even disable the watchdog...

Questions (Revised):

How can I prevent the ethernet watchdog timer from causing problems?


Stack trace

Apr 30 08:38:44 Hotcoffee kernel: [275460.837147] ------------[ cut here ]------------
Apr 30 08:38:44 Hotcoffee kernel: [275460.837166] WARNING: at /build/buildd-linux-2.6_2.6.32-41squeeze2-amd64-NDo8b7/linux-2.6-2.6.32/debian/build/source_amd64_none/net/sched/sch_generic.c:261 dev_watchdog+0xe2/0x194()
Apr 30 08:38:44 Hotcoffee kernel: [275460.837169] Hardware name: PowerEdge R710
Apr 30 08:38:44 Hotcoffee kernel: [275460.837171] NETDEV WATCHDOG: eth3 (bnx2): transmit queue 0 timed out
Apr 30 08:38:44 Hotcoffee kernel: [275460.837172] Modules linked in: 8021q garp stp parport_pc ppdev lp parport pci_stub vboxpci vboxnetadp vboxnetflt vboxdrv ext2 loop psmouse power_meter button dcdbas evdev pcspkr processor serio_raw ext4 mbcache jbd2 crc16 sg sr_mod cdrom ses ata_generic sd_mod usbhid hid crc_t10dif enclosure uhci_hcd ehci_hcd megaraid_sas ata_piix thermal libata usbcore nls_base scsi_mod bnx2 thermal_sys [last unloaded: scsi_wait_scan]
Apr 30 08:38:44 Hotcoffee kernel: [275460.837202] Pid: 0, comm: swapper Not tainted 2.6.32-5-amd64 #1
Apr 30 08:38:44 Hotcoffee kernel: [275460.837204] Call Trace:
Apr 30 08:38:44 Hotcoffee kernel: [275460.837206]  <IRQ>  [<ffffffff81263086>] ? dev_watchdog+0xe2/0x194
Apr 30 08:38:44 Hotcoffee kernel: [275460.837211]  [<ffffffff81263086>] ? dev_watchdog+0xe2/0x194
Apr 30 08:38:44 Hotcoffee kernel: [275460.837217]  [<ffffffff8104df9c>] ? warn_slowpath_common+0x77/0xa3
Apr 30 08:38:44 Hotcoffee kernel: [275460.837220]  [<ffffffff81262fa4>] ? dev_watchdog+0x0/0x194
Apr 30 08:38:44 Hotcoffee kernel: [275460.837223]  [<ffffffff8104e024>] ? warn_slowpath_fmt+0x51/0x59
Apr 30 08:38:44 Hotcoffee kernel: [275460.837228]  [<ffffffff8104a4ba>] ? try_to_wake_up+0x289/0x29b
Apr 30 08:38:44 Hotcoffee kernel: [275460.837231]  [<ffffffff81262f78>] ? netif_tx_lock+0x3d/0x69
Apr 30 08:38:44 Hotcoffee kernel: [275460.837237]  [<ffffffff8124dda3>] ? netdev_drivername+0x3b/0x40
Apr 30 08:38:44 Hotcoffee kernel: [275460.837240]  [<ffffffff81263086>] ? dev_watchdog+0xe2/0x194
Apr 30 08:38:44 Hotcoffee kernel: [275460.837242]  [<ffffffff8103fa2a>] ? __wake_up+0x30/0x44
Apr 30 08:38:44 Hotcoffee kernel: [275460.837249]  [<ffffffff8105a71b>] ? run_timer_softirq+0x1c9/0x268
Apr 30 08:38:44 Hotcoffee kernel: [275460.837252]  [<ffffffff81053dc7>] ? __do_softirq+0xdd/0x1a6
Apr 30 08:38:44 Hotcoffee kernel: [275460.837257]  [<ffffffff8102462a>] ? lapic_next_event+0x18/0x1d
Apr 30 08:38:44 Hotcoffee kernel: [275460.837262]  [<ffffffff81011cac>] ? call_softirq+0x1c/0x30
Apr 30 08:38:44 Hotcoffee kernel: [275460.837265]  [<ffffffff8101322b>] ? do_softirq+0x3f/0x7c
Apr 30 08:38:44 Hotcoffee kernel: [275460.837267]  [<ffffffff81053c37>] ? irq_exit+0x36/0x76
Apr 30 08:38:44 Hotcoffee kernel: [275460.837270]  [<ffffffff810250f8>] ? smp_apic_timer_interrupt+0x87/0x95
Apr 30 08:38:44 Hotcoffee kernel: [275460.837273]  [<ffffffff81011673>] ? apic_timer_interrupt+0x13/0x20
Apr 30 08:38:44 Hotcoffee kernel: [275460.837274]  <EOI>  [<ffffffffa01bc509>] ? acpi_idle_enter_bm+0x27d/0x2af [processor]
Apr 30 08:38:44 Hotcoffee kernel: [275460.837283]  [<ffffffffa01bc502>] ? acpi_idle_enter_bm+0x276/0x2af [processor]
Apr 30 08:38:44 Hotcoffee kernel: [275460.837289]  [<ffffffff8123a0ba>] ? cpuidle_idle_call+0x94/0xee
Apr 30 08:38:44 Hotcoffee kernel: [275460.837293]  [<ffffffff8100fe97>] ? cpu_idle+0xa2/0xda
Apr 30 08:38:44 Hotcoffee kernel: [275460.837297]  [<ffffffff8151c140>] ? early_idt_handler+0x0/0x71
Apr 30 08:38:44 Hotcoffee kernel: [275460.837301]  [<ffffffff8151ccdd>] ? start_kernel+0x3dc/0x3e8
Apr 30 08:38:44 Hotcoffee kernel: [275460.837304]  [<ffffffff8151c3b7>] ? x86_64_start_kernel+0xf9/0x106
Apr 30 08:38:44 Hotcoffee kernel: [275460.837306] ---[ end trace 92c65e52c9e327ec ]---
6
  • 1
    What is your MTU?
    – Nils
    Commented May 1, 2012 at 20:11
  • How did you know to ask? I manually set it to 9000 on this interface before running the sniff; just before the script finishes, I reset it to 1500. In fact, after disabling the sniffer function in the script, I saw another deadlock when I ran sudo ip link set mtu 1500 dev eth3 in the script (as it was finishing). Do you have some thoughts about changing MTU on the interface? Commented May 1, 2012 at 20:19
  • @Nils, it is very possible that this is a PAE kernel... the processor is a Dual-CPU Quad core x86-64 Commented May 1, 2012 at 20:25
  • Interesting. It seems Linux and OpenBSD have more in common than I thought.
    – Nils
    Commented May 1, 2012 at 20:31
  • BTW - why do you change the MTU - are you sniffing a portmirror in trunk mode?
    – Nils
    Commented May 1, 2012 at 20:34

2 Answers 2

5

I have read a similar story from GeNUA. Their workaround was to restart the network driver (OpenBSD). On Linux this would translate to: ifdown eth3 && rmmod bnx2 && modprobe bnx2 && ifup eth3.

The core problem was an internal coding problem with pointers on a PAE system in conjunction with the broadcom-driver.

5
  • When exactly are you suggesting that I run those commands? Only after I change the MTU? Commented May 1, 2012 at 20:23
  • 1
    @MikePennington I changed the link from my answer to the English version. Read it... I think you should change it every 30 minutes.
    – Nils
    Commented May 1, 2012 at 20:28
  • I need to run this in production for a few days before I can accept... it this works, I will award a bounty too. This has been kicking my butt for two weeks Commented May 1, 2012 at 20:29
  • Presumably I should not see this issue if my interface MTU is default (1500), right? I removed the code that modified my MTU, but I'm still seeing the deadlocks Commented May 2, 2012 at 15:55
  • Are all your interfaces of the same type? Look at them with ethtool -g perhaps your can raise the receive or transmit buffers to avoid this problem.
    – Nils
    Commented May 2, 2012 at 20:58
3

Commenting out my code that called ethtool to modify the NIC buffers stopped watchdog timers from tripping on the bnx2 card.

I still want to find an answer to the question about watchdog timers, but I will ask another question

def _linux_buffer_alloc(iface=None, rx_ring_buffers=768,
    netdev_max_backlog=30000):

    default_rx = 255
    default_rx_jumbo = 0
    default_netdev_max_backlog = 1000
    ## Set linux rx ring buffers (to prevent tcpdump 'dropped by intf' msg)
## FIXME: removing for now due to systematic deadlocks with the bnx2 driver
#    sample: ethtool -G eth3 rx 768
#    cmd = 'ethtool -G %s rx %s' % (iface, rx_ring_buffers)
#    p = Popen(cmd.split(' '), stdout=PIPE)
#    p.communicate(); time.sleep(0.15)
#    sample: ethtool -G eth3 rx-jumbo 0
#    cmd = 'ethtool -G %s rx-jumbo %s' % (iface, default_rx_jumbo)
#    p = Popen(cmd.split(' '), stdout=PIPE)
#    p.communicate(); time.sleep(0.15)
## /FIXME

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .