6

I have a Debian Linux server with 4x 1Gb NICs, 2 onboard and 2 PCI. All NICs are configured into a single mode 4 LACP bonded interface: bond0 I have two switches: Cisco SG300-28 and Cisco SG300-10, to be referred to as A and B respectively.
The two switches are connected to each other by an LACP LAG on 2 switch ports, both links listed as active.

All ports on both switches appear to be configured as trunk ports. This appears to be the default as I did a factory reset before testing this. Would that make any difference? There is only a single default VLAN at this stage.

The server has 2 NICs into switch A and 2 NICs into switch B (one onboard NIC and one PCI). The Linux bonding driver is clever in this respect as it works out an Aggregator ID for each interface and pairs them up by switch, so only the links into one switch are ever active even though all 4 might be up.

I have a workstation that I am testing this from, currently connected to switch A.


                   ----------------
        ===========| Server bond0 |===========
        ||         ----------------         ||
        ||                                  ||
        ||                                  ||
  ----------------                  ----------------
  |   Switch A   |=======LAG========|   Switch B   |
  ----------------                  ----------------
         |
         |
         |
    workstation

Initially the server reports it's using the Aggregator ID associated with Switch A. I get a solid ping from the workstation.

I disconnect the 2 cables from Server to Switch A, Linux switches active Aggregator to the NICs connected to switch B, the ping continues on and remains stable. The active path at this point is workstation -> Switch A -> inter-switch LAG -> Switch B -> Server

When I reconnect the cables from the server into Switch A, Linux keeps the aggregate ID as it is so still using Switch B.

Pings from workstation start being dropped as follows:

enter image description here

As soon as I disconnect Switch A again, ping returns to being solid.

So it fails when ping is originating from a port on Switch A going to the server Switch B but only when the server's links to Switch A are up but not active in the OS.
This is repeatable.

I have run a tcpdump on the Server and workstation. I can see ALL pings being transmitted from the workstation but only some of them getting a reply, per the trace above. Running a tcpdump on the Server and it looks like the missing pings aren't making it that far. So they are being dropped somewhere in the switching.

If I reverse this broken setup to the other switch, plugging workstation into Switch B, so traffic path is... workstation -> Swtich B -> inter-switch LAG -> Switch A -> Server then it works fine.

I did think this might be some sort of STP issue with the port getting blocked, but the ping drop pattern is too frequent with almost every other pair of pings being dropped. Checking the log on the switch and it didn't look like any ports were being blocked on either switch.

I have also tried replacing the inter-switch LAG with a single connection, non-LAG/LACP.
I have confirmed that the LACP settings match on all sides.

As a full-time sysadmin and only a networking part-timer/amateur, to me this points to some sort of difference in configuration between the switches. But I don't know what parts of config to check for differences. They are running different firmware versions and note that these are the small business SG300 series, so not running full iOS but do have what looks like a reasonably featured CLI.

My limited networking knowledge tells me it's something like an ARP issue. The server should only be presenting the MAC address to the active pair/switch. The dropped pings are possibly trying to be routed to the non-active switch/pair.
But how could I prove that with these switches?
I would have expected longer runs of successful and failed pings though.

My next step is to do some tcpdumps to look at the ARPs and LACPDUs to see if there's a sort of "storm" going on causing the traffic to switch between switches every couple of seconds. Though from the Linux perspective, there's no change in active Aggregator ID corresponding with the failed pings.

Does anyone else have any suggestions of what else to look at here?

EDIT: Adding RSTP status for the port-channels...


SwitchA#sh spanning-tree
Spanning tree enabled mode RSTP
Default port cost method:  long

  Root ID    Priority    32768
             Address     0c:f5:a4:c2:0e:bf
             This switch is the root
             Hello Time  2 sec  Max Age 20 sec  Forward Delay 15 sec

  Number of topology changes 12 last change occurred 20:18:37 ago
  Times:  hold 1, topology change 35, notification 2
          hello 2, max age 20, forward delay 15

Interfaces
  Name     State   Prio.Nbr    Cost    Sts   Role PortFast       Type        
--------- -------- --------- -------- ------ ---- -------- -----------------
...
   Po1    enabled  128.1000   20000    Frw   Desg   Yes       P2P (RSTP)     
   Po2    enabled  128.1001   20000    Dsbl  Dsbl    No            -         
   Po3    enabled  128.1002   20000    Dsbl  Dsbl    No            -         
   Po4    enabled  128.1003   20000    Dsbl  Dsbl    No            -         
   Po5    enabled  128.1004   20000    Dsbl  Dsbl    No            -         
   Po6    enabled  128.1005   20000    Dsbl  Dsbl    No            -         
   Po7    enabled  128.1006   20000    Dsbl  Dsbl    No            -         
   Po8    enabled  128.1007   20000    Frw   Desg    No       P2P (RSTP)     


SwitchB#sh spanning-tree
Spanning tree enabled mode RSTP
Default port cost method:  long
Loopback guard:   Disabled

  Root ID    Priority    32768
             Address     0c:f5:a4:c2:0e:bf
             Cost        20000
             Port        Po8
             Hello Time  2 sec  Max Age 20 sec  Forward Delay 15 sec
  Bridge ID  Priority    32768
             Address     1c:de:a7:75:1a:4b
             Hello Time  2 sec  Max Age 20 sec  Forward Delay 15 sec

  Number of topology changes 5 last change occurred 20:26:02 ago
  Times:  hold 1, topology change 35, notification 2
          hello 2, max age 20, forward delay 15

Interfaces
  Name     State   Prio.Nbr    Cost    Sts   Role PortFast       Type        
--------- -------- --------- -------- ------ ---- -------- ----------------- 
...
   Po1    enabled  128.1000   20000    Frw   Desg   Yes       P2P (RSTP)     
   Po2    enabled  128.1001   20000    Dsbl  Dsbl    No            -         
   Po3    enabled  128.1002   20000    Dsbl  Dsbl    No            -         
   Po4    enabled  128.1003   20000    Dsbl  Dsbl    No            -         
   Po5    enabled  128.1004   20000    Dsbl  Dsbl    No            -         
   Po6    enabled  128.1005   20000    Dsbl  Dsbl    No            -         
   Po7    enabled  128.1006   20000    Dsbl  Dsbl    No            -         
   Po8    enabled  128.1007   20000    Frw   Root    No       P2P (RSTP)     

Po8 is the inter-switch LAG and Po1 is the server LAG, in both cases.
No topology changes recorded on either switch while I've got things in their broken state (pings dropping).

0

3 Answers 3

4

This problem happens because Linux bonding in 802.3ad mode sets all slave interfaces to the same hardware mac address ("borrowed" from the first enslaved interface). Then LACP PDUs are transmitted using this same mac address and each of the switches thinks it has direct connection to the server. At the same time, bonding drops packets which are received on it's inactive aggregator. Most probably this issue: https://sourceforge.net/p/bonding/discussion/77913/thread/520c70a8/#2b21

UPDATE: There's workaround in current kernels: https://github.com/opencomputeproject/OpenNetworkLinux/blob/master/packages/base/any/kernels/3.2.65-1%2Bdeb7u2/patches/network-bonding-lacp-fix-incorrect-mux-state.patch

1
  • 1
    Just wanted to say thank you for the info - that completely makes sense and it what I suspected but was hoping the behaviour wasn't. That workaround looks interesting. Do you have any info as to whether that's been incorporated upstream? Looks like this might do the job... github.com/torvalds/linux/commit/…
    – batfastad
    Commented Oct 5, 2016 at 20:27
3

This is not a supported configuration. The SG line does not support multi-chassis LACP. You might be able to simulate this using two LACP bonds (one to each switch) and then bonding those together in a simple bond.

6
  • Oh ok. Why would having a LAG from one switch to another be any different to having a LAG from a server to a switch? I'm not looking for the LACP from the server to span across multiple switches. The Linux bonding driver is deciding which side is active. It's odd that it works one way around but not the other. Could there be a difference between the SG300-28 and the SG300-10?
    – batfastad
    Commented Jan 12, 2016 at 22:46
  • That's not how LACP works. All member links must belong to the same switch aggregation group. The SG line does not support multi-chassis link aggregation. Your configuration is not valid.
    – Ricky
    Commented Jan 13, 2016 at 8:25
  • Hmm ok, thanks for the info. I was considering a setup similar to this... unix.stackexchange.com/a/172232/30008 The only difference being I have an additional port-channel between the switches. The Linux bond driver does not allow you to bond bonds, otherwise I could create an LACP bond on each pair of interfaces going into the separate switches, then do another bond using mode active-backup on that.
    – batfastad
    Commented Jan 13, 2016 at 9:47
  • I still don't understand why this works perfectly when the Linux server's active aggregator is Switch A (SG300-28) with it's links to Switch B still connected and the pings originating from a workstation on Switch B. The pings are correctly sent from Workstation -> Switch B -> inter-switch LAG -> Switch A -> Server... completely ignoring the fact that Server also has connection into Switch B but it's passive (no LACPDUs) at the OS level. If it was unsupported I would expect ping drops this way around as well.
    – batfastad
    Commented Jan 13, 2016 at 9:50
  • 1
    How the switches are connected is irrelevant. This "works" because interfaces are added one at a time, so link 1 is added and comes up - because it's the only link. 2 is added and comes up because it matched 1. 3 and 4 are added, but do not come up because they aren't part of the same LAG as they're connected to a different switch.
    – Ricky
    Commented Jan 14, 2016 at 3:45
1

If possible (not sure how configurable those switches are) change the channel load-balance. Example: channel load-balance src-dst-mac

Other options: {dst-ip | dst-mac | src-dst-ip | src-dst-mac | src-ip | src-mac}

It looks to me like the lag algorithm is alternating to the links that have come up/up on the switch but aren't active on your server.

7
  • Thanks for the info. Yes the behaviour seems similar to that. Just checked and both switches had port-channel load-balance src-dst-mac. So not sure why the behaviour would be different between switches. The other option available on these switches is src-dst-mac-ip so will try that instead.
    – batfastad
    Commented Jan 12, 2016 at 16:56
  • 1
    Just tried src-dst-mac-ip and no luck, still same behaviour.
    – batfastad
    Commented Jan 12, 2016 at 17:11
  • It was a shot in the dark. =P Just thought I would bring it up since I had a very similar issue with a NAS a while back that that was the cause.
    – Helkas
    Commented Jan 12, 2016 at 17:11
  • No problem - all suggestions welcome :D
    – batfastad
    Commented Jan 12, 2016 at 17:20
  • Everything is up and up on all interfaces? Am I correct to assume each switch has 2 separate LACP ether channels? (1 to the server, 1 to the opposite switch)
    – Helkas
    Commented Jan 12, 2016 at 17:26

Not the answer you're looking for? Browse other questions tagged or ask your own question.