1

I am currently working on setting up a 2 node-storages network with Mellanox 2x 10Gbit cards, connected through one mikrotik switch CRS326-24S+2Q+RM. I have configured bonding in mode 802.3ad with layer 3/4 hashing on each node storage. When I directly connect two node-storage units, everything works correctly with iperf reaching the theoretical 20 Gbits, without any packet retransmission.

However, when I connect the nodes to my switch, I frequently encounter packet retransmissions with a limited throughput. Sometimes, I reach 20 Gbits without any packet loss, and other times, it cannot reach 20 Gbits; iperf displays a lot of packet loss.The issue is intermittent, occurring about half the time during testing. On the switch, for each port to which a node-storage is connected, a bonding is configured, matching the bonding configuration of the node-storages.

Schema : https://pasteboard.co/uNcXSUU5siM9.png

Bonding config of each node-storage : https://pasteboard.co/vtvpEy0nwDXQ.png

Bonding config of the switch for one node storage connected (ports 5/6 in this config) : https://pasteboard.co/4j1n2wg4Wo9H.png

Iperf3 results when connected to the switch:

Does someone has an explication of this behavior ?

Thank you for your help

1
  • For a device to be on-topic here, the manufacturer must offer optional, paid support. Unfortunately, MikroTik does not offer that. Also, hosts/servers are off-topic here. You could try to ask this question on Server Fault for a business network.
    – Ron Maupin
    Commented Feb 28 at 16:02

1 Answer 1

0

While your switch is off topic here (optional, paid vendor support is required for a device to be on topic, see the help center), that effect can be observed regardless of the switch's quality.

The basic message: Link aggregation does not create an interface with a multiple of its member interfaces' speeds. Traffic is not distributed evenly, based on load or similar.

Instead, traffic is distributed in a stateless way by hashing a subset of source interface index, source/destination MAC addresses, source/destination IP addresses, source/destination L4 port addresses, and perhaps something else. You've stated "layer 3/4 hashing", so the switch's egress port is determined by the IP addresses and the TCP ports used. When the TCP source port is ephemeral/random, the egress port is equally random - there's a 50% chance that two streams received from the ingress LAG are distributed to the same physical port, exceeding its capacity and causing dropped frames. So, what you see is to be expected.

For testing, you should take control of L4 port numbers on the client side and closely monitor the egress ports on the switch.

The general lesson: L3/L4-based traffic distribution in LAG may work well with a large number of streams, but it most often performs poorly with a small number of streams. Hashes based on L3 only, L2 only, or L2/L3 combined make the distribution even worse.

1
  • To summarize, the problem is that your testing is artificial and doesn't reveal the strengths of link bonding. It reveals the weaknesses. In a more organic environment with a lot of clients for the servers, the traffic will be more distributed. If you want peak peformance for a single/very few clients, the only option is to increase the link speed (25/50/200 Gigabit links) and disk, etc. Commented Apr 1 at 14:58

Not the answer you're looking for? Browse other questions tagged or ask your own question.