4

I am having a problem with my core network.

Currently I have a DELL PowerConnect 5524 which has a SFP+ (10Gbe) Module connected to another 5524 (Same transceiver on both) which is also stacked with a Powerconnect 5548 (via HDMI)

Like this: Network Layout However, I was recently doing some investigation into why our transatlantic IPSEC VPN so performing so badly and started doing some Wireshark packet captures and tcpdumps from the Watchguard.

I noticed huge amounts of TCP Re-transmissions and Duplicate ACKs (which after the 4th ACK triggers a re-transmit (SACKs).

However, I connected my laptop directly to the 5524 on the left and SSH'd into the device, whilst running a packet capture. Still getting lots of the above errors.

Can anyone help me? or tell me why the think that even after connecting directly to the switch and talking to the switch itself would give me these results?

I've turned off practically all Layer 3 features on the port my laptop is using and still the errors remain.

UPDATE Example PacketCap Uploaded Here: https://www.cloudshark.org/captures/882b8189541d

0

2 Answers 2

2

You're (192.168.0.164 ) directly connected to the switch (10.168.0.106) you're SSH'in into and you get dup ACK and retransmit... There is something not right , seems that the Network Gremlins are hungry.

What does the switches error counters say?

Also note that some devices puts traffic to/from the control plane at a lower priority than the data plane. SSH directly to the switch might not reveal the problem you think it's revealing. What about connecting to something else on the switch, do you see the same symptoms?

1
  • Hi Remi, there definitely seems to be something wrong with the switch, as you suggested and I thought, Dell confirmed there is prioritisation of traffic with emphasis on network side not in band configuration. Dell are perplexed by this issue but it seems to affect all three of my switches so they're investigating the firmware. Thanks for your help. Commented Jun 16, 2014 at 23:17
0

I would probably post this as a comment, but this is my first post on stackexchange and it won't let me.

ssh isn't the best application to test performance due to its interactive nature. Unless you're saying that you see delay for the characters being echoed back to you. Without knowing more about what you were doing with ssh, the things I noticed are:

  1. In general, the number of retransmissions doesn't bother me. However, the fact that this is your laptop plugged directly into the switch with no router or other devices to drop packets does present a potential problem.
  2. The retransmissions are strange. Every one marked as a retransmission that I looked at had the original TCP segment present in the capture i.e. the data wasn't dropped. In some cases your laptop ACKs the segment but the switch retransmits anyway. Maybe the ACK from your laptop got dropped. In other cases the switch sends a segment, say 50 bytes, then within 10 or so milliseconds sends it again but with more data in the payload e.g. 100 bytes. This is repacketization and is fine. But your laptop hadn't ACKed yet and it's extremely doubtful the switch has a retransmission timeout of ~10ms.
  3. At first, I thought your switch was extremely busy due to the large difference in IP IDs between the original segment and the retransmission. You can sometimes tell how busy a machine is by looking at the difference in IP IDs of consecutive packets it sends. Your laptop, for example, is using sequential IP IDs e.g. 1234, 1235, 1236. But the switch's IP IDs vary so wildly, I imagine it's using IP ID randomization. So it might not be crazy busy.
  4. A few ACKs from the switch have ethernet checksum errors according to wireshark. I'm not well versed in ethernet so I don't know when it has checksums in use and when it doesn't because clearly not every frame has one when I inspect it in Wireshark. The ones with the error are 0x00000000. So it's wrong in the sense that it's using a checksum but not calculating it. However, if it was really incorrect checksums, you wouldn't see that frame in the capture unless you're capturing on the wire between the two endpoints.

So none of that is likely a root cause, they're just things I noticed.

Going forward, I recommend:

  1. Clarifying exactly what the performance issue is with ssh. Since it's an interactive application, delays are expected.
  2. Test with some other protocol/application and get captures.
  3. Inspect the switch for physical layer issues on that interface.

Not the answer you're looking for? Browse other questions tagged or ask your own question.