0

We recently had a sitewide power failure (drained the UPSs). After everything came back up we're experiencing some strange networking behavior.

It seems as if the server can only ping one other machine on the network and none of the network shares are available. (this machine changes if we reboot the workstations)

Current Status:

  • The server can ping and be pinged from one machine on the Network.
  • DNS hostname resolves to correct IP on PING (form all machines)
  • Sever network shares (NFS/SMB) are down for all machines (even from the box that can ping)
  • NFS and SMB services are running
  • Server can be reached by ssh from whatever machine is currently able to ping.
  • Server can not ping intermediary switches?
  • Workstations can ping all intermediary hardware

ENV:

DNS/Auth - Active Directory (all static ips / no DHCP) Debian 6.3.0 (connected by 4 bonded 40gE all are up) Server <-> Mellanox Switch sn2100 <-> Mellanox Fiber 10G (sn1016) <-> Workstations

Mix OS workstations (OSX 10.14 and up, Windows 10, CentOS 7)

Suspect:

Currently suspect some kind of issue with the routing on the sn2100 but other devices route through it just fine.

2
  • 1
    Can you disable the bond and test each link individually? It makes me suspect only one link is allowing packets through. Commented Jun 24, 2022 at 14:57
  • @user1686 yeah that was my thought as well. I'm pulling off some data ATM but that will be up next for troubleshooting.
    – user36659
    Commented Jun 24, 2022 at 15:51

1 Answer 1

0

So @user1686 that was the correct debug path.

Solution:

After starting to break apart the bonded connection in the Mellanox Switch I noticed that the LAG Mode settings were all mixed up for each port. One was grayed out, one was in static mode, and the rest were correct. This was either corrupted during the power failure or the system rolled back to a state where the LAG was being built.

I removed all ports from the LAG, updated the LAG Mode for each one to be LACP Active then recreated the LAG.

Machines instantly could ping the server and only required a remount for the NFS shares to come back up.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .