User Details
- User Since
- May 10 2021, 3:25 PM (166 w, 3 d)
- Availability
- Available
- IRC Nick
- topranks
- LDAP User
- Cathal Mooney
- MediaWiki User
- CMooney (WMF) [ Global Accounts ]
Today
GNMI stats proved very helpful to keep an eye on the bandwidth shifting around
Work completed, traffic is currently bridged through the two spine switches over the AEs from the row C/D virtual-chassis and the CR interfaces connected to the Spines are working as VRRP gateway.
There is possibly a variant of option 1:
Yesterday
Though there didn't seem to be a problem afterwards, the timing makes me think of T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad
Tue, Jul 16
Upgrade completed, all hosts back online and pinging ok. Thanks all for the assistance!
Fri, Jul 12
Thu, Jul 11
Switch upgrade complete, all looks good hosts are online and responding to ping again. Thanks for the assistance!
Wed, Jul 10
Switch upgraded successfully and all hosts back online/pinging. Thanks everyone for the assistance!
Closing task - is a duplicate work was completed under T365169
I think the work on this can be done in tandem with the review of the setup in T367203: Sub-optimal cloud routing for WMCS in eqiad when link fails.
Tue, Jul 9
Switch upgrade completed without issue. All connected hosts are back online and responding to ping now, thanks all for the help.
Mon, Jul 8
Fri, Jul 5
Seems like a great tool, but we are going to move forward with pulling these stats using gnmic after successfully testing it under T326322. If we find any blockers that gNMI can't cover we can revisit using junos_exporter but hopefully that won't be needed. Future gnmic pipeline development will be tracked in T369384: Productionize gnmic network telemetry pipeline
I'm going to close this task now, the current gnmic collection is providing what we need in terms of the queue-stats for observing how QoS operates, and it seems best to track future, general improvements to our network telemetry in a separate task (see bleow).
Bit of an update on this one. We had a problem recently after lvs2011 was rebooted which is related, which we need to address.
Thu, Jul 4
All seems good with the policy changes now, closing task.
All is working well on the test-host. Well puppet was giving me a headache but I just skipped all that :)
Ok change merged, we are now announcing codfw ranges from eqord again:
cmooney@cr2-eqord> show route advertising-protocol bgp 192.80.17.197
Wed, Jul 3
So one thing I noticed is that we are not getting the stats for LAG/ae interfaces with the current setup, nor routed sub-interface stats.
Switch is back up, all looks good at first glance from the network side.
Gonna close this one as the design is finalised, see detail on wikitech here:
Tue, Jul 2
Also @Jhancock.wm when next on site can you check the mgmt / idrac connection for this one? It doesn't seem to be trying to get an IP by DHCP, and the old IP from when it was mw2289 isn't working either.
@Jhancock.wm can you confirm what position in the rack the server is in?
All seems ok following the increase:
So the change to the timeout has made a big difference, but there are still some small gaps:
Sun, Jun 30
Folks just FYI I've pushed the time here back an hour if that's ok, seems to suit most best.
Fri, Jun 28
I may have spoken too soon when I said things were working fine. It seems in codfw since the change we are only getting stats some of the time:
@fgiunchedi I was perhaps a little cheeky and merged this, but it was clear the volume of new metrics was well within what you'd said before was ok. Everything working nicely I'm glad to say.
Thanks all for the help with this one!