Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad
Open, MediumPublic

Description

Moved forward one week: 2024-07-23, 15:00 UTC
on this rack: https://netbox.wikimedia.org/dcim/racks/91/

  • db1195 - s1
  • db1202 - s7
  • db1203 - s8
  • db1205 - backup
  • moss-be1003
  • ms-be1075
  • an-presto1015
  • an-worker1156
  • an-worker1146
  • kafka-jumbo1015
  • kafka-stretch1002
  • elastic1100
  • elastic1101
  • elastic1102
  • wdqs1016
  • dse-k8s-worker1008
  • ml-serve1008
  • kubernetes1025
  • kubernetes1026
  • kubernetes1052
  • kubernetes1053
  • kubernetes1054
  • kubernetes1055
  • kubernetes1056
  • mw1496

Teams Involved: Data Persistence, Data Platform, Search, Machine Learning, Service Ops

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Other Assignee
MatthewVernon

Event Timeline

swift-wise, just need to check the cluster's happy afterwards.

moss-be1003 is part of the apus Ceph cluster, which should be in production by end of this quarter (i.e. before this work is due to happen), and will need a bit of care. Should just be a case of putting it into maintenance mode beforehand, but it's 1/3 of the cluster capacity.

cmooney triaged this task as Medium priority.
cmooney updated the task description. (Show Details)

db1205 is the secondary media backups metadata db server, usually just a standby to db1204. Unless it is the active server because the primary is unavailable, it just has to be checked that replication restarts correctly after maintenance.

Icinga downtime and Alertmanager silence (ID=6a298ae5-e736-4051-8220-9ec4f352950a) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-e3-eqiad

lsw1-e3-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=39fcbcd0-8c16-4208-ac06-f4b442e55a54) set by cmooney@cumin1002 for 0:30:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-e3-eqiad

lsw1-e3-eqiad,lsw1-e3-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=2a5cb43e-793c-4103-9499-369354315479) set by cmooney@cumin1002 for 0:40:00 on 27 host(s) and their services with reason: JunOS upgrade lsw1-e3-eqiad

an-presto1010.eqiad.wmnet,an-worker1154.eqiad.wmnet,backup1009.eqiad.wmnet,cephosd1003.eqiad.wmnet,db[1192,1198-1199,1204].eqiad.wmnet,druid1010.eqiad.wmnet,dse-k8s-worker1006.eqiad.wmnet,elastic[1093-1095].eqiad.wmnet,kafka-jumbo1012.eqiad.wmnet,kafka-stretch1001.eqiad.wmnet,kubernetes[1047-1051,1061].eqiad.wmnet,ml-serve1006.eqiad.wmnet,ms-be1074.eqiad.wmnet,mw[1491-1493].eqiad.wmnet,wdqs1015.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-07-09T15:04:20Z] <topranks> rebooting lsw1-e3-eqiad to install updated JunOS version T365998

This comment was removed by cmooney.

Mentioned in SAL (#wikimedia-operations) [2024-07-18T14:47:54Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T365998 - depooling db1195 - s1 db1202 - s7 db1203 - s8', diff saved to https://phabricator.wikimedia.org/P66816 and previous config saved to /var/cache/conftool/dbconfig/20240718-144754-arnaudb.json

data-persistence hosts handled, ready whenever you are @cmooney