Rename wikikube worker nodes during OS reimage
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	JMeybohm
	May 22 2024, 9:57 AM

Description

In the process of the next OS reimage during the next k8s upgrade we should rename the wikikube worker nodes to wikikube-workerXXXX using the reimage cookbook (https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1008818) or something like a rolling-reimage cookbook for k8s.

RESERVED NODE NAMES
Please avoid using wikikube-worker[12]0[01][56] for anything else then the dedicated sessionstore nodes (just to keep the numbering identical when changing the name)

Details

Subject	Repo	Branch	Lines +/-
mw1358: Put back insetup::serviceops	operations/puppet	production	+8 -2
Rename kubernetes2032 to wikikube-worker2002	operations/puppet	production	+3 -3
Rename kubernetes2023 to wikikube-worker2001	operations/puppet	production	+6 -3
Add wikikube-worker config	operations/puppet	production	+6 -4

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T341984 Update Kubernetes clusters to >1.25
Open		None	T336861 Fix naming confusion around main/wikikube kubernetes clusters
Open		None	T365571 Rename wikikube worker nodes during OS reimage
Duplicate		None	T366085 Relabel kubernetes2032 to wikikube-worker2002
Duplicate		None	T366468 Relabel kubernetes2023 to wikikube-worker2001
Resolved	Request	Jclark-ctr	T366583 hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet
Resolved		VRiley-WMF	T367285 Relabel eqiad wikikube worker nodes
Declined		None	T367286 Relabel codfw wikikube worker nodes

Event Timeline

JMeybohm created this task.May 22 2024, 9:57 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 22 2024, 9:57 AM

hnowlan subscribed.May 22 2024, 10:06 AM

• MoritzMuehlenhoff subscribed.May 22 2024, 10:37 AM

JMeybohm added a parent task: T336861: Fix naming confusion around main/wikikube kubernetes clusters.May 22 2024, 11:35 AM

@ayounsi I think we could test the rename cookbook together with T350152: Automation to change a server's vlan on the already cordoned kubernetes2023.codfw.wmnet, right?

JMeybohm mentioned this in T351074: Move servers from the appserver/api cluster to kubernetes.May 22 2024, 1:41 PM

Sounds good ! I'd recommend doing first a rename then normal re-image, then just a move-vlan, then on a different host, test both actions one after the other.

Mentioned in SAL (#wikimedia-operations) [2024-05-22T14:33:24Z] <jayme> drained, cordoned and pooled=inactive kubernetes2023 and kubernetes2032 for cookbook testing - T350152 T365571

I've cleared out kubernetes2023 and kubernetes2032 for you to run tests. As the hosts are pooled=inactive and cordoned in k8s all you have to do is to downtime them (which the cookbooks probably do).

After renaming, the old nodes need to be manually removed from the k8s api (https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes#Delete_the_node_from_Kubernetes_API) and the new ones need to be "uncordoned" (sudo -i; kube-env admin codfw; kubectl uncordon wikikube-workerXXXX on a deployment host). The uncordon should be done after puppet had the chance to run on all other k8s nodes.

JMeybohm updated the task description. (Show Details)May 22 2024, 3:01 PM

Change #1034956 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add wikikube-worker config

https://gerrit.wikimedia.org/r/1034956

gerritbot added a project: Patch-For-Review.May 22 2024, 3:08 PM

Change #1034976 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Rename kubernetes2023 to wikikube-worker2001

https://gerrit.wikimedia.org/r/1034976

Change #1034977 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Rename kubernetes2032 to wikikube-worker2002

https://gerrit.wikimedia.org/r/1034977

Change #1034956 merged by JMeybohm:

[operations/puppet@production] Add wikikube-worker config

https://gerrit.wikimedia.org/r/1034956

Change #1034976 merged by JMeybohm:

[operations/puppet@production] Rename kubernetes2023 to wikikube-worker2001

https://gerrit.wikimedia.org/r/1034976

Cookbook cookbooks.sre.hosts.rename started by ayounsi@cumin1002 from kubernetes2023 to wikikube-worker2001 completed:

kubernetes2023 (WARN)
- ✔️ Downtimed host on Icinga/Alertmanager
- ✔️ Netbox updated
- ⚠️Renaming failed but rollback succedded⚠️ Please check the logs for the reason and follow up with I/F if needed.

Cookbook cookbooks.sre.hosts.rename started by ayounsi@cumin1002 from kubernetes2023 to wikikube-worker2001 completed:

kubernetes2023 (PASS)
- ✔️ Downtimed host on Icinga/Alertmanager
- ✔️ Netbox updated
- ✔️ IDRAC updated
- ✔️ DNS updated
- ✔️ Switch description updated
- ✔️ Removed from DebMonitor
- ✔️ Removed from Puppet master and PuppetDB
- Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2001.codfw.wmnet with OS bullseye completed:

wikikube-worker2001 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405230825_jayme_2257839_wikikube-worker2001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

After the reimage I needed to run the following for calico to start up properly:

kubectl delete node kubernetes2023.codfw.wmnet
sudo cumin -b 35 "A:wikikube-worker-codfw" 'systemctl restart ferm' # actually just the nodes running typha

The ferm restart was required because the new node name gets added to the ferm rule before it can be resolved to an IP. ferm then does not create a iptables access rule for that host and is also not refreshed when DNS would work because the ferm rule does not change again (new node name is already in there). T365687: Improve calico-typha firewall rules

kubernetes2023 is still cordoned and depooled for additional tests of the move v-lan process

Before I forget, please notify DCops so they update the physical labels on the server.

hnowlan mentioned this in T365712: Relabel codfw Kubernetes hosts .May 23 2024, 1:30 PM

Change #1034977 merged by Hnowlan:

[operations/puppet@production] Rename kubernetes2032 to wikikube-worker2002

https://gerrit.wikimedia.org/r/1034977

Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from kubernetes2032 to wikikube-worker2002 completed:

kubernetes2032 (PASS)
- ✔️ Downtimed host on Icinga/Alertmanager
- ✔️ Netbox updated
- ✔️ IDRAC updated
- ✔️ DNS updated
- ✔️ Switch description updated
- ✔️ Removed from DebMonitor
- ✔️ Removed from Puppet master and PuppetDB
- Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2002.codfw.wmnet with OS bullseye executed with errors:

wikikube-worker2002 (FAIL)
- Failed to migrate host to the new VLAN, sre.hosts.move-vlan cookbook returned 94
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-worker2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Maintenance_bot removed a project: Patch-For-Review.May 28 2024, 12:30 PM

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2002.codfw.wmnet with OS bullseye completed:

wikikube-worker2002 (PASS)
- Host successfully migrated to the new VLAN
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405281512_hnowlan_3310614_wikikube-worker2002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

hnowlan moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.May 30 2024, 2:10 PM

Mentioned in SAL (#wikimedia-operations) [2024-05-30T18:41:26Z] <cdanis> T365571 💙root@deploy1002.eqiad.wmnet ~ 🕝⁉ kubectl delete node kubernetes2032.codfw.wmnet

18 :58:42	<+jinxer-wm>	RESOLVED: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown

JMeybohm mentioned this in T366470: Support creating phab tasks in wmflib.phabricator.Jun 3 2024, 12:58 PM

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1358 to wikikube-worker1001 completed:

mw1358 (WARN)
- ✔️ Downtimed host on Icinga/Alertmanager
- ✔️ Netbox updated
- ✔️ Netbox rolled back
- ⚠️Renaming failed but rollback succedded⚠️ Please check the logs for the reason and follow up with I/F if needed.

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1426 to wikikube-worker1002 completed:

mw1426 (PASS)
- ✔️ Downtimed host on Icinga/Alertmanager
- ✔️ Netbox updated
- ✔️ IDRAC updated
- ✔️ DNS updated
- ✔️ Switch description updated
- ✔️ Removed from DebMonitor
- ✔️ Removed from Puppet master and PuppetDB
- Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1427 to wikikube-worker1003 completed:

mw1427 (PASS)
- ✔️ Downtimed host on Icinga/Alertmanager
- ✔️ Netbox updated
- ✔️ IDRAC updated
- ✔️ DNS updated
- ✔️ Switch description updated
- ✔️ Removed from DebMonitor
- ✔️ Removed from Puppet master and PuppetDB
- Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1443 to wikikube-worker1004 completed:

mw1443 (PASS)
- ✔️ Downtimed host on Icinga/Alertmanager
- ✔️ Netbox updated
- ✔️ IDRAC updated
- ✔️ DNS updated
- ✔️ Switch description updated
- ✔️ Removed from DebMonitor
- ✔️ Removed from Puppet master and PuppetDB
- Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1490 to wikikube-worker1007 completed:

mw1490 (PASS)
- ✔️ Downtimed host on Icinga/Alertmanager
- ✔️ Netbox updated
- ✔️ IDRAC updated
- ✔️ DNS updated
- ✔️ Switch description updated
- ✔️ Removed from DebMonitor
- ✔️ Removed from Puppet master and PuppetDB
- Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Change #1038395 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw1358: Put back insetup::serviceops

https://gerrit.wikimedia.org/r/1038395

gerritbot added a project: Patch-For-Review.Jun 3 2024, 5:05 PM

Change #1038395 merged by Clément Goubert:

[operations/puppet@production] mw1358: Put back insetup::serviceops

https://gerrit.wikimedia.org/r/1038395

Maintenance_bot removed a project: Patch-For-Review.Jun 3 2024, 5:30 PM

Renamed:
mw1426 to wikikube-worker1002
mw1427 to wikikube-worker1003
mw1443 to wikikube-worker1004
mw1490 to wikikube-worker1007

On hold because of idrac too old to be upgraded by the script:
mw1358 to wikikube-worker1001

Icinga downtime and Alertmanager silence (ID=7674428f-f194-4d51-ae42-1bbedb9b1fde) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Waiting on iDrac update