Rename wikikube worker nodes during OS reimage
Open, Needs TriagePublic

Description

In the process of the next OS reimage during the next k8s upgrade we should rename the wikikube worker nodes to wikikube-workerXXXX using the reimage cookbook (https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1008818) or something like a rolling-reimage cookbook for k8s.

RESERVED NODE NAMES
Please avoid using wikikube-worker[12]0[01][56] for anything else then the dedicated sessionstore nodes (just to keep the numbering identical when changing the name)

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptMay 22 2024, 9:57 AM

@ayounsi I think we could test the rename cookbook together with T350152: Automation to change a server's vlan on the already cordoned kubernetes2023.codfw.wmnet, right?

Sounds good ! I'd recommend doing first a rename then normal re-image, then just a move-vlan, then on a different host, test both actions one after the other.

Mentioned in SAL (#wikimedia-operations) [2024-05-22T14:33:24Z] <jayme> drained, cordoned and pooled=inactive kubernetes2023 and kubernetes2032 for cookbook testing - T350152 T365571

I've cleared out kubernetes2023 and kubernetes2032 for you to run tests. As the hosts are pooled=inactive and cordoned in k8s all you have to do is to downtime them (which the cookbooks probably do).

After renaming, the old nodes need to be manually removed from the k8s api (https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes#Delete_the_node_from_Kubernetes_API) and the new ones need to be "uncordoned" (sudo -i; kube-env admin codfw; kubectl uncordon wikikube-workerXXXX on a deployment host). The uncordon should be done after puppet had the chance to run on all other k8s nodes.

Change #1034956 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add wikikube-worker config

https://gerrit.wikimedia.org/r/1034956

Change #1034976 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Rename kubernetes2023 to wikikube-worker2001

https://gerrit.wikimedia.org/r/1034976

Change #1034977 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Rename kubernetes2032 to wikikube-worker2002

https://gerrit.wikimedia.org/r/1034977

Change #1034956 merged by JMeybohm:

[operations/puppet@production] Add wikikube-worker config

https://gerrit.wikimedia.org/r/1034956

Change #1034976 merged by JMeybohm:

[operations/puppet@production] Rename kubernetes2023 to wikikube-worker2001

https://gerrit.wikimedia.org/r/1034976

Cookbook cookbooks.sre.hosts.rename started by ayounsi@cumin1002 from kubernetes2023 to wikikube-worker2001 completed:

  • kubernetes2023 (WARN)
    • โœ”๏ธ Downtimed host on Icinga/Alertmanager
    • โœ”๏ธ Netbox updated
    • โš ๏ธRenaming failed but rollback succeddedโš ๏ธ Please check the logs for the reason and follow up with I/F if needed.

Cookbook cookbooks.sre.hosts.rename started by ayounsi@cumin1002 from kubernetes2023 to wikikube-worker2001 completed:

  • kubernetes2023 (PASS)
    • โœ”๏ธ Downtimed host on Icinga/Alertmanager
    • โœ”๏ธ Netbox updated
    • โœ”๏ธ IDRAC updated
    • โœ”๏ธ DNS updated
    • โœ”๏ธ Switch description updated
    • โœ”๏ธ Removed from DebMonitor
    • โœ”๏ธ Removed from Puppet master and PuppetDB
    • Rename completed ๐Ÿ‘ - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2001.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405230825_jayme_2257839_wikikube-worker2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

After the reimage I needed to run the following for calico to start up properly:

kubectl delete node kubernetes2023.codfw.wmnet
sudo cumin -b 35 "A:wikikube-worker-codfw" 'systemctl restart ferm' # actually just the nodes running typha

The ferm restart was required because the new node name gets added to the ferm rule before it can be resolved to an IP. ferm then does not create a iptables access rule for that host and is also not refreshed when DNS would work because the ferm rule does not change again (new node name is already in there). T365687: Improve calico-typha firewall rules

kubernetes2023 is still cordoned and depooled for additional tests of the move v-lan process

Before I forget, please notify DCops so they update the physical labels on the server.

Change #1034977 merged by Hnowlan:

[operations/puppet@production] Rename kubernetes2032 to wikikube-worker2002

https://gerrit.wikimedia.org/r/1034977

Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from kubernetes2032 to wikikube-worker2002 completed:

  • kubernetes2032 (PASS)
    • โœ”๏ธ Downtimed host on Icinga/Alertmanager
    • โœ”๏ธ Netbox updated
    • โœ”๏ธ IDRAC updated
    • โœ”๏ธ DNS updated
    • โœ”๏ธ Switch description updated
    • โœ”๏ธ Removed from DebMonitor
    • โœ”๏ธ Removed from Puppet master and PuppetDB
    • Rename completed ๐Ÿ‘ - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2002.codfw.wmnet with OS bullseye executed with errors:

  • wikikube-worker2002 (FAIL)
    • Failed to migrate host to the new VLAN, sre.hosts.move-vlan cookbook returned 94
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-worker2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2002.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2002 (PASS)
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405281512_hnowlan_3310614_wikikube-worker2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-05-30T18:41:26Z] <cdanis> T365571 ๐Ÿ’™root@deploy1002.eqiad.wmnet ~ ๐Ÿ•โ‰ kubectl delete node kubernetes2032.codfw.wmnet

18 :58:42	<+jinxer-wm>	RESOLVED: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1358 to wikikube-worker1001 completed:

  • mw1358 (WARN)
    • โœ”๏ธ Downtimed host on Icinga/Alertmanager
    • โœ”๏ธ Netbox updated
    • โœ”๏ธ Netbox rolled back
    • โš ๏ธRenaming failed but rollback succeddedโš ๏ธ Please check the logs for the reason and follow up with I/F if needed.

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1426 to wikikube-worker1002 completed:

  • mw1426 (PASS)
    • โœ”๏ธ Downtimed host on Icinga/Alertmanager
    • โœ”๏ธ Netbox updated
    • โœ”๏ธ IDRAC updated
    • โœ”๏ธ DNS updated
    • โœ”๏ธ Switch description updated
    • โœ”๏ธ Removed from DebMonitor
    • โœ”๏ธ Removed from Puppet master and PuppetDB
    • Rename completed ๐Ÿ‘ - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1427 to wikikube-worker1003 completed:

  • mw1427 (PASS)
    • โœ”๏ธ Downtimed host on Icinga/Alertmanager
    • โœ”๏ธ Netbox updated
    • โœ”๏ธ IDRAC updated
    • โœ”๏ธ DNS updated
    • โœ”๏ธ Switch description updated
    • โœ”๏ธ Removed from DebMonitor
    • โœ”๏ธ Removed from Puppet master and PuppetDB
    • Rename completed ๐Ÿ‘ - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1443 to wikikube-worker1004 completed:

  • mw1443 (PASS)
    • โœ”๏ธ Downtimed host on Icinga/Alertmanager
    • โœ”๏ธ Netbox updated
    • โœ”๏ธ IDRAC updated
    • โœ”๏ธ DNS updated
    • โœ”๏ธ Switch description updated
    • โœ”๏ธ Removed from DebMonitor
    • โœ”๏ธ Removed from Puppet master and PuppetDB
    • Rename completed ๐Ÿ‘ - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1490 to wikikube-worker1007 completed:

  • mw1490 (PASS)
    • โœ”๏ธ Downtimed host on Icinga/Alertmanager
    • โœ”๏ธ Netbox updated
    • โœ”๏ธ IDRAC updated
    • โœ”๏ธ DNS updated
    • โœ”๏ธ Switch description updated
    • โœ”๏ธ Removed from DebMonitor
    • โœ”๏ธ Removed from Puppet master and PuppetDB
    • Rename completed ๐Ÿ‘ - now please run the re-image cookbook on the new name with --new

Change #1038395 had a related patch set uploaded (by Clรฉment Goubert; author: Clรฉment Goubert):

[operations/puppet@production] mw1358: Put back insetup::serviceops

https://gerrit.wikimedia.org/r/1038395

Change #1038395 merged by Clรฉment Goubert:

[operations/puppet@production] mw1358: Put back insetup::serviceops

https://gerrit.wikimedia.org/r/1038395

Renamed:
mw1426 to wikikube-worker1002
mw1427 to wikikube-worker1003
mw1443 to wikikube-worker1004
mw1490 to wikikube-worker1007

On hold because of idrac too old to be upgraded by the script:
mw1358 to wikikube-worker1001

Icinga downtime and Alertmanager silence (ID=7674428f-f194-4d51-ae42-1bbedb9b1fde) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Waiting on iDrac update

mw1358.eqiad.wmnet