Introduction
This is the parent task for moving ns1.wikimedia.org with the IP 208.80.153.231 and currently announced from codfw to anycast so that we can announce it from all seven sites.
netops [ @ayounsi / @cmooney ]: this requires your input and also can you spare a /24 please :)
Motivation
This task is similar to when we moved ns2.wikimedia.org to anycast as part of the knams move in T343942; continuing with that, we are also moving ns1 to anycast from the current unicast setup.
The motivation for doing is so is for better performance and operational resiliency. Currently all requests for ns1 reach codfw -- after this change, they should in most cases reach the DC closest to them, helping improve latency and also giving us operational resiliency as the IP will be advertised from all 16 DNS hosts across our sites instead of the current three in codfw. T98006 is the long-running task that talks about the benefits of doing this and has the changes for nsa.wikimedia.org, which was our first anycast nameserver but wasn't actually being used.
How
Puppet / bird
The operational side of this change is fairly easy to undertake because of the work we performed in T347054 and T343942. Currently, the ns1 configuration looks like:
profile::bird::advertise_vips: ns1.wikimedia.org: address: 208.80.153.231 # ns1 IP, unicast check_cmd: '/usr/local/bin/check_authdns_ns1_state /usr/lib/nagios/plugins/check_dns_query -H 208.80.153.231 -a -l -d www.wikipedia.org -t 1' ensure: present service_type: authdns-ns1
We will need to move this to hieradata/role/common/dnsbox.yaml instead so that it can be picked up by all hosts. We will also need to set skip_loopback for codfw in hieradata/common.yaml as bird will now take care of those.
conftool-data/node/<site>.yaml will also need to be updated for the relevant service bit (authdns-ns1) to be present everywhere and not just in codfw.
We have tested this for ns2 and for the refactoring done in T347054 so we don't expect any surprises there.
homer
We will need to update config/sites.yaml and add the new IP to bgp_out at the very least. I am not sure about the other router configuration required here so will defer to netops on that.
wikimedia.org zone file / glue records
We will need to update the glue record for ns1 and coordinate with MarkMonitor for that. There should be no charge for this as per our last conversation but we will need to check this again.
@ 1D IN NS ns0 @ 1D IN NS ns1 @ 1D IN NS ns2 ns0 1D IN A 208.80.154.238 ns1 1D IN A 208.80.153.231 ns2 1D IN A 198.35.27.27 ; anycasted authdns
Given the TTL here and in general inconsistencies around the TTL not being respected by recursors, we need to figure out the right order of operation so that we can support both the current ns1 IP and the future one for at least a week (?). See more below.
Measuring performance improvement
This time around given that we are doing a more controlled move of a nameserver to anycast, we plan to measure the performance improvement that might result from this change. We plan to make use of the dns metric. This is not perfect by nature as it can be influenced from all parts of the DNS lookup (so anywhere between the browser and the namserver) and not just strictly between the stub/recursor and our nameserver. But given that at least I am not aware of a better metric, this should be fine for what we are trying to do here.
Order of changes
- assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this
- make all the Puppet/confctl/bird changes so that we can start advertising the new IP from all hosts except the hosts in the codfw
- start announcing the new /24 from all sites (not the codfw hosts)
- update the zone file and glue records for ns1
- after one week (or more?) has passed, remove all traces of 208.80.153.231 and start announcing the new IP from codfw as well