Anycast ns1.wikimedia.org
Open, MediumPublic

Description

Introduction

This is the parent task for moving ns1.wikimedia.org with the IP 208.80.153.231 and currently announced from codfw to anycast so that we can announce it from all seven sites.

netops [ @ayounsi / @cmooney ]: this requires your input and also can you spare a /24 please :)

Motivation

This task is similar to when we moved ns2.wikimedia.org to anycast as part of the knams move in T343942; continuing with that, we are also moving ns1 to anycast from the current unicast setup.

The motivation for doing is so is for better performance and operational resiliency. Currently all requests for ns1 reach codfw -- after this change, they should in most cases reach the DC closest to them, helping improve latency and also giving us operational resiliency as the IP will be advertised from all 16 DNS hosts across our sites instead of the current three in codfw. T98006 is the long-running task that talks about the benefits of doing this and has the changes for nsa.wikimedia.org, which was our first anycast nameserver but wasn't actually being used.

How

Puppet / bird

The operational side of this change is fairly easy to undertake because of the work we performed in T347054 and T343942. Currently, the ns1 configuration looks like:

hieradata/role/codfw/dnsbox.yaml
profile::bird::advertise_vips:
  ns1.wikimedia.org:
    address: 208.80.153.231 # ns1 IP, unicast
    check_cmd: '/usr/local/bin/check_authdns_ns1_state /usr/lib/nagios/plugins/check_dns_query -H 208.80.153.231 -a -l -d www.wikipedia.org -t 1'
    ensure: present
    service_type: authdns-ns1

We will need to move this to hieradata/role/common/dnsbox.yaml instead so that it can be picked up by all hosts. We will also need to set skip_loopback for codfw in hieradata/common.yaml as bird will now take care of those.

conftool-data/node/<site>.yaml will also need to be updated for the relevant service bit (authdns-ns1) to be present everywhere and not just in codfw.

We have tested this for ns2 and for the refactoring done in T347054 so we don't expect any surprises there.

homer

We will need to update config/sites.yaml and add the new IP to bgp_out at the very least. I am not sure about the other router configuration required here so will defer to netops on that.

wikimedia.org zone file / glue records

We will need to update the glue record for ns1 and coordinate with MarkMonitor for that. There should be no charge for this as per our last conversation but we will need to check this again.

@           1D  IN NS   ns0
@           1D  IN NS   ns1
@           1D  IN NS   ns2
ns0         1D  IN A    208.80.154.238
ns1         1D  IN A    208.80.153.231
ns2         1D  IN A    198.35.27.27 ; anycasted authdns

Given the TTL here and in general inconsistencies around the TTL not being respected by recursors, we need to figure out the right order of operation so that we can support both the current ns1 IP and the future one for at least a week (?). See more below.

Measuring performance improvement

This time around given that we are doing a more controlled move of a nameserver to anycast, we plan to measure the performance improvement that might result from this change. We plan to make use of the dns metric. This is not perfect by nature as it can be influenced from all parts of the DNS lookup (so anywhere between the browser and the namserver) and not just strictly between the stub/recursor and our nameserver. But given that at least I am not aware of a better metric, this should be fine for what we are trying to do here.

Order of changes

  • assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this
  • make all the Puppet/confctl/bird changes so that we can start advertising the new IP from all hosts except the hosts in the codfw
  • start announcing the new /24 from all sites (not the codfw hosts)
  • update the zone file and glue records for ns1
  • after one week (or more?) has passed, remove all traces of 208.80.153.231 and start announcing the new IP from codfw as well

Event Timeline

ssingh triaged this task as Medium priority.May 29 2024, 3:32 PM

assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this

As we couldn't get a /24 from LACNIC for magru, we only have two free /24s
We have to decide between a few options:

  1. allocate a new whole /24 for ns1 right now
    • Pro: quick turnaround, no added cost
    • Con: risk of lacking public v4 IPs for future projects (eg. new pops) or core site growth, can be mitigated by applying for more prefixes in parallel but not guaranty to get them
  2. Apply (and pay) for more IPs at RIPE or ARIN (T288342) and wait for an allocation before anycasting ns1
    • Pro: limited cost, more flexible on IP usage
    • Con: long turnaround (months to years)
  3. Buy a /24 on resale market
    • Pro: faster turnaround
    • Con: higher cost
  4. Don't Anycast ns1
    • Listing it only for the sake of completeness, but not preferred. Even though there is a diminishing return after anycasting ns2, we believe anycasting one more ns would bring performance improvements to users
  5. Use the DoH Anycast prefix for ns1
    • Pro: quick turnaround, no added cost
    • Con: risk of providers blocking ns1 as a side effect of blocking DoH, mitigated by having 2 other NS.

Some side notes looking at traffic usage https://w.wiki/AER8

  • A significant chunk of clients seem to be "smart" and send request to the faster anycasted ns2, so even thought it seems to make sens to anycast one more NS, I'm less sure about anycasting all 3 of them. To be weighted against the safety net of having at least one NS unicasted.
  • ns0 does a little bit more traffic than ns1, this might indicate that some "dumb" clients only use the first NS from the list returned by the authdns. It's not significant but might be a reason to anycast ns0 instead, or return ns1 before ns0, or return them in a random order

assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this

As we couldn't get a /24 from LACNIC for magru, we only have two free /24s

On balance, given we would still have a free /24 afterwards it is probably ok to allocate one of our /24s for ns1. That leaves us with 1 more /24 for another POP (or ns0). Given it's a finite resource with potential cost to replace we should probably make sure management are ok with it being used for this purpose too. Especially in terms of any future POP buildouts.

While space is in short supply it is available on the open market, at something like $12-15k for a /24. We can also investigate the slightly cheaper options though the RIRs if they are still viable.

  1. Don't Anycast ns1
    • Listing it only for the sake of completeness, but not preferred. Even though there is a diminishing return after anycasting ns2, we believe anycasting one more ns would bring performance improvements to users

Agreed. The stats you pulled indicate that many operators are doing as suggested in RFC4697, keeping tabs on the best-performing IP from a set of NS entries and preferring that. But resolver behaviour varies wildly, and we all know some don't follow the standards at all, so there are benefits, although diminishing, from anycasting them all.

  1. Use the DoH Anycast prefix for ns1
    • Pro: quick turnaround, no added cost
    • Con: risk of providers blocking ns1 as a side effect of blocking DoH, mitigated by having 2 other NS.

I think the risk of this is not huge (if someone blocked the whole range the other NS IPs would still be reachable). But that said I think right now we can possibly use a different range for ns1, and revisit this when we come to do ns0.

Re: anycast-ns1 and future plans, etc (I won't quote all the relevant bits from both msgs above):

  • Currently we still strongly prefer to avoid mixing any other service with the DoH anycast, due to the different potential censorship risk models. There might be a future world in some years where that's less of a concern, but for now it's better to keep it isolated.
  • In the future, we may experiment with operating anycast HTTP/[23] for the main high-traffic public endpoints as well, at least in a limited capacity for certain client networks or use-cases. How broadly-used that eventually becomes depends on a lot of future unknowns and experiments. If/when we do that, I think that could share one or both of the /24 we're using for authdns.
  • The current most-basic version of the forward-looking authdns plan is that after both ns1 and ns2 are successfully anycasting on separate allocations and we've had some time to validate that we're happy with how it's working, we'll withdraw the ns0 hostname from the records completely, rather than trying to anycast a third address. Getting rid of the last unicast helps with the client caches that aren't smart about selection, basically, and we don't see much gain on any front going from 2 to 3 anycasts, given our small site count / network footprint.
  • While just having disparate ns1+ns2 anycasts at all (disparate /24, but all announced from all) does mitigate some address-space/routing risks and give us some operational flexibility during incidents (to maybe, depending on some scenario, manually withdraw or prepend one or the other routes at various sites if we have to), we have some rough ideas about how maybe we could switch to A/B sets of sites and normally advertise only one of the two from each PoP to potentially gain greater operational resiliency and possibly some perf. We're kinda waiting to really dive deep about this one until we get past ns1-anycast existing in the first place, but it's a subject we'll want netops input on, obviously. We've only talked about this in some IRC chats before, but I recorded a summary of my current draft thinking on this subject: AuthDNS Anycast Sets

assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this

Thanks for the detailed breakdown of the options!

As we couldn't get a /24 from LACNIC for magru, we only have two free /24s
We have to decide between a few options:

  1. allocate a new whole /24 for ns1 right now
    • Pro: quick turnaround, no added cost
    • Con: risk of lacking public v4 IPs for future projects (eg. new pops) or core site growth, can be mitigated by applying for more prefixes in parallel but not guaranty to get them
  2. Apply (and pay) for more IPs at RIPE or ARIN (T288342) and wait for an allocation before anycasting ns1
    • Pro: limited cost, more flexible on IP usage
    • Con: long turnaround (months to years)

Might be hard to answer but do we know if it is months or it can actually be years? Do they share some sort of an expectation on when the request will be completed?

  1. Buy a /24 on resale market
    • Pro: faster turnaround
    • Con: higher cost

Other than the cost, do we have some sort of established background for the process on your (netops) end? The last time Traffic set out to do this it was quite the adventure but also it's been a while so maybe things have changed.

  1. Don't Anycast ns1
    • Listing it only for the sake of completeness, but not preferred. Even though there is a diminishing return after anycasting ns2, we believe anycasting one more ns would bring performance improvements to users
  2. Use the DoH Anycast prefix for ns1
    • Pro: quick turnaround, no added cost
    • Con: risk of providers blocking ns1 as a side effect of blocking DoH, mitigated by having 2 other NS.

This is the only option on the list for which I would say no because of the con you mentioned. It's a risk that I don't think we can or should take.

Some side notes looking at traffic usage https://w.wiki/AER8

  • A significant chunk of clients seem to be "smart" and send request to the faster anycasted ns2, so even thought it seems to make sens to anycast one more NS, I'm less sure about anycasting all 3 of them. To be weighted against the safety net of having at least one NS unicasted.
  • ns0 does a little bit more traffic than ns1, this might indicate that some "dumb" clients only use the first NS from the list returned by the authdns. It's not significant but might be a reason to anycast ns0 instead, or return ns1 before ns0, or return them in a random order

At least pdns-recursor seems to do this:

SyncRes::doResolveAt first shuffles the nameservers both randomly and on performance order. If it knows a nameserver was fast in the past, it will get queried first. More about this later.

What knot-resolver seems to be doing is not well documented so the best-guess I have is looking at lib/selection.c in the source code, where it seems like it tries the server first which it has not seen before so that it tries every option before going back to the last selected one. In any case, I think this is all speculation on my end in assuming that pdns-recursor and knot-resolver are the most commonly used ones and we will need to dig deeper if we have to make a decision based on what they are doing. It might just be that the random sample returns ns0 but the performance improvement overrules that when we turn on ns1/2 and then the resolver sticks to that.

At least pdns-recursor seems to do this:

Anecdotally Bind seems to do the same, in a test this morning my local server went to ns2 626 times when I queried a bunch of wikis, and ns1 and ns0 5 times each (from Ireland, so ns0/ns1 in the US would be worse for me).

I still think it makes sense to Anycast, and announce all ranges from all POPs. But if the majority of recursors do this it may mean it makes sense to announce them unequally. For instance pre-pend one prefix at one POP and do the reverse at another. That may provide different network paths for resolvers talking to them, enabling them to select a "better" one than if all announcements were equal and thus routing the same. For a future discussion anyway.

On that future discussion topic (sorry I'm getting nerdsniped!) - Yeah, I had thought about prepending (vs the hard A/B cutoff) as well, but I tend to think it doesn't offer as much resiliency as the clean split.

For an example scenario:

  • Suppose the client cache in question exists in Tunisia, and the only "relevant" sites are esams+drmrs (everything else is way further away AS-path-wise). Suppose for this client, the path to drmrs is 2 hops closer than esams, so that's their default all else being equal.
  • Now suppose the drmrs option has a byzantine failure mode of some kind from the client's POV: they're still getting the advert indirectly for drmrs, and it still has the shortest natural AS path, but actual traffic between the sites is broken or unreliable, possibly in only one direction, due to random Internet $issues. Or: the Internet itself is fine, but our software stack did something stupid, and we're advertising from drmrs but our actual authdns service there is totally borked and unresponsive. How does this play out in various deployment scenarios?
    • All Equal (ns1+ns2 advertised equally from everywhere): the Tunisia client gets dead authdns, game over.
    • Prepend x1 (ns1 prepends +1 hop from esams, and ns2 prepends +1 hop from drmrs): Not enough to change anything, still game over.
    • Prepend x2 (as above with +2s): ns1 is still drmrs. ns2 now has equal lengths to both, who knows depending on other things?
    • Prepend x3 (as above with +3s): now ns2 will clearly prefer esams and still work, client saved?

Of course these prepend amounts to get to a working solution would vary for different client scenarios. In a nearby case like these, maybe even a certain value like 3 might work fairly well and fairly universally. However, in a case like magru or eqsin where the next site is very far away, I imagine it could take a lot of prepend to make this work reliably for every edge case? Regardless of the "how much prepend?" question, though: we can only control it this way in one direction. Our prepends will affect their traffic reaching us, but we have less control over the return path unless we manually engineer that for every case. Both directions have to work.

If we do the clean A/B split with the above failure scenario: ns1->drmrs dies from the client POV, and ns2->esams continues working fine, no matter the path length diffs or return path questions, etc, because we're never offering both options from the same site at all in the first place. Basically we're offering them 2x NS IPs they can try which take diverse paths to diverse sites with hopefully-diverse failure risks.

That's quite interesting seeing the variation of tradeoffs, and can be quite (an important) rabbithole. Is the goal to figure it out before anycasting ns1, or first anycast ns1 from anywhere then figure out how to modify the setup for possible better redundancy.
It could be useful to list all the failure scenarios, and if we need to mitigate them or not. (server or network missconfig, Bird bug, etc). In other words are we putting too many of our eggs in the same basket ?

On the AS path length, we're quite well connected. From https://stat.ripe.net

Screenshot 2024-05-31 at 09-53-44 RIPEstat LaunchpadSearch_14907.png (690×575 px, 56 KB)

Might be hard to answer but do we know if it is months or it can actually be years? Do they share some sort of an expectation on when the request will be completed?

There are more info on https://www.arin.net/resources/guide/ipv4/waiting_list/ and https://www.ripe.net/manage-ips-and-asns/ipv4/ipv4-waiting-list/ tl;dr; less than 2 years.

Other than the cost, do we have some sort of established background for the process on your (netops) end? The last time Traffic set out to do this it was quite the adventure but also it's been a while so maybe things have changed.

Not for me at least, services like https://ipv4connect.com/buy-ipv4 seems to make it easy, but no first hand feedback.

That's quite interesting seeing the variation of tradeoffs, and can be quite (an important) rabbithole. Is the goal to figure it out before anycasting ns1, or first anycast ns1 from anywhere then figure out how to modify the setup for possible better redundancy.
It could be useful to list all the failure scenarios, and if we need to mitigate them or not. (server or network missconfig, Bird bug, etc). In other words are we putting too many of our eggs in the same basket ?

For the redundancy and while I can't speak on Brandon's behalf, the idea at least per my understanding was to look into this after ns1 is on the anycast IP. In doing so, we will also have a better idea of this move because we plan to measure the performance metrics around the DNS lookups as reported by navigation timing. And the once we have some data, discuss on if we want to extend this further.

The bit I wanted to focus on was the dependency on Bird/BGP in general. That's something that I have also thought about and it is true that we are putting more and more things under it. That does worry me a bit as well. On the flip side though, a lot of stuff is already under bird including even the current ns0 and ns1 advertisements where we are using bird to do unicast announcements for them. So for me, nothing really changes here: we will be doing anycast announcements for the ns1 IP instead of the current unicast one. That does speak something about the dependency on bird though and it makes upgrades or any other work around it more trickier. Perhaps in that sense we can spread this a bit and have ns1 behind Liberica instead; I can't seem to put my finger on which is better but that can also depend on the timing since Liberica is a bit further away and if we wanted to this sooner (ignoring the /24 for a second even though that's a blocker), then maybe sticking to bird is the way to go here.

Regardless of this I guess, we should look into strengthening the tooling and checks around our bird setup that now serves the nameservers (all three), internal recursors, Wikimedia DNS (and check service) and the NTP for the Debian installer. That's a lot!

On the AS path length, we're quite well connected. From https://stat.ripe.net

Screenshot 2024-05-31 at 09-53-44 RIPEstat LaunchpadSearch_14907.png (690×575 px, 56 KB)

Might be hard to answer but do we know if it is months or it can actually be years? Do they share some sort of an expectation on when the request will be completed?

There are more info on https://www.arin.net/resources/guide/ipv4/waiting_list/ and https://www.ripe.net/manage-ips-and-asns/ipv4/ipv4-waiting-list/ tl;dr; less than 2 years.

Other than the cost, do we have some sort of established background for the process on your (netops) end? The last time Traffic set out to do this it was quite the adventure but also it's been a while so maybe things have changed.

Not for me at least, services like https://ipv4connect.com/buy-ipv4 seems to make it easy, but no first hand feedback.

Thanks for sharing both of the above; something to keep in mind depending on the direction we take.

Yeah my general thinking was get ns1-anycast going first, and then figure out any of the above about better resiliency before we consider withdrawing ns0-unicast completely.

That's quite interesting seeing the variation of tradeoffs, and can be quite (an important) rabbithole. Is the goal to figure it out before anycasting ns1, or first anycast ns1 from anywhere then figure out how to modify the setup for possible better redundancy.

Certainly worth teasing out the exact plan and any potential tradeoffs before we do ns1. I've actually been thinking a little more on it, reflecting on the fact:

  • Most major resolvers/dns providers appear to be 'smart' and pick the lowest-latency server (as per RFC4697)
  • If we Anycast them equally from all sites no resolvers will see any difference to any given IP
  • If there is then some odd fault at the closest POP to a given resolver, it could result in total failure of authdns for them

Which makes me think that perhaps the status-quo is actually kind of good? Or even we could decide to:

  • Leave ns0 only in eqiad
  • Leave ns1 only in codfw
  • Anycast ns2, but only from POPs

This means that:

  • From any given place on the internet, the 3 NS IPs go to 3 different locations
  • As ns2 is Anycast, even if they are far from a core site the latency to it should be reasonable
  • Most resolvers act "smart" and will pick the lowest latency one
  • For those that don't, not a major disaster as it's one request every 5 min per wiki
  • If there is a problem at any particular site there are always 2 other sites the resolver can try to reach to get an answer

If most resolvers are "smart" it changes the equation in my mind. It means Anycasting just one of the IPs ensures optimal latency for most users. The trade-off is between slightly more latency for users behind a resolver which is not "smart" - i.e. one in Germany which will pick ns0 rather than lower latency ns2 - and the additional redundancy the IPs routing to different sites gives in a failure scenario.

Yes, from a resiliency POV, in some senses keeping unicasts in the mix is an answer (and it's the answer we currently rely on). In a world with only very smart and capable resolvers, the simplest answer probably is the current setup. And indeed, not-advertising ns2 from the core DCs would be a very slight resiliency win over that.

However, It's definitely not the case that all resolvers are smart. We don't even know what awful/custom/hacked-up resolver/cache implementations many of the world's ISPs are using at any given time or what bugs they have, especially when you get out into the edge cases. What's the resolver behavior for the DNS cache used by all the clients of the 4th-most-popular ISP in Turkmenistan? It could be some ancient unpatched closed-source software from 20 years ago running on a Windows XP box for all I know. We'll never know this or a thousand other such cases. In the worst cases, some of the old/bad resolvers could even roll through all the available nsX IPs in a round-robin fashion and actually take timeouts between failures instead of doing them in parallel.

So, generally, the ideal state is that all the IPs of all the NS records we publish should be reliable and performant from everywhere. Site failures on our end are going to happen, random link problems around the internet are going to happen, and bad resolver behaviors are going to happen, but the closer we are to the ideal state under all conditions, the better. It's especially nice if we can maintain the best conditions we can even when one site (possibly the closest) is unreachable or has a dns software issue.

Anycast is a win in general because it solves a lot of reachability and perf issues for a single nsX IP. They have several potential paths to that IP hosted at diverse sites that don't share failure risks. So long as "this client can reach siteX directly" and "siteX is functioning" and "this client sees siteX as the best anycast route" all coincide all the time, it's all good, at least for that one authdns IP. However, in some other senses, that anycasted nsX IP is still a SPOF just like the unicast ones: if whatever site anycast routing has them locked onto is failing/unreachable-but-still-the-winning-advert, then this IP fails for this client, even though there were other functioning sites advertising this IP that they could have hypothetically reached successfully.

Unicast can be (and is currently!) the backup for that, but if the unicasts are only in the US, they're quite distant and thus less-reliable/performant in places far from the US. And for dumb resolvers, they're an active hinderance. You could imagine expanding to, say, 7x unicast nsX IPs, one at each site, but then the performance and resiliency can get terrible for dumb resolvers that are rotating through the whole global set, and the list doesn't scale well as we keep adding sites.

The ideal implementation is to have as many independent (as in, separate /24s) nsX IPs in your NS list as you can reasonably make use of (and still fit in a reasonable DNS response packet!), and have each of them be anycasted from a diverse, distinct array of independent sites scattered around the globe. This is how the major authdns operators work: the root servers, the major TLD servers, the major DNS service providers, etc. For the most part, though, they're using far more edge sites than we are. The roots cap their nsX list sizes based on response-packet-size concerns, but the pragmatic limit beneath that is there's little point having more nsX IPs than you have available sites in a typical region.

With our current setup of merely 7 sites, and various regions having ~1-2 (but 3 in the US only) nearby sites, going beyond 2x anycast IPs split up something like I'm suggesting in https://etherpad.wikimedia.org/p/authdns-anycast-sets is our closest approximation of this. From every client cache's POV, we have a truly-failure-independent pair of IPs, which are probably routing to (at least a close approximation of) the best pair of nearby sites we have available in our network, and in the byzantine case of one authdns IP advertising-but-failing from a given client's POV, the other still has completely-independent odds of success and is likely the next-best choice anyways.

authdns is one of those things you really want to overengineer. It underlies basically-everything, and anytime it doesn't work for somebody in the world, nothing else works either and all the rest of our redundancy plans at all the other layers are for naught.

I think the difficult part is where to stop the overengineering, for example it could make sens to use Liberica to healthcheck/advertise one of the NS anycast IP, but it might not be worth using a different AuthDNS software on half the servers, or a different Puppet infra.
Before going full anycast we need to make sure we're covering all major failure scenarios, or alternatively making a call to keep some unicast, knowing some places with broken/dumb implementation won't be the fastest, but maybe an ok tradeoff for better resiliency.

If I understand correctly, we can put the client in 3 main categories :

  • smart, that picks a NS based on latency, 1 anycast is enough
  • dumb, that only picks the first NS from the list we send them (and maybe fallback on the next if no reply), here we should either return an anycast IP as first record, or return them randomly
  • dumb, that always randomly picks a NS, here the most anycast IPs we return, the better for the client as it reduces the odds of landing on an unicast real server far away, so going from 1/3rd anycast to 2/3rd anycast should be a good improvement

For the last case, random thought, but maybe we could tilt the odds by having NS3, NS4, NS5, from the same anycast ranges, while keeping NS0 (and/or NS1) as last hope backup unicast. No idea if the client would try them all though or give up after a few.

For those that don't, not a major disaster as it's one request every 5 min per wiki

That's one point I'd like to understand better, being not that familiar with authdns. How important is it to always be the fastest ? What's the real impact on users, as they (I guess) usually use their ISP/Google/CF resolvers/cache. Could we use NEL data to figure out an impact when we make any design change ?

  • i.e. one in Germany which will pick ns0 rather than lower latency ns2

Seems like the main one is adguard-dns.com, which picks them randomly.
https://w.wiki/AGmr

We can't really afford to email all the DNS providers, but maybe a few of them ? Or it could be a nice side project to automate it all :)

I think the difficult part is where to stop the overengineering, for example it could make sens to use Liberica to healthcheck/advertise one of the NS anycast IP, but it might not be worth using a different AuthDNS software on half the servers, or a different Puppet infra.

I guess it's a matter of taste really. Elegant/Simple wins, all other things being equal. But either way, the internal concerns are somewhat separate from the external POV about which adverts are coming from where.

Before going full anycast we need to make sure we're covering all major failure scenarios, or alternatively making a call to keep some unicast, knowing some places with broken/dumb implementation won't be the fastest, but maybe an ok tradeoff for better resiliency.

If I understand correctly, we can put the client in 3 main categories :

  • smart, that picks a NS based on latency, 1 anycast is enough
  • dumb, that only picks the first NS from the list we send them (and maybe fallback on the next if no reply), here we should either return an anycast IP as first record, or return them randomly
  • dumb, that always randomly picks a NS, here the most anycast IPs we return, the better for the client as it reduces the odds of landing on an unicast real server far away, so going from 1/3rd anycast to 2/3rd anycast should be a good improvement

There's probably more scenarios than that, but those are good models for the broad behaviors to expect. Note we can't really control any explicit ordering in a meaningful way. In some cases that works out, but in others it just won't (even caches can randomly re-order RR-sets. In general RR-sets are un-ordered sets).

For the last case, random thought, but maybe we could tilt the odds by having NS3, NS4, NS5, from the same anycast ranges, while keeping NS0 (and/or NS1) as last hope backup unicast. No idea if the client would try them all though or give up after a few.

Possible, but complicated and fuzzy as to the effects on different scenarios. If we "weight" a single anycast/24 by loading up several distinct NS IPs from it, it also has an outsized negative impact anytime the unicast fallback was really needed (if the anycast isn't working from this client cache POV). Also, doesn't scale well, as it starts inflating response packet sizes for NS lists (esp in a DNSSEC future), and causing general concerns about amplification.

For those that don't, not a major disaster as it's one request every 5 min per wiki

That's one point I'd like to understand better, being not that familiar with authdns. How important is it to always be the fastest ? What's the real impact on users, as they (I guess) usually use their ISP/Google/CF resolvers/cache. Could we use NEL data to figure out an impact when we make any design change ?

Stats stuff

All numbers here to be taken with a huge grain of salt!

On the numeric front: It's generally going to be ~2 reqs every 5 minutes per client cache, at least for the heavy caches, for dyna.wikimedia.org and upload.wikimedia.org. There are other hostnames involved, but they're either low-rate meta-stuff, or they're the CNAMEs into those with ~1d TTLs, and so not very statistically significant. Some client caches are very-broadly shared, and some are relatively-small. The average reqrate the past week is ~12K RPS. If you just kind of go with these rough assumptions and math it out, then the traffic level leads to an estimate of ~1.8 million distinct caches hitting us (and who knows the implementation/behavior breakdowns within that).

Our stats say we get about 2B unique visitors in a month, so if you divide those you get an average of ~1.1K humans served by each cache. If you assume our uniques are off-base and everyone hits wikipedia occasionally, Internet population estimates are more like 5.4B, which would give closer to 3K humans/cache. So, yes, there's some amortization in play here, especially with the larger caches that probably serve a whole lot more than the average count of humans. Less so with the smaller ones, in some cases down to a single user or household.

You could also take another angle on estimating related things: ~12K RPS of global authdns lookup rate, vs ~135K RPS of global http reqrate into the caches, which implies a ratio of about 11.25 http reqs per authdns req. Another related lens: "pageviews" the past month worked out to an average ~9334 pageviews/sec, so the ratio there is ~1.25 DNS lookups per pageview.

There's a lot of junk noise in all these calculations: not all req sources are human-shaped, and not all DNS requests are legitimate, etc. Still, it does paint a picture that DNS lookup perf is probably more impactful than it would seem at first glance (which is: not very, given caching).

Measurement

On the measurement front: we have the RUM metrics from perf-team that break down stages like dns->tls->etc. I believe @ssingh was going to take a look at them in the wake of turning up ns2 in magru to see if we can get a fresh regional look there. We looked back at these for when ns2-anycast first went online in esams as well, and there looked to be evidence of small reductions in DNS lookup latency, but it's a little murky in the data.

More important than shaving a couple of averaged-out milliseconds, though, is reliability. The closer all of the NS list is to the client cache on the network, the better. The more-diverse our multiple NS records are from each other on failure risk from one client's POV, the better. Again, this is why the major authdns networks operate the way they do. The most-extreme case is the root servers. In the case of the root servers, they run 13 distinct nsX hostnames/IPs at [a-m].root-servers.net. Each of those is anycasted from tons of sites all over planet, and each is distinct. You can see the map here for the K-root anycast set. Whereas at the more typical big-commercial-dns scale, Cloudflare's Foundation DNS has fewer NS records, but a similar buildout philosophy: anycast them all, but make sure you have 2+ distinct anycast ranges and sets of independent sites.

IMHO, the A/B set solution with a pair of anycasts, is the most elegant and simple way to achieve the best balance of resiliency and perf for our authdns. It avoids complexity and maximizes the distinction between the failure risks of the sets. We get there in stages, though, and we keep looking at things as we go. Step 1 is to even have the second anycast defined and working similarly to the current ns2 setup.

IMHO, the A/B set solution with a pair of anycasts, is the most elegant and simple way to achieve the best balance of resiliency and perf for our authdns.

I think this (as set out in the Etherpad) seems like a reasonable approach. It seems like a good compromise between 'maximum resiliency' as I set out above (1 anycast), and 'minimal latency everywhere' (3 anycasts), while netting us some benefit for users behind "smart" resolvers.

Each of those is anycasted from tons of sites all over planet

Unfortunately an issue for our own Anycasting is the limited points-of-presence we have, compared to say root operators or the Cloudflares of the world.

We can't really afford to email all the DNS providers, but maybe a few of them ? Or it could be a nice side project to automate it all :)

While there is no harm in maybe contacting this one operator, in general I think it's not a road we should go down. Given the behaviour is likely to do with the software they are running I'd not be that confident they could "flip a switch" to fix it anyway.

Let's work out the best compromise of tradeoffs as we see it, and live with whatever inefficiencies we end up with. There is definitely not perfect answer here.

Possible, but complicated and fuzzy as to the effects on different scenarios. If we "weight" a single anycast/24 by loading up several distinct NS IPs from it, it also has an outsized negative impact anytime the unicast fallback was really needed (if the anycast isn't working from this client cache POV). Also, doesn't scale well, as it starts inflating response packet sizes for NS lists (esp in a DNSSEC future), and causing general concerns about amplification.

Thanks that makes sens, I'm wondering how would compares with the other proposal in term or cost/complexity/latency/availability.

To expend on my idea :

  • As the NS records are always returned sequentially (P65828), set the anycast prefix as first in the list, either by making ns2 first, or changing the ns0 IP to be in the anycast range.
  • Keep the two eqiad/codfw unicast IPs for redundancy in case the anycast prefix is unreachable (globally or locally)
  • Add ns3 and ns4 from the same anycast range than our current ns2

So for example:
ns0: 198.35.27.27
ns1: eqiad unicast
ns2: codfw unicast
ns3: 198.35.27.28
ns4: 198.35.27.29

That way:

  • Clients that pick the records sequentially will use the anycast and fallback to unicast if any issue
  • Clients that pick them randomly will pick an anycast prefix 3/5th of the time (so almost 2/3rd, but can be changed by only keeping one unicast, or more/less anycast) and fallback to unicast if any issue (but the fallback might take a bit longer?)
  • Clients that pick the fastest one won't see a difference compared to now

Just a thought, maybe moving the anycast one to the top of the list is a good low hanging fruit ?