User Details
- User Since
- Apr 3 2017, 6:23 PM (378 w, 6 d)
- Availability
- Available
- IRC Nick
- xionox
- LDAP User
- Ayounsi
- MediaWiki User
- AYounsi (WMF) [ Global Accounts ]
Today
Fri, Jul 5
Enable/test BFD between Ganeti and its VMs
Adding the BFD statement works fine for v4, but on the hypervisor side I don't think it can be added for v6 in the current state of things.
Reopening because of T369341: Some VRTS emails to Gmail accounts fail the SPF check
Thu, Jul 4
Possible, but complicated and fuzzy as to the effects on different scenarios. If we "weight" a single anycast/24 by loading up several distinct NS IPs from it, it also has an outsized negative impact anytime the unicast fallback was really needed (if the anycast isn't working from this client cache POV). Also, doesn't scale well, as it starts inflating response packet sizes for NS lists (esp in a DNSSEC future), and causing general concerns about amplification.
Thanks that makes sens, I'm wondering how would compares with the other proposal in term or cost/complexity/latency/availability.
The above patch should workaround the issue for v6 (based on @cmooney's testing)
Import script fixed and orphaned IP for the now deleted sretest2005 removed.
Wed, Jul 3
All is done here.
Confirmed working:
Jul 3 08:47:14 install2004 dhcpd[3728660]: DHCPDISCOVER from aa:00:00:f4:44:8d via 10.192.6.6 Jul 3 08:47:14 install2004 dhcpd[3728660]: DHCPOFFER on 208.80.152.130 to aa:00:00:f4:44:8d via 10.192.6.6 Jul 3 08:47:15 install2004 dhcpd[3728660]: DHCPDISCOVER from aa:00:00:f4:44:8d via 10.192.6.6 Jul 3 08:47:15 install2004 dhcpd[3728660]: DHCPOFFER on 208.80.152.130 to aa:00:00:f4:44:8d via 10.192.6.6 Jul 3 08:47:17 install2004 dhcpd[3728660]: DHCPREQUEST for 208.80.152.130 (208.80.153.105) from aa:00:00:f4:44:8d via 10.192.6.6 Jul 3 08:47:17 install2004 dhcpd[3728660]: DHCPACK on 208.80.152.130 to aa:00:00:f4:44:8d via 10.192.6.6 Jul 3 08:47:18 install2004 atftpd[479]: Serving lpxelinux.0 to 208.80.152.130:13292
$ ip addr show dev tap1 23: tap1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000 link/ether 22:22:22:22:22:01 brd ff:ff:ff:ff:ff:ff inet 208.80.152.129/32 scope link tap1 valid_lft forever preferred_lft forever inet6 fe80::2022:22ff:fe22:2201/64 scope link valid_lft forever preferred_lft forever $ ip route show dev tap1 208.80.152.130 proto static scope link
Mon, Jul 1
Fri, Jun 28
Indeed, amazing ! Just a few lines of code to replace multiple VMs and router policies :)
IPIP encapsulation is a necessary step in the good direction, whatever solution we decide on for load balancing, for the reasons mentioned by Cathal and Valentin. As data point, the vxlan license is an extra 100k for a 10 racks setup (plus yearly support).
Thu, Jun 27
Strictly on the network side, there is no blocker one way or the other.
Wed, Jun 26
Tue, Jun 25
Your proposal seems good to me.
Mon, Jun 24
Opened https://github.com/netbox-community/netbox/issues/16698 for a Netbox regression on how it handles Scripts compared to... 3.2.9
Fri, Jun 21
It's not possible to the the DB migration directly from 3.2.9 to 4.x. We need to do a pit-stop on 3.7.x.
Thu, Jun 20
Wed, Jun 19
Some notes before I forget, to make the sre.deploy.python-code work I had to:
Mon, Jun 17
We had a quick look at the network side and couldn't find any smoking gun.
Of course ! not planning on doing it today :) The task is there to not forget.
Yeah I think it's what I tried to mean with
We can also decide that batch means to silently skip any device that have a different diff, to not risk blocking the run in the middle of it if a device have local changes
Basically decide if the batch behavior is (3) or (4) and then stick to it. 4 options seems a bit too much.
I tend to prefer (3), and would be ok to not support (4), especially as in a good state there should be no local changes.
I don't understand why the need to be moved to get upgraded to 10G. If we take for example wikikube-ctrl2001 the switch in rack B6 have plenty of available/ready to use 10G ports (for example 44-47).
Can we move the cables instead of moving the servers ?
It's necessary to do the diff on all target devices anyway, so that behavior is fine.
Jun 7 2024
Jun 6 2024
Jun 5 2024
Plan so far is to merge https://gerrit.wikimedia.org/r/1037784 to be able to have a puppetized test server compatible with the new deploy directory scheme (netbox-dev)
Then to merge https://gerrit.wikimedia.org/r/1038694 and check it out from /srv/deployment/netbox-dev/deploy
Then load a copy of the prod Netbox DB on the dev instance pbsql
Then Run the deploy python code cookbook to have a working Netbox 4 setup (and fix any issue that could prevent it)
Then check if the DB migration went well
In parallel merge https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/905570/ and parent change to have better CI on netbox-extra ahead of fixing all the Netbox 4 breaking changes.
Then send/merge patches to fix those netbox-extra changes.
Then (non blocker) Update the sre.netbox.update-extras cookbook to account for those changes.
Then send Spicerack, Cookbooks and Homer patches to fix Netbox's breaking changes. Ideally by moving some of the Cookbook's Netbox API calls to Spicerack.
Jun 3 2024
Last time we rolled out this change, it was simply updating modules/install_server/files/autoinstall/common.cfg. Do you have any other place in mind where this might need to be reconfigured? I am personally for removing this completely but it's not a big deal and we can keep it around as well.
https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Opengear_Serial_Consoles
https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/ServerTech
Our engineering team has now indicated that the compact json is not supported, due to hardware limitations with respect to compact json formatting. The feature will be deprecated in Junos 24.4. So, please do not use compact json to export data.
Moving the dynamic nature of NTP definition to some automated system instead of human or Puppet is a great idea :)
Human as in right now for network devices, the list is hard-coded https://github.com/wikimedia/operations-homer-public/blob/master/config/common.yaml#L365
- i.e. one in Germany which will pick ns0 rather than lower latency ns2
Seems like the main one is adguard-dns.com, which picks them randomly.
https://w.wiki/AGmr
I think the difficult part is where to stop the overengineering, for example it could make sens to use Liberica to healthcheck/advertise one of the NS anycast IP, but it might not be worth using a different AuthDNS software on half the servers, or a different Puppet infra.
Before going full anycast we need to make sure we're covering all major failure scenarios, or alternatively making a call to keep some unicast, knowing some places with broken/dumb implementation won't be the fastest, but maybe an ok tradeoff for better resiliency.
May 31 2024
That's quite interesting seeing the variation of tradeoffs, and can be quite (an important) rabbithole. Is the goal to figure it out before anycasting ns1, or first anycast ns1 from anywhere then figure out how to modify the setup for possible better redundancy.
It could be useful to list all the failure scenarios, and if we need to mitigate them or not. (server or network missconfig, Bird bug, etc). In other words are we putting too many of our eggs in the same basket ?
May 30 2024
assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this
As we couldn't get a /24 from LACNIC for magru, we only have two free /24s
We have to decide between a few options:
- allocate a new whole /24 for ns1 right now
- Pro: quick turnaround, no added cost
- Con: risk of lacking public v4 IPs for future projects (eg. new pops) or core site growth, can be mitigated by applying for more prefixes in parallel but not guaranty to get them
- Apply (and pay) for more IPs at RIPE or ARIN (T288342) and wait for an allocation before anycasting ns1
- Pro: limited cost, more flexible on IP usage
- Con: long turnaround (months to years)
- Buy a /24 on resale market
- Pro: faster turnaround
- Con: higher cost
- Don't Anycast ns1
- Listing it only for the sake of completeness, but not preferred. Even though there is a diminishing return after anycasting ns2, we believe anycasting one more ns would bring performance improvements to users
- Use the DoH Anycast prefix for ns1
- Pro: quick turnaround, no added cost
- Con: risk of providers blocking ns1 as a side effect of blocking DoH, mitigated by having 2 other NS.
May 28 2024
The Typha firewall service is now based on firewall::service and does dynamic name resolution on the puppet server side, let's see if this improves things with the next rename.
The issue didn't happen again, but we also did the move vlan in addition to the rename (so the IP changed too).
May 27 2024
JTAC was able to confirm/duplicate the bug on 22.3R3-S2.4, they're escalating it to their engineering team.
May 24 2024
Opened JTAC case 2024-0524-163553
May 23 2024
sudo cookbook sre.network.tls --system lsw1-f8-eqiad
Before I forget, please notify DCops so they update the physical labels on the server.
May 22 2024
Sounds good ! I'd recommend doing first a rename then normal re-image, then just a move-vlan, then on a different host, test both actions one after the other.
Tested on a MX204 running Junos 21.2 and 22.4R3.25, the returned JSON is invalid...
Diff:
@@ -100,7 +100,7 @@ { }, "address-family" : - { + [{ "address-family-name" : "inet", "mtu" : "4456", "max-local-cache" : "100000", @@ -142,7 +142,7 @@ "internal-flags" : "0x0" }, "interface-address" : - { + [{ "ifa-flags" : { "ifaf-current-preferred" : "[null]", @@ -174,6 +174,7 @@ } } } + ] }, { "address-family-name" : "multiservice", @@ -182,9 +183,8 @@ { "internal-flags" : "0x0" } - } + }] } } } }
Valid:
{ "interface-information" : { "physical-interface" : { "name" : "xe-0/1/2", "admin-status" : "up", "oper-status" : "up", "local-index" : "164", "snmp-index" : "536", "description" : "Transit: Arelion (IC-) {#1071}", "link-level-type" : "Ethernet", "sonet-mode" : "LAN-PHY", "mtu" : "4470", "mru" : "4478", "source-filtering" : "disabled", "speed" : "10Gbps", "bpdu-error" : "none", "ld-pdu-error" : "none", "l2pt-error" : "none", "loopback" : "none", "if-flow-control" : "enabled", "if-speed-cfg" : "Auto", "pad-to-minimum-frame-size" : "Disabled", "if-device-flags" : { "ifdf-present" : "[null]", "ifdf-running" : "[null]" }, "ifd-specific-config-flags" : { "internal-flags" : "0x100200" }, "if-config-flags" : { "iff-snmp-traps" : "[null]", "internal-flags" : "0x4000" }, "if-media-flags" : { "ifmf-none" : "[null]" }, "physical-interface-cos-information" : { "physical-interface-cos-hw-max-queues" : "8", "physical-interface-cos-use-max-queues" : "8", "physical-interface-schedulers" : "0" }, "current-physical-address" : "f0:4b:3a:ef:7e:45", "hardware-physical-address" : "f0:4b:3a:ef:7e:45", "interface-flapped" : "2023-03-09 08:13:40 UTC (62w6d 04:14 ago)", "traffic-statistics" : { "input-bps" : "7184824", "input-pps" : "8233", "output-bps" : "85449600", "output-pps" : "8470" }, "active-alarms" : { "interface-alarms" : { "alarm-not-present" : "[null]" } }, "active-defects" : { "interface-alarms" : { "alarm-not-present" : "[null]" } }, "ethernet-pcs-statistics" : { "bit-error-seconds" : "3", "errored-blocks-seconds" : "3" }, "interface-transmit-statistics" : "Disabled", "logical-interface" : { "name" : "xe-0/1/2.0", "local-index" : "343", "snmp-index" : "555", "if-config-flags" : { "iff-up" : "[null]", "iff-snmp-traps" : "[null]", "internal-flags" : "0x4004000" }, "encapsulation" : "ENET2", "policer-overhead" : { }, "traffic-statistics" : { "input-packets" : "1184552371726", "output-packets" : "1206955771514" }, "filter-information" : { }, "address-family" : [{ "address-family-name" : "inet", "mtu" : "4456", "max-local-cache" : "100000", "new-hold-limit" : "100000", "intf-curr-cnt" : "1", "intf-unresolved-cnt" : "0", "intf-dropcnt" : "0", "address-family-flags" : { "ifff-rpf-check" : "[null]", "ifff-rpf-loose-mode" : "[null]", "ifff-sendbcast-pkt-to-re" : "[null]", "internal-flags" : "0x0" }, "interface-address" : { "ifa-flags" : { "ifaf-current-preferred" : "[null]", "ifaf-current-primary" : "[null]" }, "ifa-destination" : "80.239.192.64/30", "ifa-local" : "80.239.192.66", "ifa-broadcast" : "80.239.192.67" } }, { "address-family-name" : "inet6", "mtu" : "4456", "max-local-cache" : "75000", "new-hold-limit" : "75000", "intf-curr-cnt" : "2", "intf-unresolved-cnt" : "0", "intf-dropcnt" : "0", "address-family-flags" : { "ifff-rpf-check" : "[null]", "ifff-rpf-loose-mode" : "[null]", "internal-flags" : "0x0" }, "interface-address" : [{ "ifa-flags" : { "ifaf-current-preferred" : "[null]", "ifaf-current-primary" : "[null]" }, "ifa-destination" : "2001:2000:3080:a9a::/64", "ifa-local" : "2001:2000:3080:a9a::2", "interface-address" : { "in6-addr-flags" : { "ifaf-none" : "[null]" } } }, { "ifa-flags" : { "ifaf-current-preferred" : "[null]", "internal-flags" : "0x800" }, "ifa-destination" : "fe80::/64", "ifa-local" : "fe80::f24b:3aff:feef:7e45", "interface-address" : { "in6-addr-flags" : { "ifaf-none" : "[null]" } } } ] }, { "address-family-name" : "multiservice", "mtu" : "Unlimited", "address-family-flags" : { "internal-flags" : "0x0" } }] } } } }