0

We are using talos linux to set up k8s clusters in VMware. It's working fine on one cluster on a vmware host but on another everything works except DNS INSIDE pods/containers.

I've enabled DNS on all hosts and talosctl "get resolvers" and "dnsupstream" all give the correct dns data

NODE             NAMESPACE   TYPE             ID          VERSION   RESOLVERS
192.168.130.82   network     ResolverStatus   resolvers   2         ["10.203.32.2","10.203.32.3"]

talosctl -n 192.168.130.${ip} -e 192.168.130.${ip} get dnsupstream 
NODE             NAMESPACE   TYPE          ID            VERSION   HEALTHY   ADDRESS
192.168.130.80   network     DNSUpstream   10.203.32.2   1         true      10.203.32.2:53
192.168.130.80   network     DNSUpstream   10.203.32.3   1         true      10.203.32.3:53

Yet when I fire up a "curlimages/curl" pod and curl to an internal server it works by IP but resolve hostname does not work.

~ $ curl http://10.203.32.90:32005
RABBITMQ API 1.0.15~ $
~ $ curl http://rabbit.domain.com:32005
curl: (6) Could not resolve host: rabbit.domain.com

According to the docs resolv.conf should have 10.96.0.9 on all nodes and it has.

talosctl read /system/resolved/resolv.conf
nameserver 10.96.0.9

This works on another talos cluster (where their ip's are in the same network as the dns servers) so I have no idea how to even debug anymore or how to fix it

Another testing by starting a dnsutils pod with the image

image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3

and seeing that it started on node 192.168.130.15, I checked the dns logs there.

And they show gcr.io:

192.168.130.15: 2024-07-04T07:22:00.210Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 7044\n;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;gcr.io.\tIN\t A\n\n;; ANSWER SECTION:\ngcr.io.\t300\tIN\tA\t142.250.102.82\n"}.

So DNS works on the node itself and images are pulled

yet dnsutils itself errors out:

kubectl exec dnsutils -it -- nslookup google.com
;; connection timed out; no servers could be reached

On the working cluster

kubectl exec dnsutils -it -- nslookup google.com
;; connection timed out; no servers could be reached
kubectl exec dnsutils -it -- nslookup -query=any google.com
Server:         10.96.0.10
Address:        10.96.0.10#53

Non-authoritative answer:
google.com      nameserver = ns1.google.com.
google.com      nameserver = ns3.google.com.

On the failing cluster

kubectl exec dnsutils -it -- nslookup -query=any google.com
;; Connection to 10.96.0.10#53(10.96.0.10) for google.com failed: timed out.
;; Connection to 10.96.0.10#53(10.96.0.10) for google.com failed: timed out.

How can I debug further to see why DNS inside pods is not working? Or how to fix this if anyone knows?

1 Answer 1

0

By using traceroute, it turns out to be a firewall block for port 53, traceroute gets to 172.17.252 and beyond on the working cluster.252 is a firewall VM.

/ # traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 46 byte packets
 1  10.244.6.1 (10.244.6.1)  0.022 ms  0.031 ms  0.019 ms
 2  10.203.32.254 (10.203.32.254)  0.755 ms  1.011 ms  0.639 ms
 3  172.17.0.252 (172.17.0.252)  1.515 ms  0.854 ms  0.730 ms

and it blocks on the failing cluster

/ # traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 46 byte packets
 1  10.244.5.1 (10.244.5.1)  0.013 ms  0.010 ms  0.002 ms
 2  192.168.130.254 (192.168.130.254)  0.550 ms  0.706 ms  0.009 ms
 3  *  *  *

Not the answer you're looking for? Browse other questions tagged or ask your own question.