We are using talos linux to set up k8s clusters in VMware. It's working fine on one cluster on a vmware host but on another everything works except DNS INSIDE pods/containers.
I've enabled DNS on all hosts and talosctl "get resolvers" and "dnsupstream" all give the correct dns data
NODE NAMESPACE TYPE ID VERSION RESOLVERS
192.168.130.82 network ResolverStatus resolvers 2 ["10.203.32.2","10.203.32.3"]
talosctl -n 192.168.130.${ip} -e 192.168.130.${ip} get dnsupstream
NODE NAMESPACE TYPE ID VERSION HEALTHY ADDRESS
192.168.130.80 network DNSUpstream 10.203.32.2 1 true 10.203.32.2:53
192.168.130.80 network DNSUpstream 10.203.32.3 1 true 10.203.32.3:53
Yet when I fire up a "curlimages/curl" pod and curl to an internal server it works by IP but resolve hostname does not work.
~ $ curl http://10.203.32.90:32005
RABBITMQ API 1.0.15~ $
~ $ curl http://rabbit.domain.com:32005
curl: (6) Could not resolve host: rabbit.domain.com
According to the docs resolv.conf should have 10.96.0.9 on all nodes and it has.
talosctl read /system/resolved/resolv.conf
nameserver 10.96.0.9
This works on another talos cluster (where their ip's are in the same network as the dns servers) so I have no idea how to even debug anymore or how to fix it
Another testing by starting a dnsutils pod with the image
image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
and seeing that it started on node 192.168.130.15, I checked the dns logs there.
And they show gcr.io:
192.168.130.15: 2024-07-04T07:22:00.210Z DEBUG dns response {"component": "dns-resolve-cache", "data": ";; opcode: QUERY, status: NOERROR, id: 7044\n;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1\n\n;; OPT PSEUDOSECTION:\n; EDNS: version 0; flags:; udp: 1232\n\n;; QUESTION SECTION:\n;gcr.io.\tIN\t A\n\n;; ANSWER SECTION:\ngcr.io.\t300\tIN\tA\t142.250.102.82\n"}.
So DNS works on the node itself and images are pulled
yet dnsutils itself errors out:
kubectl exec dnsutils -it -- nslookup google.com
;; connection timed out; no servers could be reached
On the working cluster
kubectl exec dnsutils -it -- nslookup google.com
;; connection timed out; no servers could be reached
kubectl exec dnsutils -it -- nslookup -query=any google.com
Server: 10.96.0.10
Address: 10.96.0.10#53
Non-authoritative answer:
google.com nameserver = ns1.google.com.
google.com nameserver = ns3.google.com.
On the failing cluster
kubectl exec dnsutils -it -- nslookup -query=any google.com
;; Connection to 10.96.0.10#53(10.96.0.10) for google.com failed: timed out.
;; Connection to 10.96.0.10#53(10.96.0.10) for google.com failed: timed out.
How can I debug further to see why DNS inside pods is not working? Or how to fix this if anyone knows?