2

I am experiencing a network/http timeout issue with a docker-in-docker app that's running in a Kubernetes cluster and I need help in figuring out what may be happening.

I am running a docker container within docker (it's a build tool). In the innermost container, the docker build hangs on executing this line in the Dockerfile: apk add --no-cache tzdata

The console output says: fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz

I have tried a simple curl with this URL and it works about 50% of the time, the rest of the time it times out. The issue is also limited to the Alpine CDN URL. So for example, I can download an image from flickr.com 100% of the time. It is also downloading 100% of the time in a different cluster in a different VPC. Therefore, there is something particular to this specific Kubernetes stack, and this particular URL, that's causing the issue. What I need help with is how to dig further to try to identify the problem.

I have stripped the app down to the bare essence that highlights the problem. Here is the project structure:

project file structure

Here is app.py:

from time import sleep

while True:
    sleep(60)

This is the Dockerfile:

FROM python:3.7-alpine3.11

RUN apk add --no-cache                                                  \
    docker

COPY entrypoint.sh /
RUN chmod 0700 /entrypoint.sh

RUN mkdir /app
WORKDIR /app/
COPY app /app/

ENTRYPOINT [ "/entrypoint.sh" ]

This is entrypoint.sh:

#!/bin/sh
set -e

echo 'Starting dockerd...'
# check if docker pid file exists (can linger from docker stop or unclean shutdown of container)
if [ -f /var/run/docker.pid ]; then
  rm -f /var/run/docker.pid
fi
mkdir -p /etc/docker
echo '{ "storage-driver": "vfs" }' > /etc/docker/daemon.json
nohup dockerd > /var/log/dockerd.log &

# The following command does not spawn execution to the background as
#     we need to leave something holding the container in run state.
echo "Starting canary app..."
exec python3 app.py

And service.yml

apiVersion: v1
kind: List
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    labels:
      run: canary
    name: canary
  spec:
    replicas: 1
    selector:
      matchLabels:
        run: canary
    template:
      metadata:
        labels:
          run: canary
      spec:
        containers:
          - image: canary
            imagePullPolicy: IfNotPresent
            name: canary
            securityContext:
              capabilities:
                add:
                  - SYS_ADMIN
              privileged: true
        dnsPolicy: ClusterFirst
- apiVersion: v1
  kind: Service
  metadata:
    name: canary
    labels:
      run: canary
  spec:
    ports:
      - port: 80
        protocol: TCP
    selector:
      run: canary
    sessionAffinity: None
    type: ClusterIP

enter image description here

2 Answers 2

2

The issue was related to MTU. Our cluster is using Calico VXLAN networking, which has an MTU of 1450. The inner Docker container wasn't taking cognizance of this and it doesn't seem to have been picked up during the path MTU discovery (PMTUD). Weirdly enough, this was a problem with the Fastly CDN, and not other server hosts that I tried, so that was an additional confounding factor. The issue went away when I set the MTU for the inner Docker containers to 1450 as well.

1
  • My first idea would have been that IPv6 support was enabled in Docker, but not configured properly, so the container is assigned an address, which causes name lookup to start returning IPv6 addresses for services. Commented Sep 17, 2020 at 13:20
0

We had a very similar symptom running a docker-in-docker build in our CI pipeline, which runs in GKE. We suddenly started getting intermittent failures building Docker images when apt-get attempted to download packages from deb.debian.org (but also other hosts we tested manually, e.g. github.com). The problem was fixed in a similar way to that detailed by @Sushil, namely setting the MTU for the inner Docker containers. We did this by creating a custom Docker image derived from the official docker:dind image:

FROM docker:dind
COPY daemon.json /etc/docker/

Where daemon.json contains:

{
  "mtu": 1460
}

(1460 being the MTU of the VPC network associated with our Kubernetes cluster)

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .