3

My server recently crashed, because the GitLab docker/nomad container reached its defined memory limit (10G). When hitting the limit, the container spent 100% of its cpu time in kernel space. (The container was limited to 4 cpu cores.) Eventually the host locked up and was unresponsive to ssh connections:

Last minutes

The kernel log did not indicate any OOM kill. I also noticed a spike in disk io, which I can't explain.

I tried to create a smaller example, without any existing GitLab data and without Nomad:

config=$(cat <<'EOS'
external_url 'https://xxxxxxxxx'
nginx['listen_port'] = 80
nginx['listen_https'] = false
nginx['proxy_set_headers'] = {
 'Host' => '$http_host_with_default',
 'X-Real-IP' => '$remote_addr',
 'X-Forwarded-For' => '$proxy_add_x_forwarded_for',
 'X-Forwarded-Proto' => 'https',
 'X-Forwarded-Ssl' => 'on',
 'Upgrade' => '$http_upgrade',
 'Connection' => '$connection_upgrade'
}
EOS
)

# Started with
docker run --rm -d --memory="5G" \
    --name testgitlab \
    --publish 10.0.0.1:9001:80 \
    -e "GITLAB_SSL=true" \
    -e "GITLAB_OMNIBUS_CONFIG=$config" \
    gitlab/gitlab-ce:latest

The same thing happened. I stopped the container right when it went to 100% system load (no cpu limit this time): Grafana Screenshot

At first I thought this was a bug in GitLab, but I tried the same thing on my own laptop and the OOM killer instantly killed a container process when hitting the memory limit. This problem also never appeared on my old server, which used to run the GitLab container.

All systems mentioned in this post did not have any special docker settings, have swap disabled, and only modified sysctl settings in the network subsystem.

  • Old server: Debian 10
    • Kernel from buster-backports (5.10.0-0.bpo.9-amd64)
    • Filesystem: btrfs on md-raid 1 (2x nvme ssds)
  • Current server: Debian 11
    • Kernel from bullseye (5.10.0-9-amd64) and bullseye-backports (5.14.0-0.bpo.2-amd64) kernel. Problem happens with both kernels.
    • Filesystem: btrfs on lvm on md-raid 1 (2x nvme ssds)
  • My laptop: Arch
    • Kernel: 5.15.5-arch1-1
    • Filesystem: ext4 on lvm on nvme ssd

How can I avoid a host freeze, when GitLab reaches its memory limit? And why does it only affect my current server?

UPDATE: I could reproduce this behaviour with the sonatype/nexus container. So its not a GitLab thing.

UPDATE 2: I noticed a spike in page cache misses. Maybe this is the cause, just with cgroups? But why isn't my old server or my laptop affected?

UPDATE 3: I can reproduce it, when spinning up a VM with mdraid1+lvm+btrfs! Investigating further...

1 Answer 1

0

Sounds like memory thrashing. The container runs out of memory, so needs to use swap space. Writing out to the swap space causes high kernel and disk activity, because memory swapping is a function of the kernel, and and it swaps to disk.

2
  • swap is disabled. Do you mean swapping of e.g. code sections? Can this be prevented?
    – Gabscap
    Commented Dec 2, 2021 at 2:02
  • From: nomadproject.io/docs/drivers/docker#memory_hard_limit memory_hard_limit - (Optional) The maximum allowable amount of memory used (megabytes) by the container. If set, the memory parameter of the task resource configuration becomes a soft limit passed to the docker driver as --memory_reservation, and memory_hard_limit is passed as the --memory hard limit. When the host is under memory pressure, the behavior of soft limit activation is governed by the Kernel. Could explain the max-out kernel activity.
    – Brian
    Commented Dec 4, 2021 at 2:11

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .