1

edit: The issue was my umask being set to 027 rather than the default of 022. See below for details.

I'm experiencing a bewildering (set of) issues regarding LXC that manifests itself throughout the system after occurring.

When starting/stopping LXC containers, occasionally the start or stop will hang indefinitely. When this happens on startup, the container's init process is running but unkillable, even using kill -9. The container never comes online, and the only way to end the process is a system reboot.

Thing is, the system won't reboot any more either. At the same time as this issue I noticed an issue when running update-initramfs, that also hangs indefinitely. After finding this: https://unix.stackexchange.com/questions/428001/update-initramfs-hangs-on-debian-stretch I have concluded that indeed the sync command (both the utility and system call) are hanging, causing LXC to not work, update-initramfs to hang, and system shutdown to hang (as a sync should be done before unmounting filesystems). Once the issue occurs, calling sync (the utility) from the command line will consistently hang indefinitely. I have tried running it in strace but the trace ends when going into the kernel call and I can't debug further. I've monitored the caches using this but it just hovers in the <100kB range.

Considering sync has to do with filesystems I expect there is something wrong with the way LXC is handling some filesystem. I have another identical server that does not use LXC, and after comparing the output of mount I unmounted the filesystems not present on that one, to no avail. sync continues to hang.

Now, on a clean boot, and not touching LXC, sync always works, and continues to work. For this reason and the fact that I'm not seeing other problems I am positive there are no actual I/O issues. Also when a container does start succesfully, it doesn't seem to have any problems.

I have scoured the internet far and wide regarding this issue, with no success.

LXC 2.0.7-2+deb9u2 on Debian 9 (stable) with kernel 4.19.0-0.bpo.4-amd64 (although it happened in other recent kernels too), with 2 SSD's in raid1 for / and 3 HDD's in raid5 (mdadm) for /home. Guests are Debian 9 (stretch) or 10 (buster), running as unprivileged containers. I seem to have narrowed it down to this: The issue did not occur for privileged containers.

Example guest container config:

# Template used to create this container: /usr/share/lxc/templates/lxc-download

# Distribution configuration
lxc.include = /usr/share/lxc/config/debian.common.conf
lxc.include = /usr/share/lxc/config/debian.userns.conf
lxc.arch = linux64

# Container specific configuration
lxc.id_map = u 0 200000 100000
lxc.id_map = g 0 200000 100000

# Network configuration
#lxc.network.type = empty
lxc.network.type = veth
lxc.network.link = lxcbr0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:e9:4a:e7
lxc.rootfs = /var/lib/lxc/somename/rootfs
lxc.rootfs.backend = dir
lxc.utsname = somename

# Mounts
lxc.mount.entry = /var/lib/lxc/temp mnt/temp none bind 0 0

and subuid/gid mappings:

# cat /etc/s*id
root:100000:1000000000
root:100000:1000000000

Example container creation, startup, and failing stop:

# lxc-create -n test -t download
...
Distribution: debian
Release: stretch
Architecture: amd64

Downloading the image index
Downloading the rootfs
Downloading the metadata
The image cache is now ready
Unpacking the rootfs

---
You just created a Debian stretch amd64 (20190522_05:24) container.

# lxc-ls -f
NAME          STATE   AUTOSTART GROUPS IPV4 IPV6 
test          STOPPED 0         -      -    -    

# lxc-start -n test

# lxc-ls -f
NAME          STATE   AUTOSTART GROUPS IPV4 IPV6 
test          RUNNING 0         -      -    -    

# lxc-attach -n test
root@test:/# ls -alh /
total 68K
drwxr-xr-x  21 root   root    4.0K May 22 05:26 .
drwxr-xr-x  21 root   root    4.0K May 22 05:26 ..
drwxr-xr-x   2 root   root    4.0K May 22 05:26 bin
drwxr-xr-x   2 root   root    4.0K Mar 28 09:12 boot
drwxr-xr-x   4 root   root     400 May 22 09:26 dev
drwxr-xr-x  42 root   root    4.0K May 22 09:24 etc
drwxr-xr-x   2 root   root    4.0K Mar 28 09:12 home
drwxr-xr-x   9 root   root    4.0K May 22 05:25 lib
drwxr-xr-x   2 root   root    4.0K May 22 05:25 lib64
drwxr-xr-x   2 root   root    4.0K May 22 05:25 media
drwxr-xr-x   2 root   root    4.0K May 22 05:25 mnt
drwxr-xr-x   2 root   root    4.0K May 22 05:25 opt
dr-xr-xr-x 225 nobody nogroup    0 May 22 09:26 proc
drwx------   2 root   root    4.0K May 22 05:25 root
drwxr-xr-x   3 root   root      60 May 22 09:26 run
drwxr-xr-x   2 root   root    4.0K May 22 05:26 sbin
drwxr-xr-x   2 root   root    4.0K May 22 05:25 srv
dr-xr-xr-x  13 nobody nogroup    0 May 19 17:07 sys
drwxrwxrwt   2 root   root    4.0K May 22 05:25 tmp
drwxr-xr-x  10 root   root    4.0K May 22 05:25 usr
drwxr-xr-x  11 root   root    4.0K May 22 05:25 var
root@test:/# exit

# lxc-ls -f
NAME          STATE   AUTOSTART GROUPS IPV4 IPV6 
debian_buster STOPPED 0         -      -    -    
rtorrent      STOPPED 0         -      -    -    
test          RUNNING 0         -      -    -    

# lxc-stop -n test
^C

# lxc-stop -n test
... continues to hang ...
# ^C
# sync
^C^C^Z^X^C^Z^X^C^Z^C^Z^X^C
... won't die.

1 Answer 1

0

As it turns out, the problem was my less-permissive-than-default umask. Debian's default is 022, which I have changed in my user account's .bashrc to 027 for security reasons. Upon using su to become root, this umask is copied such that all lxc-* commands are executed with umask 027. However, this results in a known issue with LXC, which for some reason still hasn't been fixed (in the Debian packages, at least?).

Changing the umask (in session or by modifying the .bashrc) allows me to run the containers I already had just fine.

Resources:

https://discuss.linuxcontainers.org/t/cannot-stop-unprivileged-container-not-even-kill-9-its-systemd-process-on-host/1079

https://github.com/lxc/lxc/issues/2277 (not sure this is the same issue)

https://github.com/lxc/lxc/issues/1403

https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1642767

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .