I will attach the minimized test case below. However, it is a simple Dockerfile that has these lines:

VOLUME ["/sys/fs/cgroup"]
CMD ["/lib/systemd/systemd"]

It is Debian:buster-slim based image, and runs systemd inside the container. Effectively, I used to run the container like this:

$ docker run  --name any --tmpfs /run \
    --tmpfs /run/lock --tmpfs /tmp \
    -v /sys/fs/cgroup:/sys/fs/cgroup:ro -it image_name

It used to work fine before I upgraded a bunch of host Linux packages. The host kernel/systemd now seems to default cgroup v2. Before, it was cgroup. It stopped working. However, if I give the kernel option so that the host uses cgroup, then it works again.

Without giving the kernel option, the fix was to add --cgroupns=host to docker run besides mounting /sys/fs/cgroup as read-write (:rw in place of :ro).

I'd like to avoid forcing the users to give the kernel option. Although I am far from an expert, forcing the host namespace for a docker container does not sound right to me.

I am trying to understand why this is happening, and figure out what should be done. My goal is to run systemd inside a docker, where the host follows cgroup v2.

Here's the error I am seeing:

$ docker run --name any --tmpfs /run --tmpfs /run/lock --tmpfs /tmp \
    -v /sys/fs/cgroup:/sys/fs/cgroup:rw -it image_name
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <5e089ab33b12>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

It does not look right but especially this line seems suspicous:

Failed to create /init.scope control group: Read-only file system

It seems like there should have been something before /init.scope. That was why I reviewed the docker run options, and tried the --cgroupsns option. If I add the --cgroupns=host, it works. If I mount /sys/fs/cgroup as read-only, then it fails with a different error, and the corresponding line looks like this:

Failed to create /system.slice/docker-0be34b8ec5806b0760093e39dea35f4305262d276ecc5047a5f0ff43871ed6d0.scope/init.scope control group: Read-only file system

To me, it is like the docker daemon/engine fails to configure XXX.slice or something like that for the container. I assume that docker may be to some extend responsible for giving the namespace but something is not going well. However, I can't be so sure at all. What would be the issue/fix?

The Dockerfile I used for this experiment is as follows:

FROM debian:buster-slim

ENV container docker
ENV DEBIAN_FRONTEND noninteractive

USER root

RUN set -x

RUN apt-get update -y \
    && apt-get install --no-install-recommends -y systemd \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
    && rm -f /var/run/nologin

RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
    /etc/systemd/system/*.wants/* \
    /lib/systemd/system/local-fs.target.wants/* \
    /lib/systemd/system/sockets.target.wants/*udev* \
    /lib/systemd/system/sockets.target.wants/*initctl* \
    /lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \

VOLUME [ "/sys/fs/cgroup" ]

CMD ["/lib/systemd/systemd"]

I am using Debian. The docker version is 20.10.3 or so. Google search told me that docker supports cgroup v2 as of 20.10 but I don't actually understand what that "support" means.

  • I have actually encountered the exact same problem. I've been convinced I was running cgroupv2 the whole time and wondered why systemd inside the container cannot create its own user namespaces. The goal is to actually use v2 in order to get the most functionality of the system inside the container. I've figured out the host's systemd is not using v2 and by extension also docker - enabled it - and everything stopped working. I will test in a moment, but it seems systemd inside container needs to be told to switch to v2.
    – pinkeen
    Commented Feb 19, 2021 at 22:30
  • I need to get back to reading the cg v1/v2 technical documentation top-to-bottom because I feel lost. According to my understanding the cgroupns private mode should create a private group for the container's init. It seems not to do that. I think you should not mount /sys/fs/cgroup at all, it should be populated automatically. For reference see Rootlesskit docs here: github.com/rootless-containers/rootlesskit/blob/master/…. Also podman option docs seem to be quite insightful: docs.podman.io/en/latest/markdown/…
    – pinkeen
    Commented Feb 19, 2021 at 23:16
  • Systemd containers work with podman OOTB, already tried it long time ago. I would gladly uses podman for my own purposes but I want to create a solution that will feel familiar for everybody so compatibility-wise docker seems to be preferred... There's also LXC/LXD but it has very different approach and the selling point is support for both (or almost all) types of workloads.
    – pinkeen
    Commented Feb 19, 2021 at 23:21
  • I will try setting up docker in a clean VM from scratch, cause I've got a feeling this system might be misconfigured (it's Debian Buster but the Proxmox flavour). Even Docker for Mac does not support cgroupv2 (at least not in the stable version).
    – pinkeen
    Commented Feb 19, 2021 at 23:23
  • 1
    You can use nsenter and mount to change the :ro permission of sys/fs/cgroup in container to :rw. See this very comprehensive post on github: github.com/mviereck/x11docker/issues/…
    – mviereck
    Commented Feb 10, 2022 at 12:52

It seems to me that this use case is not explicitly supported yet. You can almost get it working but not quite.

The root cause

When systemd sees a unified cgroupfs at /sys/fs/cgroup it assumes it should be able to write to it which normally should be possible but is not the case here.

The basics

First of all, you need to create a systemd slice for docker containers and tell docker to use it - my current docker/daemon.json:

  "exec-opts": ["native.cgroupdriver=systemd"],
  "features": { "buildkit": true },
  "experimental": true,
  "cgroup-parent": "docker.slice"

Note: Not all of these options are necessary. The most important one is cgroup-parent. The cgroupdriver should already be switched to "systemd' by default.

Each slice gets its own nested cgroup. There is one caveat though: Each group might only be a "leaf" or "intermediary". Once a process takes ownershop of a cgroup no other can manage it. This means that the actual container process needs and will get its own private group attached below the configured one in the form of a systemd scope.

Reference: Please find more about systemd resource control, handling of cgroup namespaces and delegation.

Note: A this point docker daemon should use --cgroupns private by default, but you can force it anyway.

Now a newly started container will get its own group which should be available in a path that (depending on your setup) resembles:


And here is the important part: You must not mount a volume into container's /sys/fs/cgroup. The path to its private group mentioned above should get mounted there automatically.

The goal

Now, in theory, the container should be able to manage this delegated, private group by itself almost fully. This would allow its own init process to create child groups.

The problem

The problem is that the /sys/fs/cgroup path in the container gets mounted read-only. I've checked apparmor rules and switched seccomp to unconfined to no avail.

The hypothesis

I am not completely certain yet - my current hypothesis is that this is a security feature of docker/moby/containerd. Without private groups it makes perfect sense to mount this path ro.

Potential solutions

What I've also discovered is that enabling user namespace remapping causes the private /sys/fs/cgroup to be mounted with rw as expected!

This is far from perfect though - the cgroup (among others) mount has wrong ownership: it's owned by the real system root (UID0) while the container has been remapped to a completely different user. Once I've manually adjusted the owner - the container was able to start a systemd init sucessfully.

I suspect this is a deficiency of docker's userns remapping feature and might be fixed sooner or later. Keep in mind that I might be wrong about this - I did not confirm.


Userns remapping has got a lot of drawbacks and the best possible scenario for me would be to get the cgroupfs mounted rw without it. I still don't know if this is done on purpose or if it's some kind of limitation of the cgroup/userns implementation.


It's not enough that your kernel has cgroupv2 enabled. Depending on the linux distribution bundled systemd might prefer to use v1 by default.

You can tell systemd to use cgroupv2 via kernel cmdline parameter:

It might also be needed to explictly disable hybrid cgroupv1 support to avoid problems using: systemd.legacy_systemd_cgroup_controller=0

Or completely disable cgroupv1 in the kernel with: cgroup_no_v1=all

  • Upon consideration I wonder if its even possible to do without userns-remapping. It might be that kernel does not support clone/unshare inside a child namespace of UID 0?
    – pinkeen
    Commented Feb 20, 2021 at 18:45
  • You can use nsenter to mount /sys/fs/cgroup to :rw. See my comment under the question, it would have fit here better.
    – mviereck
    Commented Feb 10, 2022 at 14:07

Thanks to @pinkeen 's answer, here is my Dockerfile and command line, it works fine. I hope this helps:

FROM debian:bullseye
# Using systemd in docker: https://systemd.io/CONTAINER_INTERFACE/
# Make sure cgroupv2 is enabled. To check this: cat /sys/fs/cgroup/cgroup.controllers
ENV container docker
VOLUME [ "/tmp", "/run", "/run/lock" ]
# Remove unnecessary units
RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
  /etc/systemd/system/*.wants/* \
  /lib/systemd/system/local-fs.target.wants/* \
  /lib/systemd/system/sockets.target.wants/*udev* \
  /lib/systemd/system/sockets.target.wants/*initctl* \
  /lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
CMD [ "/lib/systemd/systemd", "log-level=info", "unit=sysinit.target" ]
docker build -t systemd_test .
docker run -t --rm --name systemd_test \
  --privileged --cap-add SYS_ADMIN --security-opt seccomp=unconfined \
  --cgroup-parent=docker.slice --cgroupns private \
  --tmpfs /tmp --tmpfs /run --tmpfs /run/lock \

Note: you MUST use Docker 20.10 or above, and your system enabled cgroupv2 (check if /sys/fs/cgroup/cgroup.controllers) exists.

  • 2
    This is the only solution that I found that worked for me as I'm running on CGroupv2 on OpenSuSE.
    – vanthome
    Commented Feb 2, 2022 at 8:15
  • I've done this FROM arm64v8/ubuntu:20.04 on a raspberrypi. I will end up on [ OK ] Reached target Graphical Interface. I thought I've used a server image and I wonder if it is normal ending up on graphic?
    – woodz
    Commented Nov 22, 2022 at 15:35

For those wondering how to solve this with the kernel commandline:

# echo 'GRUB_CMDLINE_LINUX=systemd.unified_cgroup_hierarchy=false' > /etc/default/grub.d/cgroup.cfg
# update-grub

This creates a "hybrid" cgroup setup, which makes the host cgroup v1 available again for the container's systemd.


  • I encountered this issue with LXC on Ubuntu for some of my Ubuntu/systemd-based LXC containers after upgrading to Ubuntu 22.04 (Jammy). I can confirm the workaround suggested works and my containers are now running again. Commented May 23, 2022 at 18:59
  • From the systemd documentation: "To say this clearly, legacy and hybrid modes have no future. If you develop software today and don’t focus on the unified mode, then you are writing software for yesterday, not tomorrow."
    – tstenner
    Commented Oct 27, 2022 at 7:25

I have discovered two additional workarounds for this issue that effectively retain all features of unified cgroupv2 while maintaining security - no need for the --privileged flag and no access to the root of cgroupv2 hierarchy:

  1. Use the --cgroupns host Docker option and a cgroupv2 sub-hierarchy volume binding for the container. Here is an example command:
# docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns host -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9
Detected virtualization container-other.
Detected architecture x86-64.
Initializing machine ID from random generator.
Queued start job for default target Minimal target for containerized FreeIPA server.

Not perfect, next option is better IMO.

  1. Mount /sys/fs/cgroup on the host without the nsdelegate mount option. Although there isn't an explicit option to disable nsdelegate like nodiscard for discard (see link 1, link 2 for more information), there is a workaround. Simply run any container using Docker with the --cgroupns host option and without any cgroup volume bindings. For example:
# grep cgroup /proc/mounts 
cgroup2 /sys/fs/cgroup cgroup2 rw,seclabel,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
# docker run --rm --cgroupns host ubuntu:latest echo done
# grep cgroup /proc/mounts 
cgroup2 /sys/fs/cgroup cgroup2 rw,seclabel,nosuid,nodev,noexec,relatime 0 0

After implementing these steps, you can run a container with Docker using --cgroupns private flag and volume binding of cgroupv2 sub-hierarchy. For example:

# docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns private -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9
Detected virtualization container-other.
Detected architecture x86-64.
Initializing machine ID from random generator.
Queued start job for default target Minimal target for containerized FreeIPA server.

Please note that the information provided above applies specifically to CentOS Stream release 9 with kernel-ml-6.3.7-1.el9.elrepo, systemd-252.4-598.13.hs.el9 (Hyperscale SIG) and docker-ce-24.0.2-1 (systemd cgroup driver) although may help with a wide range of different scenarios.


It's interesting to notice that with docker desktop for Mac 4.13.1 this Dockerfile works:

FROM debian:bullseye

VOLUME [ "/tmp", "/run", "/run/lock" ]

RUN apt-get update && apt-get install -y systemd bash && apt-get clean && mkdir -p /lib/systemd && ln -s /lib/systemd/system /usr/lib/systemd/system;


RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
  /etc/systemd/system/*.wants/* \
  /lib/systemd/system/local-fs.target.wants/* \
  /lib/systemd/system/sockets.target.wants/*udev* \
  /lib/systemd/system/sockets.target.wants/*initctl* \
  /lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \

CMD [ "/lib/systemd/systemd" ]

With just:

docker build . -t debiansys
docker run --rm -it --privileged debiansys

And this doesn't:

FROM amazonlinux:2

VOLUME [ "/tmp", "/run", "/run/lock" ]

RUN yum -y update && yum install -y systemd systemd-sysv bash && mkdir -p /lib/systemd && ln -s /lib/systemd/system /usr/lib/systemd/system


RUN cd /lib/systemd/system/sysinit.target.wants/ ; \
    for i in *; do [ $i = systemd-tmpfiles-setup.service ] || rm -f $i ; done ; \
    rm -f /lib/systemd/system/multi-user.target.wants/* ; \
    rm -f /etc/systemd/system/*.wants/* ; \
    rm -f /lib/systemd/system/local-fs.target.wants/* ; \
    rm -f /lib/systemd/system/sockets.target.wants/*udev* ; \
    rm -f /lib/systemd/system/sockets.target.wants/*initctl* ; \
    rm -f /lib/systemd/system/basic.target.wants/* ; \
    rm -f /lib/systemd/system/anaconda.target.wants/*

ENTRYPOINT [ "/lib/systemd/systemd" ]
docker build . -t al2sys
docker run --rm -it --privileged al2sys
[!!!!!!] Failed to mount API filesystems, freezing.

I've tried remounting the /sys/fs/cgroup inside the hyperkit machine but nothing seems to stick...

 Context:    default
 Debug Mode: false
  buildx: Docker Buildx (Docker Inc., v0.9.1)
  compose: Docker Compose (Docker Inc., v2.12.1)
  dev: Docker Dev Environments (Docker Inc., v0.0.3)
  extension: Manages Docker extensions (Docker Inc., v0.2.13)
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0)
  scan: Docker Scan (Docker Inc., v0.21.0)

 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 20.10.20
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
   Profile: default
 Kernel Version: 5.15.49-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 6
 Total Memory: 7.675GiB
 Name: docker-desktop
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
 Live Restore Enabled: false

Based on @BubbleQuote's answer, I've made it working in docker-compose (with some slight mod to docker-compose):

version: '3.9'

    image: nyamisty/systemd-ubuntu-v2:18.04
    # -it --privileged --cap-add SYS_ADMIN --security-opt seccomp=unconfined --cgroup-parent=docker.slice --cgroupns private --tmpfs /tmp --tmpfs /run --tmpfs /run/lock
    stdin_open: true # docker run -i
    tty: true
    privileged: true
      - SYS_ADMIN
      - seccomp=unconfined
    cgroup_parent: docker.slice
    cgroupns: private
      - /run
      - /run/lock
      - /tmp

However, currently docker-compose still does not support cgroupns spec. We can modify the standalone v1 docker-compose (that is, python version).

  1. Install docker-compose via PyPI
  2. grep -r 'cgroup_parent' in site-packages/compose
  3. Add lines for 'cgroupns' just like 'cgroup_parent'. In my case, I changed these files:
    • config/config.py
    • config/compose_spec.json
    • config/config_schema_v1.json
    • service.py
  4. Bump docker-py's version to support cgroupns: pip3 install -U docker
  • using cap_add with privileged: true doesn't make sense, as privileged true already sets all effective capabilities.
    – TheDiveO
    Commented Dec 29, 2023 at 16:19
  • yeah, that's for debugging. If you can achieve it with privileged: true, then you can go back and try to use only cap_add
    – Misty
    Commented Dec 30, 2023 at 20:44
  • not exactly: privileged is more than just setting all capabilities. For instance, it does not create overmounts in various places of the container's file system, especially it does not create tmpfs overmounts inside the proc fs.
    – TheDiveO
    Commented Dec 31, 2023 at 16:17


        if uidMap := daemon.idMapping.UIDMaps; uidMap != nil || c.HostConfig.Privileged || c.HostConfig.CgroupnsMode.IsPrivate() {

just added || c.HostConfig.CgroupnsMode.IsPrivate() into condition. does it fix the issue?

