Docker Swarm with thousand containers

In a research project at the FH Aachen we are simulating multi-agent systems. As I am quite a fan of docker and already have deployable containers coming from the CI/CD pipeline, we want to use the containerized deployment for the simulation too.

First, using docker to run a thousand containers for a simulation program seems like a bad design choice as the communication between the participants generates overhead, while running on the same physical machine does not provide a performance improvement. But we have 5 physical machines at the institute (running as a Proxmox Cluster, which is another story), so some kind of distribution management is needed. Furthermore we want to keep the option to run the simulation on a RaspberryPi 4 Cluster (consisting of 144 RPi4 รก 4GB, which is another story too).

Additionaly one of our use cases focuses on the emulation of the real-world deployment, which also includes the network stack and communication between the services. Using docker-compose to define the services is a very easy way to generate a dynamic, fast test deployment.

For a distributed deployment a container orchestration tool is needed.

  • Kubernetes (k8s)
  • Docker Swarm (not to be confused with the old classic-swarm)

A quite good comparison of the capabilites is like Proxmox vs Openstack. Proxmox is less configurable and can not be configured as a multitenant application, but works great for smaller deployments. Just like that docker swarm comes integrated to with docker and does not need a lot of configuration, and a bonus is that its compatible with docker-compose.yml files. For Kubernetes a lot has to be configured manually (even though Plug-and-Play options for single-nodes exist with miniKube and k3s). Easily setting up a Cluster with HA-capability can’t be easier than running docker swarm init.

Sounds too good

I know it sounds too good, and it indeed is. For large deployments one slowly ends up where the complexity of a system can’t be hidden, things go wrong and one questions if it would had been better to opt for the “big setup” from the beginning instead of wasting time configuring the “easy setup” solution.

Running out of IP addresses

When running docker swarm init, a default address pool of 10.0.0.0/8 is reserved and each deployed stack gets a /24 network, so (2^8=256) addresses are available inside the stack. As each container gets an IPv4 address behind the docker0 NAT interface, one still has to solve this. Additional attention must be spent on the host network, as it may interfer with the docker network.

I used docker swarm init --default-addr-pool 10.128.0.0/12 --default-addr-pool-mask-length 16 which allows me to create (2^16=65536) containers and (2^4=16) stacks running at the same time. Of course one could also edit the network in the docker-compose file, but I don’t like to include deployment-specific information with the source code.

This solved the problem that containers could not be started because no IP address could be assigned. Thereby I could run more than two-hundred docker containers (docker runs much earlier out of addresses, as old containers seem to still hold the ip).

Hanging VM, can’t assign MAC address for bridge

The next problem came when running 400 containers. The system was unreliable, became unresponsive and slow and sometimes the only solution was to restart the VM forcefully. journalctl -e showed some errors about the initialization of the network interfaces of each container.

The error message is described in this issue, which does not seem very docker related and is already “fixed” for a while. Luckily someone else in this thread described my exact problem and analyzed it to see what goes wrong. Adding the MacAddressPolicy=none instead of persistent for bridges and bonds to /etc/systemd/network/98-default.link helped:

[Match]
Driver=bridge bonding

[Link]
MACAddressPolicy=none

Finally, I did not see the hanging /sbin/ifquery --allow hotplug -l again, which used all the CPU and slowed down the system. I successfully ran 700 containers in a single stack distributed across 4 hardware nodes.

The RAM usage was 52GB/64GB each (due to the big docker image). CPU idle was very low. As about 650 containers use the same image, the hard disk usage is less than 10GB per host.

So I will look how the system behaves under load and see how the RAM usage could be improved but otherwise this setup is very good and easy.

Other deployments suggested increasing the ARP cache and other sys.net.ipv4 stuff, but I did never have issues with it and did not change anything

The docker nodes are running bare debian 11 with a cloud-init setup and the following install script for docker:

Docker install script
#!/bin/bash
set -e
# Check for root priviliges
if [[ $EUID -ne 0 ]]; then
   printf "Please run as root:\nsudo %s\n" "${0}"
   exit 1
fi

apt-get update
apt-get install ca-certificates curl gnupg lsb-release -y
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get update
apt install docker-ce -y

curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose

usermod -aG docker $USER