Swarm manager active but down and Dockerd start: no space left on device

Today I ran into this a broken swarm.

I have not figured out why it would report itself as “reachability: active, status: down”.

The manager

But fixing the manager was somewhat easy. It would not leave the swarm, it would not respond with “up” to “systemctl restart docker” calls. The “down” status would also survive host reboots.

Thus I decided to take it from the swarm:

# docker swarm leave --force
Error response from daemon: context deadline exceeded
# See https://github.com/moby/moby/issues/25432 for a similar issue

# Here comes the hard kill:
# rm -rf /var/lib/docker/swarm
# docker swarm init

After that the manager was wiped clean of running services but stopped containers and especially volumes were still intact.

The worker

The worker had the following issue when starting dockerd:

Error starting daemon: Unable to get the TempDir under /var/lib/docker: mkdir /var/lib/docker/tmp: no space left on device

A quick “df -h” reported abundant amounts of storage space.

I found a nifty script counting files in a folder, but with all inodes taken this can take ages to finish:

# count_em - count files in all subdirectories under current directory.
echo 'echo $(ls -a "$1" | wc -l) $1' >/tmp/count_em_$$
chmod 700 /tmp/count_em_$$
find . -mount -type d -print0 | xargs -0 -n1 /tmp/count_em_$$ | sort -n
rm -f /tmp/count_em_$$

What gave me a hint was this github issue which was confirmed by `df -i` listing inodes:

# df -i
Filesystem      Inodes   IUsed   IFree IUse% Mounted on
/dev/root       393216   13332  379884    4% /
/dev/sdb1      1310720 1310712       8  100% /var/lib/docker

To find the folder with many files (as the above script was taking ages to finish), I used

# du -sh *
20K	builder
72K	buildkit
1.2M	containerd
12K	containers
4.0K	count.sh
9.0M	image
16K	lost+found

As you can see I cancelled it at the end, as it took a tremendous amount of time counting the “network” folder (which would have been next in line to be counted).


/var/lib/docker ]# ll network/files/ > output.txt

gave me an output.txt of 324627 lines and 36 MB filesize. So I figured the network folder had too many files, exhausting all inodes for /var/lib/docker.

Checking the contents, network/files gives you:

ll network/files/0001b54304af9d1ec13b5624636851e3b5e26933b108851504a95b74c7bebe3a
total 34268
    4 drwxr-xr-x 2 root root     4096 May 25 00:29 .
34252 drwxr-x--- 1 root root 35008512 Jun 18 11:38 ..
    4 -rw-r--r-- 1 root root      160 May 25 00:29 hosts
    4 -rw-r--r-- 1 root root       38 May 25 00:29 resolv.conf
    4 -rw-r--r-- 1 root root       71 May 25 00:29 resolv.conf.hash

Then I found moby/moby#27893, where mvollrath stated his comment:

We’ve been seeing this issue where the daemon fails to start due to the conflict @dansanders pasted, our fix is to rm /var/lib/docker/network/files/* and restart the daemon. It resets all networks.

mvollrath – github

This of course doesn’t work with that man files:

# rm -f /var/lib/docker/network/files/*
-su: /bin/rm: Argument list too long

But this will:

# rm -rf /var/lib/docker/network/files
# mkdir /var/lib/docker/network/files
# df -i
Filesystem      Inodes IUsed   IFree IUse% Mounted on
/dev/root       393216 13332  379884    4% /
/dev/sdb1      1310720 12253 1298467    1% /var/lib/docker

That’s much better. Docker Restart succeeded now:

# systemctl restart docker

Of course by my deletion of /var/lib/docker/swarm, the node was not part of a swarm anymore:

docker node ls
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.

So I joined up back with the manager, using the join token created by the manager’s docker-swarm init above.

(on the manager) # docker swarm join-token worker
To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-1bcrp3j...rxdmz1 x.x.x.x:pppp

(on the worker) # docker swarm join --token SWMTKN-1-1bcrp3j...rxdmz1 x.x.x.x:pppp
This node joined a swarm as a worker.

A final check confirmed a cluster back in operation, with volumes and containers in tact (if stopped). So all I had to do was recreate the services and they were restored to old strength.

# docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
labqmv1ocvl8xiys6dcink34j *   test-xxx            Ready               Active              Leader              18.06.2-ce
bqsto21zgm3kb0prz8rx6jkz7     test-xxx            Ready               Active                                  18.06.2-ce

For the record: The file count script returned eventually giving me

426 ./overlay2/bb6365191d3c2d882b84657677697bfd09875ebd91c4313aa1b2790f0659e760/...
456 ./overlay2/bb6365191d3c2d882b84657677697bfd09875ebd91c4313aa1b2790f0659e760/...
649 ./overlay2/bb6365191d3c2d882b84657677697bfd09875ebd91c4313aa1b2790f0659e760/...
977 ./image/overlay2/distribution/...
977 ./image/overlay2/distribution/...
324626 ./network/files

The last one being the already found /var/lib/docker/network/files problem, the other knowledge to be gained: The other folders had normal size.