Today I ran into this a broken swarm.
I have not figured out why it would report itself as “reachability: active, status: down”.
The manager
But fixing the manager was somewhat easy. It would not leave the swarm, it would not respond with “up” to “systemctl restart docker” calls. The “down” status would also survive host reboots.
Thus I decided to take it from the swarm:
# docker swarm leave --force
Error response from daemon: context deadline exceeded
# See https://github.com/moby/moby/issues/25432 for a similar issue
# Here comes the hard kill:
# rm -rf /var/lib/docker/swarm
# docker swarm init
After that the manager was wiped clean of running services but stopped containers and especially volumes were still intact.
The worker
The worker had the following issue when starting dockerd:
Error starting daemon: Unable to get the TempDir under /var/lib/docker: mkdir /var/lib/docker/tmp: no space left on device
A quick “df -h” reported abundant amounts of storage space.
I found a nifty script counting files in a folder, but with all inodes taken this can take ages to finish:
#!/bin/bash
# count_em - count files in all subdirectories under current directory.
echo 'echo $(ls -a "$1" | wc -l) $1' >/tmp/count_em_$$
chmod 700 /tmp/count_em_$$
find . -mount -type d -print0 | xargs -0 -n1 /tmp/count_em_$$ | sort -n
rm -f /tmp/count_em_$$
What gave me a hint was this github issue which was confirmed by `df -i` listing inodes:
# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/root 393216 13332 379884 4% /
...
/dev/sdb1 1310720 1310712 8 100% /var/lib/docker
To find the folder with many files (as the above script was taking ages to finish), I used
# du -sh *
20K builder
72K buildkit
1.2M containerd
12K containers
4.0K count.sh
9.0M image
16K lost+found
^C
As you can see I cancelled it at the end, as it took a tremendous amount of time counting the “network” folder (which would have been next in line to be counted).
Running
/var/lib/docker ]# ll network/files/ > output.txt
gave me an output.txt of 324627 lines and 36 MB filesize. So I figured the network folder had too many files, exhausting all inodes for /var/lib/docker.
Checking the contents, network/files gives you:
ll network/files/0001b54304af9d1ec13b5624636851e3b5e26933b108851504a95b74c7bebe3a
total 34268
4 drwxr-xr-x 2 root root 4096 May 25 00:29 .
34252 drwxr-x--- 1 root root 35008512 Jun 18 11:38 ..
4 -rw-r--r-- 1 root root 160 May 25 00:29 hosts
4 -rw-r--r-- 1 root root 38 May 25 00:29 resolv.conf
4 -rw-r--r-- 1 root root 71 May 25 00:29 resolv.conf.hash
Then I found moby/moby#27893, where mvollrath stated his comment:
We’ve been seeing this issue where the daemon fails to start due to the conflict @dansanders pasted, our fix is to
mvollrath – githubrm /var/lib/docker/network/files/*
and restart the daemon. It resets all networks.
This of course doesn’t work with that man files:
# rm -f /var/lib/docker/network/files/*
-su: /bin/rm: Argument list too long
But this will:
# rm -rf /var/lib/docker/network/files
# mkdir /var/lib/docker/network/files
# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/root 393216 13332 379884 4% /
...
/dev/sdb1 1310720 12253 1298467 1% /var/lib/docker
That’s much better. Docker Restart succeeded now:
# systemctl restart docker
Of course by my deletion of /var/lib/docker/swarm, the node was not part of a swarm anymore:
docker node ls
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
So I joined up back with the manager, using the join token created by the manager’s docker-swarm init above.
(on the manager) # docker swarm join-token worker
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-1bcrp3j...rxdmz1 x.x.x.x:pppp
(on the worker) # docker swarm join --token SWMTKN-1-1bcrp3j...rxdmz1 x.x.x.x:pppp
This node joined a swarm as a worker.
A final check confirmed a cluster back in operation, with volumes and containers in tact (if stopped). So all I had to do was recreate the services and they were restored to old strength.
# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
labqmv1ocvl8xiys6dcink34j * test-xxx Ready Active Leader 18.06.2-ce
bqsto21zgm3kb0prz8rx6jkz7 test-xxx Ready Active 18.06.2-ce
For the record: The file count script returned eventually giving me
...
426 ./overlay2/bb6365191d3c2d882b84657677697bfd09875ebd91c4313aa1b2790f0659e760/...
456 ./overlay2/bb6365191d3c2d882b84657677697bfd09875ebd91c4313aa1b2790f0659e760/...
649 ./overlay2/bb6365191d3c2d882b84657677697bfd09875ebd91c4313aa1b2790f0659e760/...
977 ./image/overlay2/distribution/...
977 ./image/overlay2/distribution/...
324626 ./network/files
The last one being the already found /var/lib/docker/network/files problem, the other knowledge to be gained: The other folders had normal size.