docker and zombie processes – pid 1 in containers

I realize that the issue I’ll write about is most likely common and known to you. I have only come by this issue today and this post is for me to remember the most valuable lessons I learned.

The problem

I ran an Icinga2 in Docker containers. It was a hassle to say the least. Their provided Docker container didn’t satisfy me as (and they say it themselves) it’s not ready for production. It’s merely meant for development purposes. Because everyone needs monitoring in development… But I guess they do. So yay for them, it helps them!

It didn’t help me so I needed to run my own thing.

After having fought with it for 78 hours now, I got it fully up and running and I believe to the best of my knowledge it’s production ready. (I already put it in production so it better be).

Right off the bat I saw it producing a huge number (178) of zombie processes. Initially it was seemingly failing ping checks, close to the end of my project it was email notifications, even when they went through.

The cause

Looking into what could be causing the issues I found an entire wealth of sources talking about a zombie process problem in Docker in general.

The issue is that PID 1 in Linux has a major responsibility that distinguishes it form any other process with any other process ID: PID 1 adopts orphaned zombie processes and executes the necessary wait() on them to remove them from the kernel’s process table.

As Docker advocates for the “one process per container” philosophy more often than not a kind of process is PID 1 in containers which is not very good at handling these additional responsibilities. Neither Java nor Python can handle arbitrary orphaned processes and wait on them for example. They mostly handle only their own children.

In addition to this it is often so that software not specifically tailored to run in containers simply relies on PID 1 picking up their orphaned processes to wait on them. Icinga, in my case, does exactly that. It spawns its checks as child processes and orphans them, relying on PID 1 to wait on them to remove them from the process table.

As Icinga is PID 1 inside the container it does not do that, resulting in increasing amounts of un-waited-on zombie processes.

The solution

To my surprise, as I had never come by that issue before, it is common knowledge in the Container community that zombies are an issue. I just had never come by it.

There are a number of solutions to the problem, all revolving about supplying the (Docker) container with a minimalistic init system running as PID 1 and any process that’s also supposed to be in the container being started by the init system directly after, proxying any signals the container receives through to the actual application process.

I will simply list a couple of solutions that were interesting for me and then link the relevant articles I found for further reading, as all of them are worth your while and take too long to summarise here:

Simply add tini to your Dockerfile like so:

ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini.sha256sum /tini.sha256sum
RUN echo "$(cat /tini.sha256sum)" | sha256sum -c && chmod +x /tini

ENTRYPOINT ["/tini", "--"]

If you already have an entrypoint defined, define the ENTRYPOINT directive like so (assuming your entrypoint was /docker-entrypoint.sh):

ENTRYPOINT ["/tini", "--", "/docker-entrypoint.sh"]

This will result in tini starting as PID 1 and then directly starting whatever you specified as ENTRYPOINT and CMD in your Dockerfile.

I will remember this! Maybe it helps someone else too.

Sources/further reading