Secure Docker-in-Kubernetes

Cesar Talledo
8 min readJan 3, 2022

January 03, 2022

Intro

This post shows you how to run Docker inside a secure (rootless) Kubernetes pod. That is, you create one or more Kubernetes pods and inside of each you run Docker.

While running Docker inside pods is not new, what’s different here is that the pod will not be an insecure “privileged” pod. Instead, it will be a fully unprivileged pod launched with Kubernetes and the new Sysbox runtime (free & open-source), which means you can use this setup in enterprise settings where security is very important.

I will show you how to set this up quickly and easily with examples, and afterwards you can adjust these per your needs.

Motivation

There are several uses cases for running Docker inside a Kubernetes pod; a couple of useful ones are:

  1. Creating a pool of Docker engines on the cloud. Each user is assigned one such engine and connects remotely to it via the Docker CLI. Each Docker engine runs inside a Kubernetes pod (instead of a VM), so operators can leverage the power of Kubernetes to manage the pool’s resources.
  2. Running Docker inside Kubernetes-native CI jobs. Each job is deployed inside a pod and the job uses the Docker engine running inside the pod to build container images (e.g., Buildkit), push them to some repo, run them, etc.

In this blog post I focus on the first use case. A future blog post will focus on the second use case.

Setup

The diagram below shows the setup I will create:

As shown:

  • Kubernetes will deploy the pods with the Sysbox runtime.
  • Each pod will run a Docker engine and SSH in it.
  • Each Docker engine will be assigned to a user (say a developer working from home with a laptop).
  • The user will connect remotely to her assigned Docker engine using the Docker CLI.

Why is the Sysbox Runtime Needed Here?

Prior to Sysbox, the setup shown above required insecure “privileged” containers or VM-based alternatives such as KubeVirt.

But privileged containers are too insecure, and VMs are slower, heavier, and harder to setup (e.g., KubeVirt requires nested virtualization on the cloud).

With Sysbox, you can do this more easily and efficiently, using secure (rootless) containers and without resorting to VMs.

Kubernetes Cluster Creation

Ok, let’s get to it.

First, you need a Kubernetes cluster with Sysbox installed in it. It’s pretty easy to set this up as Sysbox works on EKS, GKE, AKS, on-prem Kubernetes, etc.

See these instructions to install Sysbox on your cluster.

For this example, I am using a 3-node Kubernetes cluster on GKE, and I’ve installed Sysbox on it with this single command:

kubectl apply -f https://raw.githubusercontent.com/nestybox/sysbox/master/sysbox-k8s-manifests/sysbox-install.yaml

Defining the Pods (with Docker inside)

Once Sysbox is installed on your cluster, next step is to define the pods that carry the Docker engine in them.

We need a container image that carries the Docker engine. In this example, I use an image called nestybox/alpine-supervisord-docker:latest that carries Alpine + Supervisord + sshd + Docker. The Dockerfile is here.

Next, let’s create a Kubernetes StatefulSet that will provision 6 pod instances (e.g., 2 per node). Each pod will allow remote access to the Docker engine via ssh. Here is the associated yaml file (dockerd-statefulset.yaml):

Before we apply this yaml, let’s analyze a few things about it.

First, we use a StatefulSet (instead of a Deployment) because we want each pod to have unique and persistent network and storage resources across its life cycle. This way if a pod goes down, we can recreate it and it will have the same IP address and the same persistent storage assigned to it.

Second, note the following about the StatefulSet spec:

  • It creates 6 pods in parallel (see replicas and podManagementPolicy).
  • The pods are rootless by virtue of using Sysbox (see the cri-o annotation and sysbox-runc runtimeClassName).
  • Each pod exposes port 22 (ssh).
  • Each pod has a persistent volume mounted onto the pod’s /var/lib/docker directory (see next section).

Persistent Docker Cache

In the StatefulSet yaml shown above, we mounted a persistent volume on each pod’s /var/lib/docker directory.

Doing this is optional, but enables us to preserve the state of the Docker engine (aka “the Docker cache”) across the pod’s life cycle. This state includes pulled images, Docker volumes and networks, and more. Without this, the Docker state will be lost when the pod stops.

Note that each pod must have a dedicated volume for this. Multiple pods can’t share the same volume because each Docker engine must have a dedicated cache (it’s a Docker requirement).

Also, note that the persistent storage is provisioned dynamically (at pod creation time, one volume per pod). This is done via a volumeClaimTemplate directive, which claims a 2GiB volume of a storage class named "gce-pd".

What is “gce-pd”? It’s a storage class that uses the Google Compute Engine (GCE) storage provisioner. The resource definition is below (gce-pd.yaml):

Since my cluster is on GKE, using the GCE storage provisioner makes sense. Depending on your scenario, you can use any other provisioner supported by Kubernetes (e.g., AWS EBS, Azure Disk, etc).

In addition, whenever we use volumeClaimTemplate, we must also define a dummy local-storage class (as otherwise Kubernetes will fail to deploy the pod). Here is the resource definition (local-storage.yaml):

Deploying the Pods

With this in place, we can now apply the yamls shown in the prior section.

$ kubectl apply -f gce-pd.yaml
$ kubectl apply -f local-storage.yaml
$ kubectl apply -f dockerd-statefulset.yaml

If all goes well, you should see the StatefulSet pods deployed within 10->20 seconds, as shown below:

You should also see the persistent volumes that Kubernetes dynamically allocated to the pods:

Verify the Pods are Working

Let’s exec into one of the pods to verify all is good:

Perfect: supervisord (our process manager in the pod) is running as PID 1, and it has started Dockerd and sshd.

Let’s check that Docker is working well:

Great, Docker is responding normally.

Finally, check that the pod is unprivileged (rootless):

This means User-ID 0 in the pod (root) is mapped to user-ID 362144 on the host, and the mapping extends for 65536 User-IDs.

In other words, you can work as root inside the pod without fear, as it has no privileges on the host.

Exposing the Pod’s IP Outside the Cluster

Now that the pods are running, we want to access the Docker engine inside each pod. In this example, we want to access the pods from outside the cluster, and do it securely.

For example, we want to give a developer sitting at home with her laptop access to a Docker engine inside one of the pods we’ve deployed.

To do this, we are going to create a Kubernetes “Load Balancer” service that exposes the pod’s SSH port externally.

Note that we need one such service per pod (rather than a single service that load balances across several pods). The reason is that the pods we’ve created are not fungible: each one carries a stateful Docker engine.

The simplest (but least automated) way to do this is to manually create a LoadBalancer service for each pod. For example, for pod dockerd-statefuset-0:

Applying this yaml causes Kubernetes to expose port 22 (SSH) of the dockerd-statefulset-0 via an external IP:

We need to repeat this for each of the pods of the StatefulSet.

Note that there are automated ways to do this, but they are beyond the scope of this blog.

Connecting Remotely to the Pods

Now that you have the pods running on the cluster (each pod running an instance of Docker engine) and a service that exposes each externally, let’s connect to them remotely.

There are two parts to accomplish this:

  1. Configure ssh access to the pod.
  2. Use the Docker CLI to connect to the pod remotely via ssh.

SSH config

To do this:

  • Exec into one of the pods, and create a password for user root inside the pod:
  • Give the pod’s external IP address (see prior section) and password to the remote user in some secret way.
  • The remote user copies her machine’s public SSH key (e.g., generated with ssh-keygen) to the pod.

For example, if the pod’s external IP is 35.194.9.153:

Docker CLI Access

After SSH is configured, the last step is to set up the Docker client to connect to the remote Docker engine. For example:

And now we can access the remote Docker engine:

There it is! The remote user can now use her dedicated Docker engine to pull and run images as usual.

At this point you have a working setup and you can use your remote Docker as if it were your local one.

The remaining sections describe topics you should keep in mind as you work with the remote cluster.

Sharing Docker Images across Docker Engines

In the current setup, each Docker engine was configured with a dedicated persistent Docker cache (to cache container images, Docker volumes, networks, etc.).

But what if you want multiple Docker engines to share an image cache?

You may be tempted to do this by having multiple Docker pods share the same Docker cache. For example, create a persistent volume for a Docker cache and mount the same volume into the /var/lib/docker directory of multiple pods. But this won’t work, because each Docker engine must have a dedicated cache.

A better way to do this is to setup a local image registry using the open-source Docker registry. For example, this local registry could run in a pod within your cluster, and you can then direct the Docker engine instances to pull/push images from it.

How to do this is beyond the scope of this article, but here is some useful info on this:

Scaling Pod Instances

To scale the pods (i.e., scale up or down), simply modify the replicas: clause in the StatefulSet yaml and apply it again.

You will also need to create the Load Balancer service for any newly added pods.

Note however that when you scale down, the Load Balancer services and persistent volumes mounted on the pod’s /var/lib/docker are not removed automatically (you must explicitly remove them as shown next).

Persistent Volume Removal

In the StatefulSet we created above, we asked Kubernetes to dynamically create a persistent volume for each pod and mount it on the pod’s /var/lib/docker directory when the pod is created (see section Persistent Docker Cache above).

When the pod is removed however, Kubernetes will not remove the persistent volume automatically. This is by design, because you may want to keep the contents of the volume in case you recreate the pod in the future.

To remove the persistent volume do the following:

  1. Stop the pod using the persistent volume.
  2. List the persistent volume claims (pvc):

3. Remove the desired pvc; this will also remove the persistent volume:

Docker Build Context

When running the Docker engine remotely, be careful with Docker builds. The reason: the Docker CLI will transfer the “build context” (i.e., the directory tree where the Dockerfile is located) over the network to the remote Docker engine. This can take a long time for large images.

Docker Buildkit may help here, since it tracks changes and only transfers the portion of the build context that has changed since a prior build.

Conclusion

Running Docker inside Kubernetes pods has several use cases, such as offloading the Docker engine away from local development machines (e.g., for efficiency or security reasons).

However, until recently doing this required very insecure privileged pods or VMs.

In this blog, I showed how to do this easily & securely with pure containers, using Kubernetes + Sysbox. Hope you found this helpful.

If you are interested in learning more about Sysbox, checkout the Sysbox GitHub repo or join the Sysbox Slack channel.

Resources

Originally published at https://blog.nestybox.com on January 3, 2022.

--

--