The last few years have seen an explosion of new technologies for running applications, and unprecedented collective lapses in applying security best practices. Forgetting all we learned over the past 40 years, we tell ourselves that running microservices in containers provides all the security we need. We convince ourselves that operating behind a firewall protects us from harm. We are sure it is okay that we terminate TLS at the load balancer. What could possibly go wrong?

Q: What could go wrong?
A: A lot

As it turns out, a lot can go wrong. While container runtimes provide good isolation for the running container from the host operating system, the isolation is not complete and should not be assumed to be free from exploits. The last few years have seen several privilege escalation defects in Docker. That is not to say you should not use containers. Rather, the takeaway is that when you use containers, you should continue to use the security best practices we relied on when we were not using containers.

In this post, we'll take a look at some common regressions in security practices associated with the migration to Docker and Kubernetes and suggest ways to avoid them. We will also give a few pointers on how to increase your security when running in containers as compared to virtual machines or standard hosts.

Running as root

One of the most common and easiest security lapses to address is running binaries as root. Running binaries as root introduces a few security concerns. First, file system permissions do not apply to the root user. The root user can read and, more importantly, write any file on the file system. If someone can compromise your container, they can change any file in the container, including executables. If your main process calls other executables, the attacker can modify one of those executables to get you to execute arbitrary code within your container. Arbitrary code that, for example, opens a reverse shell into your container. Game over! Second, the root within a container is not different than root on the host operating system. It is true that container runtime takes pains to contain processes running in a container and their privileges within that container, but these runtimes do have flaws and vulnerabilities have been found that allow escape from the container. If such an escape is successful, it is much better if the escaped process is not running as a privileged user.

In the olden days, we created different users for each daemon or service, made sure file ownership and permissions were correct, and ran the binary using that user. In the even older days we ran all the daemons and services as the nobody user because we couldn't be bothered with creating all those different users. Many have returned to those days, despite how simple it is to create a user in a Docker container. One simple need to add a single RUN command to your Dockerfile. For example,

RUN groupadd --gid 2866 atomist \
    && useradd --home-dir /home/atomist --create-home --uid 2866 \
        --gid 2866 --shell /bin/sh --skel /dev/null atomist

Once you've created the user, you can use the USER command in your Dockerfile to make it the default user when your image is run as a container.

USER atomist:atomist

If the process you are running in the container writes files, e.g., temporary files, you will have to make sure the user the process is being run as has write access to those directories.

In Kubernetes, you can enforce running containers as non-root using the pod and container security context.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: app
      name: app
    spec:
      containers:
      - image: my/app:1.0.0
        name: app
        securityContext:
          allowPrivilegeEscalation: false
          privileged: false
      securityContext:
        fsGroup: 2866
        runAsNonRoot: true
        runAsUser: 2866

Read-only file system

While we are talking about writing files, do you really need to write files within a container? If you need to write temporary/cache files, fine, but since you are going to lose everything when that container dies, you shouldn't be writing anything of import within a container. Since you are only going to write temporary files, you really don't need your container to have a writable layer. Just mount a volume at /tmp and run your container with a read-only root file system. Here's how to do that with the Docker CLI:

$ docker --read-only --tmpfs /tmp ...

In Kubernetes, you set the root file system to read-only using the pod security context and create an emptyDir volume to mount at /tmp.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: app
      name: app
    spec:
      containers:
      - env:
        - name: TMPDIR
          value: /tmp
        image: my/app:1.0.0
        name: app
        securityContext:
          readOnlyRootFilesystem: true
        volumeMounts:
        - mountPath: /tmp
          name: tmp
      volumes:
      - emptyDir: {}
        name: tmp

Terminating TLS too soon

As long as there have been firewalls, people have chosen to terminate SSL/TLS at the edge of the network. Even back in the days of on-prem corporate IT, we often terminated encrypted network communication at the edge and allowed non-encrypted network traffic within our "private" network. As we have moved to IaaS, we brought this bad practice along with us, terminating encryption at the load balancer. As these examples indicate, this security risk is not container specific, but container runtimes increase the surface of attack non-encrypted network communication presents. Since we've continued the practice as we migrate to Kubernetes, once compromised an attacker has the ability to sniff traffic from a lot more sources over the internal network.

There are a few different ways to mitigate this risk. An approach that is portable to any Kubernetes cluster involves first moving termination of TLS within the cluster and then transparently encrypting all network traffic within the cluster. Moving the TLS termination within the cluster can be completely automated and managed by Kubernetes using ingress resources fulfilled by the nginx-ingress controller with TLS certificates provisioned by cert-manager. There are a few options for transparently encrypting all network traffic within a Kubernetes cluster, e.g., linkerd and weave.

(Note: There is a note on the weave documentation page that says:

To avoid leaking your password via the kernel process table or your shell history, we recommend you store it in a file and capture it in a shell variable prior to launching weave: export WEAVE_PASSWORD=$(cat /path/to/password-file)

It is worth noting that a process' environment is also stored in the
kernel process table, so the above method does not prevent leaking
the password via the kernel process table.)

Denial of service

Setting resources limits for your containers protects against a host of denial of service attacks. If the resource usage of the processes in your container are constrained properly, they processes cannot exceed the memory and CPU resources on your host, which means they will not cause your host to die. The Docker CLI provides the --memory and --cpus command-line options to set memory and CPU resource limits, respectively. The Kubernetes pod specification, which is available for pods, deployments, daemon sets, and jobs, allows you to configure these limits in the container resources property.

To limit a container to half a CPU and half a GiB of memory, the Docker command line would look something like

$ docker run --cpus=0.5 --memory=512m

and a Kubernetes deployment specification would look like

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: app
      name: app
    spec:
      containers:
      - image: my/app:1.0.0
        name: app
        resources:
          limits:
            cpu: 500m
            memory: 512Mi

It is also a good idea to make sure if your application is not healthy that it shuts down properly so it can be replaced. Kubernetes can help you with this if your application can respond to health and readiness checks and you configure them in your pod specification. Like the resource limits, these checks are configured at the container level.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: app
      name: app
    spec:
      containers:
      - image: my/app:1.0.0
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          initialDelaySeconds: 20
          periodSeconds: 20
          successThreshold: 1
          timeoutSeconds: 3
        name: app
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 20
          periodSeconds: 20
          successThreshold: 1
          timeoutSeconds: 3

The distinction between a liveness probe and a readiness probe is that the former should indicate if the application is running. The latter should indicate if the application can service requests. If the specified number of liveness probes fail, the container is deleted and replaced. If the specified number of readiness probes fail, Kubernetes no longer routes traffic to it through any services that reference it.

Kubernetes policies

Kubernetes provides network and pod security policies that give you control over what pods can communicate with each other and what types of pods can be started, respectively.

Pod security policies allow you to control what capabilities pods can have. When pod security policies are enabled, Kubernetes will only start pods that satisfy the constraints of the pod security policies. If someone or something, e.g., a deployment resource, tries to start a pod that violates the pod security policies, Kubernetes will refuse to start it. Some examples of capabilities of a pod that can be defined in a pod security policy are the types of volumes it can use, whether is can run as a privileged, i.e., root, container, whether it can use the host network, user and group, SELinux context, and sysctl profile. Here is an example of a pod security policy that enforces some of the best practices we have already mentioned.

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: best-practices
spec:
  # non-privileged containers
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  runAsUser:
    rule: MustRunAsNonRoot
  supplementalGroups:
    rule: MustRunAs
    ranges:
      - min: 1
        max: 65535
  fsGroup:
    rule: MustRunAs
    ranges:
      - min: 1
        max: 65535
  # restrict file systems
  readOnlyRootFilesystem: false
  volumes:
    - configMap
    - emptyDir
    - projected
    - secret
    - downwardAPI
    - persistentVolumeClaim
  # limit interaction with host
  hostNetwork: false
  hostIPC: false
  hostPID: false

Network policies allow you to define ingress and egress rules, i.e., firewall rules, for your pods using IP CIDR ranges and Kubernetes label selectors for pods and namespaces, similar to how Kubernetes service resources select pods. Using label selectors to define sources and destinations for network traffic allow for very flexible and resilient firewall rules in Kubernetes, since pods are treated as cattle and the IP addresses of your applications can vary over time. Here is an example of a network policy that when created in a namespace will deny ingress from pods in other namespaces but allow pods within the namespace to communicate with each other.

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: deny-from-other-namespaces
  namespace: mine
spec:
  podSelector:
    matchLabels:
  ingress:
  - from:
    - podSelector: {}

The above policy has a pod selector that is empty, meaning it selects all pods in the namespace. Since pods in other namespaces are not selected by the ingress rule, traffic from them is rejected. There is a GitHub repository of common network policies to help you get started using network policies.

Batten down the hatches!

This is not an exhaustive list of suggestions for hardening your containers and Kubernetes cluster, but it provides a good starting point of things that are not too difficult to do but still increase the depth of your defensive posture. That's right, "defense in depth" is still important even in the world of containers. The container is not safe. The operating system is not safe. The host is not safe. The network is not safe.

Remain vigilant! As Mr. Melville teaches us, even if we are not aware of them, dangers are everywhere.

All men live enveloped in whale-lines. All are born with halters round their necks; but it is only when caught in the swift, sudden turn of death, that mortals realise the silent, subtle, ever-present perils of life.

-- Herman Melville, Moby-Dick; or, The Whale