When a pod is unhealthy in a Kubernetes cluster, does anyone notice? Have you ever deployed a new version of an app to Kubernetes, tried to test the new feature you added or bug you fixed and found the same behavior as before? Have you ever then double-checked your code, rerun your tests, checked a few more things, only then to realize that while the deployment got updated, the new pods never replaced the old because of some misconfiguration or other mistake? If so, the Kubernetes Pod Health Monitor skill is for you.

Kubernetes does a great job of hiding problems from developers and operators. When a container dies, Kubernetes automatically replaces it. When a container is unhealthy, Kubernetes stops routing requests to it. When you add a reference to a non-existent secret to a deployment, Kubernetes keeps the old version running. When there are not sufficient resources to schedule a pod, Kubernetes can even spin up new nodes to accommodate the request.

While this built-in resilience is great for availability and being able to sleep when you are on call, it can hide significant problems. For example, if an application has a memory leak, Kubernetes will gladly kill it every time the memory use grows over the limit, make it seem from the outside that everything is okay. Or, if an application has performance issues, Kubernetes may scale up the number of replicas to service all requests in a timely manner, increasing your operating costs significantly rather than what would have been a small development cost to fix it.

We certainly do not want to forfeit the resilience Kubernetes provides, but it would be great to know when pods in your clusters are having issues so you can address root causes rather than having Kubernetes paper over them. This is why we created the Kubernetes Pod Health Monitor.

Kubernetes Pod Health Monitor

The Kubernetes Pod Health Monitor is an Atomist Skill that listens for changes to pods, examines the pod status, and sends alerts to either Slack or Microsoft Teams if a pod is not healthy. The types of unhealthy states the pod health monitor checks for include

  • Image pull back-off
  • Crash loop back-off
  • OOMKilled containers, i.e., containers killed because they have used too much memory
  • Elevated container restarts
  • Containers not in a ready state.
  • Unscheduled pods
  • Misconfigured pods, e.g., it references a secret that does not exist

The pod health monitor works in concert with the Atomist k8svent utility, which runs in your Kubernetes clusters and sends pod statuses to Atomist.

ChatOps

When an unhealthy pod is detected, an alert message is sent to a channel in Slack or Microsoft Teams. For example, here is a Slack alert:

A couple of unhealthy Kubernetes pods.

So now, instead of being blithely ignorant of problems in your cluster, you have the choice to do something about it. You may, of course, choose to do nothing and let Kubernetes do what it does best, but if it is a persistent or costly problem, you now know about it, know how severe it is, and can work to fix the root cause. Knowing truly is half the battle.

Get in the Know

Don't remain in the dark about the problems in your Kubernetes cluster any longer! Here's a video showing you how to enable the Kubernetes Pod Health Monitor skill in your Kubernetes clusters.

Kubernetes Pod Health Monitor video

The Atomist channel on YouTube has other videos to help you sign up with Atomist and configure the Kubernetes and Slack integrations.

Let us know what you think of the Kubernetes Pod Health Monitor and if there are any other unhealthy states we should be sending alerts for.