By Joseph Villarreal
When dealing with critical workloads in a Kubernetes cluster, starting and stopping containers is a delicate task that requires orchestration and control. There’s even more added complexity brought into the mix when a Canary Release is involved, or a node is suddenly shut down. The Kubernetes community knows this pain too well.
It was recently announced that Graceful Node Shutdown went beta in Kubernetes 1.20. With it, the kubelet service can now instruct systems to postpone shut down for a specified duration set by a flag ShutdownGracePeriodCriticalPods. This time out gives a chance for the node to drain and terminate critical pods on the system. This feature will help resolve some of these concerns.
Unfortunately, for us still running workloads in versions prior to 1.20, we are left without a way to handle critical workload termination during a node shutdown seamlessly, and even more so under other cluster life cycle circumstances, like canary upgrades, rollbacks, and more.
As DevOps developers, we still need to implement logic around graceful pod termination so that we don’t lose data. In this article, I’ll explain two options for shutting down pods gracefully.
Graceful Pod Termination
In sharing this workaround for graceful pod termination, I wanted to avoid clutter and maximize existing resources. By leveraging tools already running in our clusters, and leaning on those we use for CI/CD pipelines, we can make an efficient workaround.
In this walkthrough, we’ll discuss using a combination of Flagger Webhooks, Argo Events, and Argo Workflows with native Kubernetes capabilities – like the Lifecycle Event Hook – to manage graceful pod termination during a node shutdown or other cluster events.
Pod Termination During Upgrades: Flagger and Container Hooks
Kubernetes provides containers with lifecycle hooks. These hooks enable containers to be aware of events in their management lifecycle and run custom code implemented in a handler when the corresponding lifecycle hook is executed. In this instance, these events can be associated with some of the different stages of a release.
The same goes for Flagger – the rollout spec goes through a process of upgrading a deployment by completing a series of steps that belong to predefined stages. Each stage is associated with states and hook calls. There are as many types of hooks and phases during a Flagger Rollout:
- Confirm-rollout
- Pre-rollout
- Rollout
- Confirm-traffic-increase
- Confirm-promotion
- Post-rollout
Graceful Node Termination With Webhooks
The idea is to connect the significant event generated by Flagger and initiate a call to the container webhook in the target deployment. This way, we can initiate a graceful termination instead of forcefully killing the pod, as a normal rolling upgrade does. We can achieve this chain of events through an Argo Workflow.
From the container side, there are two exposed hooks to the Kubelet from containers: PostStart and PreStop. For the graceful termination strategy via webhooks, we will use PreStop.
The PreStop hook is called immediately before a container is targeted for termination. This termination order can come from a direct API request or a cluster management event such as a liveness/startup probe failure, pod preemption, and other cluster events like a node shutdown.
Upon receiving the SIGTERM, the pod should start a graceful shutdown and exit. If a pod doesn’t terminate within the grace period, a SIGKILL signal will still be sent and the container will be terminated. All existing data on the container’s file system will be lost. It’s important to note that the pod’s termination grace period countdown begins right before the PreStop hook is executed. Regardless of the outcome of the handler, the container will terminate within the pod’s termination grace period of 60 seconds, even if the handler has not finished running.
Possible Implementation for a Termination Hook During an Upgrade
Let’s imagine this scenario: Flagger initiates a Canary Release and reaches the confirm-promotion stage. From this point, it makes an HTTP call to a local cluster service endpoint announcing the old pods need to start closing operations, draining open connections, and so on.
This cluster.local.service is created and backed by an Argo Event, so it’s very flexible. Things like the payload, port, and authentication can be adjusted as needed. From Flagger’s perspective, this webhook call can act as a gate. A response code different from 200 will make Flagger pause and retry. We will use this feature by enabling an intermediate service that responds with !=200 while also initiating a workflow.
- name: "send events to argo-events webhook"
type: event
url: "http://webhook-eventsource-svc.test:12000/flagger/gate"
metadata:
environment: "flagger-init"
cluster: "Cloud"
Once the Argo Event is captured, it is processed and initiates an Argo workflow. The complexity or simplicity of this workflow is relative to the application and the tasks that it needs to carry before the old pods are deleted. The common denominator is that one of the final steps performs a callback to the gate service and confirms execution – something like http://webhook-eventsource-svc.test:12000/flagger/gate/open. This call will change the response on http://webhook-eventsource-svc.test:12000/flagger/gate to a 200 response code allowing Flagger to continue the release, killing the pods we can now safely terminate.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: exit-handlers-
spec:
entrypoint: save-my-data
onExit: exit-handler # invoke exit-handler template at end of the workflow
templates:
# primary workflow template
- name: save-my-data
script:
image: entrypoint-python:alpine3.6
command: [python]
source: | # this is where you write your app’s termination logic
import random
cat /dev/urandom | od -N2 -An -i | awk -v f=1 -v r=100
# Exit handler templates
# After the completion of the entrypoint template, the status of the
# workflow is made available in the global variable {{workflow.status}}.
# {{workflow.status}} will be one of: Succeeded, Failed, Error
- name: exit-handler
steps:
- - name: notify
template: send-email
- name: celebrate
template: celebrate
when: "{{workflow.status}} == Succeeded"
- name: cry
template: cry
when: "{{workflow.status}} != Succeeded"
- name: send-email
container:
image: alpine:latest
command: [sh, -c]
args: ["echo send e-mail: {{workflow.name}} {{workflow.status}}"]
- name: celebrate
container:
image: alpine:latest
command: [sh, -c]
args: ["curl -X POST http://webhook-eventsource-svc:12000/flagger/gate/open"]
#pod can now be deleted
- name: cry
container:
image: alpine:latest
command: [sh, -c]
args: ["EROOR POD WAS BACKED UP"] # Retry? Call the workflow again
An Alternative Approach: preStop Handler Hook
Another approach, and maybe a more flexible one as it can be used under more circumstances, is to allow the termination workflow to be called from within the dying pod itself. Instead of using a logic gate, we allow Flagger or whatever controller agent that is trying to kill our pod, to iterate through the termination step and send the kill signal to the pod. From the targeted container side, this incoming request translates to start running the preStop Handler Hook.
Containers can access a hook by implementing and registering a handler for that hook. There are three types of hook handlers that can be implemented:
- Exec - Executes a specific script pre-stop.sh inside the container.
- TCP - Opens a TCP connection against a specific port on the container.
- HTTP - Executes an HTTP request against a specific endpoint on the container.
In this example, the hook runs a bash script:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "pre-stop.sh"]
As described above, the script will have “terminationGracePeriodSeconds” seconds to finish – in this case, 60 seconds – before it is killed forcefully.
Notice how this termination logic can also be decoupled from the pod itself; it can simply be a call to an external system, like one of our Argo-Events-Workflow backed API examples, or a direct call to export its current state via an HTTP request with a payload, we could call it directly using HTTP type Hook or inserted as part of the Exec Hook steps.
pre-stop.sh
… back up and termination logic here -
curl -d '{"data": "invaluableData.json","namespace":"test"}' http://webhook-eventsource-svc:12000/state/save
Some Notes About the Hook Handler Execution
When a container lifecycle management hook is called, the Kubernetes management system executes the handler according to the hook action. httpGet and tcpSocket are executed by the Kubelet process, and exec is executed in the container.
PreStop hooks are not executed asynchronously from the signal to stop the container; the hook must complete its execution before the TERM signal can be sent. If a PreStop hook hangs during execution, the pod’s phase will be Terminating and remain there until the pod is killed after its terminationGracePeriodSeconds expires.
This grace period applies to the total time it takes for both the PreStop hook to execute and for the container to stop normally. If, for example, terminationGracePeriodSeconds is 60 seconds, and the hook takes 55 seconds to complete, and the container takes 10 seconds to stop normally after receiving the signal, the container will be killed before it can stop normally. This is because terminationGracePeriodSeconds is less than the total time (55+10) it takes for these two things to happen. If either a PostStart or PreStop hook fails, it kills the container.
Users should make their hook handlers as lightweight as possible. There are cases; however, when long running commands make sense, such as when saving state prior to stopping a container.
Debugging Hook Handlers
The logs for a hook handler are not exposed in pods. This makes sense if we consider that the pod was forcefully terminated while still executing code, when operations like logging are stopped. If a handler fails for some reason, it instead broadcasts an event. For PostStart, this is the FailedPostStartHook event, and for PreStop, this is the FailedPreStopHook event. You can see these events by running kubectl describe pod .
As expected, additional logic can be started with these types of events including additional workflows like alerts or retries.
Final Thoughts
Since Kubernetes is a distributed system running many moving parts, we need to be prepared for errors and crashes. It’s expected that, at some point, one of the multitudes of microservices running in our environments will have an error or simply needs to be terminated and replaced. Our goal as site reliability engineers is to stop those circumstances from becoming a catalyst for a catastrophic series of events that could take down an application or end in a massive data loss.
Many of these situations can be unexpected, and we definitely need to prepare for them. However, as we just discussed, one of the most commonly recurring scenarios is simply ending the life of an application pod. Since we will be dealing with this a lot, it is definitely a major task to start automating these steps, to find and replace pods in a safe manner with graceful shutdown logic.