nSights Talks

Kubernetes Deployment Strategies

Tutorial Highlights & Transcript

00:00 - Kubernetes Deployment Set Up

My demo today is about Kubernetes deployment strategies. Nothing crazy and fancy, just the built-in rolling update deployment strategy.

Here I have a super simple deployment object in Kubernetes. I have set up 20 replicas. I have the strategy commented out for now because I want to show you how the default works. It’s a super simple container. It’s a Node.js application that starts a server because we need it to continue running. I’ll show you the custom things it has as we reach it. So for now, I’m just going to apply this file. Now if we go to our cluster, we see we have our 20 pods starting up, almost done. And just one last one should be ready soon. There we go. If this was a real application serving traffic, then we would need to, at some point, do a deployment to release a new version. Kubernetes comes with a bunch of built-in features that allow us to perform those tasks with zero downtime. If we just make a change and apply the deployment to Kubernetes, Kubernetes itself will just do it as best as possible without any downtime.

01:44 - Running the Kubernetes Deployment

Let’s start to do a deployment, I’m going to make a change here. I’m going to say this is version two. I’m going to apply it from here so we can see it. KubeCTL Apply deployment. So what’s going to happen now is all of these bots have an environment variable called version one. Because I’m pushing an update Kubernetes is going to realize it needs to replace all 20 of them with version two. By default, it’s going to do a rolling update. If we apply this file now, it says configured, and it will start doing it. You can see here on the right, it’s terminating a bunch of things. It’s starting a whole bunch of other things here. It’s going to do this pretty fast because this is a super simple deployment definition. It’s not doing anything like a readiness probe, or liveness probe, or anything like that. As soon as the container is running, Kubernetes thinks it’s good to go and continues with the process. This version took, I don’t know, 10 seconds to replace 20 containers. That was pretty fast. It did it in a rolling update strategy. It was terminating some pods and starting new ones. And once those new ones were ready, it was terminating some more, and so on until it did the 20. So that’s the default behavior. It happens super fast because as I mentioned, this one doesn’t have a liveness probe that will let Kubernetes know when exactly the service is fully ready. It just does it. That’s not the scenario I wanted to talk to you guys about.

I ran into this specific requirement, right? It was not for a traffic-serving application not for a web application, it was for a back-end worker running in Kubernetes. It doesn’t need to expose a port or anything. It’s just a worker doing stuff in the background. Let’s assume that’s what this is. This is just the backend worker. And again, the same situation, I need to update it. But the thing with workers is that they will always be doing a certain task while they are running. If I tell Kubernetes, I want you to replace version one with version two, it will first send that termination signal to the container. If the container doesn’t exit gracefully in a specified timeout window, then it will just kill it. For workers that might be a problem. In those cases, you might want to implement a graceful termination strategy. So that’s what I’m trying to simulate here.

05:06 - Simulating a Graceful Termination Strategy

Let’s go back to my deployment. You’ll see here I have this. Termination grace period in seconds. This was set to three, which means three seconds. Basically what happens is, when I trigger a new version, let’s say I’m going to do version three, but let’s keep it at two for now. When I trigger a new version, Kubernetes sends a signal to the container to tell it that it’s going to kill it. Then it waits three seconds. If the container is still alive in three seconds, then it kills it. For workers, you might want to give them more time to finish up what they were doing. When Kubernetes sends the signal, you give them a larger timeout. You can give them like five minutes, you can give them 20 minutes to finish up what they were doing. Kubernetes will wait those 20 minutes until they finish up and gracefully exit.

Let’s simulate that. I’m going to say that my timeout to terminate is one hour just to make it super obvious. This means I’m giving it one hour to complete this task and if it doesn’t finish, then I’m going to terminate it. Okay, so let’s see if this one is changing. Yeah, it’s changing. So now something different will happen, right? These ones are just finishing up. Now I have configured my graceful termination. And I’m going to then release a new version. Let’s apply the deployment. It is almost done. If you see here we have 39. We have almost twice the capacity. I think it’s not reaching 40 because I don’t have enough nodes to hit 40. Let’s say we have 20 running here, all running correctly. The new version, version three. But because now we have a graceful termination, you can see here we have another 20 that are stuck in terminating. And this one, I built a custom script so they would capture the termination signal and then start to exit at random times. If we take a look at one, and we check the logs, it says it got the termination signal and it’s going to wait eight minutes to simulate that it’s doing something before exiting.

This is the problem I run into. Basically, my workers were finishing up a task, but when we did a new deployment because the termination grace period was so long, we ended up with twice the capacity in the cluster while all the containers finished. That time was variable because we could have some containers running a task that would take them half an hour to finish and others that would take them five minutes. But the problem remains that as the cluster was in this state, it was basically consuming twice the expected capacity. At this scale, it’s not a problem because this is just a simple cluster. I have three nodes here and my deployment is configured for 20. So it’s 20 out of 20. But in our production case, we had like 250 containers. When something like this happens in production, that means we have 250 containers running the new version and close to 250 containers running the old version waiting to finish up their tasks. Because of that jumping capacity, auto-scaling triggered and then we scaled up a bunch of nodes. Then while this old version didn’t finish, we were consuming twice the EC2 instances. Then when this was finally finished with their tasks, we ended up with extra nodes because Kubernetes wouldn’t scale them all the way back down. What we wanted to do was do this same process but in a more controlled manner. That’s where the strategies come in.

10:14 - Configuring Termination Strategies

You’ll see now everything is on terminating, and it will wait the full hour, well, not the full hour, it will wait until the containers exit by themselves. This is what Kubernetes already has built-in as part of the rolling update. You can configure the rolling update with max search and max unavailable count. You can also configure a timeout, but we don’t need it at this moment. What I’m saying here is max search. If my deployment replica count is 20, what I’m saying here is that I’m giving it permission to go 10% above that. I am saying okay, you can start two extra containers while doing the rolling update. I’m not allowing any for unavailable ones, so you can add two extra ones. Once those two are ready, you can terminate another two and then repeat until everything is done. That’s what I would expect to happen. Let’s see it in action.

Let’s do version four, just to keep track and apply it. Now we have our 20 containers starting up. Okay, so now if I apply this, I would expect it to add two more containers. Once those are ready, terminate two, and then repeat, like keep adding two, deleting two, add two, delete two. Let’s take a look at that. And we’re done. We have 40. The same thing happened again. I have 20 terminating ones and I have 20 running ones. The only thing that changed is that it started them two at a time because that’s how I configured it. But we still ended up with twice the capacity running in the cluster. I investigated this a little bit and found out that the deployment object in Kubernetes doesn’t count terminating pods as part of the capacity. As soon as a pod enters the terminating states, it doesn’t get counted toward the deployment. Because of that, it doesn’t count toward this percentage that I have here. I investigated a little bit and found out that there is no way to make it count those terminating pods. I always ended up with the same problem my cluster was running at twice the capacity for like an hour. That’s what we wanted to avoid. That brought me to realize that deployment was not what I wanted to use here. So I’m going to delete this one. Again, force terminate everything to clean it up. What I wanted to do was not possible with a regular deployment.

13:45 - Deploying StatefulSet

These strategies work, but they stop working well when you have this special case where your containers take too long to exit. I couldn’t control it fully as I wanted. What you have to use in this case is StatefulSet. StatefulSet works a little bit differently. These are mostly used when your service requires specific storage, and you want to control where and when those pods startup. For my worker containers scenario, it also works because if it’s a StatefulSet, Kubernetes does it in a more orderly way. Let’s deploy the StatefulSet. You’ll see this one starts differently right away. Before the deployment, it started the 20 at the same time. This time it’s done one by one and a feature of the StatefulSet is that everything gets numbered. You’ll see I have them all Zero demo one, demo two, and number three. And it’s doing it one by one. It doesn’t start a new one until the last one started. This is actually ready. We will reach 20.

Let’s do the same exercise and release a new version. Again, you’ll see the difference. This time, if I ordered them by age, you can see it better. See how it’s only terminating one, and then nothing else is happening. It’s not starting new ones and it’s not terminating more than one at a time. This will take a while. As I mentioned before, this container gets the signal and then decides to wait a random time to simulate this. We will check the logs, we see that this one is going to wait seven minutes. That means that we would be waiting here seven minutes for this one to terminate, and then it would move on to number 17. It will terminate this one and restart another 17 and then continue one by one. That is almost what I wanted to implement. Except that it’s too slow because as I mentioned, in the real production scenario, I have 250. If each of those takes 10 minutes to exit, we still end up waiting a bunch of time. This is almost a feature I want, but it’s too slow. We are going to have to end it with bad news because you remember here on the deployment, we have this configuration. For our deployment object, we can configure a rolling update and configure a max search and a max unavailable. But for a StatefulSet, sadly, we cannot do it yet. This is the best and fastest way. The only thing to improve here on the worker itself is to improve the time it takes it to exit.

The feature will be available. I mean, it’s already available just not on EKS because it’s a Kubernetes 1.24 version feature. And it’s still alpha in that version. EKS right now goes up to 1.23 so we still cannot use this, but 1.24 is supposed to be available on EKS by the end of the year. By then we could test this feature out which allows us to configure a StatefulSet with a max on the available fly. I could configure it instead of going one by one here and terminating one by one, we could configure it to go 10 at a time and do more controlled worker updates, but also faster than with a regular StatefulSet.

Carlos Rodríguez

DevOps Team Lead

nClouds

Carlos has been a Senior DevOps Engineer at nClouds since 2017 and works with customers to build modern, well-architected infrastructure on AWS. He has a long list of technical certifications, including AWS Certified DevOps Engineer - Professional, AWS Certified Solutions Architect - Professional, and AWS Certified SysOps Administrator - Associate.