Run Kubernetes Clusters for Less with Amazon EC2 Spot and Karpenter
Learn how to run Kubernetes clusters for up to 90% off with Amazon Elastic Kubernetes Service (EKS), Amazon EC2 Spot Instances, and Karpenter - all in less than 60 minutes.
NodePool
. If you’re getting started with Spot in Amazon Elastic Kubernetes Service (EKS) or are struggling with the complexity of configuring multiple node groups, I recommend using Karpenter. However, if you’re already using CA and want to start spending less, you can find the detailed configuration to use Spot with CA here.✅ AWS experience | Advanced - 300 |
---|---|
⏱ Time to complete | 75 minutes |
💰 Cost to complete | < $10.00 USD |
🧩 Prerequisites | - AWS Account - AWS CLI - Kubernetes CLI (kubectl) - Terraform CLI - Helm |
📢 Feedback | Any feedback, issues, or just a 👍 / 👎 ? |
💾 Code | Download the code |
🛠 Contributors | @jakeskyaws |
⏰ Last Updated | 2024-4-26 |
- You need access to an AWS account with IAM permissions to create an EKS cluster, and an AWS Cloud9 environment if you're running the commands listed in this tutorial.
- Install and configure the AWS CLI
- Install the Kubernetes CLI (kubectl)
- Install the Terraform CLI
- Install Helm (the package manager for Kubernetes)
💡 Tip: You can skip this step if you already have a Cloud9 environment or if you’re planning to run all steps on your own computer. Just make sure you have the proper permissions listed in the pre-requisites section of this tutorial.
💡 Tip: You can control in which region to launch the Cloud9 environment by setting up theAWS_REGION
environment variable.
💡 IMPORTANT: You need to use the same IAM user/role both in the AWS Console and the AWS CLI setup. Othewrise, when you try to open the Cloud9 environment you won't have permissions to do it.
💡 NOTE: If the CloudFormation stack has not reached the "CREATE_COMPLETE" status, the CLI tools may not have been installed yet. Please wait until the stack completes before proceeding with any CLI commands..
💡 Tip: The Terraform template used in this tutorial is using an On-Demand managed node group to host the Karpenter controller. However, if you have an existing cluster, you can use an existing node group with On-Demand instances to deploy the Karpenter controller. To do so, you need to follow the Karpenter getting started guide.
aws-auth
configmap to allow nodes to connect, and creates an On-Demand managed node group for the kube-system
and karpenter
namespaces.kube.config
file to interact with the cluster through kubectl
:💡 Tip: If you’re using a different region or changed the name of the cluster, you can get the previous command for your setup from the Terraform output by running this command:terraform output -raw configure_kubectl
.
kube-system
and karpenter
namespaces, and it’s going to be only one you’ll need. For the rest of pods, Karpenter will launch nodes through a NodePool CRD. The NodePool sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. A single Karpenter NodePool is capable of handling many different pod shapes, and for this tutorial you’ll only create the default
NodePool.💡 Tip: Karpenter simplifies the data plane capacity management using an approach called group-less auto scaling. This is because Karpenter is no longer using node groups, which match with Auto Scaling groups, to launch nodes. Over time, clusters using the paradigm of running different types of applications (that require different capacity types), end up with a complex configuration and operational model where node groups must be defined and provided in advance.
main.tf
file lives and run the following command:💡 NOTE: If you're working with an existing EKS cluster, make sure to set the proper values for the previous environment variables as we'll use those values to setup the Karpenter provsioner.
NodePool
by running the following commands:NodePool
you just created:requirements
: Here’s where you define the type of nodes Karpenter can launch. Be as flexible as possible and let Karpenter choose the right instance type based on the pod requirements. For thisNodePool
, you’re saying Karpenter can launch either Spot or On-Demand Instances, families includingc
,m
andr
, with a minimum of 4 vCPUs and 8 GiB of memory. With this configuration, you’re choosing around 150 instance types from the 700+ available today in AWS. Read the next section to understand why this is important.limits
: This is how you constrain the maximum amount of resources that theNodePool
will manage. Karpenter can launch instances with different specs, so instead of limiting a max number of instances (as you’d typically do in an Auto Scaling group), you define a maximum of vCPUs or memory to limit the number of nodes to launch. Karpenter provides a metric to monitor the percentage usage of thisNodePool
based on the limits you configure.disruption
: Karpenter does a great job at launching only the nodes you need, but as pods can come an go, at some point in time the cluster capacity can end up in a fragmented state. To avoid fragmentation and optimize the compute nodes in your cluster, you can enable consolidation. When enabled, Karpenter automatically discovers disruptable nodes and spins up replacements when needed.expireAfter
: Here’s where you define when a node will be deleted. This is useful to force new nodes with up-to-date AMI’s. In this example we have set the value to 7 days.
NodePool
here.NodePool
we’re basically letting Karpenter choose from a diverse set of instance types to launch the best instance type possible. If it’s an On-Demand Instance, Karpenter uses the lowest-price
allocation strategy to launch the cheapest instance type that has available capacity. When you use multiple instance types, you can avoid the InsufficientInstanceCapacity error.price-capacity-optimized
(PCO) allocation strategy. PCO looks at both price and capacity availability to launch from the Spot Instance pools that are the least likely to be interrupted and have the lowest possible price. For Spot Instances, applying diversification is key. Spot Instances are spare capacity that can be reclaimed by EC2 when it is required. Karpenter allows you to diversify extensively to replace reclaimed Spot Instances automatically with instances from other pools where capacity is available.NodePool
can launch both On-Demand and Spot Instances, but Karpenter considers the constraints you configure within a pod to launch the right node(s). Let’s create a Deployment with a nodeSelector to run the pods on Spot instances. To do so, run the following command:Pending
, making Karpenter react and launch the nodes, similar to this output:- Noticed there were 10 pending pods, and decided that can fit all pods in only one node.
- Is considering the kubelet and kube-proxy
Daemonsets
(2 additional pods), and is aggregating all resources need for 12 pods. Moreover, Karpenter noticed that 100 instance types match these requirements. - Launched an
c7g.2xlarge
Spot Instance ineu-west-2a
as this was the pool with more spare capacity with lowest price.
Deployment
.Deployment
:💡 NOTE: To see pods being spread within AZs withh similar instance sizes, wait until pods and existing EC2 instances launched by Karpenter are removed.
nodeSelector
and the containers
block from the workload.yaml
file you downloaded before:💡 Tip: You can download the full version of the deployment manifest including the TSP here.
Deployment
again. If you downloaded the manifest from GitHub, you can simply run:NodePool
starts a new node as soon as it sees the Spot interruption warning. Karpenter’s average node startup time means that, generally, there is sufficient time for the new node to become ready and to move the pods to the new node before the node is reclaimed.STATUS
, and 2) for the Karpenter logs. In one terminal watch the nodes using this command:eks-node-viewer
a tool for visualizing dynamic node usage within a cluster. It was originally developed as an internal tool at AWS for demonstrating consolidation with Karpenter. It displays the scheduled pod resource requests vs the allocatable capacity on the node.💡 Tip: You might end up seeing only one/two Spot nodes running, and if you review the Karpenter logs, you’ll see that it was because of the consolidation process.
NodePool
you created before. But make sure you’re configuring the workload properly. One way of doing it is to use a similar approach for the Spot-friendly workload by using a nodeSelector
.Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.