Amazon EKS Upgrade Guidance (v1.25 to 1.26)

Co-authored by Containers Specialists at Amazon Web Services (AWS)

Abhishek Nanda, Containers Specialist Solutions Architect, AWS India
Arun Nalpet Ramakrishna, Sr. Specialist Technical Account Manager, AWS India
Gowtham S, Sr. Specialist Technical Account Manager, AWS India
Gladwin Neo, Associate Containers Specialist Solutions Architect, AWS Singapore
Glendon Thaiw, AWS Startups Solutions Architect, AWS Singapore

Overview

As one of the largest and most popular open-source projects for building cloud-native applications, the Kubernetes project is continually integrating new features, design requests, and bug fixes through version upgrades. New version updates are available on average every three months.

Amazon Elastic Kubernetes Service (EKS) is a managed Kubernetes platform provided by AWS to enable customers to deploy, manage, and scale Kubernetes clusters on the AWS Cloud. As EKS is based on open-source Kubernetes, AWS constantly updates EKS to ensure compatibility with the latest version of Kubernetes, while providing backward compatibility for older versions.

At AWS, Containers Specialists work closely with customers daily to help them with the migration and upgrades of large-scale EKS deployments. In this blog post, we have consolidated a list of best practices, strategies, and gotcha’s gathered from the field to help you kickstart your EKS Upgrades journey.

What you can expect from this post

This blog post distils best practices, techniques, and advanced considerations for performing EKS version upgrades. While this post primarily focuses on EKS v.125 to 1.26 upgrade, tips and strategies shared here can be applied to other version upgrades.

You can leverage this article as a companion guide and a platform to do further research, and you drive towards a successful EKS cluster upgrade.

Let’s get to it!

Dependencies & Considerations

1. Update VPC CNI to v1.12 or later

If you are running older versions of VPC CNI on your EKS clusters, before upgrading EKS cluster to v1.26, it is a must to upgrade the VPC CNI plugin to v1.12 or later. Any older versions of VPC CNI will crash if continued as-is. It is also recommended to maintain latest versions of VPC CNI add-on.

Currently, the latest version of VPC CNI available for EKS v1.26 is v1.18.0-eksbuild.1

2. P2 instances are not supported out-of-the box with AWS EKS Optimized AMIs (1.25 and later)

Starting with Kubernetes version 1.25, Amazon EC2 P2 instances cannot be used with the Amazon EKS optimized accelerated Amazon Linux AMIs out of the box. The AMIs for Kubernetes versions 1.25 or later will support NVIDIA 525 series or later drivers, which are incompatible with the P2 instances.

Migrate any P2 instances to P3, P4, and P5 instances before upgrading your Amazon EKS clusters to version 1.25 or later. You should also proactively upgrade your applications to work with the NVIDIA 525 series or later.

3. Kubernetes 1.26 no longer supports CRI v1alpha2

Kubernetes 1.26 no longer supports CRI v1alpha2, which is the interface between Kubernetes and container runtimes. This results in kubelet not being able to register the node if the container runtime does not support the newer CRI v1 interface.

Specifically, this means Kubernetes 1.26 does not support containerd versions older than 1.6.0, since those versions only implemented CRI v1alpha2. So containerd, which is now the only supported container runtime in EKS, should be updated to v1.6.0 or later. By default, EKS Optimized AMIs and Bottlerocket AMIs already have containerd v1.6.6.

If you are building your own base AMI, ensure that CRI is v1, and containerd is v1.6.0 or later.

Add-on Version Requirements

Add-on	Recommended version for EKS cluster v1.26
VPC CNI	`v1.18.0-eksbuild.1`
CoreDNS	`v1.9.3-eksbuild.11`
kube-proxy	`v1.26.13-minimal-eksbuild.2`
AWS Load Balancer Controller	`v2.7.2`

Please refer to the following reference links for add-on versions, pre-requisites and update instructions

Apart from the above mentioned core add-ons, you might also be running other add-ons like EBS CSI driver, Cluster Autoscaler, Karpenter, Prometheus, etc. These add-ons will also need to be validated individually for version compatibility.

General Cluster Upgrade Recommendations

This section contains general best practices for cluster upgrades regardless of upgrade technique or Kubernetes versions:

Be aware of allocated AWS resources especially the total number of available IP addresses per subnet as IP address consumption will increase during upgrade process.
Before updating control plane, make sure you have at least 5 free IPs in your subnet.
While updating the node-group, scale down cluster-auto-scaler deployment to zero.
Configure Pod Disruption Budgets (PDB) for your application workloads. For stateful workloads, configure PDB and make sure the zone aware constraint is set, so that pods and nodes would be in same AZ as that of your PV/PVC AZ.
Read Kubernetes release notes. Be aware of the changes implemented in the Kubernetes API before upgrading clusters. It is possible that changes in manifest files are required. AWS publishes all important changes in Kubernetes in the EKS documentation.
Start by upgrading non-production clusters, test the upgrade process in development clusters and then progressively upgrade your QA and Staging environments. This allows possible issues to be detected before it reaches Production.
Consider testing the cluster upgrade in a non-production environment and identify any impacts to current workloads and controllers. The testing process can be automated by building a continuous integration workflow to test the compatibility of your applications, controllers, and custom integrations before moving to a new Kubernetes version.
Turn on control plane logging and review the logs for any errors.
Use eksup which is an open source, EKS specific CLI that is designed to both report on your cluster’s upgrade readiness and provide a custom upgrade playbook.
Use Amazon EKS Upgrade Insights to surface insights about issues that may impact your ability to successfully upgrade a cluster to newer version of Kubernetes. You can use the EKS APIs and Console to check for upgrade readiness issues detected in your environment at any time against all future Kubernetes versions supported by EKS.
Alternatively, use tools such as kubent or pluto to scan deprecated/removed APIs in your cluster. The tools use different methods to scan and report the deprecated method. Review this for more details on how kubent looks for the deprecated APIs.

Strategies for Upgrade

Making cluster updates easier with EKS add-ons

Based on the Shared Responsibility Model, customers are responsible for ensuring that cluster add-ons are updated. To streamline management of add-ons, we recommend using EKS add-ons.

Amazon EKS add-ons can be used with any 1.18 or later EKS cluster. The cluster can include self-managed and custom AMI nodes, Amazon EKS managed node groups, and Fargate. Amazon EKS supports managing the installation and version of CoreDNS, kube-proxy, Amazon VPC CNI, AWS EBS CSI and ADOT via Managed add-ons. After you update your cluster to a new Kubernetes minor version, you must initiate the update of add-ons per options mentioned here.

Cluster Upgrade Techniques and Best Practices

Blue Green cluster upgrade

The blue green cluster upgrade technique creates a completely new cluster on newer version of Kubernetes, then deploys all the resources in the old cluster into the newly created cluster. Traffic is then diverted from the old to the new cluster.

This upgrade technique is only viable if the EKS cluster is bootstrapped using IaC automation, i.e.,Terraform, CDK, eksctl, etc. If GitOps has been implemented, replicating all the configuration in the new cluster should be as simple as bootstrapping the GitOps tool on the new cluster and applying manifests.

Otherwise, backup and restore tools like Velero or Portworx can be used for this purpose. Using a backup and restore tool, create the new cluster, restore the manifest backup and test the new cluster after the manifest backup has been restored. In order to divert traffic from old cluster to new, Route 53 Weight Policy Routes is recommended. Shifting traffic in small percentage batches will ensure that the new cluster is able to scale up gradually without performance degradation.

Pros:
- Skip multiple versions
- Easy rollback (can run current cluster configuration and next version at the same time)
- Unlimited time for testing
Cons:
- Difficult to implement when running stateful workloads
- Traffic flow is determined outside the Kubernetes construct (by CDN or DNS)
- Uses twice the resources
  - IPs
  - Control planes
  - Storage
  - Workers
  - Load balancers
- Migrating subgroups of nodes (node groups) can cause unnecessary bin packing
- Expensive
- No external DNS

In-place with Worker Node Rolling Update

In-place upgrade begins the upgrade process by upgrading the cluster Control Plane and then perform a Rolling Upgrade of worker nodes to a new version, the same way EKS Managed Node Groups perform upgrades. During the upgrade process Managed Node Groups performs proper Worker Node draining. There will be no downtime during the upgrade process if node eviction does not cause disruptions.

Upgrade process workflow:
- Upgrade Control Plane - no downtime
- Upgrade Add-Ons (coredns, aws-node, kube-proxy, cluster-autoscaler, etc) - in small and mid-sized clusters there is no downtime. In larger cluster with high network traffic dropped packets to be expected.
- Upgrade Worker Nodes - If deployments are not configured to handle worker node evictions then some downtime is to be expected. It is recommended you configure Pod disruption budgets as mentioned in earlier section. When you initiate a managed node group update, Amazon EKS automatically updates your nodes, completing the steps listed in Managed node update behavior.
Pros:
- Greater control over AMI upgrade and rollback of AMI if needed.
- Control plane is upgrade in-place, not replaced.
- Infrastructure cost savings
Cons:
- Control plane upgrade is a one-way process. If a failure or issues are encountered, the service team needs to be contacted

In-Place with Blue/Green Worker Nodes

This technique involves upgrading the control plane and performing Blue/Green-style deployments for the EKS Node Groups. This process involves creating a new set of Auto Scaling Groups using the new Worker Node version while keeping old Worker Nodes still joined to the cluster. Traffic is drained from old Worker Nodes and diverted into the new Worker Nodes, after which the old Worker Nodes are terminated. This upgrade process used by eksctl when performing Cluster Upgrades.

Upgrade process workflow:
- Upgrade Control Plane - no downtime
- Upgrade Add-Ons (coredns, aws-node, kube-proxy, cluster-autoscaler, etc) - in small and mid-sized clusters there is no downtime. In larger cluster with high network traffic dropped packets to be expected.
- Add Auto Scaling Group with new Worker Node version to the EKS cluster - no downtime
- Drain old worker nodes and wait until pods are totally migrated to new set of Worker Nodes - if deployments are not configured to handle worker node evictions then some downtime is to be expected. Further reference here.
- Once all pods have been migrated, it is safe to delete old Auto Scaling Groups.
- You can create a new node group, gracefully migrate your existing applications to the new group, and remove the old node group from your cluster. Reference here.
Pros:
- Allows new DaemonSets to come on-line and be tested independent of workload
- One workload at a time can be tested deployed via canary
- Deployment rolling update can be used for zero packet loss
- Rollback is an option
Cons:
- Automation is needed to roll-out the different node groups
- Setups with multiple ASG and cluster autoscaler setting set to random can cause issues
- Workloads binpack on same nodes and not spread out via topology constraints
- Provisioning new worker nodes while maintaining existing worker nodes cost more
- Large clusters with 100’s or 1,000’s of nodes might not be realistic

Karpenter Node upgrades - Automate Kubernetes node updates

Use Drift or node expiry to update instances continually
- Drift and Node Expiry, if using one of the EKS Optimised AMIs can help you continuously patch your data plane resulting in reduce operation complexity
  - When we are talking about managing multiple node groups or nodes, we need to think how do we patch at scale. If using one of the EKS optimised AMIs Karpenter will automatically use the latest version via querying AWS systems manager.
  - Drift handles changes to the NodePool/EC2NodeClass. For Drift, values in the NodePool/EC2NodeClass are reflected in the NodeClaimTemplateSpec/EC2NodeClassSpec in the same way that they’re set. Karpenter uses Drift to upgrade Kubernetes nodes and upgrades the nodes rolling deployment. With Karpenter version v0.33.x Drift feature gates is enabled by default and upgrade of nodes would be respect the Drift.
  - Alternatively, Nodes can be replaced by setting expiry time-to-live value (spec.disruption.expireAfter) with the Karpenter nodepool. The nodes will be terminated after a set period of time getting replaced with newer nodes.

Fargate pods update

Make sure that the kubelet of the Fargate nodes are having the same kubernetes minor version is same as the current EKS control plane version. This can be generally done by restarting the Fargate pods.

Easing your worker node upgrades using Managed Node Group

We recommend using EKS Managed Node Groups, as they come with powerful management features like automating the process of upgrading worker nodes. It also enables additional features for Cluster Autoscaler like automatic EC2 Auto Scaling Group discovery and graceful node termination. Managed Node Groups can still be used, if there is a need to have custom AMI. You can enable this capability by following the guide here.

For more best practices on EKS Cluster Upgrades, refer to the official AWS EKS documentation here.

Looking Ahead... Preparing for your future EKS Cluster Upgrades

EKS Cluster upgrades are a continuous process. To help you make future upgrades smoother, consider evaluating these deprecations.

Kubernetes v1.27
1. Alpha annotations of seccomp was deprecated in v1.19, and is planned to be removed in v1.27.
2. If any of the existing pod manifests has this annotation, you will need to migrate it to securityContext.seccompProfile within the pod spec.

About the Authors

Abhishek Nanda, Containers Specialist Solutions Architect, AWS India

Abhishek is a Containers Specialist Solutions Architect at AWS based out of Bengaluru with over 7 years of IT experience. He is passionate about designing and architecting secure, resilient and cost effective containerized and serverless infrastructure and applications.

Arun Nalpet Ramakrishna, Sr. Specialist Technical Account Manager, AWS India

Arun is a Senior Container Specialist TAM with Enterprise Support Organization at AWS. EKS is one of Arun’s key focus areas, and spends majority of time working with Enterprise Support customers to drive operational excellence and to implement best practices. Arun is based out of Bengaluru and enjoys playing badminton every single day.

Gowtham S, Sr. Specialist Technical Account Manager, AWS India

Gowtham is a Senior Container Specialist TAM at AWS, based out of Bengaluru. Gowtham works with AWS Enterprise Support customers helping them to optimize Kubernetes workloads through operations reviews and deep dive sessions. He is passionate about Kubernetes and open-source technologies.

Gladwin Neo, Associate Containers Specialist Solutions Architect, AWS Singapore

Gladwin Neo is a Containers Solutions Architect at AWS. He is a tech enthusiast with a passion for containers. He is now focusing on helping customers from a wide range of industries to modernize their workloads through the use of Containers technologies in AWS which includes Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS).

Glendon Thaiw, AWS Startups Solutions Architect, AWS Singapore

Glendon is a Startup Solutions Architect at AWS, where he works with Scale-ups across Asia Pacific to grow on the AWS Platform. Over the years, Glendon has had the opportunity to work in varying complex environments, designing, architecting, and implementing highly scalable and secure solutions for diverse organizations.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.