Easier EC2 instance maintenance with managed draining for Amazon ECS capacity providers

Easier EC2 instance maintenance with managed draining for Amazon ECS capacity providers

Learn how Amazon ECS capacity providers can help you to gracefully maintain your fleet of Amazon EC2 instances without causing downtime for your application containers.

Published Jan 23, 2024
Last Modified Jan 26, 2024
Amazon Elastic Container Service (ECS) deploys and manages your containerized tasks on AWS infrastructure. Customers can avoid the need to maintain compute instances by using Amazon ECS to deploy tasks on serverless AWS Fargate capacity. But some customers prefer to use Amazon ECS with Amazon Elastic Compute Cloud (Amazon EC2) as capacity. Using Amazon EC2 instances as capacity to run your container gives you greater control over the underlying compute infrastructure, but this comes with the downside of increased maintenance overhead. Traditionally, it has been necessary for cluster operators to build custom tooling for Amazon EC2 instance maintenance in order to avoid unexpectedly disrupting the container workloads running on the Amazon EC2 instances. Amazon ECS has the built-in ability to drain tasks that are running on an Amazon EC2 instance and move these tasks to another instance, in order to allow the original instance to be replaced or terminated. However, utilizing this draining feature required customers to implement a custom solution that relied on Auto Scaling lifecycle hooks to set container instances to draining while all tasks were drained.
Amazon ECS now provides managed instance draining as a built-in feature of Amazon ECS capacity providers. This new feature enables Amazon ECS to safely and automatically drain tasks from Amazon EC2 instances that are part of an Amazon EC2 Auto Scaling Group associated with an Amazon ECS capacity provider. This simplification will allow many Amazon ECS customers to eliminate custom lifecycle hooks that they were previously using to drain Amazon EC2 instances. Customers can now perform infrastructure updates such as rolling out a new version of the ECS agent by seamlessly using Auto Scaling Group instance refresh, with Amazon ECS ensuring workloads are not interrupted.

Amazon EC2 Auto Scaling Groups are the primary mechanism for scaling out a fleet of Amazon EC2 instances. EC2 instances in an Auto Scaling Group are configured using a Launch Template, which defines what type of Amazon EC2 instance to launch, and what Amazon Machine Image (AMI) to base the Amazon EC2 instance on. For Amazon ECS you should launch Amazon EC2 instances that are based off of the ECS Optimized AMI. This special AMI comes with everything needed to connect the instance to an Amazon ECS cluster, and launch container workloads on the host. New versions of the Amazon ECS Optimized AMI are regularly released in order to provide new Amazon ECS features and patches to bugs or security vulnerabilities in the underlying host operating system. Auto Scaling Group instance refresh is one solution for updating the AMI that is powering the Amazon EC2 instances in your cluster. It helps you to relaunch Amazon EC2 instances in the Auto Scaling Group by providing a new Launch Template that references the latest version of the Amazon ECS optimized AMI.
There are a variety of ways that you can trigger an AMI update for your EC2 Auto Scaling Group. You may wish to automate instance refreshes by periodically triggering an instance refresh via Amazon EventBridge Scheduler and an AWS Lambda function. If you prefer to use AWS CloudFormation or AWS Cloud Development Kit, you can also use the UpdatePolicy setting to configure an infrastructure as code driven rolling replacement of your EC2 instances.
No matter how you choose to refresh your Amazon EC2 instances, it will cause Amazon EC2 instances that are part of an Auto Scaling Group to be replaced. If an Amazon EC2 instance is stopped and replaced while it is running tasks, then those tasks will be stopped as well. Amazon ECS will detect the loss of any task that is part of an Amazon ECS service. In order to maintain the service’s desired count of tasks, Amazon ECS will launch a replacement task onto a different Amazon EC2 instance in the cluster. But this is a reactive fail-safe that only happens after a container has already been stopped, and the service’s running task count has already dropped below it’s desired count of tasks.
Managed instance draining is proactive, rather than reactive. Whenever an Amazon EC2 instance is set to be terminated by the Auto Scaling Group, Amazon ECS will temporarily delay the instance termination, and automatically place the instance into draining mode. This draining mode prevents any more task launches on the instance, and causes any service launched tasks running on the instance to be proactively replaced onto new hosts. The task replacement attempts to launch new replacement tasks prior to stopping existing tasks that are currently running on the draining Amazon EC2 instance. You can read more about draining behavior under Container Instance Draining in the official Amazon ECS documentation.
New managed instance draining interacts with the following Amazon ECS features that are also designed to keep your workloads running and available.

When you configure an Amazon ECS capacity provider attached to an Auto Scaling Group, one of the available options in Amazon ECS is managed termination protection, which sets scale in protection of Amazon EC2 instances that are currently running Amazon ECS tasks. Once enabled, any Amazon EC2 instance that is running one or more tasks will be ineligible to be stopped while the Auto Scaling Group is being scaled in. Managed scale in protection does not protect against all forms of Amazon EC2 instance termination, but it does prevent many of the situations where an Amazon EC2 instance would need to be drained.
If you have a workload that is particularly sensitive to disruptions you may want to enable both managed scale in protection and managed draining. When both features are enabled you have maximum protection against interruptions of your production workload. Managed scale in protection blocks many types of disruptive Amazon EC2 instance termination, and managed draining ensures that when an Amazon EC2 instance does need to be terminated, it’s running workloads are handled gracefully.
However, if your workload is stateless, and your highest priority is reducing infrastructure costs, you may prefer to leave managed termination protection turned off and rely exclusively on managed draining. This approach avoids the scenario where sparsely utilized Amazon EC2 instances stay running for hours or even days because they are hosting a task from a long running service. With managed termination protection turned off, Amazon EC2 instances can safely drained so that the cluster can scale in to an appropriate size. You will save on infrastructure cost by disabling termination protection and allowing sparsely utilized instances to be drained.

When defining an Amazon ECS task, you can specify a stop timeout for the task. If not specified, this stop timeout defaults to 30 seconds. When Amazon ECS is draining an Amazon EC2 instance, it will send a SIGTERM stop signal to each task container that needs to be stopped, and then wait for the duration of the stop timeout to see if the container gracefully exits on its own. If the container does not gracefully exit by the time the stop timeout has passed, then Amazon ECS will send a SIGKILL signal to force stop the container’s process. If your task is doing some heavy work that you can’t complete quickly, then you can configure a longer stop timeout on the task. Managed instance draining will keep the Amazon EC2 instance in a draining status until the task gracefully exits on its own, or until the stop timeout period is exceeded and Amazon ECS force stops the task. However, while Amazon ECS stop timeout for Amazon EC2 tasks can be set to wait for years if you wish, the Amazon EC2 draining period can not exceed 48 hours. Therefore, it is not advisable to set task stop timeout to greater than 48 hours.

Amazon ECS gives running tasks the ability to mark themselves as protected. This feature tells Amazon ECS that the task is busy doing important work, and the task should not be stopped. If you are using managed termination protection, then an instance that is running such a protected task is already partially protected from being stopped as part of an Auto Scaling Group scale-in. However, if managed termination protection is disabled, then the Amazon EC2 Auto Scaling Group is not aware of whether the Amazon EC2 instance has protected tasks on it. Additionally, an Amazon EC2 instance refresh can be configured to replace Amazon EC2 instances even if they are protected from scale-in by Amazon ECS managed termination protection. The Auto Scaling Group can still choose to terminate an Amazon EC2 instance that is hosting protected tasks. This will place the Amazon EC2 instance into a draining state, and any running tasks on the instance will immediately be sent a SIGTERM signal to stop, irrespective of their task protection status. You can use a stop timeout on the task to delay the task force quit, and hold the Amazon EC2 instance in a draining state for up to 48 hours.
In general it is considered best practice to set a stop timeout for any task that will also be using the task protection feature. Additionally, any mission critical tasks that will be hosted on Amazon EC2 may wish to use the Amazon ECS metadata endpoint to discover the ID of the instance that is hosting the task, then use the Amazon EC2 ModifyInstanceAttribute API to set the disableApiStop or disableApiTermination attributes on the Amazon EC2 instance that hosts the task. This will provide an additional layer of protection against any automated Amazon EC2 instance disrupting actions.

Service launched tasks can be safely drained from an Amazon EC2 instance because they are part of a replica set, and the service can always launch other tasks on other Amazon EC2 instances. However, Amazon ECS does not drain standalone tasks that were launched with the RunTask API. Instead, Amazon ECS waits for these tasks to exit on their own. As long as these tasks are running they will keep the Amazon EC2 instance in a draining status. The Amazon EC2 instance will be allowed to remain in a draining state for up to 48 hours. After that point all tasks on the instance will be force stopped, whether they were RunTask launched, or CreateService launched.

The new Amazon ECS managed instance draining feature originated directly from the feature requests on our public container roadmap. The open source RFC received over 300 reactions on GitHub. We appreciate your continued interest and engagement with the future of Amazon ECS, and we are excited to continue delivering powerful features that make it easier to deploy containers at scale on AWS. If you have a feature request of your own then please submit an issue to our public roadmap, or upvote an existing feature request to express your interest in it.
To get started with managed instance draining, please visit the Amazon ECS documentation for step by step details on how to enable managed draining for your capacity provider and autoscaling group.