How to Build and Manage a Resilient Service Using Health Checks, Decoupled Dependencies, and Load Balancing using AWS SDKs

Are you a developer who wants to build and manage a resilient service? Have you worked through tutorials like Health Checks and Dependencies and want to accomplish the same thing using code to build and manage your infrastructure? This article and code example show you how to use AWS SDKs to set up a resilient architecture that includes AWS services like Amazon EC2 Auto Scaling and Elastic Load Balancing (ELB). You’ll learn how to write code to monitor your service, manage health checks, and make real-time repairs to make your service more resilient. And you’ll do it all without ever opening the AWS Management Console.

This example is available now in the AWS SDK Code Example Library.

Just want to see the code?

The code for this example is part of a collection of code examples that show you how to use AWS software development kits (SDKs) with your development language of choice to work with AWS services and tools. You can download all of the examples, including this one, from the AWS Code Examples GitHub repository.

To see the code and follow along with this example, clone the repository and choose your preferred language from the list below. Each language link takes you to a README that includes setup instructions and prerequisites for that specific version.

Python

What are we building?

A web service that returns recommendations of books, movies, and songs. This example takes a phased approach so you can see how the web service evolves as the system becomes more sophisticated in how it responds to various kinds of failures. All resources and components are deployed and managed with AWS SDK code, giving you insight on how to create your own system that can be built in a repeatable manner.

The example is an interactive scenario that you run at a command prompt. It starts by deploying all the resources you need, then moves to a demo phase where failures are simulated and resilient solutions are implemented. By the end of the demo phase, the service has become more resilient and presents a more consistent and positive user experience even when failures occur. Finally, all resources are deleted.

Building a resilient web service involves a number of interconnected parts. The main components use by this this demo are:

Amazon EC2 Auto Scaling is used to create Amazon Elastic Compute Cloud (Amazon EC2) instances based on a launch template. The Auto Scaling group ensures that the number of instances is kept in a specified range.
Elastic Load Balancing handles HTTP requests, monitors the health of instances in the Auto Scaling group, and distributes requests to healthy instances.
A Python web server runs on each instance to handle HTTP requests. It responds with recommendations and health checks and takes different actions depending on a set of AWS Systems Manager parameters that simulate failures and demonstrate improved resiliency.
An Amazon DynamoDB table simulates a recommendation service that the web server depends on to get recommendations.

Explore the interactive scenario

The interactive scenario has three main phases: deploy resources, demonstrate resiliency, and destroy resources.

Note: This example uses your default VPC and its default security group, which must allow inbound HTTP traffic on port 80 from your computer's IP address. If you prefer, you can create a custom VPC and modify the example code to use your custom VPC instead. Find out more in the Amazon Virtual Private Cloud User Guide.

Deploy resources

The first part of the example sets up basic web servers by deploying the first set of resources:

The DynamoDB table that is used as a recommendation service. The table is populated with a few initial values.
An AWS Identity and Access Management (IAM) policy, role, and instance profile that grants permission to each Amazon EC2 instance so that it can access the DynamoDB recommendations table and Systems Manager parameters.
An Amazon EC2 launch template that specifies how instances are started. The launch template includes a startup Bash script that installs Python packages and starts a Python web server.
An Auto Scaling group that is configured to ensure that you have three running instances in three Availability Zones.

After this deployment phase, you have three instances, each acting as a web server. Each instance listens for HTTP requests on port 80 and responds with recommendations from the DynamoDB table.

The next phase of the example sets up a load-balanced endpoint by deploying the following resources:

An ELB target group that is attached to the Auto Scaling group. The target group forwards HTTP requests to instances in the Auto Scaling group on port 80, and is configured to verify the health of instances. To speed up this demo, the health check is configured with shortened times and lower thresholds. In production, you might want to decrease the sensitivity of your health checks to avoid unwanted failures.
An Application Load Balancer that provides a single endpoint for your users, and a listener that the load balancer uses to distribute requests to the underlying instances.

After this part of the deployment, you have a single endpoint that receives HTTP requests and distributes them to the underlying web servers to get recommendations and health checks.

Demonstrate resiliency

This part of the examples toggles different parts of the system by setting Systems Manager parameters that are used by the Python web server to take different actions depending on the parameter values. This creates situations where the web service fails, and shows how using a resilient architecture can keep the web service running in spite of these failures and improve your customers' experience.

After each update, the demo gives you a chance to send GET requests to the endpoint or to check the health of the instances. Each GET request responds with a recommendation from the DynamoDB table and also includes the instance ID and its Availability Zone so that you can see how the load balancer distributes requests.

The selection screen looks like this:

Initial state

At the beginning, the recommendation service successfully responds and all instances are healthy.

Broken dependency

The next phase simulates a broken dependency by setting the table name parameter to a non-existent table name. When the web server tries to get a recommendation, it fails because the table doesn't exist.

However, all instances report as healthy because they use shallow health checks, which means that they simply report success under all conditions.

Static response

The next phase sets a parameter that instructs the web server to return a static response when it cannot get a recommendation from the recommendation service. This technique lets you decouple your web server response from the failing dependency and return a successful response to your users instead of reporting a failure. The static response is to always suggest the 404 Not Found coloring book.

Bad credentials

The next phase replaces the credentials on a single instance with credentials that don't allow access to the recommendation service. Now, repeated requests sometimes get a good response and sometimes get the static response, depending on which instance is selected by the load balancer.

For example, the instance on us-west-2a gives real recommendations:

While the instance on us-west-2b gives a static response:

Deep health checks

The next phase sets a parameter that instructs the web server to use a deep health check. This means that the web server returns an error code when it can't connect to the recommendations service. Remember, it takes a minute or two for the load balancer to detect an unhealthy instance because of the threshold configuration, so if you check health right away, the instance might report as healthy.

Note that the deep health check is only for ELB routing and not for Auto Scaling instance health. This kind of deep health check is not recommended for Auto Scaling instance health, because it risks accidental termination of all instances in the Auto Scaling group when a dependent service fails. For a detailed explanation of health checks and their tradeoffs, including use of the heartbeat table pattern to automatically detect and replace failing instances, see Choosing the right health check with Elastic Load Balancing and EC2 Auto Scaling.

The instance with bad credentials reports as unhealthy:

The load balancer takes unhealthy instances out of its rotation, so now all requests to the endpoint result in good recommendations.

Replace the failing instance

This next phase uses an SDK action to terminate the unhealthy instance, at which point Auto Scaling automatically start a new instance. While the old instance is shutting down and the new instance is starting, GET requests to the endpoint continue to return recommendations because the load balancer dispatches requests only to the healthy instances.

While the instances are transitioning, you will see various results from the health check, for example:

After the new instance starts, it reports as healthy and is again included in the load balancer's rotation.

Fail open

This last phase of the example again sets the table name parameter to a non-existent table to simulate a failure of the recommendation service. This causes all instances to report as unhealthy.

When all instances in a target group are unhealthy, the load balancer continues to forward requests to them, allowing for a fail open behavior. In this case, because the web server returns a static response, users get a static response instead of a failure code.

Destroy resources

After you're done exploring the resiliency features of the example, you can keep or destroy all the resources that it created. Typically, it's a good practice to destroy the resources, to avoid unwanted charges on your account.

If you answer 'yes', the example deletes all resources and terminates all instances:

Resiliency and you

Congratulations, you've made it to the end! By following this example, you've learned how to use AWS SDKs to deploy all the resources to build a resilient web service.

You used a load balancer to let your users target a single endpoint that automatically distributed traffic to web servers running in your target group.
You used an Auto Scaling group so you could remove unhealthy instances and automatically keep the number of instances within a specified range.
You decoupled your web server from its dependencies and returned a successful static response even when the underlying service failed.
You implemented deep health checks to report unhealthy instances to the load balancer so that it dispatched requests only to instances that responded successfully.
You used a load balancer to let the system fail open when something unexpected went wrong. Your users got a successful static response, buying you time to investigate the root cause and get the system running again.

You can find more details about this example, and the complete code, in the AWS Code Examples GitHub repository.

You can explore more AWS code examples in the Code Library.

Now go build your own!

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.