Implementing multi-Region failover with standby takes over primary

This post is written by Khubyar Behramsha, Sr. AWS Solutions Architect and Marcos Ortiz, Principal AWS Solutions Architect.

In a disaster recovery (DR) scenario, a simple and effective failover approach is crucial to enacting your disaster recovery strategy and and resume normal operations quickly. This blog post showcases a sample implementation of the "standby takes over primary" (STOP) pattern for multi-Region applications. This pattern controls the failover process from a healthy standby Region. This approach enables failover initiation without depending on resources in the primary Region or on Amazon Route 53 control plane operations for changing DNS records, ensuring a reliable and streamlined process.

Our sample application, available on GitHub, implements the STOP pattern, and highlights how companies can trigger a failover within a private context, leveraging Amazon Route 53 health checks with Amazon CloudWatch alarms. This STOP implementation is just one of many different failover mechanisms you can utilize as a simple, resilient way to shift application traffic from one location to another. More information on these other patterns can be found in this blog about creating disaster recovery mechanisms using Route 53. In this post you will be see the process of deploying the sample application, configuring the necessary health checks and alarms, and testing the failover and failback process.

Advantages of this implementation

Flexible manual or automated failover mechanism

In this sample, you have the ability to have a human decision maker choose to initiate the failover by updating the CloudWatch metric, or embed that process into an existing disaster recovery playbook. Implementing a fully automated failover and failback procedure that relies solely on automatic health checks could cause potentially harmful flapping to occur. On the other hand, if you have an established manually triggered disaster recovery playbook, it is easy to add this in as an additional step to that automation.

Failover process does not have dependency on primary region

A resilience best practice associated with disaster recovery is to have a failover process that can be triggered or activated without a dependency on your primary location. A location could be an Availability Zone or a Region. In this implementation, we’re showcasing a multi-Region application. The failover action is triggered by updating the CloudWatch metric in the secondary Region.

Affordability and ease of implementation

While less feature-rich than alternative solutions, this implementation is extremely cost effective and can be easier to setup than alternatives such as to other failover solutions such as Amazon Application Recovery Controller. The major cost components for this solution involve a single Route 53 health checks and a single CloudWatch metric and alarm, resulting in a low cost to maintain.

Overview

Our sample workload consists of an Amazon Route 53 record with the routing policy as failover pointing to an Amazon API Gateway public endpoint that triggers an AWS Lambda function. Our API returns the service name and Region it is currently running in so we can easily test and validate the pattern.

Sample application architecture

We start with a Route 53 health check that monitors the state of a CloudWatch alarm in our secondary Region. When the alarm state is “OK“ Route 53 will continue to route traffic to the primary Region. When the alarm enters the “ALARM” state, either by an engineer manually updating the state of the CloudWatch metric or by an automated disaster recovery process, the Route 53 health check will register as ”unhealthy“ and route traffic to the secondary Region.

The table below outlines how Route 53 determines the status of a health check that monitors a CloudWatch alarm.

Configuration for the Route 53 health check — Route 53 health check

Configuration of the Route 53 health check

Some additional notes on this setup:

Route 53 considers a new health check to be healthy until there's enough data to determine the actual status, healthy or unhealthy. If you chose the option to invert the health check status, Route 53 considers a new health check to be unhealthy until there's enough data. If you invert the health check, Route 53 treats a healthy endpoint as unhealthy and vice versa.
For new health checks that have no last known status, the default status for the health check is healthy. More details.
If you omit the health check for the secondary record, and if the health check endpoint for the primary record is unhealthy, Route 53 always responds to DNS queries by using the secondary record. More details.

Deploying the sample application

Pre-requisites

A public domain (example.com) registered with Amazon Route 53. Follow the instructions here on how to register a domain and the instructions here to configure Amazon Route 53 as your DNS service.
An AWS Certificate Manager certificate (*.example.com) for your domain name on both the primary and secondary Regions you plan to deploy the sample APIs.

Deploy the application to the secondary Region

Here you will clone the repo, enter the right directory then use use the AWS Serverless Application Model (SAM) to deploy the application into the secondary Region.

Follow the detailed instructions here to deploy the secondary API Gateway stack.

Deploy the application to the secondary Region

Here you will use use the AWS Serverless Application Model (SAM) to deploy the application into the primary Region.

Follow the detailed instructions here to deploy the primary API Gateway stack.

Testing

Remember, if you're not testing your solution, then you're just hoping it works! To avoid that, in this section we look to validate our failover configuration and make sure everything is working as intended. To do so, we’ll be executing 4 different scripts: 1/a test script to simulate load on our API, 2/a second test script to show the status of our CloudWatch metric alarm, 3/our failover to secondary script and 4/our failback to primary script. We’ll walk through an overview of the steps below, but you can see the detailed instructions here for testing the STOP pattern.

Deploying the heartbeat script

First, we’ll start by executing a test heartbeat script designed to repeatedly send traffic to our API every 5 seconds and output some information about the service. We expect that the API response will show our service deployed in the primary Region, in this case, “us-east-1”.

output of the heartbeat, testing.sh, script showing Region as us-east-1 — heartbeat script in primary Region

Heartbeat script showing Region switch from primary to secondary

Monitoring the CloudWatch alarm metric

Open a new terminal window and execute the following script by running the command below. After executing, you should see that the alarm status is “OK”.

Image showing alarm status as OK

Execute the failover

Next, we’ll execute our failover script. It’s recommended to do this from a new terminal window so that we can continue to observe the API responses being returned by our test script. Use the command below to execute the script:

Our failover script updates the value of the “FailoverToSecondary” metric to “1.0“. This should then send our CloudWatch alarm into the ”ALARM“ state, making our Route53 health check ”unhealthy“. Once the health check has the ”unhealthy“ state, it utilizes the Route53 failover policy to route traffic to our secondary region. You should now see the output in the terminal window look like the following:

Alarm status switching from OK to ALARM

Now flip back to the terminal window where the test script is executing. After a short wait, you should see the API response showing your secondary Region, in our case “us-west-2”

Results of the test.sh script showing the Region change to “us-west-2”

Execute the failback

In a new terminal window, execute the failback script by running the below command:

Like before you’ll first see the CloudWatch alarm state change from “ALARM” to “OK”.

Alarm status switching from ALARM to OK

Then the heartbeat script should show the Region switch back to our primary Region, us-east-1.

Heartbeat script showing Region switch from secondary to primary Region, us-east-1

As an additional check, you can visit the Route 53 health check in the AWS Console. Once you’ve selected the appropriate health check, you should see a chart similar to the one below showing our health check status going from 1.0 to 0.0 when unhealthy then back to 1.0 when healthy:

Route 53 health check status line chart

Cleanup

After you are finished, follow the cleanup instructions on GitHub

Conclusion

This solution can help you implement a cost-effective, performant and resilient pattern for failing over your workloads privately through the use of Route 53 health checks and CloudWatch metric alarms. Following along with this blog you should be better equipped to handle setting up this slightly tricky configuration.

While this sample showcased a multi-Region application, the same approach could be used for a workload running in multiple Availability Zones in a single-Region.

Check out this blog for a detailed approach on implementing multi-Region failover for API Gateway.

For more resilience learning, visit AWS Architecture Blog – Resilience or Resilient Cloud Architectures on our AWS Community Blog.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.

Implementing multi-Region failover with standby takes over primary

A quick and easy walkthrough of a reliable way to privately initiate a failover for your AWS applications

Advantages of this implementation

Flexible manual or automated failover mechanism

Failover process does not have dependency on primary region

Affordability and ease of implementation

Overview

Deploying the sample application

Pre-requisites

Deploy the application to the secondary Region

Deploy the application to the secondary Region

Testing

Deploying the heartbeat script

Monitoring the CloudWatch alarm metric

Execute the failover

Execute the failback

Cleanup

Conclusion

Comments