Cost Conscious Disaster Recovery
Measuring the RTO & RPO a Pilot-Light disaster recovery strategy for a 3-tier architecture.
description: "Measuring the RTO & RPO a Pilot-Light disaster recovery strategy for a 3-tier architecture."
tags:
- disaster-recovery
- pilot-light
- resilience
waves: - resilience
spaces: - cost-optimization
- resilience
authorGithubAlias: sjeversaws
authorName: Steven Evers
additionalAuthors: - authorGithubAlias: aws-wylatowska
authorName: Samantha Wylatowska
date: 2023-09-05
- The database tier employs an Aurora PostgreSQL database in a Multi-AZ configuration to ensure high availability, and uses AWS Secrets Manager to handle the storage, replication, and rotation of database credentials.
- The application tier incorporates an Application Load Balancer (ALB) connected to a horizontally scalable API running on Elastic Container Service (ECS). The API is deployed to a AWS EC2 Target Group spanning multiple Availability Zones. This ensures high availability at the application tier.
- The web tier comprises a React-based static website stored in an Amazon S3 bucket, and delivered using a CloudFront distribution for content delivery.
- The RDS PostgreSQL database.
- The AWS Secrets Manager database credentials.
- The Amazon Elastic Container Registry repository that contains our API image.
- Credentials are kept separate from the codebase.
- Credentials can be rotated manually, and automatically, without affecting the application.
- Credentials are always encrypted.
- Only the service(s) with permission to access and decrypt the credentials can do so, satisfying a least-privilege security posture.
- Our API application, running in ECS, has a /health endpoint that includes a database query. If the query fails, the endpoint returns an error.
- Our ALB’s Target Group includes a health check that queries the /health endpoint.
- We have a Route 53 health check that targets the /health endpoint on the primary ALB.
- Create a cluster.
- Create a control panel.
- Create 2 routing controls, one for each region.
- Open each routing control and create a health check.
- In your Route 53 hosted zone, associate your DNS Failover Primary record with the primary region health check.
- In your Route 53 hosted zone, associate your DNS Failover Secondary record with the secondary region health check.
- Navigate to RDS in the console at https://console.aws.amazon.com/rds/
- Select Databases
- Select your secondary database
- Select Actions > Promote
- On the Promote Read Replica page, enter the backup retention period and the backup window for the newly promoted DB instance
- When the settings are as you want them, choose Continue
- On the acknowledgment page, choose Promote Read Replica
- Navigate to ECS in the console at https://console.aws.amazon.com/ecs/
- Select Clusters
- Select your cluster
- Select checkbox for your service
- Click Update
- Set Desired Tasks to the amount you need to handle traffic (the amount that was in the primary region)
- Click Update
- Navigate to Route 53 ARC in the console at https://console.aws.amazon.com/route53recovery/
- Under Multi-Region > Routing control, select Clusters
- Select your cluster
- Select your control panel
- In Routing controls, select the topmost checkbox that selects all of your routing controls
- Click Change routing control states
- In the Change traffic routing window that appears, toggle the routing controls so that the routing control associated with the health check for your secondary region is the only one that is On
- Type Confirm and click Change traffic routing
- 10:54 AM: We begin the scenario by executing applying a constant load to our API using the k6 testing framework.
- 10:57 AM: After a few minutes has passed, a Security Group rule is removed, effectively disabling all traffic to the database over port 5432 (the PostgreSQL port). Request success rates plummet to zero, near instantly.
- 11:00 AM: The e-mail arrives indicating the primary ALB is unhealthy.
- We perform our recovery procedure as described above.
- 11:14 AM: Traffic begins returning successfully.
- MTTD is the time it took for us to receive the e-mail after we removed network connectivity.
- RTO is the time it took for traffic to return successfully after we removed network connectivity.
- RPO is the amount of data from the primary database that we do not find in the secondary database.
- MTTD: 3 minutes
- RTO: ~17 minutes
- RPO: 0 minutes
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.