Fix Gray Failures Fast Using Automation and Route 53 ARC Zonal Shift
Your application health check dashboard is all green, but your customers are having a poor user experience. Here is how to detect and automatically fix such events.
- Normal Operations: both service monitoring and customers see the application as healthy.
- Over-reaction: service monitoring sees the application as unhealthy but everything looks fine to the customers.
- Hard Failure: both service monitoring and customers perceive the application as unhealthy.
- Gray Failure: the service monitoring perceives the application as healthy but the customers are experiencing issues and perceive the application as unhealthy.
- Detecting gray failures with outlier detection in Amazon CloudWatch Contributor Insights, which shows how to use Amazon CloudWatch Contributor Insights to help detect gray failures
- Implementing health checks, which shares best practices for using health checks
StartZonalShift
API to stop traffic from going to the impaired AZ. The AWS SDK and Developer tools provides the ability to integrate and interact with Zonal Shift API endpoint from a programming language of choice.- The App Server lost network access to the Database. The NLB still sees the App Server as healthy and continues to send requests to it. Client requests are failing with 5XX errors. A Gray Failure has occurred in the AZ.
- The Application monitoring mechanism detects the gray failure. A message is sent to the SQS queue.
- The SQS queue triggers a Lambda Function with information about the degraded AZ.
- The Lambda function makes an API call to start Route 53 ARC Zonal Shift.
- Route 53 ARC Zonal Shift shifts traffic from the degraded AZ.
- The Lambda function publishes a message to an SNS topic for notification.
- Lambda Layer: contains the boto3 SDK for python required by the logic to make zonal shift API calls.
- Lambda function: contains the python logic to start the zonal shift.
- SNS topic: receives the gray failure message from the monitoring app and triggers the lambda function.
- Clone the GitHub repository here.1git clone https://github.com/build-on-aws/automated-arc-zonal-shiftChange directory into the cloned repo.1cd automated-arc-zonal-shift
- Ensure CDK is installed.1npm install -g aws-cdkCreate a Python virtual environment.1python3 -m venv .venv
- Activate virtual environment.On MacOS or Linux1source .venv/bin/activateOn Windows1% .venv\Scripts\activate.bat
- Install the required dependencies.1pip install -r requirements.txt
- Synthesize (
cdk synth
) or deploy (cdk deploy
) the example.1cdk deploy
payload.json
file. This should trigger the Lambda function and start the Zonal Shift.- From the root directory change to src/sample.1cd src/sample
- Edit the payload.json file to reflect your load balancer's name.
DO NOT
change the Key.1vi payload.json - Log into the AWS console, navigate to the Load Balancer page, and note the status of the NLB as indicated below. Note that there is no column labelled for Zonal Shift.Image not found1aws sqs send-message --message-body file://payload.json
- Trigger the Zonal Shift by running the command below.1aws sqs send-message --message-body file://payload.json
- To verify the shift, log back into the Load Balancer page of the AWS Console and verify the status of the NLB. Note that new columns have appear for Zonal Shift as indicated below.Image not found
1
cdk destroy
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.