AWS Logo
Menu
Automate RTO and Data Recovery Validations for AWS Backup with Restore Automation

Automate RTO and Data Recovery Validations for AWS Backup with Restore Automation

This guide covers automating continuous Recovery Time Objective (RTO) and data integrity validations for AWS resources backed up with AWS Backup. It includes a strategies and sample code for rolling out a solution with Restore Test Plan

Ajay Singh
Amazon Employee
Published Oct 11, 2024
Last Modified Oct 15, 2024
The AWS Well-Architected Framework's Reliability Pillar recommends performing periodic recovery testing of data backups to verify their integrity and the effectiveness of the backup and restoration processes. This practice involves regularly restoring data from backups in a non-production environment to ensure that the backup data is not corrupted and that the restoration procedures work as expected. By doing so, organizations can identify and address any potential issues with backup and recovery mechanisms before an actual data loss or disaster scenario occurs.
Conducting periodic recovery tests for a large number of resources can be a substantial effort, requiring careful planning, coordination, execution, and reporting. A typical recovery test may involve a central team responsible for compute recovery, a database management team conducting database recoveries, and an application team validating the recovered data and measuring compliance against the target Recovery Time Objectives (RTOs). Due to the significant effort involved, these tests are often conducted semi-annually or annually.
As data volumes grow or application usage patterns change, the RTO compliance numbers calculated in the past may become outdated. This is where the Restore Test feature within AWS Backup can assist organizations in building a mechanism for continuous and automated Recovery Time Objective (RTO) and data recovery evaluation workflows, enabling them to gain confidence in the recovery process.
Advantages of creating restore test plan in AWS Backup
  1. Validate Recovery Time Objectives (RTOs): Restore test plans allow organizations to simulate real-world restore scenarios and measure the actual time it takes to restore data. This helps validate if the backup and restore processes meet defined Recovery Time Objectives (RTOs) for business continuity and disaster recovery.
  2. Verify Data Integrity: Restore test plans enable the verification of backup data integrity by restoring it to a temporary location and performing validation checks on the restored data. This ensures that backups are not corrupted and can be successfully restored when needed.
  3. Test Restore Procedures: Organizations can use restore test plans to validate and refine restore procedures, including any manual steps or automated scripts. This helps identify and address potential issues in the restore process before an actual disaster or data loss event occurs.
  4. Isolate Testing Environment: Restore test plans create an isolated testing environment separate from the production environment, allowing restore testing to be performed without impacting live systems or data.
  5. Automate Testing: Restore test plans can be scheduled and automated using AWS Backup, enabling regular and consistent testing without manual intervention.
  6. Cross-Account and Cross-Region Testing: Restore test plans support restoring backups across AWS accounts and regions, allowing the simulation of different recovery scenarios and validation of cross-account or cross-region restore processes.
  7. Audit and Compliance: Restore test plans provide detailed logs and reports, which can be used for auditing purposes and to demonstrate compliance with backup and recovery requirements.
  8. Continuous Improvement: By regularly conducting restore tests, areas for improvement in backup and recovery strategies can be identified, leading to more robust and resilient systems.
The next section outlines a high-level architecture to build an automated restore test validation workflow and how to measure RTO compliance and data integrity after recoveries. A later section discusses the key implementation steps and concepts. Detailed instructions on implementing the workflow are published in follow-on articles.

Architecture Automated RTO and Recovery Validation Workflow in AWS Backup

Architecture Automated RTO and Recovery Validation Workflow in AWS Backup
  1. Backup plans trigger scheduled backups of resources (such as DB instances, EC2, S3, FSX, etc.).
  2. Backup copies/recovery points are stored in a backup vault.
  3. The Restore Test plan selects resources based on specific tags or names and initiates a scheduled restore test job.
  4. The Restore Test plan chooses an appropriate recovery point.
  5. New restored resources are created (the restored resources are automatically terminated after the retention period defined in the Restore Test plan expires).
  6. The restore job sends job state change events to EventBridge.
  7. The EventBridge rule matches the COMPLETED event and invokes a Lambda function or Step Function, which runs a validation routine that includes RTO compliance and data integrity checks for the restored resources.
  8. The Restore Job Validation Status is updated.
  9. Restore Reports can be reviewed (refreshed at a 24-hour interval) for compliance purposes.

Key Factors to Consider for Restore Validations

Users shall review AWS documentation to gain general understanding of AWS Backup and how restore works in AWS Backup. By gaining understanding of steps listed below, users can develop good understanding on how to build solutions specific to their use case.
  • Make a decision on where to run recovery tests
    The restore test can be run on a workload account or a recovery account. If the plan is to run the restore test in a separate region or account, AWS Backup can create a copy of the vault in the recovery region and a separate AWS account.
  • Create a Backup Vault
    A vault is a logical container that stores and organizes backups/recovery points. It provides a centralized location for managing and accessing backups across different AWS services and AWS Regions. One can configure access policies, encryption settings, and other configurations at the vault level, which applies to all backups stored within that vault. Vaults can be created in multiple accounts and across regions, depending on the need and where recoveries need to be tested.
  • Create a Backup Plan
    A Backup Plan is a policy that defines when and how AWS resources, such as Amazon DynamoDB tables, Amazon Elastic File System (Amazon EFS) file systems, Amazon RDS, or EC2 instances, should be backed up. AWS Backup automatically takes incremental backups and stores them in a Vault. The Backup policy can be created centrally in a management account in case of an AWS Organization, or it can be created separately in each AWS account.During backups multiple copies of backups can be created and stored across vaults in remote regions and aws accounts.
  • Create a Restore Test Plan
    A Restore Test plan specifies the frequency of restore tests, target start time, resources, criteria for selecting recovery points, and retention period for restored resources. By using tags, multiple resources can be included in restore test plans. Restored resources are terminated/deleted after the expiry of the retention period, which helps reduce costs.Completed restore jobs have two statuses - job status and validation status. By default, all completed restore jobs have their validation status marked as "validating."
  • Understand Event Patterns
    AWS Backup sends events to Amazon EventBridge when the state of a backup, restore, or copy job changes. For restore job events are sent for FAILED, RUNNING, COMPLETED, PENDING, and CREATED state changes. By monitoring the COMPLETED event and understanding the fields contained in these events, one can build a recovery validation routine. Restores that are run as part of a restore test plan has field "restoreTestingPlanArn" in event json that differentiates it from event that comes from regular restore jobs.
Restore Job Event Examples
Event from a regular restore Job for RDS resource. This event does not contain field
"restoreTestingPlanArn"
Event from a Restore Test Job for EC2 resource - This event contains field "restoreTestingPlanArn"
  • Add an EventBridge rule
    To validate RTO compliance and data integrity of restored resources, an EventBridge rule can be configured to capture the COMPLETED event for restore jobs. This rule can then invoke a Lambda function or a Step Function, which executes a validation routine. The validation routine performs checks to ensure the restored resources meet the desired Recovery Time Objective (RTO) and verify the integrity of the restored data.It is recommended to create different matching patterns for each type of resource, such as RDS and EC2.
    • Example rule matching patterns below
      It is a best practice to include restoreTestingPlanArn in the matching rule event pattern to ensure that the Lambda function is only invoked for restore test jobs.In examples a and b, the restoreTestingPlanArn is not included in the event pattern, which will result in triggering the Lambda function for both regular restore jobs and restore test jobs. In example c, for EC2 test restores, restoreTestingPlanArn is included in the event pattern, so the event will only be triggered when the restore job is run as part of the specified restore testing plan.
    • When using patterns like examples a and b, which do not include the restoreTestingPlanArn, it is still possible to process test restore jobs and ignore regular restore job events. This can be achieved by adding a conditional check in the target Lambda function to verify the existence of the restore plan ARN in the event payload. If the restore plan ARN is present, the Lambda function processes the event as a test restore job. If the restore plan ARN is absent, the Lambda function ignores the event, treating it as a regular restore job.
  • Recovery validation routine The validation logic can be incorporated in either a Lambda Function or Step Function. For simple validations, Lambda functions will be sufficient. For complex validations that run for more than 15 minutes or have multiple steps and dependencies, Step Functions can be explored.
    • How to validate RTO (Recovery Time Objective) compliance ?
      To measure RTO compliance, the restore job run time needs to be calculated and compared with the expected RTO value.Current RTO - Restore job run time can be identified by taking the difference between the create time and completion time of a restore test job. This information is available at a) the incoming restore job event payload, and b) the restore job details. Restore job run time can also be found in scheduled restore job reports, which is CSV/JSON file created as part of scheduled runs of report within AWS backup.Expected RTO - Instead of hard-coding the expected RTO values in the Lambda function, it is recommended to store them in the AWS Systems Manager Parameter Store. This approach allows updating the RTO values without modifying the Lambda function if changes are required.Alternatively, if the protected resources are part of an application in AWS Resiliency Hub, an API call to AWS Resiliency Hub can be used to retrieve the expected RTO value.
    • How to get details about restored resource ?
      Lambda function extracts createdResourceArn key and gets resource id and runs describe command. This can be used to identify if restore resource is running and other key details.
      For EC2 instances, use describe_instances
      For Aurora clusters, use describe_db_clusters
      For RDS instances, use describe_db_instances
      For other resource types, use the corresponding describe command
    • How to validate data integrity ?
      Depending on the requirements, a simplistic validation routine that takes a few minutes to complete or deep validations that takes multi step approach can be defined.Here are some suggested data integrity validation approaches for EC2 and RDS resource types:Inspect recovered data on restored EC2 and compare with expected data- This could involve file comparison or comparing hash value or checking some other content. Lambda functions can execute remote scripts on restored ec2 and source ec2, by calling AWS Systems Manager Documents AWS-RunShellScript (for Linux) or AWS-RunPowerShellScript (for Windows) and compare data content between the instances. The script can contain a simple as linux command or command to run SHA-256 cryptographic hash function on a file or complex shell/python/powershell script.For web servers inspect data served by EC2 - If the EC2 instances are running web servers that expose APIs and web pages, use the private IP address to programmatically fetch the web pages/run API calls. Private ip can be obtained by running describe_instances againts the restored resource. Compare the received responses with the expected responses to validate the data integrity of the web server content.For RDS database - The Lambda function can connect to the restored RDS instance and run SQL queries with aggregate functions (SUM, AVERAGE, COUNT) on the restored database tables. The results can be compared with the running similar sql queries against original database. To securely access the database credentials, store them in AWS Secrets Manager and have the Lambda function retrieve the credentials from there for connecting to the database.
  • Update Restore Job Validation Status
    Upon completion of a restore job, the validation status remains as "VALIDATING." Consequently, after the recovery validation process is successfully executed, the PutRestoreValidationResult API can be invoked to update the status with one of the following values: FAILED, SUCCESSFUL, TIMED_OUT, or VALIDATING. If the validation was successful, the status should be updated to SUCCESSFUL.
  • Monitoring Automated Restore Test
    The AWS Backup console provides job dashboards for monitoring job health, success/failure metrics. Additionally, Amazon CloudWatch and Amazon EventBridge can be utilized for monitoring AWS Backup processes related to the automated restore tests. To receive notifications about AWS Backup-related events, including backup, restore, and copy events, Amazon Simple Notification Service (Amazon SNS) can be used by subscribing to the relevant topics.
  • Building Custom Reports - Use AWS Backup Schedule Automated Reports
    AWS Backup can be configured to generate automated reports in an S3 bucket, showing the status of all historical restore jobs. These CSV files will list details such as the restore test plan ARN, job run time, and status, among other information. By comparing the job run time with the expected RTO, a running report can be derived to show the variance of RTO over time. These CSV can be fed into other Business intelligence tools for reporting.
  • Deleting Test Restores
    AWS Backup provides a cost-saving feature that automatically deletes restored data from test restores. This feature helps manage costs associated with resource usage during testing and verification processes. When performing a test restore, AWS Backup adds a tag called "awsbackup-restore-test" to the restored resources. This tag identifies and enables the deletion of these resources after testing is complete, ensuring that they do not incur ongoing costs. However, if some restored resources are actively in use and cannot be deleted by AWS Backup, the "awsbackup-restore-test" tag can be utilized to aggregate and manage these resources separately. In such cases, a custom deletion routine can be implemented to remove the tagged resources when they are no longer needed.

Example Lambda function Showing Data Recovery & RTO validation for a EC2 Resource

This lambda function parses event described earlier
SSM parameter store expected RTO and expected file for which hash value need calculated.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments