Choosing The Right Chaos Engineering Tool for the Job
Basic guide for choosing chaos engineering tools for your AWS workload
Quickly restoring service after an outage requires resilient architecture and thorough disaster recovery testing
Chaos engineering helps developers easily setup and run controlled experiments across a range of AWS services and could be used to find blind spots and respond to infrequent but critical events.
- A known workload you want to test.
- A data driven hypothesis of how you expect a workload to respond to disruption.
- The type of disruption you want to introduce.
- A tool to run the experiment.
For more information on using FIS to stop EC2 instances, please visit here
- AWS APIs
- Amazon CloudWatch
- Amazon EBS
- Amazon EC2
- Amazon ECS
- Amazon EKS
- AWS Networking
- Amazon RDS
- AWS Systems Manager
FIS can be used for both simplistic scenarios like throttling a single EC2 instance CPU, to complex real-world scenarios to gradually and simultaneously impairing performance of different types of resources, APIs, services, and geographic locations. Affected resources can be randomized, and custom fault types can be created using AWS Systems Managerto further increase complexity. You can setup guardrails to only affect resources with specific tags, and set rules based on CloudWatch alarms or other tools to stop an experiment.
FIS is integrated with AWS Identity and Access Management (IAM) so you can control which users and resources have permission to access and run experiments, and which resources and services can be affected.
FIS provides visibility throughout every stage of an experiment via the AWS console and APIs. You can observe which actions have executed while an experiment is running, and view details of actions, stop conditions which were triggered, how metrics compared to your expected behavior, and more. You can use FIS from within the AWS console, AWS CLI, and AWS SDKs. You can access the FIS service programmatically to integrate experiments into your CI/CD pipelines.
You can get started with AWS FIS here
Chaos Toolkit can be deployed locally, onto an EC2 instance, or in AWS Batch as a Docker image to run from inside your AWS environment. Extension modules for AWS have been added to Chaos Toolkit and can be found on Github.
Each Chaos Toolkit Experiment is built around a single file using JSON. The JSON file will consist of the following sections:
- steady-state-hypothesis describes the normal state of your workload and checks before an experiment runs to make sure normal state is in place, and after the experiment runs to compare.
- method contains the actual experiment activities which will take place.
- action is the activity which will be applied to the workload during the experiment.
- probes define how to observe the workload during the experiment.
- controls are declared operational controls which affect the experiment execution.
- rollbacks define how to revert back to a normal state.
Experiments can be built using JSON to run AWS modules which call the AWS API to perform actions. Each experiment consists of Actions which are made operations against the workload, and Probes, which collect information from the workload during the experiment.
Chaos Toolkit allows you to run a discovery which can be used to help build your experiments. After running an experiment you can generate a report as a PDF or HTML to view the results.
Chaos Toolkit is suitable for more experienced teams who desire to run specific experiments against their AWS workloads, but requires a significant amount of hands-on.
You can get started with Chaos Toolkit here
- Resource experiments targeting compute resources.
- Network experiments targeting network latency, packet loss, DNS, and certificates.
- State experiments targeting instance state, processes, and system time.
Gremlin does not have the same native AWS API integration as FIS or the extension modules like Chaos Toolkit. You can however create custom experiments to simulate scenarios like an AZ failure by dropping all network traffic or killing an application process.
Gremlin is a good fit for a team which desires to run experiments against just their compute resources versus native AWS API integration to target managed services. A less experienced team can make use of the ease of use and minimal effort to get started.
You can get started with Gremlin here
AWS FIS | Chaos Toolkit | Gremlin | |
---|---|---|---|
License | Pay-For-What-You-Use | Open Source | Commercial |
Deployment | AWS Managed Service | EC2, Docker container using AWS Batch | Agent based on EC2, EKS |
Metrics/Scoring | Yes (when integrated with AWS Resilience Hub) | No | Yes |
Custom experiments | Yes | Yes | Yes |
Rollback | Yes | Yes | Yes |
ECS/EKS | Yes | Yes | Yes |
EC2 | Yes | Yes | Yes |
RDS failover | Yes | Yes | No |
GUI | Yes | No | Yes |
CLI | Yes | Yes | Yes |
Application testing | Yes | No | Yes |
Randomized target | Yes | Yes | Yes |
Network testing | Yes | Yes | Yes |
AZ failure | Yes (Managed scenario from library to simulate complete AZ power interruption including loss of zonal compute, no re-scaling, subnet connectivity loss, RDS failover, Elasticache failover, unresponsive EBS volumes) | Yes (Blackhole ACL, ELB target changes, ElastiCache failover, ActiveMQ failover) | Yes (Blackhole ACL) |
Region failure | Yes (Managed scenario from library to simulate loss of cross-region connectivity including pausing cross-region replication for S3 and DynamoDB) | No | Yes (Blackhole ACL) |
Thoughtfully incorporating the right chaos tools into your testing regimen will strengthen system resilience and improve incident response when outages strike.
- Chaos Engineering with AWS Fault Injection Service: https://www.youtube.com/watch?v=AThR8dFmPP4
- Chaos engineering leveraging AWS Fault Injection Service in a multi-account AWS environment: https://aws.amazon.com/blogs/mt/chaos-engineering-leveraging-aws-fault-injection-simulator-in-a-multi-account-aws-environment/
- AWS Fault Injection Service blogs: https://aws.amazon.com/blogs/devops/tag/aws-fault-injection-simulator/
- How Finbourne Assures Resiliency Through Chaos Engineering Events Every 17 min: https://www.youtube.com/watch?v=lkDq9g43djw
- DPG Media Successfully Launches Video On Demand Service with Gremlin and AWS: https://aws.amazon.com/partners/success/dpg-media-gremlin/?did=ps_card&trk=ps_card
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.