
Chaos Engineering with AWS FIS and Lambda
Use AWS FIS with Lambda to practice chaos engineering & test resiliency. Learn how to set up experiments & simulate failures to build confidence in your serverless app.
"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production."
— Principles of Chao Engineering
- Aurora DB clusters
- RDS DB instances
- DynamoDB global tables
- EBS volumes
- EC2 Auto Scaling groups
- EC2 instances
- EC2 Spot Instances
- ECS clusters
- ECS tasks
- EKS clusters
- EKS node groups
- EKS Kubernetes pods
- S3 buckets
- VPC subnets
- Lambda functions
- ElastiCache (Redis OSS) Replication Groups
- IAM roles
- transit gateways
- Invocation delay (
invocation-add-delay
) - Invocation error (
invocation-error
) - Invocation HTTP integration response (
invocation-http-integration-response
)
AWS_LAMBDA_EXEC_WRAPPER
to /opt/aws-fis/bootstrap
. This is what enables that Lambda Runtime API proxy. FIS uses some of the Lambda runtime environment modification capabilities to provide functionality. You don't need to worry too much about the details unless you are using additional extensions to the Lambda environment, in which case you'll need to set up a proxy chain.AWS_FIS_CONFIGURATION_LOCATION
with an S3 bucket ARN that points to the FisConfigs
prefix in the S3 bucket you set up, for example arn:aws:s3:::my-config-distribution-bucket/FisConfigs/
. This lets the FIS layer know where it should look for configuration details. The second is AWS_LAMBDA_EXEC_WRAPPER
with /opt/aws-fis/bootstrap
as the value. This sets up the Lambda Runtime API proxy mentioned earlier.- Short experiment action durations
- Using SnapStart
- Fast and infrequently invoked functions
- Functions already using Lambda extensions
- Functions using container runtimes
aws:lambda:invocation-add-delay
action with a startup delay of 30,000 milliseconds (30 seconds). This will ensure our Lambda function invocations time out. Since we can break the processing of these messages, we can set our action to run 100% of the time. We need to do a little math to ensure we keep our experiment running long enough.60 seconds × 3 delivery attempts = 180 seconds