At ReInvent 2024, Jennifer and I ran a chaos engineering workshop for Generative AI workloads. The goal was to help customers reason about how to apply chaos engineering practices to GenAI workloads to improve resilience. As part of the workshop, we asked all the attendees to spend 15 minutes writing down some possible failure modes to consider for the hypothesis backlog.
Top chaos engineering concerns for GenAI workloads
The chart below shows the distribution of hypothesis backlog entries grouped by topic. Let’s explore some of the entries from the top four areas: performance, latency, availability, and database.
Performance
Performance is a top concern, which is not surprising. The performance of GenAI workloads varies quite a bit based on the LLM you use, the application pattern (e.g., agents and RAG), where your LLM is hosted, and the overall load on the system.
A sudden spike in load was the most common theme in the performance entries.
If the request rate to the LLM is 1,000 requests per second, then the p95 response time will be under 1200 ms, and no throttling will occur.
A related concern is the size of the context, including documents, passed to the LLM.
If the median number of tokens in the input context is 20,000, the p95 response time will be under 5000 ms.
Testing for these concerns is relatively easy using a load testing tool like Locust. You can simulate extra load and vary the input payload (prompt) size. Mitigating performance concerns may be as simple as making sure that your Bedrock on-demand inference quotas are set appropriately, or you could consider using Bedrock Provisioned Throughput.
Variation in context size can have a big impact on LLM latency. If you expect to occasionally get large input context, you should evaluate your LLM’s latency with that context size. If you find that you cannot get the latency you want, you can consider techniques like prompt chaining, which breaks a single call to an LLM with a large context into a series of calls with smaller context.
Latency
Similar to performance, latency is also a concern. Many LLMs are used in systems that directly interact with end users. Waiting more than a couple of seconds for a response leads to a poor user experience.
Two common themes emerged in the entries for latency. The first is unavailability of an LLM, which might manifest as an API timeout:
If the inference call to the LLM times out without responding, the application will return a useful error message to the user.
The second is latency issues with the vector database.
If the vector database response time increases by 500 ms, the application will proceed with limited retrieval context.
You can test for these scenarios by using chaos engineering tools that let you inject latency or simulate API error codes. Mitigating these types of problems requires thinking about what type of user experience you want to offer. Your front-end application can fail gracefully if an LLM is unavailable, or attempt to automatically retry, while keeping the user informed about what’s happening. On the database side, you can use techniques like automatic scaling to increase database resources if the latency is due to an increase in load. You can also have the front-end application respond to database latency by limiting the amount of context retrieved; that will reduce the quality of the overall response, but can let the system operate in a degraded way.
Availability
In analyzing the availability-focused hypotheses from workshop participants, a clear pattern emerged. Participants consistently identified the need to test system behavior during partial and complete failures across difference infrastructure components. The collected hypotheses ranged from single-node failures to Availability Zone impairments, with most participants suggested controlled tests lasting between 1-10 minutes.
Hypothesis: If we fail one processing node in the LLM engine for 1 minute, then: Steady-state maintains less than 1% request failures and p99 latency under 200ms with all nodes serving requests. Under experiment conditions, expect increased latency not exceeding 400ms p99 with traffic redistribution to healthy nodes. RTO to be 2 minutes with full recovery to steady state metrics. Post RTO, failed node requests return errors. Monitoring systems detect and alert within 30 seconds of failure.
Node failures in LLM processing engines disrupt customer requests. Prevent these disruptions through infrastructure design that prioritizes fast recovery. Deploy LLM workloads across multiple nodes in different availability zones with dedicated spare capacity. Auto scaling works well for gradual traffic increases but require model loading time for new instances. For production workloads, maintain spare capacity with pre-warmed models to handle node failures. This trade-off increases infrastructure costs but delivers consistent performance and faster recovery times.
Test node failures through controlled experiments starting with process termination tests to validate basic recovery and request routing. Progress to network partition tests that simulate complete node isolation and verify traffic redistribution. Finally, conduct resource exhaustion tests to validate behavior under degraded performance. Each test should begin with a single node and gradually expand to multiple nodes. Process termination tests sudden node failure and validates immediate failover mechanisms. Network isolation tests verify traffic redistribution when nodes become unreachable. Resource constraint testing simulates degraded performance through CPU and memory pressure, validating graceful degradation patterns.
Hypothesis: If we inject 10% failure rate into a randomly selected Availability Zone (AZ) within our VPC for 10 minutes. Expected behavior includes increased error rates and latency while maintaining service availability. During steady state, maintain less than 1% error rate and p99 latency under 200ms. During degradation, expect up to 10% error rate and p99 latency under 500ms. RTO to be 5 minutes with full recovery to steady state metrics.
Partial AWS Availability Zone (AZ) impairments create complex failure modes in distributed systems. Prevent service disruption through infrastructure design that handles degraded AZ performance while maintaining strict service level objectives. Deploy workloads across multiple AZs with sufficient capacity to handle full AZ evacuation. Design applications for Availability Zone Independence (AZI) by removing cross-AZ dependencies and maintaining independent data caches. Implement adaptive routing with health checking that responds to performance degradation. This enables graceful handling of both partial failures and full AZ evacuation scenarios.
Test AZ degradation through controlled experiments starting with partial failure injection to validate basic routing and failover mechanisms. Progress to increased error rates and latency tests that verify traffic redistribution patterns. Finally, conduct complete AZ isolation tests to validate evacuation procedures. Each test should begin with minimal impact and gradually increase failure rates. Network throttling tests validate performance degradation handling. Error injection tests verify application resilience and routing behavior. Complete isolation tests simulate AZ power loss scenarios, validating full evacuation capabilities. For complete failure scenarios, the AWS Fault Injection Service AZ Availability: Power Interruption scenario injects a complete power loss to an AZ, testing your application's ability to detect the failure and redistribute workloads to healthy AZs.
Database
In analyzing the database-focused hypotheses from workshop participants, three distinct failure patterns emerged, with particular attention to Vector and Aurora databases supporting GenAI applications. Participants identified the need to test complete database access loss, performance degradation, and connection pool failures. The collected hypotheses emphasized the importance of maintaining data availability and consistent performance during failures. Most participants suggested controlled tests lasting between 30 seconds to 10 minutes, with clearly defined recovery time objectives and success criteria.
Hypothesis: If we disable Vector Database access for 10 minutes, then: Steady-state maintains less than 1% request failures and p99 latency under 200ms with 95% cache hit rate. Under experiment conditions, expect graceful request failures with proper error messages, while cache continues serving eligible requests. RTO to be 60 seconds with full recovery to steady state metrics. Monitoring systems detect and alert within 60 seconds of access loss.
Database failures disrupt application functionality and data access patterns. Prevent disruptions through infrastructure design that prioritizes redundancy and fast recovery. Deploy managed database services with Multi-AZ configuration and read replicas across AZ's and even AWS Regions. Implement automated backups with defined RPO/RTO targets and enable Point-in-Time Recovery when possible. Maintain connection pooling with proper sizing and implement retry logic with exponential backoff and jitter. This ensures data availability and consistent performance.
Test database failures through controlled experiments starting with failover testing to validate recovery mechanisms and measure database recovery times. Progress to network disruption tests that simulate complete access loss and verify application behavior. Finally, conduct backup restoration tests to validate data recovery procedures. Each test validates both the system's resilience and recovery mechanisms across all infrastructure layers. Common testing methods include security group modifications, network traffic blocking, and controlled service restarts.
For complete database failures, implement automated failover to read replicas (single-region) and enable rapid recovery procedures. This ensures minimal disruption to application functionality while maintaining data consistency and availability.
Hypothesis: If we reduce Aurora Vector DB read throughput by 50% for 10 minutes, then: Steady-state maintains less than 0.1% query error rate and p99 latency under 100ms with full throughput capacity. Under experiment conditions, expect error rates below 5%, p99 latency not exceeding 500ms while maintaining 90% of normal throughput. RTO to be 5 minutes with full recovery to steady state metrics. Monitoring systems detect and alert within 45 seconds of degradation.
Database performance degradation impacts system responsiveness and user experience. Prevent disruptions through infrastructure design that prioritizes consistent performance under load. Deploy read replicas for horizontal scaling and implement query optimization with proper indexing. Maintain caching strategies and implement request prioritization. Enable load shedding capabilities and feature flags for non-critical operations. This ensures stable performance during degraded states.
Test performance degradation through controlled experiments starting with CPU load testing to validate system behavior under resource constraints. Progress to network latency tests that simulate degraded connectivity and verify application response. Finally, conduct resource contention tests across read replicas to validate load balancing effectiveness. Each test validates both performance degradation handling and automatic recovery mechanisms. Common testing methods include resource stress testing, network throttling, and replica load simulation.
Hypothesis: If we terminate 20% of active database connections for 30 seconds, then: Steady-state maintains connection pool utilization below 75% with less than 0.1% query failures and p99 latency under 100ms. Under experiment conditions, expect success rates above 95%, p99 latency not exceeding 300ms while connection pools rebalance automatically. RTO to be 1 minute with full recovery to steady state metrics. Monitoring systems detect and alert within 10 seconds of termination.
Database connection failures impact application stability and request processing. Prevent disruptions through infrastructure design that prioritizes connection management and recovery. Deploy proper connection pooling with optimal sizing and implement comprehensive health checks. Maintain connection monitoring and automated cleanup procedures. Enable connection reuse and recycling with proper timeout policies. This ensures stable database access patterns.
Test connection failures through controlled experiments starting with random connection termination to validate retry mechanisms and pool recovery. Progress to service restart tests that verify reconnection behavior and application resilience. Finally, conduct connection pool exhaustion tests to validate queuing and error handling. Each test validates both connection management and recovery procedures. Common testing methods include connection termination, service restarts, and load testing.
Conclusion
Chaos engineering reveals critical insights about GenAI system behavior under stress conditions. Teams should approach testing methodically - starting with controlled, small-scale experiments before progressing to complex failure scenarios. This structured approach serves two purposes: validating system resilience and verifying monitoring effectiveness.
Through regular testing, teams identify gaps in recovery mechanisms and validate observability tooling. These experiments lead to refined alert thresholds and improved detection accuracy, building operational confidence through repeated validation of recovery procedures.
As organizations deploy more GenAI applications to production, resilience testing becomes essential. Our workshop findings highlight that system performance, response latency, and service availability remain top concerns for teams. Using tools like AWS Fault Injection Service, teams can run controlled experiments that test both system resilience and monitoring capabilities. This systematic approach ensures GenAI applications deliver consistent, reliable performance in production environments.
Randy DeFauw and Jennifer Moran
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.