Any Day Can Be Prime Day: How Amazon.com Search Uses Chaos Engineering to Handle Over 84K Requests Per Second
Discover how Amazon Search combines technology and culture to empower its builder teams, ensuring platform resilience through Chaos Engineering.
DevOps is an approach to solving problems collaboratively. It values teamwork and communication, fast feedback and iteration, and removing friction or waste through automation.
- Empowering teams: DevOps fosters a culture where teams are empowered to try new ideas and learn from both successes and failures.
- Ownership and responsibility: In DevOps, teams own the services they build and are responsible for ensuring the right outcomes. Empowerment is a pre-requisite to this, as the team needs to be able to understand how their services are used and be able to implement changes they see fit.
- Breaking down walls: The impulse behind "DevOps" is to break down the traditional barriers that often exist between development and operations teams. By promoting collaboration and shared goals, DevOps aims to eliminate silos and create a more streamlined and efficient workflow.
- Enabling teams to do more: This is where automation and tools can play a major role. But also organizational structure and responsibilities are crucial here. A team that takes a hand-off from the development team and operates the service for them is not ideal (see “breaking down walls” above). But a specialized team that works with development and reduces the undifferentiated heavy lifting of operating the service is better. Undifferentiated heavy lifting is all the hard work (“heavy lifting”) that is necessary to accomplish a task (say, deploy and operate a service) but does vary appreciably from service to service (“undifferentiated”). If every service team has to do this work themselves, then it is wasteful. Having one team to create tools and processes that do much of this heavy lifting removes the burden from the service teams is liberating!
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. – Principles of Chaos Engineering
When the search load exceeds [some value] and errors and latency start to climb (specify which metrics and by how much), then activating the emergency lever to disable non-critical services will keep errors and latency within acceptable limits (define these), up to loads of [specified amount].
- Learn about the Search Resilience team, detailing their progression from running load tests in the production environment to adopting chaos engineering and conducting numerous large-scale experiments
- Read about three more examples of Amazon teams using DevOps to drive resilience
- Learn how Amazon Prime Video followed their journey to enable teams to use DevOps practices and Chaos Engineering.
- With real-world examples of massive-scale production workloads from IMDb, Amazon Search, Amazon Selection and Catalog Systems, Amazon Warehouse Operations, and Amazon Transportation, this presentation shows how Amazon builds and runs cloud workloads at scale and how they reliably process millions of transactions per day
- This blog introduces you to Chaos Engineering for cloud-based applications