Twelve Resilience Sessions at re:Invent 2023 You Won't Want to Miss
There is a lot of great stuff to see and do at re:Invent! Here we pare it down to the 12 must-see sessions for those building resilient architectures on AWS.
The resilience kiosk: Get help from AWS resilience experts
The Resilience Lifecycle Framework — learn about this new roadmap to better resilience
Everyone wants to talk about disaster recovery (DR)
Keeping your application highly available
Chaos engineering is a must for your resilience strategy
Hear from AWS operational leaders
AWS services purpose-built for resilience
- ARC302 | Drive resilience with Amazon Route 53 Application Recovery Controller — Let’s start by getting hands-on. In this workshop, your mission is to implement the tools and processes to ensure a tic-tac-toe game is resilient to even the biggest adverse events. In our experience, folks will sometimes neglect to make their resilience strategy itself...resilient. This workshop shows you how to do that using Route 53 Application Recovery Controller, a highly-available data-plane API you can use to fail over to your recovery site when you need it. We also like that this workshop covers both multi-Region and in-Region recovery options.
- ARC208 | Backup and disaster recovery strategies for increased resilience — This breakout session will give you a good understanding of disaster recovery basics. But then it takes you deeper, showing how you can use AWS Elastic Disaster Recovery to protect your applications whether hosted on AWS, on-prem, or even on other cloud providers. [video]
- ARC320 | Using business metrics to make failover decisions — This is a chalk talk, so most of the conversation will be driven by audience questions. While ARC302 shows you how to implement effective failover, this session will share the sometimes harder problem of determining when to failover and how to use business metrics to determine that the failover was successful.
- ARC301 | Advanced Multi-AZ resilience patterns: Mitigating gray failures — Sure, you already know to use multiple Availability Zones (AZs) as fault isolation boundaries to achieve high availability. But in this hands-on workshop you’ll learn how to take your multi-AZ strategy to the next level. Sometimes failures are intermittent, or not easily detectable by metrics and alarms that look broadly at you application — we call these gray failures. In this workshop, you’ll implement differential observability to detect gray failures, and then implement resilience strategies to protect against these failures. This is not your grandparents’ multi-AZ strategy.
- ARC309 | Build applications that recover from an Availability Zone impairment — This session and ARC301 are a great pair together. In this breakout, you’ll learn about Amazon Route 53 Application Recovery Controller zonal shift. OK, that service is a mouthful, but what it does is super-powerful — it gives you control over which AZs are in or out for your application (which ones are receiving traffic). Using the monitoring techniques covered in this session, you’ll be able to detect when an AZ needs to be taken out-of-service, learn how to take it out, and keep healthy AZs online to serve your customer traffic. [video]
- ARC306 | Reducing your area of impact and surviving difficult days — In this breakout session, you’ll learn about cell-based architectures and sharding. These are two ways you can structure your AWS resources (like compute, storage, and network) to improve resilience. These advanced techniques give you control over the fault isolation boundaries in your architecture, constraining faults to a small number of resources while the rest continue to serve requests from your customers. [video]
- ARC317 | Improve application resilience with AWS Fault Injection Simulator — There are lots of good tools out there for fault injection, but if you’re running applications on AWS, then AWS Fault Injection Simulator (FIS) offers a lot of great functionality, built right in to AWS. This breakout session takes you on a deep dive of FIS, and covers how implementing chaos engineering can supercharge your resilience. [video]
- ARC321 | Improve the resilience of AWS workloads using chaos engineering — If you want a quick overview of chaos engineering with a lot more time to ask questions, then check out this chalk talk session too.
- ARC303 | Navigate the storm: Unleashing controlled chaos for resilient systems - Get hands-on with chaos engineering in this workshop, and learn firsthand to create safe and controlled chaos experiments that you can incorporate into your day-to-day operations.
- ARC314 | Anticipating failures with resilience modeling — Almost all materials on chaos engineering focus on the experiments themselves. We like this session because it moves the focus to what happens before we can even design an experiment. In this chalk talk, you’ll get the chance to ask questions about how to use resilience modelling to identify what’s missing from your resilience plans, and get a handle on the scenarios that can lead to “bad stuff” for your application’s resilience. Equipped with this knowledge, now you are better informed to determine what you should even be testing with your chaos experiments.
- ARC327 | 5 things you should know about resilience at scale — What happens when you get a senior principal engineer, distinguished engineer, and VP with combined experience of over 30 years operating AWS to give a talk on how to build mitigations for when things go wrong? Well, if you want to find out, then make sure to attend this can’t-miss breakout session. They will take you "beyond the 9s" -- talking about how, particularly at AWS's scale, 9s are important, but time-to-recovery and blast-radius containment are even more important. [video]
- ARC305 | Resilient architectures at scale: Real-life use cases from Amazon.com — I find I learn best by example, which is why I seek out different Amazon teams and services each year to share their architectures and processes with you on how they design for scale and resilience using AWS. This is the third year for this crowd pleaser — you can also check out previous years here: 2021; 2022. [video]
- ARC201 | Monitoring resilient architectures with AWS Resilience Hub
- ARC208 | Backup and disaster recovery strategies for increased resilience (AWS Elastic Disaster Recovery) [video]
- ARC302 | Drive resilience with Amazon Route 53 Application Recovery Controller
- STG306 | Protect AWS resources with AWS Backup
- ARC317 | Improve application resilience with AWS Fault Injection Simulator [video]
- CON401 | Deep dive into Amazon ECS resilience and availability [video]
- STG318 | Deploying Amazon S3 in multiple Regions to support global applications
- STG319 | Beyond 11 9s of durability: Data protection with Amazon S3 [video]
- STG208 | Build resilient architectures with Amazon EBS
- STG344 | How to protect unstructured files to achieve data resiliency (Amazon FSx family and Amazon EFS)
- HYB310 | Building highly available and fault-tolerant edge applications (AWS Local Zones, AWS Outposts, and the AWS Snow Family)
- COM308 | Serverless data streaming: Amazon Kinesis Data Streams and AWS Lambda [video]
- DAT306 | Improve resilience of database workloads by using chaos engineering (Amazon RDS and Amazon Aurora)
- SVS323 | I didn’t know Amazon API Gateway did that [video]
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.