logo
Menu
Four Things Everyone Should Know About Resilience

Four Things Everyone Should Know About Resilience

Dive into these essential concepts for building resilient applications in the cloud on AWS.

Seth Eliot
Amazon Employee
Published Sep 7, 2023
Last Modified Mar 19, 2024
Resilience is a big topic. I cannot possibly cover all of it here. But I can get you started on your journey to building resilient applications in the cloud by sharing these four concepts:
There is a wealth of information out there on resilience, but it can be sometimes hard to find. Therefore, while I will share with you the essentials of what you need to know about each concept, I also will give you the links to the resources you need to go deeper on each topic.

1. What is Resilience?

Resilience is the ability of an application to resist or recover from certain types of faults or load spikes, and to remain functional from the customer perspective.
Anyone who has built or operated an application in production (whether in an on-prem data center or in the cloud) knows it's a mess out there. In any complex system, for example a distributed software application, there are faults, unanticipated inputs and user behaviors, and various instabilities. With the reality that this is the environment in which your applications must operate, it is the best practices of resilience that will enable your applications to weather the storm. By recovering quickly, or avoiding impact altogether, your applications can continue to reliably serve the needs of your users.

2. How to Prevent Faults From Becoming Failures

A failure is the inability of your application to perform its required function.
A fault is a condition that has the potential to cause a failure in your application.
Faults can result in failures. But through the application of resilience best practices, faults can be mitigated via automated or operational means to avoid failures.

Types of Faults

Code and configuration

Mistakes in code or configuration can be introduced whenever code or configuration is deployed. When teams do not address their tech debt, the risk of such faults rises.

Infrastructure

Servers, storage devices, and networks are all susceptible to failure. At small scale (e.g. individual hard drives), this happens all the time, but in the cloud, such small scale failures are often abstracted away so that your application never even "sees" any fault. Larger scale issues such as entire racks or widespread data center issues (like fires or power outages) can also occur. Even at larger scale, your applications can be designed to be resilient to these events (see Resilience in the cloud below). Huge scale events that impact all data centers in a geographic location are also possible, like a meteor strike that flattens multiple data centers. However events like this, of that scale, are highly unlikely.

Data and State

Corruption or deletion of data can occur accidentally or maliciously. Errors in code or deployment can also cause data to be corrupted or deleted.

Dependencies

Your application depends on other systems such as AWS services and third-party dependencies, as well as DNS resolution and the internet itself. Faults, or changes in behavior, of these can impact your application.

Categories of Failure

When a fault occurs, it can cause the application to become unavailable. The Resilience Analysis Framework has identified five common categories of failures abbreviated as SEEMS, named for the first letter of each failure mode listed as follows:

Shared Fate

This is when a fault in one component or location cascades to other components or locations. For example, a request to a server triggers a bug that causes the server to fail. That failed request is retried by the client, impacting another server in the fleet, and continuing until all servers have failed.
Apply these best practices to protect your application from this failure mode:

Excessive Load

Without sufficient capacity to handle high demand, users will face slow and failed responses as resources are exhausted. For example, if AWS Lambda concurrency is limited to 100 in-flight requests, and if requests take 1 second each to process, then when traffic exceeds 100 requests per second, they will be throttled, which can manifest as unavailability to the end user of your application. Concurrency limit can be calculated like this:
Required Lambda Concurrency = Function duration X Request rate
Apply these best practices to protect your application from this failure mode:

Excessive Latency

This is when system processing or network traffic latency for your application takes longer than allowed by the business requirements. For example, failures in a dependency might require multiple retries by your application when calling that dependency, which can cause perceived slowdowns to the application end user.
Apply these best practices to protect your application from this failure mode:

Misconfiguration and Bugs

As a direct result of code and configuration faults, bugs can cause application slowdowns, loss of availability, or even incorrect execution. For example, configuring a client with no timeout can leave it hanging indefinitely if there is a transient error calling the database. This can manifest as a lockup to the end user.
This was a real bug I had to fix here for my chaos engineering lab that would cause processes to lock up when the database failed over from primary to standby.
You can also have misconfigurations other than code bugs, such as network control lists that block traffic they should not, or choosing insufficient memory/CPU for you Lambda function.
Apply these best practices to protect your application from this failure mode:
  • Design for operations: Adopt approaches that improve the flow of changes into production and that help refactoring, fast feedback on quality, and bug fixing (Well-Architected best practices).

Single Points of Failure (SPOF)

This is when a failure in a single component disrupts the application due to lack of redundancy of that component. For example, if requests require a call to the relational database to succeed, and that database fails, then all requests fail unless there is a standby database that the application can fail over to.
Apply these best practices to protect your application from this failure mode:

3. How to Think About Resilience

Figure 1 illustrates a mental model for resilience.
diagram with three boxes showing the components of the mental model for resilience
Figure 1. A mental model for resilience
To understand resilience, it is helpful to keep these three aspects of resilience in mind, and how they relate to each other.

High Availability (HA)

High availability is about resilience against the smaller, more frequent faults an application will encounter in production. These include component failures, transient errors, latency, and load spikes.
Availability is one of the primary ways we can quantitatively measure resiliency. We define availability, A, as the percentage of time that a workload is available for use. It’s a ratio of its “uptime” (being available) to the total time being measured (the “uptime” plus the “downtime”).
availability equation
The result of this calculation is often called "the nines" of availability, where, for example, a value of 99.99% is called "four nines of availability".
Mean time to repair (or recovery) (MTTR) and mean time between failure (MTBF) are other measures and can be related to Availability (A) via this equation.
  • MTTR: Average time it takes to repair the application, measured from the start of an outage to when the application functionality is restored.
  • MTBF: Average time that the application is operating from the time it is restored, until the next failure
availability equation

Disaster Recovery (DR)

Disaster is about resilience against the large scale, much less frequent faults an application may encounter. Such disaster events fall into these categories: natural disasters, major technical issues, or deletion or corruption of data or resources (usually due to human action). These are faults with a large scope of impact on your application, and therefore usually involves a recovery site. This can mean that the primary site is unable to support your availability requirements, and therefore you must fail over to a recovery site where your application can run on recovery versions of your infrastructure, data, and code. Or in some cases, you can continue to run in the primary site, but must restore data from the recovery site.
While disaster events will also affect availability, the primary measures of resilience during disaster recovery are how long the application is down and how much data was lost. Therefore a DR strategy must be selected based on recovery time objective (RTO) and recovery point objective (RPO). Figure 2 shows a timeline where a disaster has occurred. The RPO represents maximum acceptable amount of time since the last data recovery point and measures potential data loss, while the RTO is the maximum acceptable delay between the interruption of service and restoration of service to end users. To learn more about these, read about Business Continuity Planning (BCP) here.
RPO (Recovery Point Objective) and RTO (Recovery Time Objective) defined with respect to a disaster event
Figure 2. Recovery objectives
For more information on DR see:

Continuous Resilience

Continuous resilience spans both HA and DR. It is about anticipating, monitoring, responding to, and continuously learning from failure, as described in the blog post, Towards continuous resilience. The blog covers all these in detail. Here I only cover some highlights of each one:
Anticipating
This includes understanding the faults and failure modes above, and implementing resilience best practices to avoid or mitigate them.
Monitoring
Key here is observability, which is comprised of metrics, logs, system events, and traces. This information is used and explored to try and determine the nature of an issue in order to facilitate action. As you build applications in the cloud, be aware of any gaps in your observability strategy.
Responding
Response to failures should be based on the criticality of the application. Many responses can be automated to decrease downtime and improve consistency of recovery.
Continuously Learning
This includes chaos engineering, which uses the scientific method. You pose a hypothesis about the impact of a fault or load spike, and then run an experiment to prove that hypothesis, thereby gaining greater understanding of how your application will respond to similar events in production.
Chaos engineering experimentation is a continuous cycle of learning
Figure 3. Chaos engineering experimentation is a continuous cycle of learning

How This Model Helps You

The reason I call this a "mental model" is because it is not a perfect conceptualization, but provides a useful way to think about resilience.
It is less-than-perfect because HA and DR are not completely separate. If a disaster event causes downtime, then this also impacts the application availability. Downtime measured as "Recovery Time" (RT) for DR also counts against the availability ("nines") of HA.
However the model is still useful. A resilient architecture needs to design for both HA and DR, and these each have different strategies used to mitigate different types of faults, and are measured using different metrics and objectives.
High Availability (HA)Disaster Recovery (DR)
Fault frequencySmaller, more frequentLarger, less frequent
Scope of faultsComponent failures, transient errors, latency, and load spikesNatural disasters, major technical issues, deletion or corruption of data or resources, that cannot be recovered from using HA strategies.
Objective measurementsAvailability ("nines")Recovery Time Objective (RTO), Recovery Point Objective (RPO)
Strategies usedMitigations are run in-place and include: Replacement and fail over of components or adding capacityMitigations require a separate recovery site and include: Fail over to the recovery site or recovery of data from the recovery site

4. How Does the Cloud Help You Build Resilient Applications

The cloud makes it easier to implement resilience best practices. The cloud provides multiple sites for deployment, helping with fault isolation. It also makes it easier to provision redundant resources to avoid single point of failure (SPOF). The cloud offers tools and automation to implement resilience best practices. Let’s look at some specific practices on the AWS Cloud.

Resilience in the Cloud on AWS

Fault Isolation Boundaries

To avoid shared fate failures, we implement fault isolation, which prevents faults from spreading. In the AWS cloud, you can make use of availability zones (AZs) and Regions as isolation boundaries. Each AWS Region is a location around the world where AWS runs multiple data centers, serving AWS services. There are over 30 Regions. Each Region is divided into three or more AZs. AZs are physically separate from each other, and are housed in different data center buildings. They are far enough apart from each other so they should not share fate (including disasters like fires or floods). But they are close enough to each other, and connected using high-bandwidth, low-latency networking, that an application can be deployed across them.
These resources discuss Regions, AZs, and other fault isolation boundaries on AWS, and when to use them:

Multi-AZ, Single-Region

Once you understand your application criticality, which includes considering reputational, financial or regulatory risk, you can choose a resilience strategy.
Most resilience needs can be satisfied by deploying an application in a single AWS Region, using multiple AZs. Figure 4 shows such an architecture where all tiers of the architecture make use of three AZs for high availability. As part of disaster recovery, data stores are backed up or versioned so that they can be restored to a last known good state. In Figure 4 the backups are in-Region, and copying these to another Region would provide an extra layer of resilience.
Services like Amazon S3 and Amazon DynamoDB do not appear in multiple AZs in Figure 4, however they too are multi-AZ. They are managed and serverless, and AWS takes responsibility that they are run across multiple AZs. Resilience is always a shared responsibility between AWS and the customers of AWS - for some services AWS takes more responsibility, and for others more responsibility falls to the customer.
A resilient application architecture using multiple availability zones in a single AWS Region
Figure 4. A resilient application architecture using multiple availability zones in a single AWS Region
Also note in Figure 4 that this architecture avoids a single point of failure (SPOF) on the RDS database. If the primary fails, then RDS will fail over to the standby which will continue operation. And for read transactions, servers can continue to use the read replica even in the event the primary RDS instance fails. See here on how to Protect Your Data in Amazon RDS Against Disaster or Accidental Deletion.
For applications with high resilience needs, you can implement Advanced Multi-AZ Resilience Patterns. This includes Availability Zone Independence (AZI), also sometimes called Availability Zone Affinity. This architectural pattern isolates resources inside an Availability Zone and prevents interaction among resources in different Availability Zones except where absolutely required.

Multi-Region

For applications with the highest resilience needs, a multi-Region architecture can make sense. The reliability pillar best practice, Select the appropriate locations for your multi-location deployment, discusses when an application might require multi-Region, including considerations for DR, HA, proximity to users, and data residency.

Mitigations for Excessive Load

The cloud also makes it easier to accommodate excessive load. In AWS you can configure capacity such as launching more EC2 instances or raising the concurrency quota for AWS Lambda. The most resilient workloads ensure capacity is already available ahead of high traffic needs using static stability. And scheduled auto scaling or predictive auto scaling also can be used to scale up ahead of high traffic needs. However, in the case of unanticipated load, you can also set up dynamic auto scaling, which scales up in response to traffic metrics you configure. The links I provided are for Amazon EC2, however many AWS services can auto scale, such as Lambda, DynamoDB, Aurora, ECS, and EKS. And once again, since resilience is a shared responsibility, how you write your software to build in resilience patterns is important here. For the highest resilience, implementing a throttling mechanism or load shedding in your software, will keep your application available while the reactive auto scaling spins up more capacity.

Conclusion

Your customers expect an always-on experience with your application. To deliver this you must build with resilience in mind. With resilience best practices and concepts, you cloud-based application will be able to resist or recover from faults or load spikes, and remain available. Using the mental model for resilience presented here, you can implement strategies for High Availability (HA), and Disaster Recovery (DR), while continuously learning in response to resilience challenges in the cloud.

Learn More

The AWS Well-Architected framework documents the best practices or creating and operating applications in the cloud. The Reliability pillar and Operational Excellence pillar cover the best practices for resilience. The former focuses on architecting resilient workloads, while the latter focuses on operating them.
AWS services focused on resilience:
Other resources that will help you with your resilience journey:
More content on resilience in the cloud:

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments