Four Things Everyone Should Know About Resilience

Resilience is a big topic. I cannot possibly cover all of it here. But I can get you started on your journey to building resilient applications in the cloud by sharing these five concepts:

What is resilience
How to prevent faults from becoming failures
How to think about resilience
How does the cloud help you build resilient applications

Resilience is a vast topic with a wealth of information available, but finding the right resources can sometimes be challenging. That’s why I’ll not only cover the essentials of each concept but also provide links to deeper resources to help you explore further.

1. What is Resilience?

Resilience is the ability of an application to resist or recover from certain types of faults or load spikes, and to remain functional from the customer perspective.

Anyone who has built or operated an application in production (whether in an on-prem data center or in the cloud) knows it's a mess out there. In any complex system, for example a distributed software application, there are faults, unanticipated inputs and user behaviors, and various instabilities. With the reality that this is the environment in which your applications must operate, it is the best practices of resilience that will enable your applications to weather the storm. By recovering quickly, or avoiding impact altogether, your applications can continue to reliably serve the needs of your users.

2. How to Prevent Faults From Becoming Failures

A failure is the inability of your application to perform its required function.

A fault is a condition that has the potential to cause a failure in your application.

Faults can result in failures. But through the application of resilience best practices, faults can be mitigated via automated or operational means to avoid failures.

Types of Faults

Code and configuration

Mistakes in code or configuration can be introduced whenever code or configuration is deployed. When teams do not address their tech debt, the risk of such faults rises.

Infrastructure

Servers, storage devices, and networks are all susceptible to failure. At small scale (e.g. individual hard drives), this happens all the time, but in the cloud, such small scale failures are often abstracted away so that your application never even "sees" any fault. Larger scale issues such as entire racks or widespread data center issues (like fires or power outages) can also occur. Even at larger scale, your applications can be designed to be resilient to these events (see Resilience in the cloud below). Huge scale events that impact all data centers in a geographic location are also possible, like a meteor strike that flattens multiple data centers. However events like this, of that scale, are highly unlikely.

Data and State

Corruption or deletion of data can occur accidentally or maliciously. Errors in code or deployment can also cause data to be corrupted or deleted.

Dependencies

Your application depends on other systems such as AWS services and third-party dependencies, as well as DNS resolution and the internet itself. Faults, or changes in behavior, of these can impact your application.

Categories of Failure

When a fault occurs, it can cause the application to become unavailable. The Resilience Analysis Framework has identified five common categories of failures abbreviated as SEEMS, named for the first letter of each failure mode listed as follows:

Shared Fate

This is when a fault in one component or location cascades to other components or locations. For example, a request to a server triggers a bug that causes the server to fail. That failed request is retried by the client, impacting another server in the fleet, and continuing until all servers have failed.