Everything Fails All the Time

The massive power outage that struck the Northeast United States and Canada on August 14, 2003, is a cautionary tale. A cascading series of events, triggered by overloaded transmission lines in Ohio, ultimately resulted in the blackout of over 50 million people. This incident vividly demonstrates how seemingly isolated failures within a complex system can quickly escalate into widespread service disruption.

Just like electrical engineers fortify power grids with redundancy and robust infrastructure, we can design fault-tolerant and high-performing cloud architectures by understanding potential system failures and employing the same system design methods.

This series of articles dives deep into the Reliability Block Diagram(RBD) method, exploring its importance and developing strategies for designing resilient systems.

We’ll then explore the following systems in RBD.

Series Systems
Parallel systems
Stand-by redundant systems
Load sharing systems
Complex interconnected systems

Read thee original Article published on Medium.

Availability and Availability Service Level Agreements (SLAs)

Cloud providers and power companies use availability Service Level Agreements (SLAs) to guarantee a certain level of uptime to their customers. These SLAs are typically expressed as a percentage, such as 99.9% (i.e. “three nines”). A 99.9% SLA translates to an allowable downtime of just 0.1% over a specific period. The higher the percentage in the SLA, the more resilient the service is designed to be.

Availability percentages in SLAs (Service Level Agreements) directly translate to downtime. Here’s a quick breakdown:

Calculating Availability

Let’s delve deeper into how availability is calculated for different system configurations. Let’s consider two simple but common types of systems: series and parallel.

Series Systems

In a series configuration, all components must function for the system to work. Think of Christmas lights strung together. If one bulb burns out, the entire strand goes dark.

The availability of a series system (As) can be calculated by multiplying the individual availabilities (Ai) of each component:

A_s = A_1 * A_2 * A_3 … A_n

For example, if three components in a series each have an individual availability of 99% (0.99), the overall system availability (As) would be:

A_s = 0.99 * 0.99 * 0.99 = 0.9702 (approximately 97%)

Example implementation on AWS: We can estimate the System availability (As) based on the SLA provided by AWS for each service.

A web server with a load balancer and a storage unit configured in series.

Parallel Systems

In a parallel configuration, the system remains functional as long as at least one component is operational. Imagine multiple traffic lanes leading to a city. Even if one lane is closed, traffic can flow through the others.

To calculate the availability of a parallel system (Ap), we can use the following formula:

A_p = 1 - (1 - A_1) * (1 - A_2) * … * (1 - A_n)

Using the same example of three components with 99% availability (0.99) each, the overall system availability (Ap) for a parallel configuration would be:

A_p = 1 - (1–0.99) * (1–0.99) * (1–0.99) = 0.9997 (approximately 99.97%)

As you can see, parallel configurations offer significantly higher overall availability compared to series configurations.

Example implementation on AWS: (Combining both series and parallel concepts)

A Multi-AZ load balancer forwards traffic to two web servers in different availability zones.

Step 1: Calculate the availability of each EC2 and EBS in a series arrangement.

A_a = A_ec2 * A_ebs = 0.99 * 0.99 = 0.9801 A_b = A_ec2 * A_ebs = 0.99 * 0.99 = 0.9801

Step 2: Calculate the availability of two series systems in parallel.

A_a//b = 1-(1-A_a) * (1-A_b) = 1-(1–0.99) * (1–0.99) = 0.9996

Step 3: Calculate the availability of the above, in series with the load balancer.

A_s = A_a//b * A_elb = 0.9996 * 0.9999 = 0.9995 (99.95%)

Building a Resilient Future

By understanding these basic principles of availability and system design, we can build more robust architectures to withstand shared fate and single-point-of-failure events.

In a future article, we’ll explore how to calculate availability for more complex systems, such as “m of n” configurations, where the system remains operational as long as a minimum number (m) out of a total number (n) of components are functioning. This will provide a more comprehensive understanding of designing highly resilient systems for critical applications.

Site Terms, Privacy, and more.

Designing Highly Available Systems 1/5 - Series & Parallel

Closing the Cloud Design Gap: Why System Design and Availability Calculations Matter