How to Build Resilience, One Step at a Time

How to Build Resilience, One Step at a Time

Learn how AWS teams foster and develop a culture of resilience incrementally and use structured best practices to improve the resilience of teams, business functions, and systems.

Nina Kurth
Amazon Employee
Published Dec 12, 2023
Last Modified Mar 7, 2024
The Chinese adage “cross the river by feeling the stones”—taking one step, stabilizing yourself, and assessing your options before taking another—refers to a programmatic approach in undertaking change and dealing with uncertainty. I like to use this metaphor to describe how one can think about building a resilient product or service, another journey best done one step at a time.
Just as crossing a river requires patience, determination, and adaptability, so does moving toward resilience. By carefully choosing your path, testing the waters, preparing with solid strategies, and taking incremental steps, you can successfully navigate changing circumstances.

Four Stepping Stones

AWS teams know that in the real world, everything fails all the time. So they’re encouraged to think about building resilience with four stepping stones to improve the resilience of teams, processes, and systems.

1. Start with the Customer and Work Backward

Resilient teams relentlessly prioritize their customers and continuously seek to understand their needs and preferences. Teams that are not resilient, on the other hand, often lose touch with customer demands, resulting in wasted effort and declining productivity. At Amazon, the first stepping stone for achieving resilience is to understand what it means for our customers’ businesses. We do not start with an underlying technology or application workload in mind; instead, we obsess about understanding what business functions are critical to a particular customer journey.
How? Start by identifying your customers’ journeys through visual representations of their entire experience with a product, service, or brand, from initial to final interactions.
Once you have laid out that map, you can work with the business’s product and marketing teams to identify the critical functions that can make or break your customer’s experience. The greatest value of starting resilience planning with a customer-journey map is that it forces you to take an outside-in perspective before diving into technical aspects. It allows you to define what truly matters to your customers and their businesses, opening the door to evaluating the dependencies between business functions and underlying systems, recognizing vulnerabilities, and assessing the impacts of potential points of failure.
Demonstrating your commitment to understanding customers reassures stakeholders that you prioritize customers’ interests over internal priorities. Once you have this understanding and trust with the business, you can better tailor technical resilience requirements and related investments. With insights and data from business resilience planning, you can make informed decisions about resource allocation, technology investments, and risk mitigation strategies, whether for redundancy, failover mechanisms, or data-recovery plans.

2. Align Business and IT

We typically think of resilience in terms of technology, systems, applications, and workloads—but really it starts with customers and employees. So our second stepping stone is about alignment. Resilient teams break down silos and maintain open communications with stakeholders, including other teams and internal and external customers. They have to be data-driven and collect experiences from every level of the organization to understand and enhance customer experiences.
AWS recently helped a global medical-device company with its business-resilience planning as it prepared for a global product launch. We brought together cross-functional teams from business, R&D, marketing, IT, support, field staff, data, and managed services to focus on a single goal: providing the best possible customer experience for patients, doctors, and field staff.
Because we approached it this way, “barriers were broken down, exposing some false assumptions and increasing transparency between groups,” according to the company’s head of R&D. “Difficult discussions translated to measurable goals and tangible outputs to work toward, and allowed teams to prioritize resources and investments.”
The key reason for bringing these cross-functional teams together was to understand the impact of disruptions on multiple business and IT functions. We gained valuable insights into the consequences of various scenarios by quantifying potential financial, organizational, operational, technological, and reputational risks. Leaders could then better prioritize and allocate business and IT resources, minimizing the guesswork in capacity and resilience planning.

3. Learn from Failures

In cloud-based development, success hinges on navigating challenges and delivering results. Incidents and failures happen, but our job is to minimize their impact on customers. At Amazon, we say that failure and invention are inseparable twins, and to invent, you have to experiment without fear of failure. But what happens when errors occur? Nonresilient teams focus on fixing the problem rather than learning from it. Our third stepping stone involves doing both: fix and learn. This is where the broader company culture comes into play: if the organization gives learning low priority, the quality and depth of the results will reflect this. And nonresilient teams will be prone to repeating past mistakes. So, how do you introduce this learning culture to teams? AWS uses a structured mechanism called correction of error (COE).
At Amazon, post-incident activity is never treated as just a repair. A huge part of the postmortem is devoted to determining actions to prevent such errors from occurring again. COE is a proactive mechanism that empowers anyone to take control. It involves identifying, analyzing, and rectifying an error to minimize its impact on a project and, more broadly, the organization. COE is how we stop repeating past mistakes. It is a powerful mechanism to change how teams** handle, perceive, value, and treat the incidents they experience.
“COE encourages my developers to dive deep and collect all possible data from the incident internally and from the customer,” said one customer who adopted COE across his business and IT organizations. “It forces them to seek information that was not initially collected, but that’s crucial to identifying the root cause of the error. No learning is left on the table, and teams generate a list of actionable and trackable next steps to prioritize and focus on.”

4. Build a Culture of Resilience

Building business resilience was the top priority for CEOs in McKinsey & Company’s article and podcast “Six CEO Priorities for 2023”—perhaps because, as McKinsey research has found, resilient companies generate 50 percent higher total shareholder returns than their less resilient peers.
Resilience is not just about defense; it’s also about competitive advantage. Resilient customer experiences create shareholder value. Companies that set resilience OKRs and KPIs at every level of the organization, including for the board and CEO, are better positioned to adapt and build long-term value. But for operational teams to translate these targets into actions, the entire organization has to understand the importance of resilience in achieving the company’s long-term goals. This requires a shift in culture, which takes time, effort, and resources.
Imagine the banks of a river forming the letter “S.” The safest place to cross is typically the straight section in the middle. If you lose your footing, the current will likely carry you to the bank on the outside of the bend. In a culture of resilience, this current is the collective mindset and approach to working that promotes and nurtures the ability to adapt, recover, and thrive in the face of challenges, setbacks, or adversity—and to plan for it.
The right culture can ensure that resilience isn’t just included in a one-off initiative but is instead an integral part of the organization. Resilience is a choice, just as speed is. But you have to establish a culture that embraces resilience, whether for a new application, a cloud transformation, or a business or IT function.
The culture of an organization is defined by its leaders. They must live the culture by talking about and acting according to it. Good governance ensures that best practices are integrated into everyday work and that everyone understands the importance of the culture of resilience. Governance is critical to creating this culture, with a feedback loop from customers, especially when transitioning to new ways of working. This governance ensures that the customer experience is consistently considered and enhanced as part of resilience-building.

Make Resilience a Principle

Resilience is built one step at a time. Just like when crossing a river, each small move forward counts. At AWS, we look at resiliency from the perspective of both the business and technology. Successful businesses thrive on change, but those with longevity tend to anticipate and prepare for marketplace disruption by making resilience a principle—integrated with every element of their business, technology, and culture. A business is only as resilient as its weakest link.
As an avid hiker in Northern Lapland, I often find myself in remote locations needing to cross a river or a fast-flowing mountain stream. I would love to hear your best tips for safely getting to the other side—and how you are building a culture of resilience within your team and organization. Let’s connect via LinkedIn.


Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.