AWS Logo
Menu
Step Function Reliability: A/B Testing

Step Function Reliability: A/B Testing

Transition from a legacy code monolith to micro-services with AWS Step Functions. Focus on A/B testing to enhance customer experience and reliability. Includes attempts with AppConfig, final solution with Step Function aliases, and practical examples using Terraform and Python.

Published Apr 15, 2025

Introduction:

After many years of maintaining a legacy code monolith, the lack of accurate logging, nested 'if' statements 8 layers deep, and repeated code, the team was given the green light to architect a rewrite. It was evident AWS Step Functions gave us the ability to create the application with a micro-service approach; leveraging logic flows and parallel states to improve maintainability and observability. Fast forward past a few months of development, a new question arises "What if we wanted to shadow deploy our code, and ensure new features are working?". The aim is to improve customer experience through new features, while maintaining the reliability of a production critical system. Though the product has unit testing, it is difficult to account for all edge cases and scenarios.
The following article is the recap of an exploration to implement A/B Testing with step functions to increase product reliability.

First Try: AppConfig

Being new to A/B testing, it took a bit of researching the right terms for what we wanted. Was it shadow deployment, blue green deployment, A/B testing? It’s still a bit unclear but we landed on calling it A/B testing and first found CloudWatch Evidently, a feature which was now a part of the AppConfig service. Starting by making a feature flag, configuring a deployment environment, and an empty step function, we just wanted to see how this service worked. It was rather simple to get the configuration, set the flag to a variable, and utilize it in future states as seen by the step machine below:
Basic StepFunction Accessing AppConfig
Basic StepFunction Accessing AppConfig
However, while the feature flag was working, there were a few hurdles to make this into a full deployment:
  1. This service wasn't approved for use within our company, meaning a delay of a few months before I could make it out of sandbox.
  2. The AppConfig deployment strategies didn't support using the execution metrics of the step function.
  3. Using feature flags meant tracking the variable across multiple lambdas, which would've been simplified with JSONata, but the existing step function was using JSONPath and there wasn't time to refactor.

Second Try: Step Function Aliases

Given that AppConfig didn’t satisfy our requirements it was back to looking into step function features. The goal was to find the equivalent of a load balancer using a weighted routing policy. Come to find that a step function alias was exactly what was needed.... well almost. While aliases handled the routing of the incoming events, unlike a load balancer, there was no 'health check' to ensure the step function version being executed would be successful. So, it was clear there would need to be custom logic for updating the alias to route to the correct version while incrementally taking more traffic until the newest version is handling 100% of traffic. Below is an architectural overview of the process just described.

Architecture Overview:

Overall Architecture Diagram
Overall Architecture Diagram
  1. Starting with an Api Gateway routing webhook events as input to the step function.
  2. Upon invoking the step function alias, executions are split 90% old version, 10% latest version.
  3. A CloudWatch metric for `ExecutionsSucceeded` is created with an alarm for the `SampleCount` threshold of 10 in a 12 hour period.
  4. The CloudWatch alarm action is configured to kick off a Lambda.
  5. The Lambda adjusts the alias routing to each version by 10%, if the alias reaches 100% the alarm is disabled.
This architecture allows us to measure success of the new version and send more traffic with confidence. The weight of the traffic routing given a 12 hour execution period would work as the following:
Routing Weight Over Time
Routing Weight Over Time

Sample Code:

Below is an explanation of sample terraform and python which would mimic this setup.
The CloudWatch Alarm monitors successful executions with the dimension of the Step Function ARN. The description is used to send the Step Function Alias ARN to the Release Lambda, so it has the context of which Alias to update.
The Step Function Alias imports all existing Step Function versions and indexes the second version and weighs it with 90% of the traffic. This acts as the “old version.” The other route version is referenced directly from the published Step Function. This is the “new version” that starts with 10% of traffic and is gradually increased by the Release Lambda.
The Lambda accepts the standard CloudWatch Alarm event. It extracts the Step Function Alias ARN from the description field of the event and then describes the Alias to get the current routing configuration.
If there is only one route configured in the Alias, the “new version” of the Step Function has successfully been executed enough times that it is accepting 100% of incoming traffic. As a result, the Release Lambda will disable the CloudWatch Alarm, effectively unenrolling the “new version” of the Step Function from the A/B testing process and labeling it as “stable.”
If there is more than one route configured in the Alias, the “new version” routing needs to be increased by 10% and the “old version” will be decreased by 10%. Finally, the Lambda will return the CloudWatch Alarm from an “In Alarm” state to an “Insufficient Information” state to continue the A/B testing process with more successful executions.

What Next?

While this architecture has proven to be effective so far, there is one large feature to implement such that failures do not impact customer experience. First, will be a system that can keep track of the last successful version, so if the newest version is unstable and causing errors, an alarm action can revert the alias changes. Ensuring that feedback will be provided to customers by starting a new execution of the stable version. Initial testing of this has uses a DymamoDB to track versions and their stability status, but there are many edge cases to work through.
Shoutout to my colleague Keelan Zeigler for helping me bring the solution to life and writing this article.
 

2 Comments