AI found the concurrency bug in this code, then fixed it
Fixing distributed system bugs with the assistance of Amazon Q Developer
Nathan Peck
Amazon Employee
Published Apr 25, 2024
Last Modified Apr 26, 2024
The jump from coding for a single local user to coding a distributed system for many concurrent users is a leap similar to a student going from elementary school math to algebra, or algebra to calculus. Suddenly, many of the code patterns that worked just fine before now no longer work properly. In this post I’m going to show a common bug when coding a distributed system that has concurrent users, and then show an example of how Amazon Q can help catch and diagnose this bug, then fix it.
When writing code that will be run as a distributed system with many concurrent users, one thing you’ll learn very quickly is the importance of data consistency. But data consistency issues aren’t always obvious, especially since they can be hard to test for, and difficult to reproduce. Imagine that you are building a basic counter shared by multiple concurrent users. Perhaps you are counting the number of views to a page, and each time a user visits the page then the counter is incremented.
The naive way to approach this problem is:
- Fetch the current value of the counter from some shared state store, such as your database
- Increment the value
- Store the incremented counter back into the shared state store
As a diagram:
This approach works just fine when there is only a single user contributing to the counter, as long as they are only contributing a single increment at a time. But what happens when you have multiple users and multiple distributed processes that are mutating the counter?
Let’s imagine that three people load the page at nearly the exact same time:
- Each of the three concurrent page views fetches the current value of the counter at the time of the page load. All three page views observe an initial page view count of
42
. - Each of the three operations increments their own local copy of the value
42
to43
. - Each of them saves the value back to the central store of truth as
43
.
We have a problem... there have been three page views, but the counter only went up by one!
If you’ve ever played an online game that had a money duplication glitch or item duplication glitch, then this bug, or some variation of it, is likely to blame. These things are easy to miss, especially if you come from a background of coding local applications for a single concurrent user.
As an example, I wrote the following vulnerable code, using the naive approach listed above. If multiple users attempt to access this endpoint concurrently then the counter will not behave as expected. Most, if not all, counter increments will be lost.
Fortunately, if you have Amazon Q installed in your IDE, you have access to an assistant who can review this code and identify the problem!
Playing the role of a junior developer who has never encountered this issue before, I asked a fairly open ended question:
I’m trying to build a hit counter. Why is this code not keeping track of hits properly?
After reviewing my code, Amazon Q correctly identified that there was an issue with concurrency:
Concurrency issues - If multiple requests are hitting the endpoint simultaneously, it's possible the counts could get overwritten or missed. DynamoDB supports atomic increment/decrement operations that can help avoid this.
Use DynamoDB's atomic counter operation instead of a separate Get/Update
Next I asked Amazon Q how to rewrite this code to be safe for concurrency. It offered two options:
Both options turn the counter update into an atomic operation that happens within the DynamoDB service. DynamoDB ensures that updates are applied in a consistent manner that will not lose any of the updates.
I chose to generate code for the
UpdateItemCommand
approach, and Amazon Q gave me the following rewritten piece of code:This code worked out of the box, with no changes necessary. It even includes the ability to specify an initial value for the counter if the value was not yet set. But best of all, it no longer "misses" hits. The endpoint can be called as many times concurrently as necessary, and the counter field updates to the appropriate value in DynamoDB.
As you can see, generative AI can be used to help catch and diagnose complex issues such as distributed system bugs. This capability will only get better over time. I’m optimistic for a future where junior developers can work alongside generative AI assistants in learning to build systems that are concurrency-safe, as well as diagnosing and resolving complex concurrency bugs in existing systems.
Want to learn more about how to use Amazon Q in your day to day work as a software builder? Find more tips and tricks in the Amazon Q Developer Center.
Want to learn more about how to use Amazon Q in your day to day work as a software builder? Find more tips and tricks in the Amazon Q Developer Center.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.