How We Built The Model Brawl League: A Chat Bot Arena for LLMs
Learn how we built a video game to benchmark LLMs
Banjo Obayomi
Amazon Employee
Published Jul 2, 2024
Benchmarking Large Language Models (LLMs) has become increasingly popular in the AI community. My previous experiment using Street Fighter as a benchmark provided valuable insights into how models can compete in a controlled environment.
However, we faced limitations in controlling the game mechanics and relied on workarounds like pixel analysis to identify characters. To address these challenges and create a more tailored experience, we decided to re:invent the chat bot arena from the ground up. Thus, the Model Brawl League was born.
The Model Brawl League is a 2.5D Fighting Game built using Unity. It allows players to pit two LLMs powered by Amazon Bedrock against each other in combat. Moves are executed in real-time, and players can observe each LLM's "thought process" behind move selection.
Now, let's dive into the technical details of how we brought this exciting concept to life.
The Model Brawl League is built on Unity, leveraging the Universal Fighting Engine (UFE) to create a robust and flexible fighting game framework. Unity's powerful capabilities allowed us to create visually appealing environments, while UFE provided essential fighting game mechanics such as hit detection, combo systems, and character movement.
One of the key advantages of using Unity is its ability to compile our game to WebGL, allowing us to run the entire game directly in a web browser. By compiling to WebGL, we are able to control the game using JavaScript.
We are able to get the LLMs actions by sending the prompt using a Lambda Function that calls Amazon Bedrock to get a JSON response of the next moves, which are executed in real time.
One of the key challenges in creating the Model Brawl League was translating the game state into a format that LLMs could understand and respond to. Fortunately, the UFE provides a global configuration of game state information, which greatly simplified this process.
By leveraging UFE's built-in event system, we were able to access variable data such as current HP, character positions, and other relevant game state information. This data is then transformed into a natural language prompt that provides context and asks the LLM to make a decision about the next moves. Because we have the full state we didn't need to do any image processing.
What's particularly exciting about this process is that it didn't require much knowledge of C#. Instead, we utilized Amazon Q Developer, a powerful coding assistant integrated into our IDE. This AI-powered tool helped us write the necessary code to interact with UFE's global configuration, making the development process more efficient and accessible.
Here's an example of how we set up event listeners in UFE to capture game state changes:
Once the game state data is captured in the global configuration, we access this data using JavaScript and construct our prompt. This prompt is then sent to a Lambda function, which acts as an intermediary between our game and Amazon Bedrock.
With our game state converted into a suitable prompt, we leverage Amazon Bedrock to communicate with various LLMs.
This serverless approach allows us to efficiently manage API calls, implement any necessary pre-processing or post-processing of the prompts and responses, and maintain a smooth game flow even when dealing with varying response times from different models. Using the Converse API we are able to use the same code to process each request.
Our Lambda function handles the following tasks:
- Receives the game state prompt from the JavaScript front-end
- Formats the prompt if necessary
- Calls the appropriate LLM through Amazon Bedrock's API
- Processes the LLM's response
- Returns the processed response back to the game
This architecture enables us to easily switch between different models, facilitating matchups between LLMs while keeping the core game logic separate from the AI interaction logic.
Now that we have our AI responses, the next challenge was to translate these decisions into actual game moves.
To make the Model Brawl League accessible and easy to use, we implemented an LLM "controller" which allows moves to be executed directly in the browser using JavaScript. This approach eliminates the need for complex server-side processing and enables real-time gameplay with minimal latency.
Our JavaScript execution engine interprets the LLM's response, translates it into game commands, and applies those commands to the characters in the Unity game.
The heart of the Model Brawl League is its game loop, which continuously cycles through the process of capturing the game state, generating prompts, receiving LLM responses, and executing moves. This loop continues until a winner is determined.
We've also implemented a robust logging system that captures each step of the process, allowing for post-game analysis and providing valuable data on how effective the LLM was and the cost.
On average the smaller models do better, due to faster response time, but still collecting data to get more comprehensive results.
The Model Brawl League represents an exciting new frontier in LLM benchmarking. By creating a controlled, purpose-built environment for AI combat, we've opened up new possibilities for understanding and improving an LLMs decision-making capabilities in dynamic, adversarial settings.
While the full version of the Model Brawl League offers high-fidelity graphics and advanced features, we understand that not everyone has access to Unity or the resources to set up a complete fighting game engine.
To that end, we've developed a mini, open-source version of the Model Brawl League using PyGame based on this tutorial form Coding With Russ. This simplified version maintains the core concepts and functionality of the full game, but in a much more digestible format.
To get started with the PyGame Model Brawl League and see how all the pieces fit together, check out our open-source code repository.
We're excited to see how the community will use and expand upon this framework. Whether you're benchmarking the latest language models, exploring AI decision-making, or just curious about how LLMs perform in a fighting game context, we invite you to join us in the Model Brawl League arena!
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.