Five LLMs battled Pokemon. Claude Opus was super effective
Gotta prompt ‘em all!
Banjo Obayomi
Amazon Employee
Published Apr 17, 2024
As an avid Pokemon player, this is definitely been my favorite experiment building with LLMs. When I saw the PokéLLMon paper from Georgia Institute of Technology create an agent to do Pokemon battles, I just had to see which model was the very best. On the surface, Pokémon battles may seem simple, with a limited action space of four moves or switching to one of five other Pokémon. However, the game's depth lies in the countless strategies that arise from the interplay of 18 Pokémon types, unique stats, and more.
The question remained: could LLMs use game state information paired with a Pokédex to help them pick the best action? In this post, I'll share 3 fascinating lessons I learned as LLMs battled their way to the top of the Pokémon League.
I was able to use the Poke-env battle simulator to provides an arena for automated battles. The simulator can get the current state of the game such as the stats of the Pokémon, move data, and the previous actions taken.
I was then able to setup two LLM agents with Amazon Bedrock to face each other in a match. Here is how the test bed works:
This current state of the match is translated into text that is added into a prompt with all the relevant context for the LLM such as available moves and stats, and previous turns.
Given the state of the game, I then ask the LLMs to make the best move. The matches are not in real time so the models do have time to think through their actions and not penalized for having slower response times. Here is an example response from a model given the game state:
The environment was able to orchestrate the moves returned in the JSON object until one side was defeated. Now, let's see how the models stacked up against each other.
My previous Street Fighter experiment showed that the Claude models were the best when it came to gaming, so I wanted to see how they stacked up vs the Mistral models especially Claude Opus which just launched on Amazon Bedrock. For each match up I used the same parameters and system prompt in a generation 8 best-of-five battle format with random Pokémon.
Haiku won here 3-2. It was a close match, with Haiku showcasing its speed and responsiveness. However, there were instances where Haiku made wrong moves or switched Pokémon randomly.
Sonnet won 3-2, another close one that could have gone either way. The models seemed to be on par with each other, but Mistral occasionally made mistakes or less-than-ideal choices. We'll dive deeper into these inconsistencies later in the post.
Opus won 4-1. While Opus took its time to respond, the extra processing paid off. It consistently made optimal moves based on the scenario and maintained a steady attacking strategy. In contrast, Mistral would randomly switch Pokémon, losing momentum and allowing Opus to capitalize on these missteps.
Opus was declared the champion!!! While being the most powerful model, comes with a higher computational cost. On average it took 21 seconds to pick a move vs 3 seconds from Haiku making it 7 times slower. However, when you need top-tier performance and intelligence from an LLM, Opus is a clear front runner.
Model | Ranking | Average Speed (seconds) |
---|---|---|
🥇 Claude 3 Opus | Champion | 21 |
🥈 Claude 3 Sonnet | Tied 2nd | 10 |
🥈 Mistral Large | Tied 2nd | 8 |
🥉 Claude 3 Haiku | 3rd | 3 |
Mixtral 8x7B | 4th | 8 |
While I'd need to run more matches to get definitive data, here are the lessons I learned from watching the models battle.
The LLM Pokémon battles revealed several fascinating aspects of how these models approach complex tasks, from the importance of prompt engineering to the challenges of dealing with hallucinations and sub optimal decision-making. Here are the lessons I learned:
When I initially used the original prompt from the PokéLLMon paper tested on GPT models with Claude 3 Sonnet, the results were underwhelming, with a mere 5% win rate against the bot. To improve performance, I crafted a new system prompt following best practices, which boosted the win rate to an impressive 50%.
The new prompt really emphasizes don’t switch and provides example of how to think through moves based on the current match up. Also, by giving the model a more “aggressive” tone it focused more on attacking and seemed to really get into the battles.
Therefore, the best move is to attack with Shadow Ball. It has a high chance to OHKO Alakazam from full health, and even if it doesn't, Gengar will survive thanks to Focus Sash and can then finish the job.Don't overthink it - just click Shadow Ball and remove Alakazam from the field immediately.
This experience aligns with similar tests that improved Claude 2's accuracy on recalling data from long documents shoot up from 27% to 98% just by adding in a single phrase to the prompt.
I suspect that the Mistral models' poorer performance might be attributed to bad prompting. For further experiments, I plan to follow Mistral's best practices for prompting, to ensure a better comparison.
The models hallucinated frequently during the battles. The vast complexity of the Pokémon universe, with 18 different types and over 1,000 Pokémon, led to the models making logical errors, such as:
- Stating incorrect type match ups
- Confusing defense with special defense
- Forgetting about abilities and held items
Here are some examples of these hallucinations:
- The optimal play is to maintain offensive pressure with Toxic to cripple Bronzong over time. Then Air Slash for damage while Bronzong is poisoned. Vespiquen should be able to win this matchup without much trouble. (Bronzong is a steel type and immune to poison attacks)
- Looking at the rest of my team, Arcanine stands out as the best switch-in. It resists Orbeetle's Bug attacks, is immune to its Psychic attacks, and can threaten with super-effective Fire moves like Flare Blitz. (Arcanine is not immune to Psychic attacks)
- While Blacephalon's Special Attack is boosted, Aurorus has Defense stat of 203, which could help mitigate some of the damage. (Wrong Stat to compare)
This highlights that as the intelligence required for a task increases, more context doesn't always help even with stronger models. For high intelligence tasks, specialized tools for example a tool that can do battle calculations would be more helpful for the model to use instead of trying to do the "math' itself.
During the battles I noticed the models would display "Panic switching" as observed in the original paper. The LLMs tend to switch Pokémon frequently vs strong or stats boosted Pokémon, giving their opponents ample time to set up and attack, ultimately leading to their defeat. Even with my prompt explicitly warning against this behavior, some models still made consecutive switches, allowing their opponents to win with ease.
The model would even justify that its worth switching:
I acknowledge switching twice in a row gives Gardevoir free turns to attack. However, preserving Turtonator and swapping in the bulkier, offensive threat in Malamar is worth it to put pressure back on Gardevoir.
These taught me while prompt engineering, can help outcomes it wont be enough to get the desired outcome all the time. In the paper they used a method which compared 3 outputs from the LLM and it chose the best result. These led to a 7% increase win rate against the bot.
In the next section, we'll explore how you can try your hand at setting up your own LLM Pokémon champion.
Ready to build your own LLM Pokémon master? All the code and documentation you need to get started are available on GitHub. I'm excited to see how the community can improve upon this experiment by:
- Tweaking prompts to optimize LLM performance
- Trying out different LLMs to find the best contenders
- Exploring model behaviors to gain deeper insights
If you're interested in experimenting with the winning Claude 3 models, check out my comprehensive getting started guide for detailed instructions and best practices.
Have an idea for taking this experiment to the next level? Want to share your findings or discuss the implications of LLMs in gaming and beyond? Leave a comment below and let's keep the conversation going!
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.