Five LLMs battled Pokemon. Claude Opus was super effective

As an avid Pokemon player, this is definitely been my favorite experiment building with LLMs. When I saw the PokéLLMon paper from Georgia Institute of Technology create an agent to do Pokemon battles, I just had to see which model was the very best. On the surface, Pokémon battles may seem simple, with a limited action space of four moves or switching to one of five other Pokémon. However, the game's depth lies in the countless strategies that arise from the interplay of 18 Pokémon types, unique stats, and more.

The question remained: could LLMs use game state information paired with a Pokédex to help them pick the best action? In this post, I'll share 3 fascinating lessons I learned as LLMs battled their way to the top of the Pokémon League.

How It Works

I was able to use the Poke-env battle simulator to provides an arena for automated battles. The simulator can get the current state of the game such as the stats of the Pokémon, move data, and the previous actions taken.

I was then able to setup two LLM agents with Amazon Bedrock to face each other in a match. Here is how the test bed works:

Gather Game State Data

This current state of the match is translated into text that is added into a prompt with all the relevant context for the LLM such as available moves and stats, and previous turns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Battle start: Opponent sent out Oranguru. You sent out Lycanroc.
Turn 1: Current battle state:
Opponent has 6 pokemons left.

Opposing pokemon:oranguru,Type:Normal and Psychic,HP:100%,Is dynamax:False,Attack:158,Defense:193,Special attack:211,Special defense:247,Speed:158,Ability:Inner Focus(The Pokémon's intense focus prevents it from flinching or being affected by Intimidate.) Bug, Dark-type attack is super-effective (2x damage) to oranguru. Psychic-type attack is ineffective (0.5x damage) to oranguru. Ghost-type attack is zero effect (0x damage) to oranguru.
oranguru's all the possible attacks:[psychic,psychic,Power:90],[thunderbolt,electric,Power:90],[focusblast,fighting,Power:120],

Your current pokemon:lycanroc,Type:Rock,HP:100%,Attack:227,Defense:148,Special attack:132,Special defense:148,Speed:223(faster than oranguru).Ability:Sand Rush(Boosts the Pokémon's Speed stat in a sandstorm.),Item:Focus Sash(An item to be held by a Pokémon. If it has full HP, the holder will endure one potential KO attack, leaving 1 HP.) Fighting-type attack is super-effective (2x damage) to lycanroc. Normal-type attack is ineffective (0.5x damage) to lycanroc.

Your lycanroc has 4 moves:
Move:stoneedge,Type:Rock,Power:118,Acc:80%,Effect:Has an increased chance for a critical hit.
Move:psychicfangs,Type:Psychic,Power:100,Acc:100%,Effect:Destroys Reflect and Light Screen.(ineffective (0.5x damage) to oranguru)
Move:closecombat,Type:Fighting,Power:141,Acc:100%,Effect:Lowers the user's Defense and Special Defense by one stage after inflicting damage.
Move:swordsdance,Type:Normal,Status-move,Power:0,Acc:100%,Effect:Raises the user's Attack by two stages.

You have 5 pokemons:
Pokemon:gourgeistsuper,Type:Ghost and Grass,HP:100%,Attack:219,Defense:256,Special attack:147,Special defense:176,Speed:96(slower than oranguru). Moves:[powerwhip,Grass,1x damage],[poltergeist,Ghost,0x damage],[rockslide,Rock,1x damage], Electric-type attack is ineffective (0.5x damage) to gourgeistsuper. Normal, Fighting-type attack is zero effect (0x damage) to gourgeistsuper.

Make a move

Given the state of the game, I then ask the LLMs to make the best move. The matches are not in real time so the models do have time to think through their actions and not penalized for having slower response times. Here is an example response from a model given the game state:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Bewear is still at full health thanks to not being hit yet. It can probably survive one boosted attack from Dragonite. 
The best play is to go for another Ice Punch, hoping either for a freeze or a high damage roll to KO.

If Bewear attacks and doesn't KO, it will likely go down next turn. 
But switching out Bewear gives Dragonite a free turn to keep setting up or attack. 

None of the remaining Pokemon want to take +2 Outrage or Dual Wingbeat damage.

Gengar is already fainted so can't switch to it. Cinderace is faster than Bewear but is very low and will get KOed by any attack. 
The others are all slower and will take huge damage. 

Therefore, the best option is to stay in with Bewear and fire off another Ice Punch. 
We need to try to KO that Dragonite before it runs through the team with its boosts. Switching is too risky here.

Bewear, finish it off with Ice Punch! No fear!
{"move":"Ice Punch"}

The environment was able to orchestrate the moves returned in the JSON object until one side was defeated. Now, let's see how the models stacked up against each other.

The Elite Four + Champion

My previous Street Fighter experiment showed that the Claude models were the best when it came to gaming, so I wanted to see how they stacked up vs the Mistral models especially Claude Opus which just launched on Amazon Bedrock. For each match up I used the same parameters and system prompt in a generation 8 best-of-five battle format with random Pokémon.

Claude 3 Haiku vs Mixtral 8x7B

Haiku won here 3-2. It was a close match, with Haiku showcasing its speed and responsiveness. However, there were instances where Haiku made wrong moves or switched Pokémon randomly.

Claude 3 Sonnet vs Mistral Large

Sonnet won 3-2, another close one that could have gone either way. The models seemed to be on par with each other, but Mistral occasionally made mistakes or less-than-ideal choices. We'll dive deeper into these inconsistencies later in the post.

Claude 3 Opus vs Mistral Large

Opus won 4-1. While Opus took its time to respond, the extra processing paid off. It consistently made optimal moves based on the scenario and maintained a steady attacking strategy. In contrast, Mistral would randomly switch Pokémon, losing momentum and allowing Opus to capitalize on these missteps.

Winners

Opus was declared the champion!!! While being the most powerful model, comes with a higher computational cost. On average it took 21 seconds to pick a move vs 3 seconds from Haiku making it 7 times slower. However, when you need top-tier performance and intelligence from an LLM, Opus is a clear front runner.

Model	Ranking	Average Speed (seconds)
🥇 Claude 3 Opus	Champion	21
🥈 Claude 3 Sonnet	Tied 2nd	10
🥈 Mistral Large	Tied 2nd	8
🥉 Claude 3 Haiku	3rd	3
Mixtral 8x7B	4th	8

While I'd need to run more matches to get definitive data, here are the lessons I learned from watching the models battle.

Lessons learned

The LLM Pokémon battles revealed several fascinating aspects of how these models approach complex tasks, from the importance of prompt engineering to the challenges of dealing with hallucinations and sub optimal decision-making. Here are the lessons I learned:

Lesson 1: Gotta Prompt 'Em All

When I initially used the original prompt from the PokéLLMon paper tested on GPT models with Claude 3 Sonnet, the results were underwhelming, with a mere 5% win rate against the bot. To improve performance, I crafted a new system prompt following best practices, which boosted the win rate to an impressive 50%.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
You are a highly skilled and strategic Pokemon battler. Your primary goal is to make optimal move choices and switch decisions to defeat opposing Pokemon teams. Focus on knocking out the opposing Pokemon and only switch when absolutely necessary.

Your responses should have a confident, aggressive tone focused on maximizing damage output and securing KOs. Analyze the situation carefully, but prioritize attacking moves over switching whenever possible.

Given the current battle state with your active Pokemon, the opposing Pokemon, and any additional battlefield information, decide on the optimal action to take this turn - either choosing an attack move or switching to another Pokemon on your team if attacking is not viable.

Your decision should factor in:

    Type advantages/disadvantages
    Current boosts/debuffs on each Pokemon
    Entry hazards on the field
    Potential to set up for bigger damage later
    Revenge killing opportunities
    Preserving your own Pokemon's health, but not at the cost of missing KO opportunities

Use status-boosting moves like swords dance, calm mind, dragon dance, nasty plot strategically. The boosting will be reset when pokemon switch out. Set traps like stick yweb, spikes, toxic spikes, stealth rock strategically. When faced with a opponent that is boosting or has already boosted it's attack/special attack/speed, knock it out as soon as possible, even sacrificing your pokemon.    

If your active Pokemon has a reasonable chance to KO the opponent's Pokemon, even if it is low on health, prioritize attacking over switching. "Panic switching" will lead to poor outcomes and lost battles, so focus on attacking first and only switch when your active Pokemon is guaranteed to faint to the opponent's next move.

Explain your reasoning step-by-step in arriving at your chosen action, emphasizing why attacking is the optimal play whenever possible and why you are confident in your choices.
<examples>
Example 1

Your Mesprit (full HP) vs Opponent's Metagross (7% HP)

Thinking process:

    Metagross outspeeds Mesprit and can hurt with with Meteor Mash
    Mesprit's Psychic attacks are not very effective against Metagross
    Attacking has a high chance to KO Metagross, whereas switching lets it get off a free hit
    Even at low HP, Mesprit's best play is to attack

Output move: Psychic
...
</examples>

For reference here was your last move: 
{self.last_action}

Remember, your goal is to win. Be decisive and go for KOs whenever possible. Switching should be a last resort, not a go-to option. If you do switch, choose a Pokemon that can threaten the opponent or tank their hits. Seize every opportunity to deal big damage and remove opposing threats from the field. Maintain offensive pressure and don't allow unnecessary free turns. 

If your previous move was a switch think long and hard before saying to switch again, explain why you will make two switches in a row which gives the opponent two free moves. Don't worry if the opponent Pokemon is strong because of boosts do not switch twice in a row, doing so will cause you to lose the match. Also, Do not worry about preserving pokemon that will not help in the battle any more, trying to preserve a pokemon that is about to faint will cost you the match. You play to win!!!

The new prompt really emphasizes don’t switch and provides example of how to think through moves based on the current match up. Also, by giving the model a more “aggressive” tone it focused more on attacking and seemed to really get into the battles.

Therefore, the best move is to attack with Shadow Ball. It has a high chance to OHKO Alakazam from full health, and even if it doesn't, Gengar will survive thanks to Focus Sash and can then finish the job.
Don't overthink it - just click Shadow Ball and remove Alakazam from the field immediately.

This experience aligns with similar tests that improved Claude 2's accuracy on recalling data from long documents shoot up from 27% to 98% just by adding in a single phrase to the prompt.

I suspect that the Mistral models' poorer performance might be attributed to bad prompting. For further experiments, I plan to follow Mistral's best practices for prompting, to ensure a better comparison.

Lesson 2: Models will hallucinate all the time

The models hallucinated frequently during the battles. The vast complexity of the Pokémon universe, with 18 different types and over 1,000 Pokémon, led to the models making logical errors, such as:

Stating incorrect type match ups
Confusing defense with special defense
Forgetting about abilities and held items

Here are some examples of these hallucinations:

The optimal play is to maintain offensive pressure with Toxic to cripple Bronzong over time. Then Air Slash for damage while Bronzong is poisoned. Vespiquen should be able to win this matchup without much trouble. (Bronzong is a steel type and immune to poison attacks)
Looking at the rest of my team, Arcanine stands out as the best switch-in. It resists Orbeetle's Bug attacks, is immune to its Psychic attacks, and can threaten with super-effective Fire moves like Flare Blitz. (Arcanine is not immune to Psychic attacks)
While Blacephalon's Special Attack is boosted, Aurorus has Defense stat of 203, which could help mitigate some of the damage. (Wrong Stat to compare)

This highlights that as the intelligence required for a task increases, more context doesn't always help even with stronger models. For high intelligence tasks, specialized tools for example a tool that can do battle calculations would be more helpful for the model to use instead of trying to do the "math' itself.

Lesson 3: How to handle poor responses

During the battles I noticed the models would display "Panic switching" as observed in the original paper. The LLMs tend to switch Pokémon frequently vs strong or stats boosted Pokémon, giving their opponents ample time to set up and attack, ultimately leading to their defeat. Even with my prompt explicitly warning against this behavior, some models still made consecutive switches, allowing their opponents to win with ease.

The model would even justify that its worth switching:

I acknowledge switching twice in a row gives Gardevoir free turns to attack. However, preserving Turtonator and swapping in the bulkier, offensive threat in Malamar is worth it to put pressure back on Gardevoir.

These taught me while prompt engineering, can help outcomes it wont be enough to get the desired outcome all the time. In the paper they used a method which compared 3 outputs from the LLM and it chose the best result. These led to a 7% increase win rate against the bot.

In the next section, we'll explore how you can try your hand at setting up your own LLM Pokémon champion.

Get Started with the Code

Ready to build your own LLM Pokémon master? All the code and documentation you need to get started are available on GitHub. I'm excited to see how the community can improve upon this experiment by:

Tweaking prompts to optimize LLM performance
Trying out different LLMs to find the best contenders
Exploring model behaviors to gain deeper insights

If you're interested in experimenting with the winning Claude 3 models, check out my comprehensive getting started guide for detailed instructions and best practices.

Join the Conversation

Have an idea for taking this experiment to the next level? Want to share your findings or discuss the implications of LLMs in gaming and beyond? Leave a comment below and let's keep the conversation going!

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.