Can Claude Play Pokemon? Kind of
LLMs still struggle with spatial reasoning
Banjo Obayomi
Amazon Employee
Published May 20, 2024
In 2014, the internet witnessed a phenomenon known as Twitch Plays Pokémon. This social experiment turned a solo video game adventure into a chaotic, collective endeavor where thousands of players controlled a single game of Pokémon Red through chat commands. The result was an unpredictable yet fascinating experience that led to a Guinnes world record with 1,165,140 playing at once.
But what if we could recreate this experiment using the latest in generative AI? Enter Claude 3, a multi-modal language model capable of processing both text and images. By leveraging the PokémonRedExperiments library, I embarked on a journey to create a bot that could take snapshots of the game state and determine the next moves in Pokémon Red. Here's how it went.
In previous experiments, I transformed the game state into text to make it easier for the LLM to process. However, for this project, I wanted to see how well the model could understand the game state directly from images. The goal was to keep the setup as light as possible and test the limits of Claude 3’s capabilities.
To achieve this, I needed to create two helper functions to translate an output text from the model into valid actions for the game. This is where Amazon Q Developer, a generative AI assistant for coding in the IDE, proved invaluable.
By describing the function inputs and outputs, I could generate the necessary code quickly and accurately. For example, I asked Amazon Q Developer
Can you write a function giving two arrays return the indexes if they are matching
e.g
next_moves = ["DOWN", "DOWN", "DOWN"]
valid_actions = ["DOWN", "LEFT", "RIGHT", "UP", "A", "B", "START"]
values = get_action_indices(next_moves, valid_actions)
# values = [0,0,0]
This resulted in code that fit perfectly into the test bed. This is a great prompting technique you can leverage when you know what the inputs and outputs will be. Amazon Q Developer is currently leading SWE-bench a benchmark built to see if LLMS can solve real world GitHub issues. I'm a "Gen AI" first builder I like using tools like Q to enhance my workload.
With the code working, the setup now involved taking a snapshot of the game screen, feeding it into Claude using Amazon Bedrock, and receiving the suggested button presses in return.
One of the immediate challenges was the model's spatial understanding or lack thereof. Claude struggled to navigate through the game environment and frequently got stuck in loops. This was particularly evident when trying to get past Professor Oak’s Lab, a relatively simple task for a human player but a significant hurdle for Claude.
I experimented with slight updates to the prompt to improve performance. I also attempted to incorporate lessons from the recent paper on “Visualization of Thought,” which discusses how LLMs can navigate tile grids. Despite these efforts, the model still couldn't navigate the game.
Testing other models like GPT-4V, Gemini, and Llava yielded similar results. The experiments highlighted a fundamental issue that current LLMs lack spatial reasoning abilities.
This aligns with the observations made by Yann LeCun, Chief AI Scientist at Meta, who has provided numerous examples such as an LLM failing to reason directions moving from the North Pole.
Given the challenges faced by the LLM when playing independently, I turned to the concept of Twitch Plays Pokémon for inspiration. What if I could create a system where Claude would read commands from a Twitch stream and summarize them to determine the next actions?
To test this idea, I set up a Twitch bot using tmi.js. This bot read text from a Twitch channel and fed the commands into Claude’s prompt. Claude would then execute the moves based on the summarized commands. The results were significantly better than the autonomous attempts.
This hybrid approach is similar to Twitch Plays Pokémon but adds a fun layer where the model can comment on what chat is saying, and take actions.
This experiment with Claude playing Pokémon Red highlighted several important insights about the capabilities and limitations of current LLMs:
- Spatial Reasoning in LLMs: One of the most significant takeaways from this experiment is the current limitation of LLMs in spatial reasoning. Navigating a game environment involves understanding and interpreting spatial relationships, something these models are not yet adept at.
- Prompt Engineering: Changes in the prompt can lead to varying degrees of success, like in my Pokemon Battle experiment updates to the prompt created a win rate from 5 to 50%. However, there is still a lot to learn about optimizing prompts for specific tasks, especially those involving visual inputs.
- Human-AI Collaboration: Combining human input with LLM processing can lead to better outcomes. The Twitch integration demonstrated that while the LLM might struggle with certain tasks independently, it can enhance its actions by processing and executing commands from human input. Amazon Bedrock Agents recently added the "return to control" feature to support this.
While Claude 3 and other LLMs are not yet capable of playing Pokémon Red autonomously, the experiment provided valuable insights into the current capabilities and limitations of these models.
There is still much to learn and explore in this space. If there is enough interest, we could set up a full stream to see if the collective efforts of a Twitch audience, mediated by Claude, can beat Pokémon Red.
For those interested in experimenting further, all the code and documentation are available on GitHub.
Let me know in the comments if you want us to set up a stream of Claude playing Pokémon!
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.