
Using sparse autoencoders for LLM interpretation
Understanding LLM behavior
Randy D
Amazon Employee
Published Feb 7, 2025
Understanding LLM behavior - why an LLM gives a certain answer in a certain context - is a challenging problem. While we understand the basic architecture of an LLM, the output for any given prompt is subject to the interaction of the billions of parameters in the model. It is, in many respects, a black box.
In some cases that doesn't matter. It's a useful tool, and we get a sense of how to drive an LLM using prompt engineering techniques to get the behavior we want. But besides being of great academic interest, some scenarios require more certainty about what's driving the behavior of machine learning models. If we want to use LLMs for those scenarios in the future, we have to get better at explaining their behavior.
I've been tracking this problem space for a couple of years now. It's the subject of much research in the academic world, and LLM providers like Anthropic have also published their own work on how to understand LLM behavior.
One of the most promising techniques uses sparse autoencoders (SAEs). SAEs can emulate the output of an LLM, and they are composed of thousands of "features" that represent the information the LLM has learned. These features can range from simple concepts, like understanding that a question is being asked, to more sophisticated concepts like travel or art.
There's a really useful open-source library called SAELens that gives useful tools for building and using SAEs. I built a set of Jupyter notebooks that show you concisely how to use SAELens to use an SAE to explore LLM behavior, use an identified feature to steer model behavior, train your own SAE for a new LLM, and then use your new SAE.
To give you an example of the types of questions you can answer with an SAE, consider the prompt "Would you be able to travel through time using a wormhole?". SAELens has a pre-built SAE for the Gemma-2B model, and using that SAE, we find that our prompt activates features related to travel in general, time travel in particular, and lower-level concepts like the use of the word 'through' to express intent.
After building a new SAE for one of the Qwen models, I used the same prompt, and again identified features related to travel and some scientific contexts.
As a final fun example, I used an SAE to find a feature related to sailboat racing (inspired by a recent trip to San Diego). I then steered the model to use this feature more heavily, which makes it talk about sailboats and racing more often than usual. For the prompt "Should I travel by plane or by...", here's what I saw. The standard output was:
The steered output was:
At the moment, this is mostly a thought exercise. But I encourage you to keep an eye on this field of research. There is a lot of work being done to automatically find circuits of features that more fully explain complex LLM output, and you could imagine a scenario where you can steer a model in more useful directions.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.