Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability
๐ Abstract
The article introduces "logit prisms", a method to break down the outputs of transformer models into individual components, enabling a more interpretable understanding of how these models work. The key idea is to treat the non-linear activations in the model as constants, allowing the output to be expressed as a sum of linear projections. This allows the contributions of different parts of the model, such as attention heads, MLP neurons, and input embeddings, to be calculated and analyzed. The authors demonstrate the power of this approach through two examples analyzing the behavior of the gemma-2b model.
๐ Q&A
[01] Introduction
1. What is the logit lens and how does the article extend it?
- The logit lens is a tool for understanding how transformer models make decisions.
- This article extends the logit lens approach in a mathematically rigorous and effective way by treating certain parts of the network activations as constants and leveraging the linear properties within the network to break down the logit output into individual component contributions.
2. What are the "prisms" introduced in the article?
- The article introduces simple "prisms" for the residual stream, attention layers, and MLP layers that allow the authors to calculate how much each component contributes to the final logit output.
3. What are the two illustrative examples presented in the article?
- The first example examines how the gemma-2b model performs the task of retrieving a capital city from a country name.
- The second example explores how the gemma-2b model adds two small numbers (ranging from 1 to 9).
[02] Method
1. What is the key idea behind the logit prisms approach?
- The key idea is to treat nonlinear activations as constants, which enables the authors to calculate the contribution of any component in the network using a series of linear transformations.
2. How does the article decompose the residual stream?
- The article breaks down the output logit into individual terms for each embed, attention, and MLP layer by treating the normalization factor as a constant.
3. How does the article decompose the attention mechanism?
- The article breaks down the attention output into a sum over its attention heads, and further decomposes each attention head's output as a weighted sum of the circuit outputs for each token in the sequence.
4. How does the article decompose the MLP layers?
- The article breaks down the MLP output into a sum over individual neuron contributions by treating the scaling vector and normalization factor as constants.
[03] Examples
1. What insights does the article gain about the gemma-2b model's retrieval of capital cities?
- The article finds that the gemma-2b model learns to represent country names and capital city names in a way that allows it to easily transform a country embedding to the capital city unembedding using a linear projection.
2. What insights does the article gain about the gemma-2b model's addition of small numbers?
- The article finds that the network predicts output numbers using interpretable templates learned by MLP neurons, and that when multiple neurons are activated simultaneously, their predictions interfere with each other to produce the final prediction that peaks at the correct number.
3. How does the article visualize the digit unembedding space?
- The article visualizes the digit unembedding space in 2D and discovers that the numbers form a heart-like shape with reflectional symmetry around the 0-5 axis.
[04] Conclusion
1. What is the main contribution of the article?
- The main contribution of the article is the introduction of "logit prisms", a simple but effective way to break down transformer outputs and make them easier to interpret.
2. What key insights did the article gain by applying logit prisms to the gemma-2b model?
- The article gained valuable insights into how the gemma-2b model works internally, including how it represents and retrieves factual information, and how it performs arithmetic operations.