Language models can explain neurons in language models
๐ Abstract
The article discusses the use of GPT-4 to automatically generate explanations for the behavior of neurons in large language models and to score those explanations. The authors release a dataset of these explanations and scores for every neuron in GPT-2. The goal is to improve the interpretability of language models and uncover additional information by looking inside the model.
๐ Q&A
[01] Automated Explanation Generation
1. What is the motivation behind the authors' approach to interpretability research?
- The authors want to automate the alignment research work itself, as a promising approach that can scale with the pace of AI development.
- As future models become increasingly intelligent and helpful as assistants, the authors hope to find better explanations for the models' behavior.
2. What are the key steps in the authors' methodology?
- The authors run 3 steps on every neuron:
- Generate an explanation of the neuron's behavior using GPT-4
- Simulate what the neuron would do using GPT-4
- Score the explanation based on how well the simulated activations match the real activations
3. What are some of the limitations of the current approach?
- Neurons may have very complex behavior that is impossible to describe succinctly
- The approach only explains neuron behavior as a function of the original text input, without saying anything about its downstream effects
- The overall procedure is quite compute intensive
4. What are the authors' plans for future extensions and generalizations of their approach?
- The authors want to eventually automatically find and explain entire neural circuits implementing complex behaviors
- They want to use models to form, test, and iterate on fully general hypotheses, just as an interpretability researcher would
- The ultimate goal is to use these techniques to interpret their largest models as a way to detect alignment and safety problems before and after deployment
[02] Dataset Release and Scoring
1. What dataset are the authors releasing?
- The authors are releasing a dataset of the (imperfect) explanations and scores for every neuron in GPT-2, generated using GPT-4.
2. How well do the GPT-4-generated explanations perform?
- The authors found that the vast majority of their explanations score poorly, with an average score of 0.34.
- However, they were able to improve scores by iterating on the explanations, using larger models to generate the explanations, and changing the architecture of the explained model.
- The authors found over 1,000 neurons with explanations that scored at least 0.8, meaning they account for most of the neuron's top-activating behavior.
3. What are the authors' hopes for the research community's use of the released dataset and tools?
- The authors hope the research community will develop new techniques for generating higher-scoring explanations and better tools for exploring GPT-2 using the explanations.
- They want to eventually be able to rapidly uncover interesting qualitative understanding of model computations as the explanations improve.