Prism: mapping interpretable concepts and features in a latent space of language | thesephist.com
๐ Abstract
The article explores a scalable, automated way to directly probe embedding vectors representing sentences in a small language model and "map out" what human-interpretable attributes are represented by specific directions in the model's latent space. It discusses the key ideas, methodology, results and applications, caveats and limitations, and future work of this research.
๐ Q&A
[01] Key ideas
1. What are the key ideas presented in the article?
- Sparse autoencoders can discover tens of thousands of human-interpretable features in text embedding models.
- Interventions in latent space enable precise and coherent semantic edits to text that compose naturally.
- Large language models can automatically label and score its own explanations for features discovered with sparse autoencoders, and this process can be made much more efficient with a new approach called normalized aggregate scoring.
2. What are some examples of the human-interpretable features discovered by the sparse autoencoders? The article provides a "Feature gallery" with examples of features discovered, including:
- Topic/subject matter features (e.g. legal concepts, cellular biology, trains, baking)
- Sentiment and tone features
- Specific word or phrasing features (e.g. starts with 'L', uses 'despite' for contrast)
- Punctuation, grammar, and text structure features (e.g. first-person quotes, comma-separated adjectives)
- Number, date, and counting features
- Natural and programming language features
3. How does the author use these discovered features to enable semantic text editing? The author demonstrates examples of making precise semantic edits to text by modifying the embedding in the direction of specific features, such as turning a statement into a question using the "Interrogative sentence structure" feature. The author also shares a novel "feature gradients" approach that uses gradient descent to make these edits even more precise.
[02] Methodology
1. How does the author train the sparse autoencoders? The author uses a two-layer autoencoder model with a hidden layer size that is 8x, 16x, 32x, or 64x larger than the input/output embedding size. The training process involves:
- Using a medium-sized English dataset (Minipile) to generate 31M sentence embeddings
- Training the sparse autoencoder to reconstruct the embeddings while also optimizing for sparsity
- Monitoring the log feature density histogram during training to look for a "second bump" indicating good, interpretable features
2. How does the author use GPT-4 to automatically label and score the discovered features? The author uses a two-step process:
- Labeling with chain-of-thought: GPT-4 is shown the top 50 highest-activating examples for a feature and asked to describe the common attribute.
- Normalized aggregate scoring: GPT-4 is asked to rate the fraction of both high-activating and non-activating examples that fit the auto-generated label, and the difference between the two fractions is used as a normalized confidence score.
[03] Results and applications
1. What are some of the key findings from studying the discovered features?
- Larger models have more interpretable features, and the features become more specific as model size increases.
- The features can provide insights into what the embedding model "sees" in a given input text.
- The features can be used to explain why certain embeddings are closer together or how adding context affects the encoding.
2. How does the author use the discovered features for semantic text editing? The author demonstrates several examples of making precise semantic edits to text by modifying the embedding in the direction of specific features, such as adding nautical terms, introducing randomness, or translating to French. The author also shares a "feature gradients" approach that uses gradient descent to make these edits even more precise.
3. What are the limitations and caveats of this work? The author notes that sparse autoencoders are a relatively new technique, so the current approach is more of a proof-of-concept than a rigorous claim about the interpretability of embedding spaces. The author also acknowledges that the automated labeling and scoring approach is less rigorous than OpenAI's work, and that the feature labeling may miss some nuances due to only considering top-activating examples.
[04] Future work
1. What are the author's plans for future work? The author plans to:
- Scale the sparse autoencoder approach to larger, more production-grade models like Llama 3 8B and OpenAI's CLIP
- Explore extending the techniques to other modalities like images, audio, and code
- Investigate advancements in sparse autoencoder architecture and training recipes
- Deepen the understanding of how features and feature spaces relate across different models and datasets
2. What are the broader implications and design possibilities the author envisions from this work? The author is excited about the potential to use the discovered features and semantic editing capabilities to build richer, more direct interfaces for interacting with foundation models and information. This could include visualizing feature activations across documents, enabling "copy-paste" of stylistic elements, and treating semantics, style, and structure as first-class formatting controls in creative applications.