Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
🌈 Abstract
The article discusses the use of filler tokens (e.g. "......") in transformer language models, and how they can provide computational benefits independent of the choice of tokens. The key points are:
- Chain-of-thought responses from language models can improve performance, but it's unclear if this is due to human-like task decomposition or just the greater computation allowed by additional tokens.
- The authors show that transformers can use meaningless filler tokens to solve algorithmic tasks they couldn't solve without intermediate tokens.
- However, learning to use filler tokens is difficult and requires specific, dense supervision.
- The authors provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula.
- The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations.
🙋 Q&A
[01] Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
1. What is the motivation for studying the use of filler tokens in transformer language models? The motivation is to understand whether the performance gains from chain-of-thought responses are due to human-like task decomposition or just the greater computation allowed by additional tokens. By studying the use of meaningless filler tokens, the authors aim to determine if additional tokens can provide computational benefits independent of the token choice.
2. What are the key findings regarding the use of filler tokens? The key findings are:
- Transformers can use meaningless filler tokens to solve algorithmic tasks they couldn't solve without intermediate tokens.
- However, learning to use filler tokens is difficult and requires specific, dense supervision. Standard chain-of-thought data is insufficient for models to learn to leverage filler tokens effectively.
- The authors provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula.
3. What are the implications of the finding that intermediate tokens can act as filler tokens? The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens. This undermines the reliance on purely behavioral alignment methods that judge or compare model output tokens.
[02] Synthetic data: 3SUM and 2SUM
1. What are the 3SUM and 2SUM-Transform tasks, and why were they chosen? The 3SUM task involves determining if a set of three numbers sums to zero, while the 2SUM-Transform task involves determining if a pair of transformed numbers sums to zero. These tasks were chosen because they are theoretically interesting - 3SUM likely requires more than a single forward pass, while 2SUM-Transform prevents in-place computation over input tokens.
2. What are the key findings from the experiments on these synthetic tasks? The key findings are:
- On the 3SUM task, transformers can solve the problem with filler tokens but fail without them, demonstrating the computational benefits of filler tokens.
- On the 2SUM-Transform task, transformers with filler tokens significantly outperform those without.
- Learning to use filler tokens is difficult and requires specific, dense supervision. Standard chain-of-thought data is insufficient for models to learn to leverage filler tokens effectively.
3. How do the results on the synthetic tasks inform the understanding of filler token usage in language models? The results on the synthetic tasks demonstrate that transformers can benefit from filler tokens to solve certain types of problems, particularly those that require resolving many nested quantifiers. However, the difficulty in learning to use filler tokens suggests that current language models are unlikely to spontaneously discover and utilize filler token computations, unless provided with appropriate training data.