Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
🌈 Abstract
The article investigates the impact of vocabulary size on the scaling of large language models (LLMs). It proposes three complementary approaches to predict the compute-optimal vocabulary size, considering the trade-off between model complexity and computational constraints. The key findings are:
- Vocabulary parameters should be scaled slower than non-vocabulary parameters as models become more computationally intensive.
- Most existing LLMs use vocabulary sizes that are smaller than the optimal, as predicted by the proposed approaches.
- Adopting the predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes.
🙋 Q&A
[01] Scaling Laws with Vocabulary
1. What are the key attributes considered in the scaling laws for language models? The key attributes considered are:
- Non-vocabulary parameters (θ_nv)
- Vocabulary parameters (θ_v)
- Number of training characters (D)
2. Why is it important to incorporate the vocabulary size when studying scaling laws for language models? Prior work on scaling laws has generally overlooked the impact of vocabulary size, leading to substantial variability in vocabulary sizes across existing LLMs. The article argues that the vocabulary size affects performance non-trivially and needs to be considered jointly with model parameters and training data.
3. How does the article propose to normalize the language modeling loss to enable fair comparison across models with different vocabulary sizes? The article introduces the unigram-normalized language modeling loss, which normalizes the loss with respect to the vocabulary size. This allows assessing the language model's efficacy independent of the vocabulary size.
[02] Analysis: Why the Optimal Vocabulary Size is Bounded by Compute
1. What are the three key perspectives provided in the analysis to explain why the optimal vocabulary size is constrained by the computational budget?
- Fixed normalized loss perspective: Increasing the vocabulary size initially improves tokenization fertility, but at very large vocabulary sizes, the gain from tokenization fertility decreases while the parameters from expanding the vocabulary cannot be adequately trained with limited data.
- Fixed FLOP budget perspective: For a fixed FLOP budget, the loss initially decreases with increasing vocabulary size, but then starts to rise, indicating an optimal vocabulary size.
- Parameter growing perspective: Non-vocabulary parameters can benefit from increases in both depth and width, while vocabulary parameters are confined to a single layer, limiting their ability to benefit from increases in depth.
[03] Estimating the Optimal Vocabulary Size
1. What are the three approaches proposed in the article to estimate the optimal vocabulary size?
- Approach 1 (Estimating power laws via IsoFLOPs): Trains models with varying vocabulary configurations under the same FLOP budget, then fits power laws to predict the optimal allocation of non-vocabulary parameters, vocabulary parameters, and training data.
- Approach 2 (Derivative-based Estimation): Computes the derivative of FLOPs with respect to the vocabulary size and finds the zero solution to estimate the optimal vocabulary size.
- Approach 3 (Parametric Fit of Loss Formula): Modifies the Chinchilla scaling laws to incorporate vocabulary size and fits the resulting formula to predict the normalized loss function.
2. What are the key findings from the three approaches regarding the scaling of vocabulary parameters compared to non-vocabulary parameters? All three approaches converge on the finding that vocabulary parameters should be scaled slower than non-vocabulary parameters as models become more computationally intensive.
[04] Discussion
1. How do the predicted optimal vocabulary sizes compare to the vocabulary sizes used in existing large language models? The article finds that most existing LLMs use vocabulary sizes that are smaller than the optimal, as predicted by the proposed approaches. For example, the article predicts that the optimal vocabulary size for Llama2-70B should have been at least 216K, 7 times larger than its actual vocabulary of 32K.
2. How does the article empirically verify the predictions of the optimal vocabulary size? The article trains 3B parameter models with the predicted optimal vocabulary sizes and compares their performance to baselines using commonly used vocabulary sizes. The results show that models with the suggested optimal vocabulary sizes consistently outperform the baselines under the same FLOP budget.
3. How does the article address scenarios with scarce or excessive training data, beyond the compute-optimal setting? The article shows that the Approach 3 can handle both data-constrained ("undertraining") and data-excessive ("overtraining") scenarios. It demonstrates that adopting the predicted optimal vocabulary size improves performance in these practical scenarios as well.