gzip Predicts Data-dependent Scaling Laws
🌈 Abstract
The article investigates whether scaling laws for neural language models (LMs) are agnostic to the training data, as suggested by prior work. The authors generate datasets of varying complexities by modulating the syntactic properties of a probabilistic context-free grammar (PCFG) and find that:
- Scaling laws are sensitive to differences in data complexity.
- The compressibility of the data, as measured by the gzip compression algorithm, is an effective predictor of how data complexity impacts scaling properties.
The authors propose a new data-dependent scaling law for LMs that accounts for the training data's gzip-compressibility, where the compute-optimal frontier increases its preference for dataset size over parameter count as the training data becomes harder to compress.
🙋 Q&A
[01] Modulating Data Complexity via Syntactic Properties of a PCFG
1. How did the authors control the syntactic complexity of the datasets? The authors used a probabilistic context-free grammar (PCFG) to generate datasets with varying complexities. They controlled the syntactic properties of the PCFG by adjusting the number of terminals, non-terminals, the maximum length of production rule right-hand sides, and the maximum number of production rules per non-terminal.
2. How did the authors measure the complexity of the generated datasets? The authors used the compressibility of the datasets, as measured by the gzip compression algorithm, as a proxy for complexity. They computed the ratio of compressed to original data size for a sample of 1000 token sequences from each dataset and used the median of these ratios as the complexity measure.
3. How did the compressibility of the PCFG datasets compare to real-world datasets? The authors found that as the syntactic complexity of the PCFG datasets increased, their compressibility decreased, making them more similar to the compressibility of natural language datasets. Some of the PCFG datasets were more similar in compressibility to code datasets, which are generally more compressible than natural language.
[02] Are Scaling Laws Sensitive to Data Complexity?
1. What did the authors observe about the convergence of models on datasets with different compressibilities? The authors observed that models converged faster on more compressible (less complex) datasets compared to less compressible (more complex) datasets, indicating that more compute is required to model the more complex datasets.
2. How did the authors compute the scaling laws for each PCFG dataset? The authors fitted the scaling law function proposed by Hoffmann et al. (2022) to the training results for each PCFG dataset, and found that the fitted parameters of the scaling law were considerably different for each dataset.
3. What did the authors observe about the compute-optimal frontier of the scaling laws for the different PCFG datasets? The authors found that as the datasets became harder to compress (more complex), the compute-optimal frontier of the scaling law gradually shifted to prefer dataset size over parameter count, crossing over the 1-to-1 frontier of the Chinchilla scaling law.
[03] Predicting Scaling Law Parameters from gzip-compressibility
1. How did the authors fit a data-dependent scaling law using gzip-compressibility? The authors fitted linear regressions to predict each parameter of the scaling law (E, A, B, α, β) as a function of the dataset's gzip-compressibility. They then used these fitted regressions to define a new data-dependent scaling law formula.
2. How did the authors show that gzip-compressibility, and not just syntactic properties, was the key predictor of scaling law parameter shifts? The authors conducted experiments where they held the vocabulary size constant while varying other syntactic properties, and found that gzip-compressibility still predicted the scaling law parameter shifts. They also showed that when the datasets had the same gzip-compressibility despite varying syntactic properties, there was no significant shift in scaling law parameters.
3. What are the implications of the data-dependent scaling law for training models on real-world datasets, such as code? The authors suggest that the data-dependent scaling law could lead to more optimal compute allocation for training language models on code datasets, which are generally more compressible than natural language datasets. They estimate this could save $278,000 in H100 hours when training a 6B parameter code-generation model.