Time Matters: Scaling Laws for Any Budget
๐ Abstract
The article discusses how to estimate the training speed and final loss of large language models based on their hyperparameters, without the need for extensive empirical training. The key points are:
๐ Q&A
[01] Introduction
1. What is the primary cost driver for training large models? The primary cost driver for training large models is wall-clock training time.
2. Why are popular time estimates based on FLOPs poor estimates? The article shows that popular time estimates based on FLOPs are poor estimates, and constructs a more accurate proxy based on memory copies.
3. What does the article aim to do? The article aims to estimate the training speed of a transformer model from its hyperparameters, and combine this with a scaling law curve like Chinchilla to estimate the final loss of the model.
[02] The parameter equivalence principle
1. What is the parameter equivalence principle? The parameter equivalence principle states that above a certain scale, the final loss of a transformer is primarily a function of how many parameters there are, not where they are in the model.
2. What implication does this have? The implication is that models of the same size that allocate their parameters differently compete primarily on speed, so we should choose architectures that optimize for training speed.
[03] Estimating linear scaling law coefficients
1. What did the authors do to estimate the scaling law coefficients? The authors fitted their own linear coefficients for the scaling law by iterating over 1,535 different decoder-only transformer models trained on the C4 dataset.
2. How do the authors' coefficients compare to previous work? The authors' computed coefficients differ significantly from the values quoted in previous papers like Chinchilla, suggesting the coefficients are highly sensitive to the details of the setup.
[04] Equations for estimating the speed of a model
1. Why are FLOPS a poor estimate of runtime? The article shows that the runtime of the model is actually driven by data copying (memory copies), not the actual computation (FLOPs).
2. What equations did the authors derive to estimate the speed of a model? The authors derived equations to estimate the number of parameters (PARAMS), memory copies (MEMCPYS), and FLOPs for a transformer model, and combined them into an equation to estimate the total training time per step (TIME).
[05] Estimating the throughput
1. How did the authors determine the coefficients in the TIME equation? The authors conducted a large-scale sweep over model hyperparameters, trained for 5 minutes, and applied linear regression to determine the coefficients , , and in the TIME equation.
2. What did the authors find about the importance of the different terms (FLOPs, MEMCPY) in the TIME equation? The authors found that MEMCPY is a much stronger predictor of runtime than FLOPs, and can account for essentially all of the explanatory power on its own.
[06] Putting it all together
1. How accurate is the authors' approach compared to using the traditional Chinchilla scaling law? The authors show that their approach produces estimates that are indistinguishable from using the traditional Chinchilla scaling law, achieving the same R^2 of 0.9.
[07] Better loss with faster models
1. What architectural insights does the authors' approach provide? The authors show that increasing the embed size at the expense of other hyperparameters is favorable, and that narrow MLPs and shallow models with large embed sizes are preferred.