# Neural scaling law - Wikipedia

## ๐ Abstract

The article discusses neural scaling laws, which are scaling laws that relate the parameters of a family of neural networks. It covers the key factors that characterize a neural model, such as model size, training dataset size, training cost, and error rate, and how they are empirically found to be related by simple statistical laws. The article also covers recent research on neural scaling laws, including the "Chinchilla scaling" law, and efforts to go "beyond Chinchilla scaling" by modifying the training pipeline to achieve the same loss with less compute.

## ๐ Q&A

### [01] Neural Scaling Laws

**1. What are the 4 key parameters that characterize a neural model?**

- Size of the model (usually the number of parameters)
- Size of the training dataset (number of data points)
- Cost of training (time and computational resources)
- Error rate after training

**2. What is the relationship between these 4 parameters as described by neural scaling laws?**

- They are empirically found to be related by simple statistical laws, usually written as (number of parameters, dataset size, computing cost, loss).

**3. What is the complication with sparse models like mixture-of-expert models?**

- In sparse models, only a fraction of the parameters are used during every inference, unlike most other neural networks where all parameters are used.

**4. How does the size of the training dataset affect model performance?**

- Larger training datasets provide a richer and more diverse source of information for the model to learn from, leading to improved generalization performance.
- However, increasing the dataset size also increases the computational resources and time required for training.

**5. What are the two types of training datasets in the "pretrain, then finetune" method used in large language models?**

- The pretraining dataset and the finetuning dataset, where the finetuning dataset is typically less than 1% the size of the pretraining dataset.

**6. How does the cost of training a neural model depend on various factors?**

- The cost of training is a function of the size of the model, the size of the training dataset, the complexity of the training algorithm, and the computational resources available.
- Doubling the training dataset does not necessarily double the cost of training, as one can train the model for several epochs over the same dataset.

**7. How can the performance of a neural model be improved?**

- Using more data, larger models, different training algorithms, regularizing the model to prevent overfitting, and early stopping using a validation set.

### [02] Scaling Laws in Neural Networks

**1. What are the key findings from the 2017 paper on neural scaling laws?**

- Previous works found the scaling exponent to scale like , with , but the 2017 paper found .
- The exponent can change based on the task, but changing the architecture, optimizers, regularizers, and loss functions only changes the proportionality factor, not the exponent.
- The number of parameters necessary to reach the lowest levels of loss, given a fixed dataset size, grows like for another exponent .

**2. What are the scaling laws found in the 2020 analysis?**

- The scaling laws found are , , and over multiple modalities (text, video, image, text to image, etc.).

**3. What is the "Chinchilla scaling" law for training Transformer language models?**

- The Chinchilla scaling law states that for a large language model autoregressively trained for one epoch with a cosine learning rate schedule, the optimal scaling is where the variables are model size, dataset size, and training compute, and the statistical parameters are .

**4. What are the implications of the Chinchilla scaling law?**

- It allows solving for the optimal model size and training dataset size for a given compute budget, or vice versa.
- It suggests that when given an increased budget, the number of model parameters and the number of tokens for training should scale in approximately equal proportions.

**5. What are the efforts to go "beyond Chinchilla scaling"?**

- The goal is to make the scaling law exponent larger, so the same loss can be trained for much less compute.
- Techniques like filtering data and using "denoising objectives" have been explored to achieve this.

**6. How does "overtraining" during training affect performance during inference?**

- Overtraining, where the model is trained for longer than the Chinchilla-optimal, can lead to better performance during inference.

**7. What is the "Broken Neural Scaling Laws" (BNSL) framework?**

- BNSL describes the scaling behaviors of artificial neural networks as following a smoothly broken power law functional form, where the transitions between linear segments on a log-log plot are called "breaks".
- This functional form has been observed across a wide range of scenarios and architectures in AI.

### [03] Scaling Laws in Vision Transformers and Machine Translation

**1. What were the findings on scaling laws for vision transformers?**

- Vision transformers, similar to language transformers, exhibit scaling laws. A study found that the error probability of a finetuned vision transformer on ImageNet scales as .

**2. What were the findings on scaling laws for neural machine translation?**

- A study on encoder-decoder Transformer models for English-to-German translation found that:
- For source-natural datasets, the model quickly overfits, and the scaling exponent is larger.
- For source-synthetic datasets, the scaling exponent is smaller.

- Another study found the Kaplan et al. (2020) scaling law applies to machine translation, with the BLEU score scaling as .

**3. What were the findings on scaling laws for transfer learning in language models?**

- When pretraining on English text and finetuning on Python text, the "transferred token count" scales as .
- When pretraining on both English text and non-Python code, the "transferred token count" scales as .