Pre-training Small Base LMs with Fewer Tokens
๐ Abstract
The paper studies the effectiveness of a simple approach to develop a small base language model (LM) starting from an existing large base LM. The key idea is to inherit a few transformer blocks from the larger LM, and then train this smaller model on a very small subset (0.1%) of the raw pretraining data of the larger model. The authors call this approach "Inheritune" and demonstrate it for building a small base LM with 1.5B parameters using 1B tokens. They show that the resulting model compares favorably to publicly available base models of 1B-2B size, some of which have been trained using 50-1000 times more tokens.
The paper also explores a slightly different setting where small LMs are trained utilizing larger LMs and their full pre-training dataset. Here, the authors show that smaller LMs trained utilizing some of the layers of GPT2-medium (355M) and GPT-2-large (770M) can effectively match the validation loss of their bigger counterparts when trained from scratch for the same number of training steps on OpenWebText dataset with 9B tokens.
๐ Q&A
[01] Developing a small base LM in low data regime using Inheritune
1. Questions related to the content of the section?
- What is the key idea behind the Inheritune method proposed in the paper?
- How does the performance of the 1.5B model developed using Inheritune compare to publicly available baseline models of similar size?
- How does the Inheritune method compare to a pruning technique like Sheared LLaMA in terms of performance?
Answers:
- The key idea behind Inheritune is to train a small target model by inheriting the first few transformer blocks (layers) from a larger reference model, and then further training this smaller model on a very small subset (0.1%) of the original pretraining data of the reference model.
- The 1.5B model developed using Inheritune achieves 89% of the average downstream accuracy on 9 different datasets and 94% of the MMLU (5-shot) score compared to the 3B reference model. It also performs comparably or better than publicly available baseline models of similar size that were trained with 50-300 times more data.
- Inheritune can be better classified as an initialization technique, unlike pruning which is a compression method. While the performance of Inheritune is competitive with the Sheared LLaMA model, the two methods have different underlying approaches and computational requirements.
[02] Inheritune scales across different model sizes
1. Questions related to the content of the section?
- How does the performance of Inheritune-derived models vary with the choice of the number of inherited layers?
- What insights does the analysis of Inheritune-derived models of different sizes provide?
Answers:
- The paper presents an analysis of Inheritune-derived models with different choices of the number of inherited layers (8 models with 8-20 layers). The results show a positive trend in the MMLU (5-shot) score as the number of inherited layers increases, with a slight dip in performance for the 20-layer model potentially due to overfitting.
- The analysis of Inheritune-derived models of different sizes suggests that the Inheritune method can be scaled to develop small base LMs of varying sizes, and the choice of the number of inherited layers is an important hyperparameter that affects the performance of the resulting model.
[03] Additional analysis with larger reference LMs and 50B data
1. Questions related to the content of the section?
- How does the performance of Inheritune-derived models change when using larger reference models (7B parameters) and a larger subset (50B) of the pretraining data?
- What insights does the analysis of the number of training epochs provide?
- What are the findings regarding the use of repeated vs. fresh tokens during training?
Answers:
- When using larger reference models (OpenLLaMA-7B and LLaMA2-7B) and a 50B subset of the pretraining data, the Inheritune-derived 1.5B/1.6B models show even greater improvements in the MMLU (5-shot) score compared to the results with the 1B data and OpenLLaMA-3B reference model.
- The analysis of the number of training epochs shows that repetition is helpful, particularly for the MMLU task, while the average performance on the other 9 datasets peaks at around 5 epochs and then deteriorates.
- The paper finds that one can safely reuse the 1B tokens for up to 10-20 epochs without compromising performance, suggesting that repeating the tokens can be a viable approach when the available pretraining data is limited.
[04] Exploratory Analysis of Inheritune in the presence of full pre-training data
1. Questions related to the content of the section?
- What is the key finding when Inheritune is applied in the setting where the full pretraining dataset is available?
- How do the Inheritune-derived models compare to their larger reference models and same-sized models trained from scratch in terms of validation loss and convergence?
Answers:
- In the setting where the full pretraining dataset is available, the paper shows that a smaller target LM can be extracted from larger reference models (GPT2-large and GPT2-medium) without compromising the validation loss. For example, a 16-layer GPT2-medium model can match the validation loss of the 24-layer GPT2-medium model.
- The Inheritune-derived models not only match the validation loss of their larger reference models but also outperform same-sized models trained from scratch, even when the latter are trained for twice the number of steps. Additionally, the Inheritune-derived models exhibit a similar convergence pattern to their larger reference models.
Appendix B provides further details on the implementation and training hyperparameters used for the different experiments in the paper.