Data, Data Everywhere: A Guide for Pretraining Dataset Construction
🌈 Abstract
The article discusses the process of pretraining dataset construction for large language models. It highlights the lack of open information on how to develop effective pretraining sets and aims to provide insights across all steps of pretraining set development. The key findings include:
🙋 Q&A
[01] Data Curation
1. What are the key findings related to data curation?
- Compared to raw text, deduplicated and quality filtered data improve model accuracy.
- In deduplication, it is better to prioritize keeping samples from older sources than more recent ones.
2. How does data deduplication impact model performance? Prioritizing older documents over more recent ones in deduplication leads to significantly better results.
[02] Data Selection
1. What are the key findings related to data selection using DSIR?
- DSIR improves the quality of web crawl snapshots.
- DSIR functions best when applied across each data source individually rather than the entire pretraining corpus.
- DSIR is sensitive to the composition of the target dataset used for selection.
[03] Data Sampling
1. What are the key findings related to data sampling methods?
- UniMax provides the best sampling weights for the English and multilingual domains.
- Alpha sampling, with a value of α = 1.3, provides the best sampling weights for the code domain.
- DoReMi is unable to produce competitive sampling weights as it often gives the majority of the weight to a single source.
[04] Data Attributes
1. What insights were gained from analyzing the attributes of web crawl data?
- Website homepages, news articles, and blogs constitute the majority of web crawl documents, while conversational texts are sparsely contained.
- Technical domains like finance, law, and science are among the least represented in web crawl.
- Explanatory or news articles on science and health are the most likely to be high quality documents.
- Domains or types of speech that are generally of high quality may also exhibit high toxicity, explaining why previous toxicity-based filtering has harmed model accuracy.
2. How can data attributes be used to improve pretraining set development?
- Buckets defined by data attributes substantially improve the performance of data sampling methods.
- Attribute information can be used to define more useful target sets for data selection.