Data curation via joint example selection further accelerates multimodal learning
๐ Abstract
The article discusses the importance of data curation in large-scale pretraining, and proposes a method called "Joint Example Selection for Training" (JEST) to jointly select batches of data that are more effective for learning than selecting examples independently. The key points are:
๐ Q&A
[01] Data Curation and Model-Based Selection
1. What are the key challenges with current data pipelines for large-scale pretraining?
- Current data pipelines rely heavily on manual curation, which is difficult and expensive to scale
- Model-based data curation holds promise for improving the slow, power-law scaling of large-scale pretraining across modalities
2. How do existing methods for model-based data curation work?
- Existing methods apply curation at the level of individual data points, but the quality of a batch is also a function of its composition
- In multimodal learning, contrastive objectives directly expose the interactions between examples in a batch
3. What is the key idea behind the proposed JEST method?
- JEST derives a simple and tractable algorithm to jointly select relevant 'sub-batches' of data from much larger 'super-batches' based on their model-based scores
- This allows JEST to accelerate learning beyond what is possible by selecting examples independently
[02] Joint Example Selection (JEST)
1. How does JEST leverage multimodal contrastive objectives?
- The contrastive loss of a batch decomposes into a sum of conditional losses for each example given the other examples
- JEST samples batches in proportion to their joint learnability, which is enabled by a sequential approach inspired by blocked Gibbs sampling
2. What are the different model-based scoring criteria considered?
- Hard learner: prioritize batches with high loss under the current learner model
- Easy reference: prioritize batches with low loss under a pretrained reference model
- Learnability: prioritize batches that are both unlearned and learnable, combining the hard learner and easy reference criteria
3. How does JEST compare to independent example selection methods?
- JEST significantly outperforms independent example selection, particularly at high filtering ratios where independent selection leads to performance degradation
- Learnability-based JEST yields the best scaling behavior compared to other prioritization criteria
[03] Efficient Scoring and Multi-Resolution Training
1. What challenges arise from scoring large super-batches in JEST?
- Scoring large super-batches increases the computational cost per iteration, reducing the efficiency gains in terms of total FLOPs
2. How does the Flexi-JEST variant address this?
- Flexi-JEST uses online model approximation techniques like FlexiViT to efficiently score super-batches at low resolution
- It also leverages multi-resolution training, where the learner is trained on a mix of full and low-resolution images
3. What are the trade-offs between JEST and Flexi-JEST?
- JEST maximizes training speed/efficiency, reaching state-of-the-art with 13 fewer training iterations
- Flexi-JEST minimizes total FLOPs, reaching state-of-the-art with 9.9 fewer FLOPs than a comparable baseline
[04] Data Quality Bootstrapping
1. How does the choice of reference model affect JEST performance?
- JEST performance is decoupled from the reference model's own performance, suggesting the importance of dataset size over dataset quality
- Scaling the reference dataset size (WebLI-curated++) leads to significant improvements in both reference model and JEST performance
2. What is the role of data curation in JEST?
- The ability to steer the data selection process towards the distribution of smaller, well-curated datasets is essential to the performance of JEST
- This is enabled by the concept of a pretrained reference model, which prioritizes examples that most resemble the data it was trained on
3. How does JEST compare to prior art?
- JEST and its Flexi-JEST variant surpass state-of-the-art models on multiple benchmarks, using up to 13 fewer training iterations and 10 fewer FLOPs
- JEST also enables strong data quality bootstrapping, where a reference model trained on a small curated dataset can guide the curation of a much larger dataset
</output_format>