Summarize by Aili

Data curation via joint example selection further accelerates multimodal learning

🌈 Abstract

The article discusses the importance of data curation in large-scale pretraining, and proposes a method called "Joint Example Selection for Training" (JEST) to jointly select batches of data that are more effective for learning than selecting examples independently. The key points are:

🙋 Q&A

[01] Data Curation and Model-Based Selection

1. What are the key challenges with current data pipelines for large-scale pretraining?

Current data pipelines rely heavily on manual curation, which is difficult and expensive to scale
Model-based data curation holds promise for improving the slow, power-law scaling of large-scale pretraining across modalities

2. How do existing methods for model-based data curation work?

Existing methods apply curation at the level of individual data points, but the quality of a batch is also a function of its composition
In multimodal learning, contrastive objectives directly expose the interactions between examples in a batch

3. What is the key idea behind the proposed JEST method?

JEST derives a simple and tractable algorithm to jointly select relevant 'sub-batches' of data from much larger 'super-batches' based on their model-based scores
This allows JEST to accelerate learning beyond what is possible by selecting examples independently

[02] Joint Example Selection (JEST)

1. How does JEST leverage multimodal contrastive objectives?

The contrastive loss of a batch decomposes into a sum of conditional losses for each example given the other examples
JEST samples batches in proportion to their joint learnability, which is enabled by a sequential approach inspired by blocked Gibbs sampling

2. What are the different model-based scoring criteria considered?

Hard learner: prioritize batches with high loss under the current learner model
Easy reference: prioritize batches with low loss under a pretrained reference model
Learnability: prioritize batches that are both unlearned and learnable, combining the hard learner and easy reference criteria

3. How does JEST compare to independent example selection methods?

JEST significantly outperforms independent example selection, particularly at high filtering ratios where independent selection leads to performance degradation
Learnability-based JEST yields the best scaling behavior compared to other prioritization criteria

[03] Efficient Scoring and Multi-Resolution Training

1. What challenges arise from scoring large super-batches in JEST?

Scoring large super-batches increases the computational cost per iteration, reducing the efficiency gains in terms of total FLOPs

2. How does the Flexi-JEST variant address this?

Flexi-JEST uses online model approximation techniques like FlexiViT to efficiently score super-batches at low resolution
It also leverages multi-resolution training, where the learner is trained on a mix of full and low-resolution images

3. What are the trade-offs between JEST and Flexi-JEST?

JEST maximizes training speed/efficiency, reaching state-of-the-art with 13 fewer training iterations
Flexi-JEST minimizes total FLOPs, reaching state-of-the-art with 9.9 fewer FLOPs than a comparable baseline

[04] Data Quality Bootstrapping

1. How does the choice of reference model affect JEST performance?

JEST performance is decoupled from the reference model's own performance, suggesting the importance of dataset size over dataset quality
Scaling the reference dataset size (WebLI-curated++) leads to significant improvements in both reference model and JEST performance

2. What is the role of data curation in JEST?

The ability to steer the data selection process towards the distribution of smaller, well-curated datasets is essential to the performance of JEST
This is enabled by the concept of a pretrained reference model, which prioritizes examples that most resemble the data it was trained on

3. How does JEST compare to prior art?

JEST and its Flexi-JEST variant surpass state-of-the-art models on multiple benchmarks, using up to 13 fewer training iterations and 10 fewer FLOPs
JEST also enables strong data quality bootstrapping, where a reference model trained on a small curated dataset can guide the curation of a much larger dataset

</output_format>

Shared by Daniel Chen ·

Install fromChrome Web Store