Summarize by Aili

Thinking about High-Quality Human Data

https://lilianweng.github.io/posts/2024-02-05-human-data-quality/

🌈 Abstract

The article discusses the importance of high-quality human-annotated data for training modern deep learning models, and various techniques for ensuring data quality through careful task design, rater selection and training, as well as data aggregation and cleaning.

🙋 Q&A

[01] High-Quality Data for Model Training

1. What are the key steps involved in collecting high-quality human-annotated data?

Task design: Designing clear and simple task workflows to improve clarity and reduce complexity
Rater selection and training: Selecting annotators with matched skillsets and providing necessary training, as well as ongoing feedback and calibration
Data collection and aggregation: Applying ML techniques to clean, filter and smartly aggregate the collected data to identify true labels

2. What is the "wisdom of the crowd" concept and how has it been applied in human data collection?

The "wisdom of the crowd" concept, first mentioned in a 1907 Nature paper, suggests that the middlemost estimate from a large group of people can be very close to the true value.
Early studies like Callison-Burch (2009) showed that non-expert human annotators on platforms like Amazon Mechanical Turk can provide reliable evaluations for tasks like machine translation, even outperforming expert-generated gold references.

3. What are some techniques for aggregating labels from multiple raters?

Majority voting: Taking the majority vote as the final label
Raw agreement: Measuring the percentage of other raters agreeing with each rater
Cohen's Kappa: Measuring inter-rater agreement while accounting for agreement by chance
Probabilistic graph modeling: Using techniques like MACE to model rater competence and identify potential "spammers"

[02] Rater Disagreement and Annotation Paradigms

1. What are the two contrasting paradigms for human data annotation discussed in the article?

Descriptive paradigm: Encourages annotator subjectivity and tries to model diverse beliefs
Prescriptive paradigm: Discourages annotator subjectivity and tries to consistently apply one belief

2. What are the pros and cons of each paradigm? Descriptive paradigm:

Pros: Can help identify subjective entries and embrace diversity
Cons: Metrics like rater disagreement cannot be used to measure data quality, and models trained on this data cannot be optimized for a single preset behavior

Prescriptive paradigm:

Pros: More aligned with standard NLP setup, easier to do quality control
Cons: Expensive and challenging to create high-quality annotation guidelines, cannot capture interpretable diversity of beliefs

3. What are some factors that can lead to rater disagreement?

Annotator identity (e.g. demographic background)
Topic of the content being annotated
Severity level of the content (e.g. extreme vs. benign)
Stochastic errors or inconsistency at the individual rater level

[03] Techniques for Identifying Mislabeled Data

1. How can influence functions be used to identify potentially mislabeled data?

Influence functions measure the effect of upweighting or removing a training data point on the model parameters and loss function
By measuring the influence of a data point on its own loss, we can identify samples that are likely to be mislabeled, as removing them would improve the model's performance

2. What are the key ideas behind the "Data Maps" and "Forgettable Examples" methods?

Data Maps tracks the model's confidence and variability in predictions during training to identify "hard-to-learn" samples, which are more likely to be mislabeled
Forgettable Examples identifies samples that are consistently misclassified or "forgotten" by the model during training, which are also more likely to be mislabeled

3. How does the AUM (Area Under the Margin) method work to detect mislabeled data?

AUM measures the area under the margin (difference between logits of true and next-highest class) for each sample during training
Mislabeled samples are expected to have smaller margins due to the tension between generalization and the wrong supervised signal
A threshold is determined using "threshold samples" with deliberately flipped labels, and samples below this threshold are identified as potentially mislabeled

</output_format>

Shared by Daniel Chen ·

Install fromChrome Web Store