Summarize by Aili

Benchmarking Benchmark Leakage in Large Language Models

🌈 Abstract

The article discusses the issue of benchmark dataset leakage in large language models (LLMs). It highlights how the expanding use of pre-training data has led to an increase in benchmark dataset leakage, which can skew benchmark effectiveness and lead to unfair comparisons. The authors introduce a detection pipeline using Perplexity and N-gram accuracy metrics to identify potential data leakages. By analyzing 31 LLMs in the context of mathematical reasoning tasks, the article reveals substantial instances of training and even test set misuse, resulting in potentially unfair comparisons. The authors offer recommendations regarding model documentation, benchmark setup, and future evaluations, including the introduction of a "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.

🙋 Q&A

[01] Benchmark Leakage Detection

1. What are the key challenges in detecting benchmark leakage?

Cannot guarantee that the test data is leakage-free
Difficult to determine the threshold score for leakage due to multiple influencing factors
Unknown utilization of benchmarks during pre-training
Inaccessible model weights for closed-source models

2. How does the proposed detection pipeline address these challenges?

Uses Perplexity and N-gram Accuracy as atomic metrics to capture both continuous and discrete aspects of language modeling
Synthesizes reference benchmarks to provide a comparison baseline and mitigate the issue of potentially contaminated test sets
Normalizes the metric differences to enable meaningful comparisons across different models

3. How does the N-gram Accuracy metric help with instance-level leakage detection?

High accuracy in predicting n-grams of an example suggests a high probability that the sample was encountered during the training process
Leverages lenient metrics like ROUGE-L and edit distance similarity to account for potential data augmentation or reformatting

[02] Evaluation of Existing Models

1. What were the key findings from the evaluation of 31 existing LLMs?

Many models, including well-known ones, may have inadvertently leveraged training data to boost their performance on mathematical reasoning tasks, leading to unfair advantages
The Aquila2 series and InternLM-2 (excluding the Base version) showed signs of potential benchmark data utilization
The N-gram Accuracy metric enabled the detection of instances where models could accurately predict n-grams from the training sets, suggesting potential data leakage

2. How did the authors recommend addressing the issue of data leakage in model development and evaluation?

Introduce the "Benchmark Transparency Card" to document the utilization of benchmarks during model training and evaluation
Construct benchmarks from the latest corpus to minimize the risk of overlap with pre-training data
Maintain private test sets and consider encrypting or dynamically updating benchmarks to guard against overfitting
Evaluate models using a variety of contemporary challenges, such as new exam questions, to provide a more balanced assessment of their capabilities

</output_format>

Shared by Daniel Chen ·

Install fromChrome Web Store