ACarefulExaminationofLargeLanguageModelPerformanceonGradeSchoolArithmetic
๐ Abstract
The article presents a careful examination of the performance of large language models (LLMs) on grade school arithmetic problems. It introduces a new dataset called Grade School Math 1000 (GSM1k) that is designed to mirror the established GSM8k benchmark, in order to investigate concerns about dataset contamination in LLM training. The key findings include:
๐ Q&A
[01] Benchmark Creation
1. What is the purpose of creating the GSM1k dataset? The purpose of creating the GSM1k dataset is to investigate concerns about dataset contamination, where data closely resembling benchmark questions may have leaked into the training data of LLMs, leading to inflated performance on those benchmarks rather than true reasoning ability.
2. How was the GSM1k dataset constructed? The GSM1k dataset was constructed using human annotators, without any assistance from language models. Annotators were instructed to create novel grade school math problems similar in difficulty to the problems in the GSM8k dataset, as measured by the number of steps required to solve them. Extensive quality checks were performed to ensure the problems were solvable using only basic arithmetic and had positive integer answers.
3. How does the GSM1k dataset compare to the GSM8k dataset in terms of difficulty? The creators of GSM1k took great care to ensure the difficulty distribution of problems in GSM1k matched that of the GSM8k dataset, as measured by the number of steps required to solve the problems. Human evaluations also found similar solve rates between the two datasets, suggesting they are comparable in difficulty.
[02] Model Evaluation
1. What were the key findings from evaluating leading LLMs on the GSM1k dataset? The evaluation found that many LLMs showed substantial drops in performance, up to 13%, when evaluated on GSM1k compared to GSM8k. This suggests that the strong performance of these models on GSM8k may have been partially due to dataset contamination rather than true reasoning ability.
2. Which model families showed the most evidence of overfitting? The Mistral and Phi model families in particular showed consistent overfitting across nearly all model sizes, with performance drops of around 10% between GSM8k and GSM1k.
3. Did all models show signs of overfitting? No, the authors found that frontier models, as well as the Llama2 family, showed minimal signs of overfitting. These models maintained similar performance on both GSM8k and GSM1k.
4. What was the relationship between a model's likelihood of generating GSM8k examples and its performance gap between GSM8k and GSM1k? The authors found a positive correlation between a model's probability of generating examples from GSM8k and its performance gap between GSM8k and GSM1k. This suggests that partial memorization of the GSM8k dataset may be one factor contributing to overfitting.
[03] Dataset Release
1. When does the authors plan to publicly release the GSM1k dataset? The authors do not plan to publicly release the GSM1k dataset immediately, in order to prevent future dataset contamination. They have committed to releasing the dataset either when the top open-source models score over 95% on GSM1k, or by the end of 2025, whichever comes earlier.
2. What steps will the authors take to enable public evaluation of the GSM1k dataset? The authors will run recurring evaluations of major open-source and closed-source model releases on GSM1k, and will open-source their entire evaluation code to allow the public to reproduce the results.