Many-Shot In-Context Learning
๐ Abstract
The article discusses the scaling of in-context learning (ICL) in large language models (LLMs), where the number of examples provided in the context during inference is increased from the few-shot to the many-shot regime. The key findings are:
- Scaling the number of in-context examples leads to significant performance gains across a wide variety of generative and discriminative tasks.
- To mitigate the limitation of needing human-generated outputs for many-shot ICL, the article explores two new settings: "Reinforced ICL" using model-generated chain-of-thought rationales, and "Unsupervised ICL" which removes rationales from the prompt altogether.
- The article also analyzes how the learning dynamics of ICL change from the few-shot to the many-shot regime, finding that many-shot ICL can overcome pre-training biases and learn high-dimensional functions with numerical inputs.
๐ Q&A
[01] Scaling In-Context Learning (ICL)
1. What are the key findings regarding scaling the number of in-context examples for ICL?
- Scaling the number of in-context examples from few-shot to many-shot leads to significant performance gains across a wide variety of tasks, including translation, summarization, planning, reward modeling, mathematical problem solving, scientific question-answering, and algorithmic reasoning.
- For example, in machine translation from English to low-resource languages like Tamil and Kurdish, using the entire development set (997 examples) for many-shot ICL outperformed the 1-shot Gemini prompt by 4.5% and 1.5% respectively on the test set, establishing new state-of-the-art results.
- In summarization, many-shot ICL achieved performance remarkably close to specialized summarization models fine-tuned on the task datasets.
2. What are the limitations of scaling in-context examples observed in the experiments?
- For some tasks like summarization, performance on the original dataset (XSum) declined when using more than 50 in-context examples. This was attributed to the model occasionally generating summaries with fabricated dates and times, despite the absence of such data in the in-context examples.
- The order of in-context examples was found to significantly impact performance, even in the many-shot setting, leading to large variations in performance across different subdomains of a task.
[02] Reinforced and Unsupervised ICL
1. Why were Reinforced ICL and Unsupervised ICL introduced?
- Many-shot ICL could be limited by the availability of high-quality human-generated rationales or demonstrations, particularly for complex reasoning tasks.
- To address this, Reinforced ICL uses model-generated chain-of-thought rationales instead of human rationales, filtering them based on answer correctness.
- Unsupervised ICL goes a step further by removing rationales from the prompt altogether, and prompting the model only with the inputs (e.g., problems).
2. How did Reinforced and Unsupervised ICL perform compared to ICL with human-written rationales?
- On mathematical problem-solving tasks like Hendrycks MATH, both Reinforced and Unsupervised ICL outperformed ICL with ground-truth solutions in both the few-shot and many-shot regimes.
- On the GPQA question-answering benchmark, Reinforced ICL with model-generated rationales matched or outperformed ICL with human-written rationales, especially in the few-shot setting.
- On the BIG-Bench Hard algorithmic reasoning tasks, Reinforced ICL strongly outperformed the standard 3-shot chain-of-thought prompt, often achieving the best performance with just 3 shots.
[03] Analyzing Many-Shot In-Context Learning
1. How does many-shot ICL help overcome pre-training biases?
- Experiments on the Financial PhraseBank sentiment analysis dataset showed that with few-shot ICL, the model struggled to overcome pre-existing biases from pre-training.
- However, as the number of shots increased, performance on flipped and abstract labels (which conflicted with pre-training biases) dramatically improved, approaching that of the default labels.
- This indicates that with sufficient in-context examples, LLMs can overcome pre-training biases.
2. What insights did the analysis of learning non-natural language tasks provide?
- Many-shot ICL was able to substantially outperform random-chance accuracy on high-dimensional linear classification tasks, nearly matching the performance of a strong k-nearest neighbors baseline.
- On the sequential parity task, many-shot ICL with up to 8192 shots surpassed the performance of a GPT-2 Medium model trained from scratch on 20 more input-output examples.
- These results suggest the potential of many-shot learning to adapt to new tasks and domains that might be misaligned with an LLM's training data.
3. What were the key findings regarding the use of negative log-likelihood (NLL) as a predictor of ICL performance?
- While NLL was found to decrease predictably as the context length increased, it was not a reliable predictor of downstream task performance, especially for problem-solving and reasoning tasks.
- For example, NLL continued to decrease even as performance plateaued or declined for tasks like MATH and GPQA.
- The authors conclude that NLL may not be a good proxy when attempting to predict final performance in problem-solving domains, as there can be many potentially correct reasoning paths the model can take.