Summarize by Aili

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

🌈 Abstract

The article examines the impact of introducing new factual knowledge through fine-tuning on the tendency of large language models (LLMs) to hallucinate responses that are not grounded in their pre-existing knowledge. The authors design a controlled setup focused on closed-book question answering to study this effect.

🙋 Q&A

[01] Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

1. What is the key conjecture explored in the article? The article explores the common conjecture that exposure to new factual knowledge during fine-tuning may encourage large language models (LLMs) to hallucinate factually incorrect responses, as the model is essentially trained to generate facts that are not grounded in its pre-existing knowledge.

2. How do the authors design their study to isolate the effect of new knowledge? The authors design a controlled setup focused on closed-book question answering, where they vary the proportion of fine-tuning examples that introduce new knowledge, while controlling for other factors.

3. What are the key findings from the study?

LLMs struggle to acquire new factual knowledge through fine-tuning, as fine-tuning examples that introduce new knowledge are learned significantly slower than those consistent with the model's pre-existing knowledge.
As the model eventually learns new knowledge through fine-tuning, it linearly increases the model's tendency to hallucinate responses that are not grounded in its pre-existing knowledge.

4. What are the practical implications of the findings?

The findings highlight the risk of introducing new factual knowledge through fine-tuning, and support the view that LLMs mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more efficiently.
Mitigating overfitting using early-stopping or filtering-out fine-tuning examples that introduce new knowledge can help reduce the risk of hallucinations.

[02] Quantifying Knowledge in LLMs

1. What is the SliCK approach and how does it categorize (question, answer) pairs? SliCK is a method proposed by the authors to categorize (question, answer) pairs into four knowledge categories based on a continuous measure (PCorrect) that quantifies the agreement between model-generated answers and the ground-truth labels. The categories are: HighlyKnown, MaybeKnown, WeaklyKnown, and Unknown.

2. How does the PCorrect measure work and what is its purpose? PCorrect estimates the likelihood that the model will accurately generate the correct answer to a given question, when prompted with random few-shot exemplars and using different decoding temperatures. This measure is used to annotate each (question, answer) pair with its knowledge category with respect to the model.

3. What are the key insights from the analysis of the SliCK categories?

The Unknown category effectively captures (question, answer) pairs for which the model never predicts the correct answer, indicating a lack of relevant knowledge.
The three Known categories (HighlyKnown, MaybeKnown, WeaklyKnown) capture different degrees of the model's knowledge, which is shown to be important for understanding the model's behavior during fine-tuning.

[03] How Harmful are Unknown Examples?

1. What is the key finding regarding the impact of Unknown fine-tuning examples on performance? Higher proportions of Unknown fine-tuning examples lead to performance degradation, regardless of the fine-tuning duration. This indicates that Unknown examples are less useful than Known examples.

2. How do the authors determine if Unknown examples are harmful or just neutral? By creating ablated variants of the fine-tuning dataset that only contain Known examples, the authors show that Unknown examples are actually harmful, especially when the model is trained for longer durations (CONVERGENCE). In early stopping (EARLY_STOP), the Unknown examples have a neutral effect.

3. What insight do the authors gain from analyzing the training dynamics? The authors find that Unknown fine-tuning examples are fitted substantially slower than Known examples. This suggests that LLMs struggle to acquire new factual knowledge through fine-tuning, and instead mostly learn to expose their pre-existing knowledge using the Known examples.

4. How do the authors quantify the impact of Known vs Unknown examples on accuracy using a linear model? The authors use a linear regression model to show that fitting Unknown examples has a negative impact on test accuracy, while fitting Known examples has a positive impact. The magnitudes of these effects are roughly equal, indicating that the presence of Unknown examples is the primary driver of performance degradation.

[04] Understanding Knowledge Types: Their Value and Impact

1. What is the key finding regarding the importance of MaybeKnown fine-tuning examples? Surprisingly, a model fine-tuned only on HighlyKnown examples does not yield the best results. The authors find that incorporating MaybeKnown fine-tuning examples, representing facts with lower degrees of certainty, plays an important part in properly handling such examples during inference.

2. What is the impact of WeaklyKnown and Unknown fine-tuning examples on overfitting? The authors find that fine-tuning on WeaklyKnown and Unknown examples increases the risk of overfitting, leading to a significant performance drop from EARLY_STOP to CONVERGENCE. This is in contrast to the more robust performance of models fine-tuned on MaybeKnown or HighlyKnown examples.

3. How does the composition of the fine-tuning dataset (e.g., DNatural) affect performance compared to single-category variants? DNatural, which has the natural distribution of the four knowledge categories, performs on-par with DMaybeKnown in EARLY_STOP. However, its performance degrades significantly after CONVERGENCE, underperforming DMaybeKnown. This indicates that the presence of WeaklyKnown and Unknown examples in DNatural still leads to overfitting.

</output_format>

Shared by Daniel Chen ·

Install fromChrome Web Store