Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability
๐ Abstract
The paper explores the relationship between the scale of language models (LMs) and their tendency to hallucinate, which refers to generating factually incorrect information. To fully control the training data content, the authors use a knowledge graph (KG) dataset and train a set of increasingly large LMs on it. They find that for a fixed dataset, larger and longer-trained LMs hallucinate less. However, eliminating hallucinations on 5% of the training data requires an order of magnitude larger model than what is currently considered optimal. The authors also study how hallucination detectors depend on scale and find an inverse relationship between the scale of the LM and the detectability of its hallucinations.
๐ Q&A
[01] Controlling What an LM Knows
1. What is the core challenge in studying LM hallucinations? The core challenge is that typically we do not know what information the model was exposed to during training. Without this knowledge, it is difficult to determine whether the model output is wrong because (a) it has never been trained on a given fact, (b) it was trained on it but did not memorize it, or (c) the model memorized factually incorrect information that did appear in the training set.
2. How does using a Knowledge Graph (KG) dataset help address this challenge? Using a KG dataset provides full controllability of the factual content that the LMs are exposed to during training. Since the KG contains structured knowledge in the form of [subject, predicate, object] triplets, it is straightforward to query whether a generated fact from an LM indeed exists in the dataset or not, hence offering a quantifiable measure of hallucination.
3. How is the KG dataset constructed to enable studying the effect of scale? The authors design datasets that reflect three levels of visibility: 1) a fully visible set (FVS) that both the LMs and detectors are trained on, 2) a partially visible set (PVS) that only the LMs are trained on, and 3) an invisible set (IVS) that neither the LM nor the detector have seen. They vary the sizes of FVS and PVS to study the effect of scale.
[02] Hallucination Rate and How It Scales
1. What is the key finding about the relationship between LM scale and hallucination rate? For a fixed dataset size, larger and longer-trained LMs tend to hallucinate less. However, increasing the dataset size (adding more facts to learn) leads to a higher hallucination rate, given a fixed number of training epochs and LM size.
2. Why is multi-epoch training necessary to achieve low hallucination rates? This is because each unique piece of information in the KG dataset is only seen once per epoch. The lack of repetition of facts means the LM needs to memorize each fact, which requires multiple passes over the data to achieve low hallucination rates.
3. What is the trade-off between hallucination rate and other model performance measures? The authors find that the longer training needed to achieve low hallucination rates can hurt the LM's generalization ability to unseen data. This presents a trade-off between hallucination rate and other performance metrics like out-of-distribution accuracy.
[03] Hallucination Detectability and How It Scales
1. What are the two main types of hallucination detection tasks explored in the paper? The paper explores two types of detection tasks:
- Sentence task: The detector ingests both the original subject-predicate and the predicted object tokens, and judges if the object is hallucinated.
- Token task: The detector takes the embedding of a token from a given layer of the trained LM, and predicts if it is hallucinated.
2. What is the key finding about the relationship between LM scale and hallucination detectability? The authors find an inverse relationship between the scale of the LM and the detectability of its hallucinations. Larger, longer-trained LMs hallucinate less, but detecting their remaining hallucinations becomes increasingly hard.
3. Why is this inverse relationship observed? This is because the lower hallucination rates of larger LMs make it harder for detectors to distinguish hallucinated outputs from correct ones. The improvement in detector accuracy is confounded by the already lower hallucination rate of the more powerful LMs.