
Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall
๐ Abstract
The article focuses on evaluating the factual knowledge recall of large language models (LLMs). It introduces FACT-Bench, a comprehensive benchmark to assess LLMs' ability to recall factual knowledge learned from pretraining. The article presents a holistic assessment of 31 LLMs across 10 model families, examining factors that affect their factual knowledge recall.
๐ Q&A
[01] Introduction
1. What are the key challenges identified in evaluating the factuality of LLMs' generated outputs?
- The article identifies four key challenges in evaluating the factuality of LLMs:
- Making the questions simple enough to solely require knowledge recall rather than complex reasoning or multi-source information
- Ensuring the questions are fair and query knowledge that exists in the pretraining data of all LLMs
- Making the questions diverse and representative
- Making the questions specific enough so that the answer is unique and grounded in some knowledge source
2. How does the FACT-Bench benchmark address these challenges?
- Simplicity: The benchmark constructs questions based on Wikidata triplets to elicit knowledge from LLMs.
- Validity: The benchmark selects triplets whose subject has a Wikipedia article and whose object also appears in the same article.
- Diversity: The benchmark covers 20 domains, 134 property types, and 3 answer types (entities, dates, and numbers).
- Specificity: The benchmark manually selects property types likely to yield unique answers and uses prompt engineering to generate specific questions.
[02] FACT-Bench
1. What are the key characteristics of the FACT-Bench dataset?
- Simplicity: Questions are based on Wikidata triplets to elicit knowledge recall.
- Validity: Triplets are selected such that the subject has a Wikipedia article and the object appears in the same article.
- Diversity: The dataset covers 20 domains, 134 property types, and 3 answer types (entities, dates, and numbers).
- Specificity: Property types are manually selected to yield unique answers, and prompts are engineered to generate specific questions.
2. How is the FACT-Bench dataset constructed and evaluated?
- The dataset consists of 20K question-answer pairs, with 5K for training and 15K for evaluation.
- Evaluation metrics include Exact Match (EM), F1 score, and a new metric "Contains" that checks if any of the ground-truth answers appear in the prediction.
- The upper-bound performance is estimated through human validation of a 2K subset, which shows 90% accuracy for the full 15K set and 100% for the 2K "Premium2k" subset.
[03] Benchmarking LLMs
1. What are the key findings from benchmarking 31 LLMs across 10 model families?
- Instruction-tuning hurts knowledge recall: Pretraining-only models consistently outperform their instruction-tuned counterparts.
- Positive effect of model scaling: Larger models outperform smaller ones across all model families.
- Significant gap with upper-bound: The best performance from GPT-4 still represents a large gap with the estimated upper-bound.
- LLMs struggle with long-tail entities and certain property types: Consistent with previous findings, LLMs perform poorly on less popular entities and specific property types like dates and numbers.
2. How do in-context exemplars affect LLMs' factual knowledge recall?
- Counterfactual in-context exemplars (with incorrect answers) lead to significant degradation of factual knowledge recall for large models.
- The degradation is attributed to exemplars that contradict the model's known knowledge, as well as the number of such exemplars.
3. What are the findings from fine-tuning experiments on LLaMA-7B?
- Fine-tuning on known knowledge (that the model is already familiar with) is beneficial and outperforms fine-tuning on unknown or mixed knowledge.
- Fine-tuning on unknown knowledge teaches the model to hallucinate, verifying the hypothesis from previous work.