Summarize by Aili

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

🌈 Abstract

The article focuses on evaluating the factual knowledge recall of large language models (LLMs). It introduces FACT-Bench, a comprehensive benchmark to assess LLMs' ability to recall factual knowledge learned from pretraining. The article presents a holistic assessment of 31 LLMs across 10 model families, examining factors that affect their factual knowledge recall.

🙋 Q&A

[01] Introduction

1. What are the key challenges identified in evaluating the factuality of LLMs' generated outputs?

The article identifies four key challenges in evaluating the factuality of LLMs:
- Making the questions simple enough to solely require knowledge recall rather than complex reasoning or multi-source information
- Ensuring the questions are fair and query knowledge that exists in the pretraining data of all LLMs
- Making the questions diverse and representative
- Making the questions specific enough so that the answer is unique and grounded in some knowledge source

2. How does the FACT-Bench benchmark address these challenges?

Simplicity: The benchmark constructs questions based on Wikidata triplets to elicit knowledge from LLMs.
Validity: The benchmark selects triplets whose subject has a Wikipedia article and whose object also appears in the same article.
Diversity: The benchmark covers 20 domains, 134 property types, and 3 answer types (entities, dates, and numbers).
Specificity: The benchmark manually selects property types likely to yield unique answers and uses prompt engineering to generate specific questions.

[02] FACT-Bench

1. What are the key characteristics of the FACT-Bench dataset?

Simplicity: Questions are based on Wikidata triplets to elicit knowledge recall.
Validity: Triplets are selected such that the subject has a Wikipedia article and the object appears in the same article.
Diversity: The dataset covers 20 domains, 134 property types, and 3 answer types (entities, dates, and numbers).
Specificity: Property types are manually selected to yield unique answers, and prompts are engineered to generate specific questions.

2. How is the FACT-Bench dataset constructed and evaluated?

The dataset consists of 20K question-answer pairs, with 5K for training and 15K for evaluation.
Evaluation metrics include Exact Match (EM), F1 score, and a new metric "Contains" that checks if any of the ground-truth answers appear in the prediction.
The upper-bound performance is estimated through human validation of a 2K subset, which shows 90% accuracy for the full 15K set and 100% for the 2K "Premium2k" subset.

[03] Benchmarking LLMs

1. What are the key findings from benchmarking 31 LLMs across 10 model families?

Instruction-tuning hurts knowledge recall: Pretraining-only models consistently outperform their instruction-tuned counterparts.
Positive effect of model scaling: Larger models outperform smaller ones across all model families.
Significant gap with upper-bound: The best performance from GPT-4 still represents a large gap with the estimated upper-bound.
LLMs struggle with long-tail entities and certain property types: Consistent with previous findings, LLMs perform poorly on less popular entities and specific property types like dates and numbers.

2. How do in-context exemplars affect LLMs' factual knowledge recall?

Counterfactual in-context exemplars (with incorrect answers) lead to significant degradation of factual knowledge recall for large models.
The degradation is attributed to exemplars that contradict the model's known knowledge, as well as the number of such exemplars.

3. What are the findings from fine-tuning experiments on LLaMA-7B?

Fine-tuning on known knowledge (that the model is already familiar with) is beneficial and outperforms fine-tuning on unknown or mixed knowledge.
Fine-tuning on unknown knowledge teaches the model to hallucinate, verifying the hypothesis from previous work.

Shared by Daniel Chen ·

Install fromChrome Web Store