magic starSummarize by Aili

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

๐ŸŒˆ Abstract

The article focuses on evaluating the factual knowledge recall of large language models (LLMs). It introduces FACT-Bench, a comprehensive benchmark to assess LLMs' ability to recall factual knowledge learned from pretraining. The article presents a holistic assessment of 31 LLMs across 10 model families, examining factors that affect their factual knowledge recall.

๐Ÿ™‹ Q&A

[01] Introduction

1. What are the key challenges identified in evaluating the factuality of LLMs' generated outputs?

  • The article identifies four key challenges in evaluating the factuality of LLMs:
    • Making the questions simple enough to solely require knowledge recall rather than complex reasoning or multi-source information
    • Ensuring the questions are fair and query knowledge that exists in the pretraining data of all LLMs
    • Making the questions diverse and representative
    • Making the questions specific enough so that the answer is unique and grounded in some knowledge source

2. How does the FACT-Bench benchmark address these challenges?

  • Simplicity: The benchmark constructs questions based on Wikidata triplets to elicit knowledge from LLMs.
  • Validity: The benchmark selects triplets whose subject has a Wikipedia article and whose object also appears in the same article.
  • Diversity: The benchmark covers 20 domains, 134 property types, and 3 answer types (entities, dates, and numbers).
  • Specificity: The benchmark manually selects property types likely to yield unique answers and uses prompt engineering to generate specific questions.

[02] FACT-Bench

1. What are the key characteristics of the FACT-Bench dataset?

  • Simplicity: Questions are based on Wikidata triplets to elicit knowledge recall.
  • Validity: Triplets are selected such that the subject has a Wikipedia article and the object appears in the same article.
  • Diversity: The dataset covers 20 domains, 134 property types, and 3 answer types (entities, dates, and numbers).
  • Specificity: Property types are manually selected to yield unique answers, and prompts are engineered to generate specific questions.

2. How is the FACT-Bench dataset constructed and evaluated?

  • The dataset consists of 20K question-answer pairs, with 5K for training and 15K for evaluation.
  • Evaluation metrics include Exact Match (EM), F1 score, and a new metric "Contains" that checks if any of the ground-truth answers appear in the prediction.
  • The upper-bound performance is estimated through human validation of a 2K subset, which shows 90% accuracy for the full 15K set and 100% for the 2K "Premium2k" subset.

[03] Benchmarking LLMs

1. What are the key findings from benchmarking 31 LLMs across 10 model families?

  • Instruction-tuning hurts knowledge recall: Pretraining-only models consistently outperform their instruction-tuned counterparts.
  • Positive effect of model scaling: Larger models outperform smaller ones across all model families.
  • Significant gap with upper-bound: The best performance from GPT-4 still represents a large gap with the estimated upper-bound.
  • LLMs struggle with long-tail entities and certain property types: Consistent with previous findings, LLMs perform poorly on less popular entities and specific property types like dates and numbers.

2. How do in-context exemplars affect LLMs' factual knowledge recall?

  • Counterfactual in-context exemplars (with incorrect answers) lead to significant degradation of factual knowledge recall for large models.
  • The degradation is attributed to exemplars that contradict the model's known knowledge, as well as the number of such exemplars.

3. What are the findings from fine-tuning experiments on LLaMA-7B?

  • Fine-tuning on known knowledge (that the model is already familiar with) is beneficial and outperforms fine-tuning on unknown or mixed knowledge.
  • Fine-tuning on unknown knowledge teaches the model to hallucinate, verifying the hypothesis from previous work.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.