SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
๐ Abstract
The paper introduces SPIQA (Scientific Paper Image Question Answering), the first large-scale QA dataset designed to interpret complex figures and tables within the context of scientific research articles across various domains of computer science. The dataset contains 270K questions that require reasoning over figures, tables, and textual content. The paper also proposes three tasks to evaluate the capabilities of multimodal large language models (MLLMs) in understanding scientific papers: direct QA with figures and tables, direct QA with full paper, and Chain-of-Thought (CoT) QA. Additionally, the authors introduce LLMLogScore (L3Score), a novel metric for evaluating free-form QA responses based on the confidence of large language models. Extensive experiments are conducted on the SPIQA dataset using 12 prominent foundational models, and the results highlight the potential for developing specialized systems for scientific QA in the future.
๐ Q&A
[01] Introduction
1. What are the key limitations of existing question-answering (QA) datasets based on scientific papers?
- Existing QA datasets are limited in scale and focus solely on textual content, overlooking the wealth of information presented in figures and tables.
2. What is the goal of introducing the SPIQA dataset?
- To address the limitation of existing datasets, SPIQA is introduced as the first large-scale QA dataset specifically designed to interpret complex figures and tables within the context of scientific research articles across various domains of computer science.
3. What are the three tasks proposed in the paper to evaluate the capabilities of multimodal large language models (MLLMs)?
- The three tasks are:
- Direct QA with figures and tables
- Direct QA with full paper
- Chain-of-Thought (CoT) QA
4. What is the novel metric introduced in the paper for evaluating free-form QA responses?
- The paper introduces LLMLogScore (L3Score), a metric that uses the log-likelihood probabilities generated by a large language model to evaluate the quality of candidate answers.
[02] SPIQA Dataset and Tasks
1. What are the key guidelines followed during the collection and curation of the SPIQA dataset?
- The dataset was curated from 25,859 computer science research papers published in top-tier conferences between 2018-2023.
- The questions and answers focus on the figures, tables, and textual content of the papers, requiring a holistic understanding of the research.
- The dataset includes three tasks: direct QA with figures and tables, direct QA with full paper, and Chain-of-Thought (CoT) QA.
2. How was the SPIQA dataset generated and filtered?
- The questions and answers were automatically generated using the Gemini 1.5 Pro model, and then manually filtered to ensure high quality.
- The dataset also includes questions from the QASA and QASPER datasets that require reasoning over figures and tables.
3. What are the key statistics and splits of the SPIQA dataset?
- SPIQA contains 270,194 question-answer-rationale triplets across 25,859 computer science papers.
- The dataset is divided into training, validation, and three evaluation splits (test-A, test-B, test-C).
[03] LLMLogScore (L3Score): An Improved Metric for Free-form QA
1. What are the limitations of existing LLM-based metrics for evaluating free-form QA responses?
- Current metrics like LAVE, LIMA, and Prometheus-Vision rely on predefined rating scales and do not consider the confidence of the language models in their evaluations.
2. How does the proposed LLMLogScore (L3Score) metric address these limitations?
- L3Score directly uses the log-likelihood probabilities generated by a large language model (GPT-4o) to evaluate the semantic similarity between the candidate and ground-truth answers.
- It does not require a predefined rating scale and incorporates the model's confidence in the evaluation.
[04] Experiments
1. What are the key findings from the experiments conducted on the SPIQA dataset?
- Closed-source models like Gemini, GPT-4, and Claude-3 generally outperform open-source models on the SPIQA tasks.
- Fine-tuning open-source models like InstructBLIP and LLaVA 1.5 on the SPIQA training set leads to significant performance improvements.
- The Chain-of-Thought (CoT) QA task and providing the full paper text help enhance the performance of the baseline models.
2. What are the key insights from the ablation studies?
- Captions accompanying the figures and tables are crucial for the models to comprehend the content.
- Models struggle more with understanding complex plots, charts, and tables compared to schematic diagrams.
</output_format>