A Survey on Benchmarks of Multimodal Large Language Models
๐ Abstract
The paper presents a comprehensive review of 180 benchmarks and evaluations for Multimodal Large Language Models (MLLMs), focusing on perception and understanding, cognition and reasoning, specific domains, key capabilities, and other modalities. The evaluation of MLLMs is crucial for comparing and investigating their performance, as well as providing guidance for MLLM applications in various fields. The survey covers the current state of MLLM evaluation, highlighting the rapid growth in this research area and the superior performance of models like OpenAI's GPT-4 and Google's Gemini.
๐ Q&A
[01] Perception and Understanding
1. What are the key aspects of evaluating MLLMs' perception and understanding capabilities?
- Evaluating MLLMs' accuracy in object identification and detection, understanding of scene context and object relationships, and ability to respond to questions about image content.
- Assessing MLLMs' fine-grained perception abilities, including visual grounding and object detection, fine-grained identification and recognition, and nuanced vision-language alignment.
- Evaluating MLLMs' image understanding, including multi-image understanding, implication understanding, and image quality and aesthetics perception.
2. What are some of the comprehensive evaluation benchmarks for MLLMs' perception and understanding?
- LLaVA-Bench, OwlEval, MME, MMBench, Open-VQA, TouchStone, SEED-Bench, MM-Vet, MDVP-Bench, LAMM, ChEF, and UniBench.
3. What are some of the fine-grained perception benchmarks for MLLMs?
- Flickr30k Entities, Visual7W, CODE, MagnifierBench, CV-Bench, P2GB, VisualCoT, Winoground, VALSE, VL-CheckList, ARO, and Eqben.
4. What are some of the image understanding benchmarks for MLLMs?
- Mementos, MileBench, MuirBench, MMIU, COMPBENCH, II-Bench, ImplicitAVE, and FABA-Bench.
[02] Cognition and Reasoning
1. What are the key aspects of evaluating MLLMs' cognition and reasoning capabilities?
- Assessing MLLMs' visual relation reasoning, vision-indispensable reasoning, and context-related reasoning.
- Evaluating MLLMs' knowledge-based question answering and knowledge editing abilities.
- Leveraging cognitive science principles to assess MLLMs' general intelligence and problem-solving skills through abstract visual reasoning, mathematical question answering, and multidisciplinary question answering.
2. What are some of the general reasoning benchmarks for MLLMs?
- VSR, What's Up, CRPE, MMRel, GSR-BENCH, CODIS, CFMM, and VL-ICL Bench.
3. What are some of the knowledge-based reasoning benchmarks for MLLMs?
- KB-VQA, FVQA, OK-VQA, A-OKVQA, SOK-Bench, MMEdit, and MIKE.
4. What are some of the intelligence and cognition benchmarks for MLLMs?
- RAVEN, MARVEL, VCog-Bench, M3GIA, MathVista, Math-Vision, and MATHCHECK-GEO.
[03] Specific Domains
1. What are the key aspects of evaluating MLLMs' capabilities in specific domains?
- Assessing MLLMs' ability to integrate complex visual and textual information, adapt to decision-making roles in dynamic environments, and effectively process diverse cultural and linguistic data.
- Evaluating MLLMs' performance in specialized domains such as medicine, industry, and autonomous driving.
2. What are some of the text-rich VQA benchmarks for MLLMs?
- TextVQA, TextCaps, OCRBench, P2GB, and SEED-Bench-2-Plus.
3. What are some of the document-oriented question answering benchmarks for MLLMs?
- InfographicVQA, SPDocVQA, MP-DocVQA, DUDE, and MM-NIAH.
4. What are some of the chart-oriented question answering benchmarks for MLLMs?
- ChartQA, SciGraphQA, MMC-Benchmark, ChartBench, ChartX, and CharXiv.
[04] Key Capabilities
1. What are the key capabilities evaluated for MLLMs?
- Dialogue capabilities, including handling extended dialogues and accurately following instructions.
- Hallucination and trustworthiness, assessing the model's robustness and safety in handling diverse inputs and avoiding harmful or inappropriate content.
2. What are some of the benchmarks for evaluating MLLMs' conversation abilities?
- DEMON, VisIT-Bench, CoIN, and MIA-Bench.
3. What are some of the benchmarks for evaluating MLLMs' hallucination?
- CHAIR, POPE, GAVIE, M-HalDetect, MMHAL-BENCH, and MHaluBench.
4. What are some of the benchmarks for evaluating MLLMs' trustworthiness?
- BenchLMM, MMR, MAD-Bench, MM-SAP, VQAv2-IDK, MM-SPUBENCH, MM-SafetyBench, and JailBreakV-28K.
[05] Other Modalities
1. What are the key aspects of evaluating MLLMs' capabilities in other modalities?
- Assessing MLLMs' performance in understanding and interpreting video content, including temporal perception, long video understanding, and comprehensive evaluation.
- Evaluating MLLMs' audio understanding capabilities, including human speech processing, music understanding, and general audio comprehension.
- Assessing MLLMs' 3D scene perception and reasoning abilities.
2. What are some of the video understanding benchmarks for MLLMs?
- TimeIT, MVBench, Perception Test, VilMA, VITATECS, TempCompass, OsCaR, ADLMCQ, EgoSchema, MovieChat-1k, MLVU, and Event-Bench.
3. What are some of the audio understanding benchmarks for MLLMs?
- Dynamic-SUPERB, MuChoMusic, and AIR-Bench.
4. What are some of the 3D scene understanding benchmarks for MLLMs?
- ScanQA, LAMM, ScanReason, and M3DBench.