https://arxiv.org/pdf/2401.16420v1
๐ Abstract
The article introduces InternLM-XComposer2, a cutting-edge vision-language model that excels in free-form text-image composition and comprehension. It highlights the model's ability to go beyond conventional vision-language understanding and adeptly craft interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation.
๐ Q&A
[01] Introduction
1. What are the key advancements of InternLM-XComposer2 compared to its predecessor, InternLM-XComposer? InternLM-XComposer2 represents a significant advancement over its predecessor, InternLM-XComposer, in both text-image composition and comprehension. It is adept at producing high-quality, integrated text-image articles from a variety of free-form inputs, such as detailed specifications, structured outlines, and reference images, serving a wide range of application contexts. In the realm of multimodal understanding, it demonstrates exceptional capabilities in detailed perception, logical reasoning, and extensive knowledge integration.
2. What are the two critical design elements that enable the appealing capabilities of InternLM-XComposer2? The two critical design elements are:
- Partial LoRA (P-LoRA): The Partial LoRA design harmonizes the model's abilities in composition and comprehension by feeding forward image tokens with additional LoRA parameters, while language tokens retain the original architecture. This selective enhancement ensures robust performance in both visual and textual domains.
- High-quality and Diverse Data Foundation: The quality and diversity of the training data are pivotal. The dataset for free-form text-image composition excels in adhering to complex instructions, customization with text and image for tailored content, high-quality and stylistically diverse writing, and versatile text editing including condensing, expanding, and revising.
[02] Experiments
1. How does InternLM-XComposer2 perform compared to closed-source APIs and previous open-source SOTA models on various benchmarks? InternLM-XComposer2 demonstrates competitive performance with closed-source APIs and significantly outperforms existing open-source models across a range of benchmarks, including MathVista, MMMU, AI2D, MME, MMBench, MMBench-Chinese, SEED-Bench (Image), LLaVA-Bench (In-the-Wild), QBench, MM-Vet, HallusionBench, and ChartVQA. Notably, it achieves SOTA results in 6 out of the 12 benchmarks with only 7B parameters, showcasing its remarkable proficiency in multimodal understanding.
2. How does InternLM-XComposer2 perform on the CreationBench benchmark for evaluating the creativity of language models? On the CreationBench benchmark from OpenCompass, InternLM-XComposer2 showcases outstanding performance, scoring an impressive 6.24 overall when evaluated without the GPT-4 referenced answer. Even when evaluated with the GPT-4 reference, the model maintained strong performance, demonstrating its ability to generate responses with high levels of creativity and logical structure, critical for user engagement and satisfaction in conversational AI applications.