LLaVA-OneVision: Easy Visual Task Transfer
๐ Abstract
The article presents LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating insights from the LLaVA-NeXT blog series. The key contributions are:
- LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video.
- The design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities, particularly in video understanding and cross-scenario tasks.
- The authors release the generated multimodal instruction data, codebase, model checkpoints, and a visual chat demo to the public.
๐ Q&A
[01] Modeling
1. What are the key components of the LLaVA-OneVision model architecture? The model architecture inherits the minimalism design of the LLaVA series, consisting of:
- LLM: Qwen-2 as the language model
- Vision Encoder: SigLIP as the visual encoder
- Projector: A 2-layer MLP to project image features into the word embedding space
2. How does LLaVA-OneVision represent visual signals? LLaVA-OneVision uses an "AnyRes" strategy to handle visual inputs of different resolutions and aspect ratios:
- For single-image, it uses a large maximum spatial configuration to maintain the original image resolution.
- For multi-image, it only considers the base image resolution to save computational resources.
- For video, it resizes each frame to the base resolution and uses bilinear interpolation to reduce the number of tokens.
3. What are the key principles behind the data used for training LLaVA-OneVision? The training data consists of two main components:
- High-quality knowledge learning data, which includes re-captioned detailed descriptions, document/OCR data, and Chinese/language data. This data is mostly synthetic to scale efficiently.
- Visual instruction tuning data, which is carefully curated and categorized into different vision, instruction, and response types to cover a diverse set of skills.
[02] Experimental Results
1. How does LLaVA-OneVision perform on single-image benchmarks compared to other models? LLaVA-OneVision outperforms existing open-source models and approaches the performance of commercial models like GPT-4V and GPT-4o on various single-image benchmarks, including chart/diagram understanding, perception and reasoning, and real-world understanding tasks.
2. How does LLaVA-OneVision perform on multi-image and video benchmarks? LLaVA-OneVision consistently outperforms existing multi-image and video LMMs, demonstrating strong capabilities in tasks like multi-image reasoning, identifying differences, and understanding 3D environments. The model also shows improved performance after the "one-vision" training stage, which combines single-image, multi-image, and video data.
3. What are some of the emerging capabilities of LLaVA-OneVision through task transfer? The article highlights several examples of LLaVA-OneVision exhibiting new capabilities through task transfer, such as:
- Joint understanding of diagrams and charts
- Applying multimodal models to GUI-based tasks
- Set-of-mark reasoning
- Generating detailed video creation prompts from static images
- Analyzing differences between video sequences
- Understanding multi-camera video footage from self-driving cars
- Comprehending composed sub-videos
- Understanding visual prompts in videos
[03] Conclusions
1. What are the key contributions of the LLaVA-OneVision work? The key contributions of the LLaVA-OneVision work are:
- Developing a family of open LMMs that achieve state-of-the-art performance across single-image, multi-image, and video scenarios.
- Demonstrating the ability to transfer tasks and capabilities across different modalities, leading to new emerging capabilities.
- Releasing the generated multimodal instruction data, codebase, model checkpoints, and a visual chat demo to the public.
2. How does the LLaVA-OneVision work contribute to the broader field of large multimodal models? The LLaVA-OneVision work serves as a valuable starting point for the community to build specific applications and develop stronger LMMs for diverse vision scenarios through further scaling. The authors believe that learning from large-scale synthetic data is becoming a trend as AI models continue to grow more powerful.