MiniCPM-V: A GPT-4V Level MLLM on Your Phone
๐ Abstract
The article discusses the development of efficient Multimodal Large Language Models (MLLMs) that can be deployed on end-side devices, such as mobile phones and personal computers. It introduces the MiniCPM-V series, a set of MLLMs that aim to achieve a good balance between performance and efficiency. The key contributions of the work include:
- Introducing and open-sourcing the MiniCPM-V series, a set of efficient end-side MLLMs.
- Investigating key techniques that enable MLLMs to achieve a performance-efficiency balance at scale, including architecture design, training, inference, and deployment.
- Summarizing the trend of MLLM development, which shows that the model sizes for achieving GPT-4V level performance are rapidly decreasing over time, while end-side device computation capacity is steadily increasing.
๐ Q&A
[01] Model Architecture
1. What are the key modules in the MiniCPM-V architecture? The MiniCPM-V model comprises three key modules: the visual encoder, compression layer, and language model (LLM). The input image is first encoded by the visual encoder using an adaptive visual encoding approach, then compressed by the compression layer, and finally fed into the LLM along with the text input for conditional text generation.
2. How does the adaptive visual encoding method work? To handle high-resolution images with different aspect ratios, the adaptive visual encoding method divides the input image into slices that better match the pre-training setting of the visual encoder in terms of resolution and aspect ratio. It then adjusts and interpolates the position embeddings to fit the size of each slice, and compresses the visual tokens using a compression module.
3. What are the benefits of the adaptive visual encoding approach? The adaptive visual encoding approach ensures that the visual encoding respects the raw aspect ratio of the input and preserves sufficient visual details, while keeping the number of visual tokens moderate to be affordable on end-side devices.
[02] Training
1. What are the three phases of the MiniCPM-V training process? The training process consists of three phases: pre-training, supervised fine-tuning (SFT), and RLAIF-V (Reward Learning from AI Feedback for Vision).
2. How does the pre-training phase work? The pre-training phase aims to align the visual modules with the input space of the LLM and learn foundational multimodal knowledge. It is further divided into three stages: (1) warming up the compression layer, (2) extending the input resolution of the pre-trained visual encoder, and (3) training the visual modules using the adaptive visual encoding strategy.
3. What is the purpose of the RLAIF-V phase? The RLAIF-V phase addresses the hallucination problem in MLLMs, where the models generate responses that are not factually grounded in the input image. It employs a divide-and-conquer strategy to collect high-quality feedback from open-source MLLMs and performs direct preference optimization to align the model's behavior.
[03] End-side Deployment
1. What are the key challenges in deploying MLLMs on end-side devices? The key challenges include memory constraints, CPU/GPU speed restrictions, and power consumption limitations, which are significantly more restrictive on end-side devices compared to high-performance servers.
2. What techniques does MiniCPM-V use to address these challenges? MiniCPM-V employs a suite of optimization techniques, including model quantization, memory usage optimization, compilation optimization, and NPU acceleration, to enable efficient deployment on end-side devices.
3. What are the results of the end-side deployment evaluation? The evaluation shows that with the optimization techniques, MiniCPM-Llama3-V 2.5 can operate efficiently on both mobile phones and personal computers, delivering acceptable latency and throughput that are comparable or higher than human reading speed.
[04] Experimental Results
1. How does the performance of MiniCPM-V series compare to other open-source and proprietary models? MiniCPM-Llama3-V 2.5 outperforms strong open-source models like Idefics2-8B and even larger models like Cambrian-34B on the comprehensive OpenCompass benchmark. It also achieves better performance than powerful proprietary models like GPT-4V-1106 and Gemini Pro, with significantly fewer parameters.
2. What are the key capabilities of the MiniCPM-V series? The MiniCPM-V series exhibits strong OCR capabilities, high-resolution image perception, trustworthy behavior with low hallucination rates, and multilingual support for over 30 languages.
3. How does the smaller MiniCPM-V 2.0 model perform compared to other 2B~3B models? MiniCPM-V 2.0 with 2B parameters achieves significantly better performance compared to other 2B~3B models, and is even comparable to Llama3-based 8B MLLMs, demonstrating the effectiveness of the techniques used in the MiniCPM-V series.