From Unimodals to Multimodality: DIY Techniques for Building Foundational Models
๐ Abstract
The article discusses successful, low-cost approaches to creating multimodal AI models by leveraging pre-trained unimodal models. It covers three main techniques: prompt adaptation, intermediate module training, and adapter mixture. The article also emphasizes the importance of high-quality curated data for enhancing the performance of these multimodal models.
๐ Q&A
[01] Parameter-Efficient Fine-Tuning (PEFT)
1. What are the three main categories of PEFT approaches discussed in the article?
- Adapters: Small modules inserted into a pre-trained model, with only the adapters being trained during fine-tuning.
- LoRA: Injects trainable low-rank decomposition matrices into the model to approximate weight updates, significantly reducing the number of trainable parameters.
- P*-tuning (prefix-tuning, prompt tuning): Prepends a set of learnable prefix vectors or tokens to the input embedding, and only these "soft prompts" are trained when fine-tuning.
2. What are the strengths and limitations of each PEFT technique?
- Adapters: Add a small number of parameters, can capture complex task-specific information, but may complicate the optimization process and lead to longer training times.
- LoRA: Highly efficient and scalable with very large models, but its adaptation is constrained to what can be expressed within the low-rank structure.
- P*-tuning: Extremely parameter-efficient, but may not be able to capture complex task-specific information as effectively as other methods.
[02] Prompt Adaptation
1. How does LLaMA-Adapter integrate visual information using a pre-trained visual encoder? LLaMA-Adapter introduces learnable adaptation prompts and integrates visual information by projecting global visual features from a pre-trained visual encoder (e.g., CLIP) into the dimension of the LLM's adaptation prompts.
2. What are the key improvements in LLaMA-Adapter V2 over the original LLaMA-Adapter?
- Introduces more learnable parameters by unfreezing all the normalization layers in LLaMA and adding a learnable bias and scale factor to all linear layers.
- Feeds visual tokens into the early layers of the language model, while the adaptation prompts are added to the top layers.
- Employs a joint training paradigm for both image-text captioning data and language-only instruction data.
[03] Intermediate Module Training
1. How do MiniGPT-4 and LLaVA connect the vision encoder and language model? Both models use a single learnable linear projection layer to align the visual encoder (ViT-G/14 for MiniGPT-4, ViT-L/14 for LLaVA) with the language model (Vicuna).
2. What are the two-stage training processes for MiniGPT-4 and LLaVA?
- MiniGPT-4: (1) Pretrain the projection layer on a large dataset of aligned image-text pairs, (2) Fine-tune the linear projection layer with a smaller, high-quality dataset.
- LLaVA: (1) Train the projection layer on a large image-text pairs dataset, (2) Fine-tune the pre-trained projection layer and LLM weights using a high-quality generated dataset of language-image instruction-following data.
[04] Adapter Mixture
1. What is the key idea behind the Mixture-of-Modality Adapters (MMA) proposed in Cheap&Quick? MMA introduces a learnable token 't' as the modality selector token, which indicates the input features' modality (unimodal or multimodal) and informs the router module on how to combine the output of the learned adapters.
2. How does Cheap&Quick connect LLaMA and CLIP-ViT using MMA? Cheap&Quick inserts MMA into both ViT and LLaMA before the multi-head attention modules. The adapters and projection layer (only 3.8M parameters) were trained with a mixture of text-only and text-image data, demonstrating a significant reduction in training costs while maintaining high performance on vision-language tasks.
[05] High-Quality Curated Data
1. How do MiniGPT-4, LLaVA, and Video-ChatGPT generate high-quality multimodal instruction-following data?
- MiniGPT-4: (1) Use the pre-trained model to generate detailed descriptions, (2) Refine the generated descriptions using ChatGPT and manual verification.
- LLaVA: Leverage ChatGPT/GPT-4 to generate multimodal instruction-following data based on widely available image-text pair data.
- Video-ChatGPT: (1) Human-assisted annotation, (2) Semi-automatic annotation using pre-trained models.
2. How does the MIMIC-IT dataset demonstrate the importance of high-quality data? MIMIC-IT generated a dataset of 2.8 million multimodal instruction-response pairs, which was used to fine-tune OpenFlamingo. The resulting model outperformed OpenFlamingo, demonstrating superior in-context and zero-shot learning capabilities.