PaliGemma: A versatile 3B VLM for transfer
๐ Abstract
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective for transfer learning. PaliGemma achieves strong performance on a wide variety of open-world tasks, including standard VLM benchmarks as well as more specialized tasks such as remote-sensing and segmentation.
๐ Q&A
[01] Introduction
1. What is PaLI? PaLI is a series of state-of-the-art vision-language models, starting with the first PaLI [22] and continuing with PaLI-X [23] and PaLM-E [35], which pushed the performance further by combining larger vision and language models.
2. What is Gemma? Gemma is a family of auto-regressive decoder-only open large language models built from the same research and technology used to create the Gemini [7] models. PaliGemma uses the 2 billion parameter pretrained version of Gemma.
3. What are the main goals of PaliGemma? The main goal of PaliGemma is to provide a versatile base VLM that reaches state-of-the-art results not only on standard benchmarks like COCO captions and VQAv2, but also on more exotic tasks like Remote-Sensing VQA, TallyVQA, video captioning and QA, and referring expression segmentation.
[02] Related Work
1. What are the different generations of vision-language models?
- The first generation, spearheaded by CLIP [90] and ALIGN [47], extends large-scale classification pretraining to leverage web data without human labeling.
- The second generation, akin to T5 [91], unifies captioning and question-answering tasks via generative encoder-decoder modeling.
- The most recent works perform "instruction tuning" to make the raw model more user-friendly.
2. How does PaliGemma fit into this landscape? PaliGemma is an open base VLM without instruction tuning, aiming to answer questions about what really matters in VLM pretraining and transfer.
[03] Model
1. What is the high-level architecture of PaliGemma? PaliGemma takes an image and a textual description (prompt/question) as input, and autoregressively generates a text string (answer/output). This flexible API covers many standard tasks like classification, captioning, VQA, as well as more complex structured outputs like detection and segmentation.
2. What are the main components of PaliGemma's architecture? PaliGemma consists of:
- A SigLIP-So400m vision encoder
- A linear projection layer to map the vision tokens to the same dimensions as the Gemma-2B language model
- The Gemma-2B decoder-only language model
3. How does PaliGemma handle different input resolutions? PaliGemma provides three different checkpoints, trained at 224px, 448px, and 896px resolutions, to handle tasks that benefit from higher image resolution.
[04] Pretraining
1. What are the different stages of PaliGemma's pretraining?
- Stage0: Unimodal pretraining of the individual vision and language components
- Stage1: Multimodal pretraining on a broad mixture of vision-language tasks
- Stage2: Short continued pretraining at higher image resolutions (448px and 896px)
- Stage3: Transfer to individual tasks or a "mix" of tasks
2. What is unique about PaliGemma's multimodal pretraining (Stage1)? Unlike common practice, PaliGemma does not freeze the image encoder during Stage1. Instead, it uses a slow linear warm-up for the image encoder's learning rate to avoid destructive supervision signal from the initially unaligned language model.
3. How does PaliGemma's pretraining task mixture differ from previous work? PaliGemma's pretraining tasks are designed to force the model to acquire a broad range of "skills", rather than focusing on tasks that are user-friendly out of the box. The model is then expected to quickly rewire itself to the specific task during transfer.
[05] Results
1. How does PaliGemma perform across the evaluated tasks? PaliGemma achieves state-of-the-art results not only on standard benchmarks like COCO captions and VQAv2, but also on more specialized tasks like Remote-Sensing VQA, TallyVQA, video captioning and QA, and referring expression segmentation.
2. How does PaliGemma compare to larger VLMs in terms of performance? PaliGemma, with less than 3 billion total parameters, matches the performance of much larger VLMs like PaLI-X (22B+32B) and PaLM-E (22B+540B) across a wide range of tasks.
[06] Ablations
1. What did the authors find about the importance of multimodal pretraining duration? Shorter pretraining hurts performance, and skipping the multimodal Stage1 pretraining entirely is the worst setting. However, a 10x shorter Stage1 (100M examples) still provides good results on most tasks.
2. How did the authors evaluate the impact of different architectural choices? The authors ablated design choices like causal masking, learning objective, and initialization of new tokens, finding that PaliGemma's choices are effective.
3. What did the authors find about freezing components during pretraining? Not freezing any part of the model, including the image encoder, during Stage1 pretraining is advantageous compared to common practice.
[07] Transferability
1. How repeatable are the transfer results? The transfer results are highly repeatable, with small standard deviations across multiple reruns from the same pretrained checkpoint.
2. How sensitive are the transfer results to hyperparameters? A simple and single hyperparameter setup works well for the majority of tasks, with a few exceptions that benefit from more extensive hyperparameter tuning.
3. How many examples are needed to fine-tune PaliGemma on a new task? PaliGemma can achieve good results on most tasks with as few as 256-4096 fine-tuning examples, demonstrating its strong transferability.
[08] Conclusion
1. What are the key takeaways about PaliGemma? PaliGemma is a new, small, open base VLM that provides state-of-the-art performance across a wide variety of benchmarks, demonstrating that VLMs on the "smaller" side can be highly effective. The authors hope PaliGemma serves as a useful starting point for further research in instruction tuning and specific applications.