Chameleon: Mixed-Modal Early-Fusion Foundation Models
๐ Abstract
Chameleon is a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforming Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model.
๐ Q&A
[01] Pre-Training
1. How does Chameleon represent images and text? Chameleon represents images, in addition to text, as a series of discrete tokens and takes advantage of the scaling properties of auto-regressive Transformers.
2. What are the two stages of pre-training for Chameleon? The first stage uses a data mixture of large-scale unsupervised text-only, text-image, and interleaved text-image datasets. The second stage reduces the weight of the first stage data by 50% and mixes in higher quality datasets while maintaining a similar proportion of image text tokens.
3. What architectural and optimization techniques were used to achieve stability when scaling Chameleon models above 8B parameters and 1T tokens? Key techniques include:
- Query-key normalization to control norm growth in the attention mechanism
- Revised placement of layer norms within the transformer block
- Use of z-loss regularization to address logit shift in the final softmax
- Careful tuning of optimization hyperparameters like learning rate, batch size, and gradient clipping
[02] Alignment
1. What types of datasets were used in the supervised fine-tuning (SFT) stage? The SFT datasets include Text, Code, Visual Chat, Image Generation, Interleaved Text/Image Generation, and Safety data.
2. How was data balancing handled during the SFT stage? Balancing modalities within the SFT stage was found to be important, to prevent the model from learning an unconditional prior of generating a specific modality.
3. What optimization techniques were used during the SFT stage? The SFT stage used a cosine learning rate schedule, weight decay, dropout, and z-loss regularization.
[03] Human Evaluations and Safety Testing
1. What was the purpose of the human evaluation experiment? The human evaluation experiment was designed to measure the quality of Chameleon's mixed-modal long form responses to open-ended prompts, as existing benchmarks may be limited.
2. How did Chameleon perform compared to the baselines (Gemini-Pro and GPT-4V) in the human evaluation? Chameleon-34B substantially outperformed the baselines, achieving a 60.4% preference rate against Gemini-Pro and a 51.6% preference rate against GPT-4V in pairwise comparisons.
3. What were the key findings from the safety testing of Chameleon? The safety testing showed that an overwhelming majority of Chameleon's responses were considered safe, with only 0.39% and 0.095% unsafe responses for the 7B and 34B models respectively. Further interactive safety testing also demonstrated significant protection against malicious prompts.
[04] Benchmark Evaluations
1. How did Chameleon perform on text-only benchmarks compared to other state-of-the-art language models? Chameleon-7B and Chameleon-34B were competitive with the corresponding Llama-2 models, with Chameleon-34B even outperforming Llama-2 70B on commonsense reasoning and reading comprehension tasks, and performing on par with Mixtral 8x7B.
2. How did Chameleon perform on image captioning and visual question answering benchmarks? On visual question answering and image captioning benchmarks, Chameleon-34B achieved state-of-the-art performance, outperforming models like Flamingo, IDEFICS and Llava-1.5.
3. What unique capabilities does Chameleon demonstrate compared to existing models? Chameleon is able to perform non-trivial image generation, while also maintaining competitive performance on text-only tasks, all within a single model. This represents a significant step towards realizing the vision of unified foundation models capable of flexibly reasoning over and generating multimodal content.