Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
๐ Abstract
The paper presents a unified transformer model called Show-o that can handle both multimodal understanding and generation tasks. Show-o unifies autoregressive and discrete diffusion modeling within a single transformer architecture, allowing it to adaptively process inputs and outputs of various and mixed modalities. The unified model demonstrates comparable or superior performance to existing individual models tailored for understanding or generation across various benchmarks, highlighting its potential as a next-generation foundation model.
๐ Q&A
[01] Methodology
1. How does Show-o unify autoregressive and diffusion modeling within a single transformer? Show-o represents text as discrete tokens and models them autoregressively, similar to large language models. For images, it employs discrete denoising diffusion to model the image tokens instead of continuous representations. This allows Show-o to handle both text and images distinctly within a unified transformer architecture.
2. How does Show-o handle diverse input data and task variations? Show-o employs a text tokenizer and image tokenizer to encode input data into discrete tokens. It also proposes a unified prompting strategy to process these tokens into structured sequences as input, enabling the model to handle various vision-language tasks seamlessly.
3. What are the two main training objectives of Show-o? Show-o has two training objectives: 1) Next Token Prediction (NTP) for autoregressive modeling of text tokens, and 2) Mask Token Prediction (MTP) for discrete diffusion modeling of image tokens.
4. How does Show-o's training pipeline progressively train the model? Show-o's training pipeline has three stages: 1) pre-training the image token embeddings and learning pixel dependencies, 2) aligning image and text for multimodal understanding and generation, and 3) fine-tuning on high-quality data.
[02] Experiments
1. How does Show-o's performance compare to individual models on multimodal understanding tasks? Show-o demonstrates comparable or even better performance than individual models tailored for multimodal understanding, such as LLaVA and InstructBLIP, across benchmarks like POPE, MME, Flickr30k, VQAv2, and GQA.
2. How does Show-o's text-to-image generation capability compare to other models? On the GenEval benchmark, Show-o outperforms generation-only models of similar size and achieves comparable performance to larger models like DALL-E 2 and SD3, despite being a unified model handling both understanding and generation.
3. What are the advantages of Show-o's unified modeling approach compared to autoregressive image generation? Show-o requires significantly fewer sampling steps (around 20 times fewer) to generate images compared to autoregressive models, demonstrating inherent potential for acceleration.
4. What are the key insights from the ablation studies on different input representations for multimodal understanding? The ablation studies show that using continuous image representations from pre-trained encoders like CLIP-ViT leads to better multimodal understanding performance compared to using discrete image tokens. However, the unified pre-training approach can still boost the understanding capabilities when using discrete image tokens.
</output_format>