MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
๐ Abstract
The article introduces MARS, a novel auto-regressive framework that retains the capabilities of pre-trained Large Language Models (LLMs) while incorporating exceptional text-to-image (T2I) generation abilities. The key contributions are:
-
The design of the Semantic Vision-Language Integration Expert (SemVIE) module, which seamlessly integrates the pre-trained LLM with a trainable visual expert, preserving the NLP capabilities while endowing the model with advanced visual understanding.
-
A multi-stage refinement training strategy that significantly enhances MARS' robust instruction-following capability and its ability to generate high-quality images with rich details.
-
MARS demonstrates strong performance on various benchmarks, including MS-COCO, T2I-CompBench, and human evaluations, while requiring only 9% of the training budget of Stable Diffusion v1.5.
-
MARS possesses bilingual generation capabilities, handling both English and Chinese language prompts, and the flexibility to perform joint image and text generation tasks.
๐ Q&A
[01] Overall Framework
1. What is the overall framework of MARS? MARS is a unified framework that combines pre-trained Large Language Models (LLMs) with visual generation capabilities. It consists of distinct yet harmonized visual and linguistic expert models, with the linguistic module leveraging a pre-trained LLM (e.g., Qwen-7B) and the visual counterpart undergoing initialization concurrently.
2. How does MARS preserve the NLP capabilities of the pre-trained LLM? MARS incorporates the Semantic Vision-Language Integration Expert (SemVIE) module, which adds parallel visual experts to the attention blocks of the pre-trained LLM. This allows MARS to harness the powerful language interpretation abilities of the LLM while also equipping the model with advanced visual generation and comprehension abilities.
3. What are the key advantages of MARS's architecture? MARS amplifies the flexibility of auto-regressive methods for T2I generation and joint image-text synthesis, with the potential for expansion to any-to-any tasks. The framework's performance is verified across various evaluative measures, including the MS-COCO benchmark, T2I-CompBench, and human evaluation.
[02] Semantic Vision-Language Integration Expert (SemVIE)
1. How does the SemVIE module work? The SemVIE module is a specialized multi-modal Mixture of Experts (mm-MoE) designed to handle both visual and semantic tokens. It consists of Attention-MoE and Feed-Forward Network (FFN)-MoE modules, which are strategically situated following each layer normalization step within the transformer modules. The routing mechanism allocates each input token to the corresponding expert model best equipped for its processing.
2. How does the SemVIE module facilitate the integration of textual and visual modalities? The SemVIE module enables a comprehensive and incremental interplay between the textual and visual modalities across every layer of the model, fostering deep integration that yields images closely aligned with their textual descriptors. This integration capitalizes on the profound linguistic insights afforded by the pre-trained LLM, leveraging the advanced language comprehension capabilities to enrich visual understanding.
[03] Multi-Stage Refinement Training
1. What are the key stages of the multi-stage refinement training strategy? The multi-stage training strategy consists of three stages:
- Stage I: Pre-training for Text-to-Image Alignment
- Stage II: High-Quality Data Alignment
- Stage III: High-Resolution Refinement
2. How does each stage contribute to the overall performance of MARS?
- Stage I optimizes MARS for text-to-image generation and image captioning tasks, utilizing an auto-regressive approach.
- Stage II further advances the fidelity of image synthesis by employing a dataset with meticulously curated text-image pairs.
- Stage III utilizes a cascading super-resolution strategy to enhance the resolution of the generated images, employing the Next K-Token Prediction (NKTP) method.
3. What are the benefits of the multi-stage training strategy? The multi-stage training strategy incrementally refines the correlation between textual prompts and visual outputs, allowing MARS to generate images that not only reflect the text's intent but also display a depth of detail akin to photorealistic representations.
[04] Multilingual Generation Capabilities
1. How does MARS handle multilingual text-to-image generation? MARS is built upon the Qwen architecture, which is designed to support multiple languages, including both English and Chinese. During the training phase, MARS incorporated a small yet significant proportion of Chinese in-house data, enabling it to effectively interpret concepts across linguistic boundaries and generate coherent visual content for both English and Chinese prompts.
2. What are the key advantages of MARS's multilingual generation capabilities? MARS's ability to seamlessly handle both English and Chinese language prompts, as well as its capacity for joint image and text generation, demonstrate its potential for versatile applications beyond just text-to-image synthesis.