SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
๐ Abstract
The article presents SEED-X, a versatile multimodal foundation model that can serve as various AI assistants in real-world scenarios. SEED-X is enhanced with two key features: (1) the ability to comprehend images of arbitrary sizes and ratios, and (2) multi-granularity image generation capabilities, including both high-level instructional image generation and low-level image manipulation tasks. These features enable SEED-X to be effectively instruction-tuned for diverse applications across different domains.
๐ Q&A
[01] Visual Tokenization and De-tokenization
1. How does SEED-X's visual tokenization and de-tokenization work?
- SEED-X uses a pre-trained ViT as the visual tokenizer to obtain image features.
- In the first stage, a visual de-tokenizer is pre-trained to decode realistic images from the ViT features.
- In the second stage, the visual de-tokenizer is further fine-tuned to take an extra condition image as input, allowing it to preserve fine-grained details of the input image for image manipulation tasks.
2. What are the benefits of using a pre-trained ViT as the visual tokenizer?
- The ViT features serve as a bridge to decouple the training of the visual (de-)tokenizer and the multimodal language model (MLLM) in SEED-X.
- This architecture enables effective high-quality image generation, which is a crucial capability for applying multimodal models in real-world scenarios.
[02] Dynamic Resolution Image Encoding
1. How does SEED-X handle input images of arbitrary sizes and aspect ratios?
- SEED-X's dynamic resolution image encoding divides the input image into a grid of sub-images.
- Extrapolatable 2D positional embeddings are added to the ViT features of each sub-image to provide the MLLM with positional information.
- This allows the MLLM to process images of any resolution, even if the specific resolution was not encountered during training.
2. What are the advantages of dynamic resolution image encoding?
- It enables SEED-X to handle input images of arbitrary sizes and aspect ratios, avoiding the loss of fine-grained information that can occur when resizing images to a fixed pre-defined resolution.
- This is an important capability for real-world applications where the input images may have diverse resolutions and aspect ratios.
[03] Multimodal Pre-training and Instruction Tuning
1. What datasets were used for SEED-X's multimodal pre-training?
- The pre-training data includes image-caption pairs, grounded image-text data, interleaved image-text data, OCR data, and pure text data.
- The images from LAION-COCO and SAM were re-captioned to improve both image comprehension and generation.
2. How was SEED-X instruction-tuned for different real-world applications?
- SEED-X was fine-tuned using a LoRA module on various public and in-house datasets covering image editing, text-rich QA, grounded and referencing QA, conversational interactions, slide generation, and storytelling.
- This resulted in specialized instruction-tuned models like SEED-X-Edit, SEED-X-PPT, SEED-X-Story, and SEED-X-Try-on, each tailored for specific real-world tasks.
[04] Evaluation and Applications
1. How did SEED-X perform on multimodal benchmarks?
- SEED-X-I achieved competitive performance on image comprehension and generation tasks compared to other state-of-the-art multimodal models.
- It demonstrated the ability to effectively handle images of arbitrary sizes and ratios, as well as generate high-quality images aligned with the given instructions.
2. What real-world applications were showcased for SEED-X?
- SEED-X can function as various multimodal AI assistants, such as an interactive designer generating images while illustrating creative intent, and a knowledgeable personal assistant comprehending images and providing relevant suggestions.
- The instruction-tuned models demonstrated capabilities in high-level instructional image generation, low-level image manipulation, text-rich comprehension, mathematical reasoning, and diagram understanding.