
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
๐ Abstract
The article discusses the development of a multimodal image generation model called MUMU, which can generate images from prompts that include both text and reference images. The key points are:
- MUMU is built by replacing the text encoder of the Stable Diffusion XL (SDXL) diffusion model with a vision-language model called Idefics2.
- The model is trained on a dataset of text-image pairs, where image crops corresponding to words in the captions are inserted into the prompts.
- MUMU can harmonize diverse conditioning images from different inputs into a coherent output, and can also perform some style transfer.
- The article evaluates MUMU's strengths and weaknesses, and discusses potential future research directions.
๐ Q&A
[01] Dataset Construction
1. How was the multimodal training dataset for MUMU constructed?
- The dataset was bootstrapped from existing text-image data by using object detection to extract image crops corresponding to words in the image captions.
- The dataset consists of both synthetic images generated using SDXL as well as high-quality publicly available images, with the image crops inserted into the text prompts.
- To encourage the model to disentangle content and style, each content prompt was paired with multiple different styles.
2. What were some of the key choices made in constructing the dataset?
- The authors biased the dataset to include more images with people, and replaced some full-person crops with just the person's head to get more high-resolution face training data.
- The authors also experimented with including additional conditioning information like canny edges, depth maps, and sketches of the full input image.
[02] MUMU Architecture
1. What are the key components of the MUMU architecture?
- MUMU uses the SDXL diffusion model as its base, but replaces the CLIP text encoder with the Idefics2 vision-language model.
- Idefics2 is composed of a vision transformer, a perceiver transformer, and a large vision-language model transformer.
- The authors ablate the perceiver transformer and use a larger number of tokens per image, finding that this improves detail preservation.
2. How does MUMU differ from prior work on adding image conditioning to diffusion models?
- Rather than using separate encoders for different types of conditioning, MUMU uses a single multimodal encoder (Idefics2) that can handle both text and image inputs.
- This allows MUMU to directly place conditioning images into the generated output, and to harmonize diverse conditioning images from different sources.
[03] Evaluation and Findings
1. What are some of the key strengths of the MUMU model?
- MUMU can harmonize diverse conditioning images from different inputs into a coherent output, such as generating a person in a cartoon style or a person riding a scooter.
- MUMU can also perform some style transfer, though it struggles more with translating human faces into abstract styles.
- MUMU's multimodal conditioning allows it to outperform a text-only approach (ChatGPT4 + DALL-E 3) in preserving details from the input images.
2. What are some of the limitations and areas for improvement with MUMU?
- MUMU struggles with consistency of small details like faces and clothing, and has slightly worse text adherence than the base SDXL model.
- The authors hypothesize that many of these issues could be improved by scaling up the model and dataset, but also identify potential architectural changes like exploring alternative image tokenization methods.
- Evaluating and measuring the model's performance on multimodal tasks is also identified as an area needing further research.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.