Generative Photomontage
๐ Abstract
The paper proposes a framework for creating desired images by compositing them from various parts of generated images, forming a "Generative Photomontage". Given a stack of images generated by ControlNet using the same input condition and different seeds, users can select desired parts from the generated results using a brush stroke interface. The authors introduce a novel technique that takes in the user's brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. This allows users to achieve their desired result by combining the best parts from multiple generated images.
๐ Q&A
[01] Text-to-Image Generation
1. What are the key challenges with text-to-image generation models?
- Text-to-image models may not achieve exactly what a user envisions due to the ambiguity in mapping from lower-dimensional input space (e.g., text, sketch) to high-dimensional pixel space.
- It is often challenging to achieve a single image that includes everything the user wants, as the user may like different parts from different generated results.
- While adding various conditions to text-to-image models can provide greater user control, the process is still akin to a "dice roll" where the model generates a range of outputs that differ in lighting, appearance, and backgrounds.
2. How does the proposed framework address these challenges?
- The framework treats the model's generated images as intermediate outputs and allows users to select and composite desired parts from different images.
- This gives users fine-grained control over the final output and significantly increases the likelihood of achieving their desired result, compared to the trial-and-error process of re-rolling the dice.
[02] Proposed Method
1. What are the key steps of the proposed method?
- Start with a stack of images generated by ControlNet using the same input condition and different seeds.
- Allow users to select desired parts from the generated results using brush strokes.
- Perform a multi-label graph-based optimization in diffusion feature space to segment the image regions.
- Composite the segmented regions during a final denoising process using a new feature injection and mixing method.
2. How does the method leverage the shared spatial structures across the generated images?
- The images share common spatial structures from the same input condition, which the method leverages for the composition process.
- The graph-based optimization in diffusion feature space groups regions with similar diffusion features while satisfying the user's input strokes.
- The feature injection and mixing method then composites the segmented regions harmoniously.
3. What are the key advantages of the proposed approach?
- User interaction: It strikes a balance between exploration (using the model's generative capabilities) and control (allowing users to select and composite desired parts).
- Artifact correction: Users can replace undesired regions with more visually appealing regions from other images to build towards their desired result.
[03] Results and Evaluation
1. What are the main applications demonstrated in the results?
- Appearance mixing: Combining different components (e.g., roofs, windows, colors) to create new designs or artistic compositions.
- Shape and artifact correction: Replacing incorrectly generated shapes or artifacts with desired parts from other images.
- Prompt alignment: Compositing multiple images generated from simpler prompts to better match a complex prompt.
2. How does the method perform compared to baselines?
- Quantitative evaluation shows the method achieves the highest PSNR and second-lowest LPIPS loss compared to various baselines.
- The method's seam gradient scores are within the range of the input image stack, indicating seamless blending.
- User studies demonstrate that the method outperforms baselines in terms of blending quality, while being comparable in realism.
3. What are the limitations of the proposed approach?
- If the input images differ significantly in scene structure, the method may rely more on user input to create a valid composite.
- The graph cut parameters are empirically chosen, and for objects with curvy outlines, additional user strokes may be required to obtain a finer boundary.