Summarize by Aili

Generative Photomontage

🌈 Abstract

The paper proposes a framework for creating desired images by compositing them from various parts of generated images, forming a "Generative Photomontage". Given a stack of images generated by ControlNet using the same input condition and different seeds, users can select desired parts from the generated results using a brush stroke interface. The authors introduce a novel technique that takes in the user's brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. This allows users to achieve their desired result by combining the best parts from multiple generated images.

🙋 Q&A

[01] Text-to-Image Generation

1. What are the key challenges with text-to-image generation models?

Text-to-image models may not achieve exactly what a user envisions due to the ambiguity in mapping from lower-dimensional input space (e.g., text, sketch) to high-dimensional pixel space.
It is often challenging to achieve a single image that includes everything the user wants, as the user may like different parts from different generated results.
While adding various conditions to text-to-image models can provide greater user control, the process is still akin to a "dice roll" where the model generates a range of outputs that differ in lighting, appearance, and backgrounds.

2. How does the proposed framework address these challenges?

The framework treats the model's generated images as intermediate outputs and allows users to select and composite desired parts from different images.
This gives users fine-grained control over the final output and significantly increases the likelihood of achieving their desired result, compared to the trial-and-error process of re-rolling the dice.

[02] Proposed Method

1. What are the key steps of the proposed method?

Start with a stack of images generated by ControlNet using the same input condition and different seeds.
Allow users to select desired parts from the generated results using brush strokes.
Perform a multi-label graph-based optimization in diffusion feature space to segment the image regions.
Composite the segmented regions during a final denoising process using a new feature injection and mixing method.

2. How does the method leverage the shared spatial structures across the generated images?

The images share common spatial structures from the same input condition, which the method leverages for the composition process.
The graph-based optimization in diffusion feature space groups regions with similar diffusion features while satisfying the user's input strokes.
The feature injection and mixing method then composites the segmented regions harmoniously.

3. What are the key advantages of the proposed approach?

User interaction: It strikes a balance between exploration (using the model's generative capabilities) and control (allowing users to select and composite desired parts).
Artifact correction: Users can replace undesired regions with more visually appealing regions from other images to build towards their desired result.

[03] Results and Evaluation

1. What are the main applications demonstrated in the results?

Appearance mixing: Combining different components (e.g., roofs, windows, colors) to create new designs or artistic compositions.
Shape and artifact correction: Replacing incorrectly generated shapes or artifacts with desired parts from other images.
Prompt alignment: Compositing multiple images generated from simpler prompts to better match a complex prompt.

2. How does the method perform compared to baselines?

Quantitative evaluation shows the method achieves the highest PSNR and second-lowest LPIPS loss compared to various baselines.
The method's seam gradient scores are within the range of the input image stack, indicating seamless blending.
User studies demonstrate that the method outperforms baselines in terms of blending quality, while being comparable in realism.

3. What are the limitations of the proposed approach?

If the input images differ significantly in scene structure, the method may rely more on user input to create a valid composite.
The graph cut parameters are empirically chosen, and for objects with curvy outlines, additional user strokes may be required to obtain a finer boundary.

Shared by Daniel Chen ·

Install fromChrome Web Store