Artist: Aesthetically Controllable Text-Driven Stylization without Training
๐ Abstract
The paper introduces a training-free approach called "Artist" that disentangles content and style generation in the diffusion process to achieve fine-grained and aesthetically-controlled text-driven image stylization. The key insights are:
- Disentangling the denoising of content and style into separate diffusion processes while sharing information between them.
- Proposing simple yet effective content and style control methods that suppress style-irrelevant content generation, resulting in harmonious stylization results.
๐ Q&A
[01] Characterizing Content and Style in Diffusion Process
1. How does the paper characterize the generation of content and style during the diffusion process? The paper provides an intuitive yet practical analysis to characterize the generation of content and style during the diffusion process. The analysis reveals an inherent challenge in controlling them within a single diffusion trajectory - while content modification appears linear in time, stylization strength is quadratic. This further demonstrates the entanglement of content and style in the diffusion process.
2. What is the key insight behind the proposed approach to address this challenge? The key insight is to disentangle the denoising of content and style into separate diffusion processes while sharing information between them. This allows the strength of style and content generation to be maximally attained in either delegation process without trade-offs.
[02] Proposed Content and Style Control Methods
1. What are the two main components of the proposed content and style control methods? The two main components are:
- Content control: Using the hidden features in ResBlock instead of its output to better disentangle content from style, enabling fine-grained control over content preservation.
- Style control: Decomposing the role of style generation into content-aware and content-independent components, with the content-aware style generation achieved by injecting the cross-attention query from content to the style delegation.
2. How do these control methods help achieve aesthetically fine-grained stylization? The proposed content and style control methods suppress style-irrelevant content generation and enable flexible control over the degree of content abstractness, resulting in harmonious stylization results that align well with the given style prompts.
[03] Evaluation and Comparison
1. What are the key metrics used to evaluate the text-driven stylization performance? The paper introduces the use of Visual-Language Models (VLMs) as aesthetic-level metrics, including Content-Aware Style Alignment, Style-Aware Content Alignment, and Aesthetic Score. These metrics aim to holistically evaluate the content preservation and style strength alignment with human aesthetic preferences.
2. How does the proposed Artist method perform compared to existing methods? Extensive experiments demonstrate that the proposed Artist method outperforms previous methods in achieving the best balance between stylization strength and content preservation, as evidenced by the superior performance on both conventional metrics (e.g., LPIPS, CLIP Alignment) and the introduced aesthetic-level VLM-based metrics.