Stable Diffusion 2 Is the First Artist-Friendly AI Art Model
๐ Abstract
The article discusses the release of Stable Diffusion 2, the latest version of the open-source generative AI model from Stability.ai. It covers the key technical improvements in Stable Diffusion 2, the controversies around the changes made to the text/image encoder, and the broader implications for the generative AI landscape.
๐ Q&A
[01] Stable Diffusion 2 Overview
1. What are the key technical improvements in Stable Diffusion 2 compared to previous versions?
- The baseline Stable Diffusion 2.0-base model is trained on an aesthetic subset of the LAION-5B dataset and generates 512x512 images
- Stable Diffusion 2.0-v defaults to a better 768x768 resolution
- Depth2img is a depth-to-image model that improves the model's ability to preserve structure and coherence
- The upscaler model enhances the resolution 4x (e.g. from 512x512 to 2048x2048)
- A text-guided inpainting model allows for semantically replacing parts of the original image
2. Why did Stability.ai decide to replace the CLIP encoder with their own OpenCLIP encoder?
- Stability.ai wanted to use an encoder trained on a publicly available dataset (LAION-5B) rather than the proprietary CLIP encoder from OpenAI
- This allows developers and users to better understand what the encoder has learned and how it works
3. What are the implications of using OpenCLIP instead of CLIP?
- The things Stable Diffusion 2 "knows" are different from what Stable Diffusion 1, DALL-E, and Midjourney "know" due to the different training data
- This means that prompt techniques and heuristics that worked for earlier versions may not work as well for Stable Diffusion 2
- However, Stability.ai claims that OpenCLIP has learned these things better, even if differently
[02] Controversies around Stable Diffusion 2
1. What are the main criticisms from users regarding the changes in Stable Diffusion 2?
- Stability.ai removed most NSFW content, celebrity images, and the names of famous (modern) artists from the training data
- This means users can no longer generate images "in the style of" specific artists, which was a popular use case for Stable Diffusion 1
- Many users consider these changes a "step back" and a "regression" in the capabilities of Stable Diffusion
2. Why did Stability.ai make these changes to the training data?
- Stability.ai is likely trying to comply with copyright laws and reduce the legally dubious practice of scraping the internet for artists' work without permission
- This is similar to the issues faced by companies like Microsoft, GitHub, and OpenAI with the Copilot lawsuit
3. How do the article's author view the tradeoffs between user frustration and artist protection?
- The author acknowledges that Stable Diffusion 2 is objectively more limited in its ability to generate art compared to previous versions or competitors like Midjourney
- However, the author believes Stability.ai had a valid reason to make these changes to avoid potential lawsuits and better protect the rights of artists whose work was used in the training data
- The author argues that the AI community should have more open-minded and respectful conversations about balancing the needs of users and artists in the context of generative AI