Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models
๐ Abstract
The paper addresses the problem of multi-object 3D pose control in image diffusion models. It proposes the use of "Neural Assets" - per-object latent representations with consistent 3D appearance but variable 3D pose. Neural Assets are trained by extracting visual representations from one frame in a video and reconstructing the object's appearance in a different frame, conditioned on the corresponding 3D bounding boxes. This enables learning disentangled appearance and pose features. The paper demonstrates that Neural Assets can enable fine-grained 3D pose and placement control of individual objects in a scene, as well as compositional scene generation, such as swapping backgrounds and transferring objects across scenes.
๐ Q&A
[01] Neural Assets
1. What are Neural Assets and how are they used? Neural Assets are per-object latent representations that disentangle an object's 3D appearance and 3D pose. They are trained by extracting visual representations from one frame in a video and reconstructing the object's appearance in a different frame, conditioned on the corresponding 3D bounding boxes. This allows the model to learn consistent 3D appearance and variable 3D pose.
2. How do Neural Assets enable 3D-aware object editing? By encoding both the visual appearance and 3D pose of objects, Neural Assets provide an interface for fine-grained control over the 3D placement, rotation, and occlusion of individual objects in a scene. The model can manipulate the 3D bounding box coordinates to translate, rotate, and resize objects, while preserving their visual appearance.
3. How are Neural Assets integrated with a text-to-image diffusion model? The paper proposes to tokenize the Neural Assets (appearance and pose representations) and feed them as a sequence of conditioning tokens to a pre-trained text-to-image diffusion model, such as Stable Diffusion. This allows the model to leverage the existing text-to-image architecture while enabling 3D-aware object control.
[02] Experimental Results
1. How does the Neural Assets model perform compared to baselines on 3D object editing tasks? The paper evaluates the model on both synthetic (OBJect, MOVi-E) and real-world (Objectron, Waymo Open) datasets. The results show that the Neural Assets model significantly outperforms baselines like 3DIT and a chained approach, achieving state-of-the-art performance on metrics such as PSNR, SSIM, LPIPS, and DINO feature similarity.
2. What practical applications does the Neural Assets model support? The paper demonstrates that the Neural Assets model can enable a variety of 3D-aware scene editing capabilities, including translating, rotating, and resizing individual objects, as well as more complex operations like swapping backgrounds and transferring objects across scenes.
3. How does the design of the Neural Assets representation impact the model's performance? The paper conducts an ablation study to analyze the importance of different design choices, such as the visual encoder, background modeling, and training strategy. The results show that using a DINO-based visual encoder, modeling the background separately, and training on paired video frames are all crucial for achieving the best performance.