Oral Session

Orals 6B Image & Video Synthesis

Summit Flex Hall AB


Fri 21 Jun 1 p.m. PDT — 2:30 p.m. PDT

Overflow in Signature Room on the 5th Floor in Summit

Fri 21 June 13:00 - 13:18 PDT

Oral #1
Alchemist: Parametric Control of Material Properties with Diffusion Models

Prafull Sharma · Varun Jampani · Yuanzhen Li · Xuhui Jia · Dmitry Lagun · Fredo Durand · William Freeman · Mark Matthews

We propose a method to control material attributes of objects like roughness, metallic, albedo, and transparency in real images. Our method capitalizes on the generative prior of text-to-image models known for photorealism, employing a scalar value and instructions to alter low-level material properties. Addressing the lack of datasets with controlled material attributes, we generated an object-centric synthetic dataset with physically-based materials. Fine-tuning a modified pre-trained text-to-image model on this synthetic dataset enables us to edit material properties in real-world images while preserving all other attributes. We show the potential application of our model to material edited NeRFs.

Fri 21 June 13:18 - 13:36 PDT

Oral #2
Generative Image Dynamics

Zhengqi Li · Richard Tucker · Noah Snavely · Aleksander Holynski

We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural, oscillatory dynamics of objects such as trees, flowers, candles, and clothes swaying in the wind. We model dense, long-term motion in the Fourier domain as spectral volumes, which we find are well-suited to prediction with diffusion models. Given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume, which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module, the predicted motion representation can be used for a number of downstream applications, such as turning still images into seamlessly looping videos, or allowing users to realistically interact with objects in a real picture by interpreting the spectral volumes as image-space modal bases, which approximate object dynamics.

Fri 21 June 13:36 - 13:54 PDT

Oral #3
Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

Daniel Geng · Inbum Park · Andrew Owens

We consider the problem of synthesizing multi-view optical illusions---images that change appearance upon a transformation, such as a flip. We present a conceptually simple, zero-shot method to do so based on diffusion. For every diffusion step we estimate the noise from different views of a noisy image, combine the noise estimates, and perform a step of the reverse diffusion process. A theoretical analysis shows that this method works precisely for views that can be written as orthogonal transformations, of which permutations are a subset. This leads to the idea of a visual anagram, which includes images that change appearance upon a rotation or a flip, but also upon more exotic pixel permutations such as a jigsaw rearrangement. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method.

Fri 21 June 13:54 - 14:12 PDT

Oral #4
MonoHair: High-Fidelity Hair Modeling from a Monocular Video

Keyu Wu · LINGCHEN YANG · Zhiyi Kuang · Yao Feng · Xutao Han · Yuefan Shen · Hongbo Fu · Kun Zhou · Youyi Zheng

Undoubtedly, high-fidelity 3D hair is crucial for achieving realism, artistic expression, and immersion in computer graphics. While existing 3D hair modeling methods have achieved impressive performance, the challenge of achieving high-quality hair reconstruction persists: they either require strict capture conditions, making practical applications difficult, or heavily rely on learned prior data, obscuring fine-grained details in images. To address these challenges, we propose MonoHair, a generic framework to achieve high-fidelity hair reconstruction from a monocular video, without specific requirements for environments. Our approach bifurcates the hair modeling process into two main stages: precise exterior reconstruction and interior structure inference. The exterior is meticulously crafted using our Patch-based Multi-View Optimization (PMVO). This method strategically collects and integrates hair information from multiple views, independent of prior data, to produce a high-fidelity exterior 3D line map. This map not only captures intricate details but also facilitates the inference of the hair’s inner structure. For the interior, we employ a data-driven, multi-view 3D hair reconstruction method. This method utilizes 2D structural renderings derived from the reconstructed exterior, mirroring the synthetic 2D inputs used during training. This alignment effectively bridges the domain gap between our training data and real-world data, thereby enhancing the accuracy and reliability of our interior structure inference. Lastly, we generate a strand model and resolve the directional ambiguity by our hair growth algorithm. Our experiments demonstrate that our method exhibits robustness across diverse hairstyles and achieves state-of-the-art performance. For more results, please refer to our project page

Fri 21 June 14:12 - 14:30 PDT

Oral #5
Analyzing and Improving the Training Dynamics of Diffusion Models

Tero Karras · Miika Aittala · Jaakko Lehtinen · Janne Hellsten · Timo Aila · Samuli Laine

Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling.As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.