Skip to yearly menu bar Skip to main content


Oral Session

Oral Session 1A: Image and Video Synthesis

Fri 13 Jun 7 a.m. PDT — 8:15 a.m. PDT
Abstract:
Chat is not available.

Fri 13 June 7:00 - 7:15 PDT

Motion Prompting: Controlling Video Generation with Motion Trajectories

Daniel Geng · Charles Herrmann · Junhwa Hur · Forrester Cole · Serena Zhang · Tobias Pfaff · Tatiana Lopez-Guevara · Yusuf Aytar · Michael Rubinstein · Chen Sun · Oliver Wang · Andrew Owens · Deqing Sun

Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as motion prompts. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term motion prompt expansion. We demonstrate the versatility of our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance.

Fri 13 June 7:15 - 7:30 PDT

Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

Ryan Burgert · Yuancheng Xu · Wenqi Xian · Oliver Pilarski · Pascal Clausen · Mingming He · Li Ma · Yitong Deng · Lingxiao Li · Mohsen Mousavi · Michael Ryoo · Paul Debevec · Ning Yu

Generative modeling aims to transform chaotic noise into structured outputs that align with training data distributions. In this work, we enhance video diffusion generative models by introducing motion control as a structured component within latent space sampling. Specifically, we propose a novel real-time noise warping method that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, enabling fine-grained motion control independent of model architecture and guidance type. We fine-tune modern video diffusion base models and provide a unified paradigm for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. By leveraging a real-time noise-warping algorithm that preserves spatial Gaussianity while efficiently maintaining temporal consistency, we enable flexible and diverse motion control applications with minimal trade-offs in pixel quality and temporal coherence. Extensive experiments and user studies demonstrate the advantages of our method in terms of visual quality, motion controllability, and temporal consistency, making it a robust and scalable solution for motion-controllable video synthesis.

Fri 13 June 7:30 - 7:45 PDT

LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping

Pascal Chang · Sergio Sancho · Jingwei Tang · Markus Gross · Vinicius C. Azevedo

Anamorphosis refers to a category of images that are intentionally distorted, making them unrecognizable when viewed directly. Their true form only reveals itself when seen from a specific viewpoint, which can be through some catadioptric device like a mirror or a lens. While the construction of these mathematical devices can be traced back to as early as the 17th century, they are only interpretable when viewed from a specific vantage point and tend to lose meaning when seen normally. In this paper, we revisit these famous optical illusions with a generative twist. With the help of latent rectified flow models, we propose a method to create anamorphic images that still retain a valid interpretation when viewed directly. To this end, we introduce Laplacian Pyramid Warping, a frequency-aware image warping technique key to generating high-quality visuals. Our work extends Visual Anagrams [Geng et al. 2024] to latent space models and to a wider range of spatial transforms, enabling the creation of novel generative perceptual illusions.

Fri 13 June 7:45 - 8:00 PDT

Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space

Yifan Zhou · Zeqi Xiao · Shuai Yang · Xingang Pan

Latent Diffusion Models (LDMs) are known to have an unstable generation process, where even small perturbations or shifts in the input noise can lead to significantly different outputs. This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant. While introducing anti-aliasing operations can partially improve shift-equivariance, significant aliasing and inconsistency persist due to the unique challenges in LDMs, including 1) aliasing amplification during VAE training and multiple U-Net inferences, and 2) self-attention modules that inherently lack shift-equivariance. To address these issues, we redesign the attention modules to be shift-equivariant and propose an equivariance loss that effectively suppresses the frequency bandwidth of the features in the continuous domain. The resulting alias-free LDM (AF-LDM) achieves strong shift-equivariance and is also robust to irregular warping. Extensive experiments demonstrate that AF-LDM produces significantly more consistent results than vanilla LDM across various applications, including video editing and image-to-image translation.

Fri 13 June 8:00 - 8:15 PDT

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Ziqi Pang · Tianyuan Zhang · Fujun Luan · Yunze Man · Hao Tan · Kai Zhang · William Freeman · Yu-Xiong Wang

We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generatng images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enabling random order is to insert a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports in-painting, outpainting and resolution extrapolation in a zero-shot manner.We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios.