Track: Oral Session 4A: Image and Video Synthesis

Sat 14 June 11:00 - 11:15 PDT

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Jingfeng Yao · Bin Yang · Xinggang Wang

Latent diffusion models (LDM) with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: increasing the per-token feature dimension in visual tokenizers improves reconstruction quality but requires substantially larger diffusion models and extended training time to maintain generation performance. This results in prohibitively high computational costs, making high-dimensional tokenizers impractical. In this paper, we argue that this limitation stems from the inherent difficulty of learning unconstrained high-dimensional latent spaces and address this limitation by aligning the latent space with pre-trained vision foundation models. Our VA-VAE (Vision foundation model Aligned Variational AutoEncoder) expands the Pareto frontier of visual tokenizers, enabling 2.7 times faster Diffusion Transformers (DiT) convergence in high-dimensional latent space. To further validate our approach, we optimize a DiT baseline, referred to as LightningDiT, achieving superior performance on class conditional generation with only 6% of the original training epochs. The integrated system demonstrates the effectiveness of VA-VAE, achieving 0.28 rFID and 1.73 gFID on ImageNet-256 generation in 400 epochs—outperforming the original DiT's 0.71 rFID and 2.27 gFID in 1400 epochs, without more complex designs. To our knowledge, this marks the first latent diffusion system to achieve both superior generation and reconstruction without increasing training costs. Our codes and weights will be open source.

Sat 14 June 11:15 - 11:30 PDT

Language-Guided Image Tokenization for Generation

Kaiwen Zha · Lijun Yu · Alireza Fathi · David A. Ross · Cordelia Schmid · Dina Katabi · Xiuye Gu

Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focusing on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2\% and 48.1\% on ImageNet 256$\times$256 and 512$\times$512 benchmarks respectively, across varying number of tokens. These tokenization improvements consistently translate to 16.3\% and 34.3\% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve 93.5$\times$ inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization.

Sat 14 June 11:30 - 11:45 PDT

DreamRelation: Bridging Customization and Relation Generation

Qingyu Shi · Lu Qi · Jianzong Wu · Jinbin Bai · Jingbo Wang · Yunhai Tong · Xiangtai Li

Customized image generation is essential for delivering personalized content based on user-provided prompts, enabling large-scale text-to-image diffusion models to better align with individual needs. However, existing models often neglect the relationships between customized objects in generated images. In contrast, this work addresses this gap by focusing on relation-aware customized image generation, which seeks to preserve the identities from image prompts while maintaining the predicate relations specified in text prompts. Specifically, we introduce DreamRelation, a framework that disentangles identity and relation learning using a carefully curated dataset. Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation. Then, we propose two key modules to tackle the two main challenges—generating accurate and natural relations, especially when significant pose adjustments are required, and avoiding object confusion in cases of overlap. First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships. Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases. Extensive results on our proposed benchmarks demonstrate the superiority of DreamRelation in generating precise relations while preserving object identities across a diverse set of objects and relations. The source code and trained models will be made available to the public.

Sat 14 June 11:45 - 12:00 PDT

Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Jian Han · Jinlai Liu · Yi Jiang · Bin Yan · Yuqi Zhang · Zehuan Yuan · BINGYUE PENG · Xiaobing Liu

We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity refactors visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary classifier and bitwise self-correction mechanism. By theoretically expanding the tokenizer vocabulary size to infinity in Transformer, our method significantly unleashes powerful scaling capabilities to infinity compared to vanilla VAR. Extensive experiments indicate Infinity outperforms AutoRegressive Text-to-Image models by large margins, matches or surpasses leading diffusion models. Without extra optimization, Infinity generates a 1024$\times$1024 image in 0.8s, 2.6$\times$ faster than SD3-Medium, making it the fastest Text-to-Image model. Models and codes will be released to promote the further exploration of Infinity for visual generation.

Sat 14 June 12:00 - 12:15 PDT

Autoregressive Distillation of Diffusion Transformers

Yeongmin Kim · Sotiris Anagnostidis · Yuming Du · Edgar Schoenfeld · Jonas Kohler · Markos Georgopoulos · Albert Pumarola · Ali Thabet · Artsiom Sanakoyeu

Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a $5\times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1\% extra FLOPs on ImageNet-256. Moreover, ARD reaches FID of 1.84 on ImageNet-256 in merely 4 steps and outperforms the publicly available 1024p text-to-image distilled models in prompt adherence score with a minimal drop in FID compared to the teacher.