Poster Session
Poster Session 2
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen · Yunhao Gou · Runhui Huang · Zhili Liu · Daxin Tan · Jing Xu · Chunwei Wang · Yi Zhu · yihan zeng · Kuo Yang · Dingdong WANG · Kun Xiang · Haoyuan Li · Haoli Bai · Jianhua Han · Xiao-Hui Li · Weike Jin · Nian Xie · Yu Zhang · James Kwok · Hengshuang Zhao · Xiaodan Liang · Dit-Yan Yeung · Xiao Chen · Zhenguo Li · Wei Zhang · Qun Liu · Lanqing Hong · Lu Hou · Hang Xu
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches.For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.
Let's Chorus: Partner-aware Hybrid Song-Driven 3D Head Animation
Xiumei Xie · Zikai Huang · Wenhao Xu · Peng Xiao · Xuemiao Xu · Huaidong Zhang
Singing is a vital form of human emotional expression and social interaction, distinguished from speech by its richer emotional nuances and freer expressive style. Thus, investigating 3D facial animation driven by singing holds significant research value. Our work focuses on 3D singing facial animation driven by mixed singing audio, and to the best of our knowledge, no prior studies have explored this area. Additionally, the absence of existing 3D singing datasets poses a considerable challenge. To address this, we collect a novel audiovisual dataset, ChorusHead which features synchronized mixed vocal audio and pseudo-3D flame motions for chorus singing. In addition, We propose a partner-aware 3D chorus head generation framework driven by mixed audio inputs. The proposed framework extracts emotional features from the background music and dependence between singers and models the head movement in a latent space from the Variational Autoencoder (VAE), enabling diverse interactive head animation generation. Extensive experimental results demonstrate that our approach effectively generates 3D facial animations of interacting singers, achieving notable improvements in realism and handling background music interference with strong robustness.
KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
Antoni Bigata Casademunt · Michał Stypułkowski · Rodrigo Mira · Stella Bounareli · Konstantinos Vougioukas · Zoe Landgraf · Nikita Drobyshev · Maciej Zieba · Stavros Petridis · Maja Pantic
Current audio-driven facial animation methods achieve impressive results for short videos but suffer from error accumulation and identity drift when extended to longer durations. Existing methods attempt to mitigate this through external spatial control, increasing long-term consistency but compromising the naturalness of motion. We propose \textbf{KeyFace}, a novel two-stage diffusion-based framework, to address these issues. In the first stage, keyframes are generated at a low frame rate, conditioned on audio input and an identity frame, to capture essential facial expressions and movements over extended periods of time. In the second stage, an interpolation model fills in the gaps between keyframes, ensuring smooth transitions and temporal coherence. To further enhance realism, we incorporate continuous emotion representations and handle a wide range of non-speech vocalizations (NSVs), such as laughter and sighs. We also introduce two new evaluation metrics for assessing lip synchronization and NSV generation. Experimental results show that KeyFace outperforms state-of-the-art methods in generating natural, coherent facial animations over extended durations, successfully encompassing NSVs and continuous emotions.
EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
Rang Meng · Xingyu Zhang · Yuming Li · Chenguang Ma
Recent work on human animation usually involves audio, pose, or movement maps conditions, thereby achieves vivid animation quality. However, these methods often face practical challenges due to extra control conditions, cumbersome condition injection modules, or limitation to head region driving. Hence, we ask if it is possible to achieve striking half-body human animation while simplifying unnecessary conditions. To this end, we propose a half-body human animation method, dubbed EchoMimicV2, that leverages a novel Audio-Pose Dynamic Harmonization strategy, including Pose Sampling and Audio Diffusion, to enhance half-body details, facial and gestural expressiveness, and meanwhile reduce conditions redundancy. To compensate for the scarcity of half-body data, we utilize Head Partial Attention to seamlessly accommodate headshot data into our training framework, which can be omitted during inference, providing a free lunch for animation. Furthermore, we design the Phase-specific Denoising Loss to guide motion, detail, and low-level quality for animation in specific phases, respectively. Besides, we also present a novel benchmark for evaluating the effectiveness of half-body human animation. Extensive experiments and analyses demonstrate that EchoMimicV2 surpasses existing methods in both quantitative and qualitative evaluations. The source code and test dataset will be open-sourced.
X-Dyna: Expressive Dynamic Human Image Animation
Di Chang · Hongyi Xu · You Xie · Yipeng Gao · Zhengfei Kuang · Shengqu Cai · Chenxu Zhang · Guoxian Song · Chao Wang · Yichun Shi · Zeyuan Chen · Shijie Zhou · Linjie Luo · Gordon Wetzstein · Mohammad Soleymani
We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key factors underlying the loss of dynamic details, enhancing the lifelike qualities of human video animations.At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos.Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations.
Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset
Yiqun Mei · Mingming He · Li Ma · Julien Philip · Wenqi Xian · David M George · Xueming Yu · Gabriel Dedic · Ahmet Levent Taşel · Ning Yu · Vishal M. Patel · Paul Debevec
Video portrait relighting remains challenging because the results need to be both photorealistic and temporally stable.This typically requires a strong model design that can capture complex facial reflections as well as intensive training on a high-quality paired video dataset, such as dynamic one-light-at-a-time (OLAT). In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. From the model side, we design a new conditional video diffusion model built upon state-of-the-art pre-trained video diffusion model, alongside a new lighting injection mechanism to enable precise control. This way we leverage strong spatial and temporal generative capability to generate plausible solutions to the ill-posed relighting problem. Our technique uses a hybrid dataset consisting of static expression OLAT data and in-the-wild portrait performance videos to jointly learn relighting and temporal modeling. This avoids the need to acquire paired video data in different lighting conditions. Our extensive experiments show that our model produces state-of-the-art results both in terms of photorealism and temporal consistency.
Monocular and Generalizable Gaussian Talking Head Animation
Shengjie Gong · Haojie Li · Jiapeng Tang · Dongming Hu · Shuangping Huang · Hao Chen · Tianshui Chen · Zhuoman Liu
In this work, we introduce Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training. Compared with previous 3D Gaussian Splatting (3DGS) methods that requires elusive multi-view datasets or tedious personalized learning/inference, MGGtalk enables more practical and broader applications. However, in the absence of multi-view and personalized training data, the incompleteness of geometric and appearance information poses a significant challenge. To address these challenges, MGGTalk explores depth information to enhance geometric and facial symmetry characteristics to supplement both geometric and appearance features. Initially, based on the pixel-wise geometric information obtained from depth estimation, we incorporate symmetry operations and point cloud filtering techniques to ensure a complete and precise position parameter for 3DGS. Subsequently, we adopt a two-stage strategy with symmetric priors for predicting the remaining 3DGS parameters. We begin by predicting Gaussian parameters for the visible facial regions of the source image. These parameters are subsequently utilized to improve the prediction of Gaussian parameters for the non-visible regions. Extensive experiments demonstrate that MGGTalk surpasses previous state-of-the-art methods, achieving superior performance across various metrics.
FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video
Jiawei Zhang · Zijian Wu · Zhiyang Liang · Yicheng Gong · Dongfang Hu · Yao Yao · Xun Cao · Hao Zhu
Reconstructing high-fidelity, animatable 3D head avatars from effortlessly captured monocular videos is a pivotal yet formidable challenge. Although significant progress has been made in rendering performance and manipulation capabilities, notable challenges remain, including incomplete reconstruction and inefficient Gaussian representation. To address these challenges, we introduce FATE — a novel method for reconstructing an editable full-head avatar from a single monocular video. FATE integrates a sampling-based densification strategy to ensure optimal positional distribution of points, improving rendering efficiency. A neural baking technique is introduced to convert discrete Gaussian representations into continuous attribute maps, facilitating intuitive appearance editing. Furthermore, we propose a universal completion framework to recover non-frontal appearance, culminating in a 360$^\circ$-renderable 3D head avatar. FATE outperforms previous approaches in both qualitative and quantitative evaluations, achieving state-of-the-art performance. To the best of our knowledge, FATE is the first animatable and 360$^\circ$ full-head monocular reconstruction method for a 3D head avatar. The code will be publicly released upon publication.
CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models
Felix Taubner · Ruihang Zhang · Mathieu Tuli · David B. Lindell
Reconstructing photorealistic and dynamic portrait avatars from images is essential to many applications including advertising, visual effects, and virtual reality. Depending on the application, avatar reconstruction involves different capture setups and constraints — for example, visual effects studios use camera arrays to capture hundreds of reference images, while content creators may seek to animate a single portrait image downloaded from the internet. As such, there is a large and heterogeneous ecosystem of methods for avatar reconstruction. Techniques based on multi-view stereo or neural rendering achieve the highest quality results, but require hundreds of reference images. Recent generative models produce convincing avatars from a single reference image, but visual fidelity yet lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. Our approach demonstrates state-of-the-art performance for single-, few-, and multi-image 4D portrait avatar reconstruction, and takes steps to bridge the gap in visual fidelity between single-image and multi-view reconstruction techniques.
GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion
Jiapeng Tang · Davide Davoli · Tobias Kirschstein · Liam Schoneveld · Matthias Nießner
We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing thatGAF outperforms the previous state-of-the-art methods in novel view synthesis by a 5.34\% higher SSIM score. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.
Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior
Chen Guo · Junxuan Li · Yash Kant · Yaser Sheikh · Shunsuke Saito · Chen Cao
We present Vid2Avatar-Pro, a method to create photorealistic and animatable 3D human avatars from monocular in-the-wild videos. Building a high-quality avatar that supports animation with diverse poses from a monocular video is challenging because the observation of pose diversity and view points is inherently limited. The lack of pose variations typically leads to poor generalization to novel poses, and avatars can easily overfit to limited input view points, producing artifacts and distortions from other views. In this work, we address these limitations by leveraging a universal prior model (UPM) learned from a large corpus of multi-view clothed human performance capture data. We build our representation on top of expressive 3D Gaussians with canonical front and back maps shared across identities. Once the UPM is learned to accurately reproduce the large-scale multi-view human images, we fine-tune the model with an in-the-wild video via inverse rendering to obtain a personalized photorealistic human avatar that can be faithfully animated to novel human motions and rendered from novel views. The experiments show that our approach based on the learned universal prior sets a new state-of-the-art in monocular avatar reconstruction by substantially outperforming existing approaches relying only on heuristic regularization or a shape prior of minimally clothed bodies (e.g., SMPL) on publicly available datasets.
SinGS: Animatable Single-Image Human Gaussian Splats with Kinematic Priors
Yufan Wu · Xuanhong Chen · Wen Li · Shunran Jia · Hualiang Wei · Kairui Feng · Jialiang CHEN · Yuhan Li · Ang He · Weimin Zhang · Bingbing Ni · Wenjun Zhang
Despite significant advances in accurately estimating geometry in contemporary single-image 3D human reconstruction, creating a high-quality, efficient, and animatable 3D avatar remains an open challenge. Two key obstacles persist: incomplete observation and inconsistent 3D priors.To address these challenges, we propose \textbf{SinG}, aiming to achieve high-quality and efficient animatable 3D avatar reconstruction.At the heart of SinG are two key components: Kinematic Human Diffusion (KHD) and compact 3D distillation. The former is a foundational human model that samples within pose space to generate a highly 3D-consistent and high-quality sequence of human images, inferring unseen viewpoints and providing kinematic priors. The latter is a system that reconstructs a compact, high-quality 3D avatar even under imperfect priors, achieved through a novel regional Laplacian that enables precise and compact assembly of 3D primitives.Extensive experiments demonstrate that SinG enables lifelike, animatable human reconstructions, maintaining both high quality and inference efficiency.
EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting
Suzhen Wang · Weijie Chen · Wei Zhang · Minda Zhao · Lincheng Li · Rongsheng Zhang · Zhipeng Hu · Xin Yu
Character customization, or ``face crafting,'' is a vital feature in role-playing games (RPGs) that enhances player engagement by allowing the creation of personalized avatars. Existing automated methods often face challenges in maintaining style consistency and adaptability across different game engines due to their reliance on specific image domains. We introduce EasyCraft, an innovative end-to-end feedforward framework that automates character crafting by integrating both text and image inputs. Our approach utilizes a translator capable of converting facial images of any style into crafting parameters. To develop this translator, we first establish a unified feature distribution in the translator's image encoder through self-supervised learning on a large-scale dataset. Subsequently, we learn the mapping from the feature distribution to the crafting parameters specific to a game engine. By integrating text-to-image techniques with our translator, EasyCraft also supports precise, text-based character crafting. EasyCraft's integration of diverse inputs significantly enhances the versatility and accuracy of avatar creation. Extensive experiments on two RPG games demonstrate the effectiveness of our method, achieving state-of-the-art results and facilitating adaptability across various avatar engines.
RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos
Yuxin Yao · Zhi Deng · Junhui Hou
This paper considers the problem of modeling articulated objects captured in 2D videos to enable novel view synthesis, while also being easily editable, drivable, and reposable. To tackle this challenging problem, we propose RigGS, a new paradigm that leverages 3D Gaussian representation and skeleton-based motion representation to model dynamic objects without utilizing additional template priors. Specifically, we first propose skeleton-aware node-controlled deformation, which deforms a canonical 3D Gaussian representation over time to initialize the modeling process, producing candidate skeleton nodes that are further simplified into a sparse 3D skeleton according to their motion and semantic information. Subsequently, based on the resulting skeleton, we design learnable skin deformations and pose-dependent detailed deformations, thereby easily deforming the 3D Gaussian representation to generate new actions and render further high-quality images from novel views. Extensive experiments demonstrate that our method can generate realistic new actions easily for objects and achieve high-quality rendering. The source code will be publicly available.
Learning Person-Specific Animatable Face Models from In-the-Wild Images via a Shared Base Model
Yuxiang Mao · Zhenfeng Fan · Zhijie Zhang · Zhiheng Zhang · Shihong Xia
Training a generic 3D face reconstruction model in a self-supervised manner using large-scale, in-the-wild 2D face image datasets enhances robustness to varying lighting conditions and occlusions while allowing the model to capture animatable wrinkle details across diverse facial expressions. However, a generic model often fails to adequately represent the unique characteristics of specific individuals. In this paper, we propose a method to train a generic base model and then transfer it to yield person-specific models by integrating lightweight adapters within the large-parameter ViT-MAE base model. These person-specific models excel at capturing individual facial shapes and detailed features while preserving the robustness and prior knowledge of detail variations from the base model. During training, we introduce a silhouette vertex re-projection loss to address boundary "landmark marching" issues on the 3D face caused by pose variations. Additionally, we employ an innovative teacher-student loss to leverage the inherent strengths of UNet in feature boundary localization for training our detail MAE. Quantitative and qualitative experiments demonstrate that our approach achieves state-of-the-art performance in face alignment, detail accuracy, and richness. The code will be released to the public upon the acceptance of this paper.
ControlFace: Harnessing Facial Parametric Control for Face Rigging
Wooseok Jang · Youngjun Hong · Geonho Cha · Seungryong Kim
Manipulation of facial images to meet specific controls such as pose, expression, and lighting, also referred to as face rigging is a complex task in computer vision. Existing methods are limited by their reliance on image datasets, which necessitates individual-specific fine-tuning and limits their ability to retain fine-grained identity and semantic details, reducing practical usability. To overcome these limitations, we introduce ControlFace, a novel face rigging method conditioned on 3DMM renderings that enables flexible, high-fidelity control. ControlFace employs a dual-branch U-Nets: one, referred to as FaceNet, captures identity and fine details, while the other focuses on generation. To enhance control precision, control mixer module encodes the correlated features between the target-aligned control and reference-aligned control, and a novel guidance method, reference control guidance, steers the generation process for better control adherence. By training on a facial video dataset, we fully utilize FaceNet’s rich representations while ensuring control adherence. Extensive experiments demonstrate ControlFace’s superior performance in identity preservation, and control precision, highlighting its practicality. Code and pre-trained weights will be publicly available.
HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion
Yifang Xu · BenXiang Zhai · Yunzhuo Sun · Ming Li · Yang Li · Sidan Du
Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents Method, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training Method. Extensive experimental results demonstrate that our method surpasses the state-of-the-art approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works. Our code is available at -.
DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image
Hyeongjin Nam · Donghwan Kim · Jeongtaek Oh · Kyoung Mu Lee
Most existing methods of 3D clothed human reconstruction from a single image consider the clothed human as a single object without separating them. In this work, we present DeClotH, a novel framework that separately reconstructs 3D cloth and human body from a single image. This task has not yet been explored because of the extreme occlusion between cloth and the human body, which makes it complicated to infer the overall geometries and textures. Furthermore, while recent 3D human reconstruction methods show impressive results by leveraging text-to-image diffusion models, naively adopting such a strategy to this problem often provides incorrect guidance, especially in reconstructing 3D cloth. To address these challenges, we propose two core designs in our framework. First, to alleviate the occlusion issue, we leverage 3D template models of both cloth and human body as regularizations, which provide strong priors and prevent erroneous reconstruction by the occlusion. Second, we leverage a cloth diffusion model specifically devised to provide contextual information about cloth appearances for reconstructing 3D cloth. Qualitative and quantitative experiments demonstrate that our proposed approaches are highly effective in reconstructing 3D cloth and the human body. The codes will be released.
Disentangled Pose and Appearance Guidance for Multi-Pose Generation
Tengfei Xiao · Yue Wu · Yuelong Li · Can Qin · Maoguo Gong · Qiguang Miao · Wenping Ma
Human pose generation is a complex task due to the non-rigid and highly variable nature of human body structures and appearances. While existing methods overlook the fundamental differences between spatial transformations of poses and texture generation for appearance, making them prone to overfitting. To this end, a multi-pose generation framework is proposed which is driven by disentangled pose and appearance guidance. Our approach comprises a Global-aware Pose Generation module that iteratively generates pose embeddings, enabling effective control over non-rigid body deformations. Additionally, we introduce the Global-aware Transformer Decoder, which leverages similarity queries and attention mechanisms to achieve spatial transformations and enhance pose consistency through a Global-aware block. In the appearance generation phase, we condition a diffusion model on pose embeddings produced in the initial stage and introduce an appearance adapter that extracts high-level contextual semantic information from multi-scale features, enabling further refinement of pose appearance textures and providing appearance guidance. Extensive experiments on the UBC Fashion and TikTok datasets demonstrate that our framework achieves state-of-the-art results in both quality and fidelity, establishing it as a powerful approach for complex pose generation tasks.
Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction
Yuanbo Wang · Zhaoxuan Zhang · Jiajin Qiu · Dilong Sun · Zhengyu Meng · Xiaopeng Wei · Xin Yang
Diffusion models have made breakthroughs in 3D generation tasks in recent years. However, current 3D diffusion models focus on reconstructing target shape from images or a set of partial observations. While excelling in global context understanding, they struggle to capture the local details of complex shapes and limited to the occlusion and lighting conditions. To overcome these limitations, we utilize tactile images to capture the local 3D shape information and propose a Touch2Shape model, which leverages a touch-conditioned diffusion model to explore and reconstruct the target shape from touch. For shape reconstruction, we have developed a touch embedding module to condition the diffusion model in creating a compact representation and a touch shape fusion module to refine the reconstructed shape. For shape exploration, we combine the diffusion model with reinforcement learning to train a shape exploration policy. This involves using the generated latent vector from the diffusion model to guide the touch exploration policy training through a novel reward design. Experiments validate that our method significantly improve the reconstruction quality thorough both qualitatively and quantitative analysis, and our touch exploration policy further boosts reconstruction performance.
MangaNinja: Line Art Colorization with Precise Reference Following
Zhiheng Liu · Ka Leong Cheng · Xi Chen · Jie Xiao · Hao Ouyang · Kai Zhu · Yu Liu · Yujun Shen · Qifeng Chen · Ping Luo
Derived from diffusion models, MangaNinja specializes in the task of reference-guided line art colorization. We incorporate two thoughtful designs to ensure precise character detail transcription, including a patch shuffling module to facilitate correspondence learning between the reference color image and the target line art, and a point-driven control scheme to enable fine-grained color matching. Experiments on a self-collected benchmark demonstrate the superiority of our model over current solutions in terms of precise colorization. We further showcase the potential of the proposed interactive point control in handling challenging cases (e.g., extreme poses and shadows), cross-character colorization, multi-reference harmonization, etc., beyond the reach of existing algorithms.
HVI: A New Color Space for Low-light Image Enhancement
Qingsen Yan · Yixu Feng · Cheng Zhang · Guansong Pang · Kangbiao Shi · Peng Wu · Wei Dong · Jinqiu Sun · Yanning Zhang
Low-Light Image Enhancement (LLIE) is a crucial computer vision task that aims to restore detailed visual information from corrupted low-light images. Many existing LLIE methods are based on standard RGB (sRGB) space, which often produce color bias and brightness artifacts due to inherent high color sensitivity in sRGB. While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts.To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. The former enforces small distances for red coordinates to remove the red artifacts, while the latter compresses the low-light regions to remove the black artifacts. To fully leverage the chromatic and intensity information, a novel Color and Intensity Decoupling Network (CIDNet) is further introduced to learn accurate photometric mapping function under different lighting conditions in the HVI space. Comprehensive results from benchmark and ablation experiments show that the proposed HVI color space with CIDNet outperforms the state-of-the-art methods on 10 datasets.
Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation
Tianfu Wang · Mingyang Xie · Haoming Cai · Sachin Shah · Christopher Metzler
Transparent surfaces, such as glass, create complex reflections that obscure images, posing challenges for various computer vision applications. We present Flash-Split, a robust, two-stage flash-based approach for separating transmitted and reflected light using a single pair of flash/no-flash images, even if they are misaligned. Stage 1 performs separation in latent space via a dual-branch diffusion model, reducing the need for alignment by encoding both transmission and reflection with cross-attention. Stage 2 enhances image sharpness through conditional decoding, blending separated low-frequency and original high-frequency information. Flash-Split achieves state-of-the-art results in single-view reflection separation, validated on synthetic and real-world data, and outperforms existing methods in challenging scenarios.
Noise Modeling in One Hour: Minimizing Preparation Efforts for Self-supervised Low-Light RAW Image Denoising
Feiran Li · Haiyang Jiang · Daisuke Iso
Noise synthesis is a promising solution for addressing the data shortage problem in data-driven low-light RAW image denoising. However, accurate noise synthesis methods often necessitate labor-intensive calibration and profiling procedures during preparation, preventing them from landing to practice at scale. This work introduces a practically simple noise synthesis pipeline based on detailed analyses of noise properties and extensive justification of widespread techniques. Compared to other approaches, our proposed pipeline eliminates the cumbersome system gain calibration and signal-independent noise profiling steps, reducing the preparation time for noise synthesis from days to hours. Meanwhile, our method exhibits strong denoising performance, showing an up to 0.54dB PSNR improvement over the current state-of-the-art noise synthesis technique.
Quad-Pixel Image Defocus Deblurring: A New Benchmark and Model
Hang Chen · Yin Xie · Xiaoxiu Peng · Lihu Sun · Wenkai Su · Xiaodong Yang · Chengming Liu
Defocus deblurring is a challenging task due to the spatially varying blur. Recent works have shown impressive results in data-driven approaches using dual-pixel (DP) sensors. Quad-pixel (QP) sensors represent an advanced evolution of DP sensors, providing four distinct sub-aperture views in contrast to only two views offered by DP sensors. However, research on QP-based defocus deblurring is scarce. In this paper, we propose a novel end-to-end learning-based approach for defocus deblurring that leverages QP data. To achieve this, we design a QP defocus and all-in-focus image pair acquisition method and provide a QP Defocus Deblurring (QPDD) dataset containing 4,935 image pairs. We then introduce a Local-gate assisted Mamba Network (LMNet), which includes a two-branch encoder and a Simple Fusion Module (SFM) to fully utilize features of sub-aperture views. In particular, our LMNet incorporates a Local-gate assisted Mamba Block (LAMB) that mitigates local pixel forgetting and channel redundancy within Mamba, and effectively captures global and local dependencies. By extending the defocus deblurring task from a DP-based to a QP-based approach, we demonstrate significant improvements in restoring sharp images. Comprehensive experimental evaluations further indicate that our approach outperforms state-of-the-art methods.
ScribbleLight: Single Image Indoor Relighting with Scribbles
Jun Myeong Choi · Annie N. Wang · Pieter Peers · Anand Bhattad · Roni Sengupta
Image-based relighting of indoor rooms creates an immersive virtual understanding of the space, which is useful for interior design, virtual staging, and real estate. Relighting indoor rooms from a single image is especially challenging due to complex illumination interactions between multiple lights and cluttered objects featuring a large variety in geometrical and material complexity. Recently, generative models have been successfully applied to image-based relighting conditioned on a target image or a latent code, albeit without detailed local lighting control. In this paper, we introduce ScribbleLight, a generative model that supports local fine-grained control of lighting effects through scribbles that describe changes in lighting. Our key technical novelty is an Albedo-conditioned Stable Image Diffusion model that preserves the intrinsic color and texture of the original image after relighting and an encoder-decoder-based ControlNet architecture that enables geometry-preserving lighting effects with normal map and scribble annotations. We demonstrate ScribbleLight's ability to create different lighting effects (eg, turning lights on/off, adding highlights, cast shadows, or indirect lighting from unseen lights) from sparse scribble annotations.
Hearing Anywhere in Any Environment
Xiulong Liu · Anurag Kumar · Paul Calamia · Sebastia Vicenc Amengual Gari · Calvin Murdock · Ishwarya Ananthabhotla · Philip W Robinson · Eli Shlizerman · Vamsi Krishna Ithapu · Ruohan Gao
In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion. Despite recent advances in neural approaches for Room Impulse Response (RIR) estimation, most existing methods are limited to the single environment on which they are trained, lacking the ability to generalize to new rooms with different geometries and surface materials. We aim to develop a unified model capable of reconstructing the spatial acoustic experience of any environment with minimum additional measurements. To this end, we present xRIR, a framework for cross-room RIR prediction. The core of our generalizable approach lies in combining a geometric feature extractor, which captures spatial context from panorama depth images, with a RIR encoder that extracts detailed acoustic features from only a few reference RIR samples. To evaluate our method, we introduce AcousticRooms, a new dataset featuring high-fidelity simulation of over 300,000 RIRs from 260 rooms. Experiments show that our method strongly outperforms a series of baselines. Furthermore, we successfully perform sim-to-real transfer by evaluating our model on four real-world environments, demonstrating the generalizability of our approach and the realism of our dataset.
EnvGS: Modeling View-Dependent Appearance with Environment Gaussian
Tao Xie · Xi Chen · Zhen Xu · Yiman Xie · Yudong Jin · Yujun Shen · Sida Peng · Hujun Bao · Xiaowei Zhou
Reconstructing complex reflections in real-world scenes from 2D images is essential for achieving photorealistic novel view synthesis. Existing methods that utilize environment maps to model reflections from distant lighting often struggle with high-frequency reflection details and fail to account for near-field reflections. In this work, we introduce RefGS, a novel environment representation that employs a set of Gaussian primitives as an explicit 3D model for capturing reflections. Our approach models all reflections, regardless of their distance, into a unified set of Gaussian primitives, effectively representing high-frequency reflections from both near and distant light sources. To efficiently render these environment Gaussians, we developed a ray-tracing-based renderer that leverages the GPU's RT core for fast rendering. This allows us to jointly optimize our model for high-quality reconstruction while maintaining real-time rendering speeds. Results from multiple real-world and synthetic datasets demonstrate that our method produces significantly more detailed reflections, achieving the best rendering quality in real-time novel view synthesis. The code will be released upon acceptance.
Geometry Field Splatting with Gaussian Surfels
Kaiwen Jiang · Venkataram Sivaram · Cheng Peng · Ravi Ramamoorthi
Geometric reconstruction of opaque surfaces from images is a longstanding challenge in computer vision, with renewed interest from volumetric view synthesis algorithms using radiance fields. We leverage the geometry field proposed in recent work for stochastic opaque surfaces, which can then be converted to volume densities. We adapt Gaussian kernels or surfels to splat the geometry field rather than the volume, enabling precise reconstruction of opaque solids. Our first contribution is to derive an efficient and almost exact differentiable rendering algorithm for geometry fields parameterized by Gaussian surfels, while removing current approximations involving Taylor series and no self-attenuation. Next, we address the discontinuous loss landscape when surfels cluster near geometry, showing how to guarantee that the rendered color is a continuous function of the colors of the kernels, irrespective of ordering. Finally, we use latent representations with spherical harmonics encoded reflection vectors rather than spherical harmonics encoded colors to better address specular surfaces. We demonstrate significant improvement in the quality of reconstructed 3D surfaces on widely-used datasets.
Locally Orderless Images for Optimization in Differentiable Rendering
Ishit Mehta · Manmohan Chandraker · Ravi Ramamoorthi
Problems in differentiable rendering often involve optimizing scene parameters that cause motion in image space. The gradients for such parameters tend to be sparse, leading to poor convergence. While existing methods address this sparsity through proxy gradients such as topological derivatives or lagrangian derivatives, they make simplifying assumptions about rendering. Multi-resolution image pyramids offer an alternative approach but prove unreliable in practice. We introduce a method that uses locally orderless images --- where each pixel maps to a histogram of intensities that preserves local variations in appearance. Using an inverse rendering objective that minimizes histogram distance, our method extends support for sparsely defined image gradients and recovers optimal parameters. We validate our method on various inverse problems using both synthetic and real data.
Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes
JunYong Choi · Min-Cheol Sagong · SeokYeong Lee · Seung-Won Jung · Ig-Jae Kim · Junghyun Cho
We propose a diffusion-based inverse rendering framework that decomposes a single RGB image into geometry, material, and lighting. Inverse rendering is inherently ill-posed, making it difficult to predict a single accurate solution. To address this challenge, recent generative model-based methods aim to present a range of possible solutions. However, the objectives of finding a single accurate solution and generating diverse possible solutions can often be conflicting. In this paper, we propose a channel-wise noise scheduling approach that allows a single diffusion model architecture to achieve two conflicting objectives. The resulting two diffusion models, trained with different channel-wise noise schedules, can predict a single highly accurate solution and present multiple possible solutions. Additionally, we introduce a technique to handle high-dimensional per-pixel environment maps in diffusion models. The experimental results on both synthetic and real-world datasets demonstrate the superiority of our two models and their complementary nature, highlighting the importance of considering both accuracy and diversity in inverse rendering.
Feature-Preserving Mesh Decimation for Normal Integration
Moritz Heep · Sven Behnke · Eduard Zell
High-resolution normal maps, e.g. from photometric stereo, capture astonishing surface details. Since normals are estimated pixel-by-pixel, the subsequent normal integration is typically performed on the regular pixel grid. Consequently, resolving structures half the size requires doubling both image height and width.To avoid this quadratic growth with increasing image size, recent work has suggested approximating the pixel grid with lower-resolution 2D triangle meshes and integrating the triangle mesh afterwards. In our work, we analyse the local shape structure to increase the compression ratio and precision of 2D meshes. Previous work generated only isotropic meshes with equilateral triangles. Instead, our mesh decimation method leverages anisotropy by placing edges and vertices along ridges and furrows, thereby achieving a more accurate approximation. Our main contribution is the derivation of the renowned quadric error measure from mesh decimation for screen space applications and its combination with optimal Delaunay triangulations.
SGCR: Spherical Gaussians for Efficient 3D Curve Reconstruction
Xinran Yang · Donghao Ji · Yuanqi Li · Jie Guo · Yanwen Guo · Junyuan Xie
Neural rendering techniques have made substantial progress in generating photo-realistic 3D scenes. The latest 3D Gaussian Splatting technique has achieved high quality novel view synthesis as well as fast rendering speed. However, 3D Gaussians lack proficiency in defining accurate 3D geometric structures despite their explicit primitive representations. This is due to the fact that Gaussian's attributes are primarily tailored and fine-tuned for rendering diverse 2D images by their anisotropic nature. To pave the way for efficient 3D reconstruction, we present Spherical Gaussians, a simple and effective representation for 3D geometric boundaries, from which we can directly reconstruct 3D feature curves from a set of calibrated multi-view images. Spherical Gaussians is optimized from grid initialization with a view-based rendering loss, where a 2D edge map is rendered at a specific view and then compared to the ground-truth edge map extracted from the corresponding image, without the need for any 3D guidance or supervision. Given Spherical Gaussians serve as intermedia for the robust edge representation, we further introduce a novel optimization-based algorithm called SGCR to directly extract accurate parametric curves from aligned Spherical Gaussians. We demonstrate that SGCR outperforms existing state-of-the-art methods in 3D edge reconstruction while enjoying great efficiency.
AMR-Transformer: Enabling Efficient Long-range Interaction for Complex Neural Fluid Simulation
Zeyi Xu · Jinfan Liu · Kuangxu Chen · Ye Chen · Zhangli Hu · Bingbing Ni
Accurately and efficiently simulating complex fluid dynamics is a challenging task that has traditionally relied on computationally intensive methods. Neural network-based approaches, such as convolutional and graph neural networks, have partially alleviated this burden by enabling efficient local feature extraction. However, they struggle to capture long-range dependencies due to limited receptive fields, and Transformer-based models, while providing global context, incur prohibitive computational costs. To tackle these challenges, we propose AMR-Transformer, an efficient and accurate neural CFD-solving pipeline that integrates a novel adaptive mesh refinement scheme with a Navier-Stokes constraint-aware fast pruning module. This design encourages long-range interactions between simulation cells and facilitates the modeling of global fluid wave patterns, such as turbulence and shockwaves. Experiments show that our approach achieves significant gains in efficiency while preserving critical details, making it suitable for high-resolution physical simulations with long-range dependencies. On CFDBench, PDEBench and a new shockwave dataset, our pipeline demonstrates up to an order-of-magnitude improvement in accuracy over baseline models. Additionally, compared to ViT, our approach achieves a reduction in FLOPs of up to 60 times.
MaRI: Material Retrieval Integration across Domains
Jianhui Wang · Zhifei Yang · Yangfan He · Huixiong Zhang · Yuxuan Chen · Jingwei Huang
Accurate material retrieval is critical for creating realistic 3D assets. Existing methods rely on datasets that capture shape-invariant and lighting-varied representations of materials, which are scarce and face challenges due to limited diversity and inadequate real-world generalization. Most current approaches adopt traditional image search techniques. They fall short in capturing the unique properties of material spaces, leading to suboptimal performance in retrieval tasks. Addressing these challenges, we introduce MaRI, a framework designed to bridge the feature space gap between synthetic and real-world materials. MaRI constructs a shared embedding space that harmonizes visual and material attributes through a contrastive learning strategy by jointly training a image and a material encoder, bringing similar materials and images closer while separating dissimilar pairs within the feature space. To support this, we construct a comprehensive dataset comprising high-quality synthetic materials rendered with controlled shape variations and diverse lighting conditions, along with real-world materials processed and standardized using material transfer techniques. Extensive experiments demonstrate the superior performance, accuracy, and generalization capabilities of MaRI across diverse and complex material retrieval tasks, outperforming existing methods.
Spherical Manifold Guided Diffusion Model for Panoramic Image Generation
Xiancheng Sun · Mai Xu · Shengxi Li · Senmao Ma · Xin Deng · Lai Jiang · Shen gang
Panoramic image essentially acts as a pivotal role in emerging virtual reality and augmented reality scenarios; however, the generation of panoramic images are essentially challenging due to the intrinsic spherical geometry and spherical distortions caused by equirectangular projection (ERP). To address this, we start from the very basics of spherical manifold of panoramic images, and propose a novel spherical manifold convolution (SMConv) on $S^2$ manifold. Based on the SMConv operation, we propose a spherical manifold guided diffusion (SMGD) model for text-conditioned panoramic image generation, which can well accommodate the spherical geometry during generation. We further develop a novel evaluation method by calculating grouped Fréchet inception distance (FID) on cube-map projections, which can well reflect the quality of generated panoramic images, compared to existing methods that randomly crop ERP-distorted content. Experiment results demonstrate that our SMGD model achieves the state-of-the-art generation quality and accuracy, whilst retaining the shortest sampling time in the text-conditioned panoramic image generation task.
MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation
Zilong Chen · Yikai Wang · Wenqiang Sun · Feng Wang · Yiwen Chen · Huaping Liu
In this paper, we introduce MeshGen, an advanced image-to-3D pipeline that generates high-quality 3D meshes with detailed geometry and physically based rendering (PBR) textures. Addressing the challenges faced by existing 3D native diffusion models, such as suboptimal auto-encoder performance, limited controllability, poor generalization, and inconsistent image-based PBR texturing, MeshGen employs several key innovations to overcome these limitations. We pioneer a render-enhanced point-to-shape auto-encoder that compresses meshes into a compact latent space, by designing perceptual optimization with ray-based regularization. This ensures that the 3D shapes are accurately represented and reconstructed to preserve geometric details within the latent space. To address data scarcity and image-shape misalignment, we further propose geometric augmentation and generative rendering augmentation techniques, which enhance the model's controllability and generalization ability, allowing it to perform well even with limited public datasets. For texture generation, MeshGen employs a reference attention-based multi-view ControlNet for consistent appearance synthesis. This is further complemented by our multi-view PBR decomposer that estimates PBR components and a UV inpainter that fills invisible areas, ensuring a seamless and consistent texture across the 3D mesh. Our extensive experiments demonstrate that MeshGen largely outperforms previous methods in both shape and texture generation, setting a new standard for the quality of 3D meshes generated with PBR textures.
RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects
Soumyaratna Debnath · Ashish Tiwari · Kaustubh Sadekar · Shanmuganathan Raman
Recent advancements in learning-based methods have opened new avenues for exploring and interpreting various artistic forms, such as shadow art, origami, and sketch art, through computational models. One notable visual art form is 3D Anamorphic Art, in which a specific ensemble of arbitrarily shaped 3D objects creates a realistic and meaningful expression when observed from a particular viewpoint and loses its coherence over the other viewpoints. In this work, we build on insights from 3D Anamorphic Art to address the task of 3D object arrangement. We introduce RASP - a differentiable-rendering-based framework to arrange arbitrarily shaped 3D objects within a bounded volume (or container) via shadow-guided optimization aiming towards minimal inter-object spacing and near-maximal occupancy. Here, the projected silhouette of the 3D object arrangement serves as its shadow. Furthermore, we propose a novel SDF-based formulation to handle inter-object intersection and container extrusion. Since the 3D objects can also be "parts" of another 3D object, we demonstrate how RASP can be extended to part assembly alongside object packing. Finally, we present artistic illustrations of multi-view anamorphic art, achieving meaningful expressions from multiple viewpoints within a single ensemble.
Twinner: Shining Light on Digital Twins in a Few Snaps
Jesus Zarzar · Tom Monnier · Roman Shapovalov · Andrea Vedaldi · David Novotny
We present the first large reconstruction model, Twinner, capable of recovering a scene's illumination as well as an object's geometry and material properties from only a few posed images. Twinner is based on the Large Reconstruction Model and innovates in three key ways:1) We introduce a memory-efficient voxel-grid transformer whose memory scales only quadratically with the size of the voxel grid. 2) To deal with scarcity of high-quality ground-truth PBR-shaded models, we introduce a large fully-synthetic dataset of procedurally-generated PBR-textured objects lit with varied illumination.3) To narrow the synthetic-to-real gap, we finetune the model on real life datasets by means of a differentiable physically-based shading model, eschewing the need for ground-truth illumination or material properties which are challenging to obtain in real life. We demonstrate the efficacy of our model on the real life StanfordORB benchmark where, given few input views, we achieve reconstruction quality significantly superior to existing feed-forward reconstruction networks, and comparable to significantly slower per-scene optimization methods.
CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner
Weiyu Li · Jiarui Liu · Hongyu Yan · Rui Chen · Yixun Liang · Xuelin Chen · Ping Tan · Xiaoxiao Long
We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. Despite the significant advancements in 3D generation, existing methods still struggle with lengthy optimization processes, self-occlusion, irregular mesh topologies, and difficulties in accommodating user edits, consequently impeding their widespread adoption and implementation in 3D modeling softwares. Our work is inspired by the craftsman, who usually roughs out the holistic figure of the work first and elaborates the surface details subsequently. Specifically, we first introduce a robust data preprocessing pipeline that utilizes visibility check and winding mumber to maximize the use of existing 3D data. Leveraging this data, we employ a 3D-native DiT model that directly models the distribution of 3D data in latent space, generating coarse geometries with regular mesh topology in seconds. Subsequently, a normal-based geometry refiner enhances local surface details, which can be applied automatically or interactively with user input. Extensive experiments demonstrate that our method achieves high efficacy in producing superior quality 3D assets compared to existing methods.
Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation
Jiantao Lin · Xin Yang · Meixi Chen · Xu Yingjie · Dongyu Yan · Leyi Wu · Xinli Xu · Lie XU · Shunsi Zhang · Ying-Cong Chen
Diffusion models have achieved great success in generating 2D images. However, the quality and generalizability of 3D content generation remain limited. State-of-the-art methods often require large-scale 3D assets for training, which are challenging to collect. In this work, we introduce Kiss3DGen (Keep It Simple and Straightforward in 3D Generation), an efficient framework for generating, editing, and enhancing 3D objects by repurposing a well-trained 2D image diffusion model for 3D generation. Specifically, we fine-tune a diffusion model to generate "3D Bundle Image", a tiled representation composed of multi-view images and their corresponding normal maps. The normal maps are then used to reconstruct a 3D mesh, and the multi-view images provide texture mapping, resulting in a complete 3D model. This simple method effectively transforms the 3D generation problem into a 2D image generation task, maximizing the utilization of knowledge in pretrained diffusion models. Furthermore, we demonstrate that our Kiss3DGen model is compatible with various diffusion model techniques, enabling advanced features such as 3D editing, mesh and texture enhancement, etc. Through extensive experiments, we demonstrate the effectiveness of our approach, showcasing its ability to produce high-quality 3D models efficiently.
PartGen: Part-level 3D Generation and Reconstruction with Multi-view Diffusion Models
Minghao Chen · Roman Shapovalov · Iro Laina · Tom Monnier · Jianyuan Wang · David Novotny · Andrea Vedaldi
Text- or image-to-3D generators and 3D scanners can now produce 3D assets with high-quality shapes and textures. These assets typically consist of a single, fused representation, like an implicit neural field, a Gaussian mixture, or a mesh, without any useful structure.However, most applications and creative workflows require assets to be made of several meaningful parts that can be manipulated independently. To address this gap, we introduce ParGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object. First, given multiple views of a 3D object, generated or rendered, a multi-view diffusion model extracts a set of plausible and view-consistent part segmentations, dividing the object into parts. Then, a second multi-view diffusion model takes each part separately, fills in the occlusions, and uses those completed views for 3D reconstruction by feeding them to a 3D reconstruction network. This completion process considers the context of the entire object to ensure that the parts integrate cohesively. The generative completion model can make up for the information missing due to occlusions; in extreme cases, it can hallucinate entirely invisible parts based on the input 3D asset. We evaluate our method on generated and real 3D assets and show that it outperforms segmentation and part-extraction baselines by a large margin. We also showcase downstream applications such as 3D part editing.
FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts
Tongyuan Bai · Wangyuanfan Bai · Dong Chen · Tieru Wu · Manyi Li · Rui Ma
Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph-based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene synthesis. Specifically, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient and user-friendly solution that unifies text-based and graph-based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.
Reference-Based 3D-Aware Image Editing with Triplanes
Bahri Batuhan Bilecen · Yiğit Yalın · Ning Yu · Aysegul Dundar
Generative Adversarial Networks (GANs) have emerged as powerful tools for high-quality image generation and real image editing by manipulating their latent spaces. Recent advancements in GANs include 3D-aware models such as EG3D, which feature efficient triplane-based architectures capable of reconstructing 3D geometry from single images. However, limited attention has been given to providing an integrated framework for 3D-aware, high-quality, reference-based image editing. This study addresses this gap by exploring and demonstrating the effectiveness of the triplane space for advanced reference-based edits. Our novel approach integrates encoding, automatic localization, spatial disentanglement of triplane features, and fusion learning to achieve the desired edits. Additionally, our framework demonstrates versatility and robustness across various domains, extending its effectiveness to animal face edits, partially stylized edits like cartoon faces, full-body clothing edits, and 360-degree head edits. Our method shows state-of-the-art performance over relevant latent direction, text, and image-guided 2D and 3D-aware diffusion and GAN methods, both qualitatively and quantitatively.
WonderWorld: Interactive 3D Scene Generation from a Single Image
Hong-Xing Yu · Haoyi Duan · Charles Herrmann · William Freeman · Jiajun Wu
We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. The major challenge lies in achieving fast generation of 3D scenes. Existing scene generation approaches fall short of speed as they often require (1) progressively generating many views and depth maps, and (2) time-consuming optimization of the scene representations. Our approach does not need multiple views, and it leverages a geometry-based initialization that significantly reduces optimization time. Another challenge is generating coherent geometry that allows all scenes to be connected. We introduce the guided depth diffusion that allows partial conditioning of depth estimation. WonderWorld generates connected and diverse 3D scenes in less than 10 seconds on a single A6000 GPU, enabling real-time user interaction and exploration. We will release full code and software for reproducibility.
UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping
Aashish Rai · Dilin Wang · Mihir Jain · Nikolaos Sarafianos · Kefan Chen · Srinath Sridhar · Aayush Prakash
3D Gaussian Splatting (3DGS) has demonstrated superior quality in modeling 3D objects and scenes. However, generating 3DGS remains challenging due to their discrete, unstructured, and permutation-invariant nature. In this work, we present a simple yet effective method to overcome these challenges. We utilize spherical mapping to transform 3DGS into a structured 2D representation, termed UVGS. UVGS can be viewed as multi-channel images, with feature dimensions as a concatenation of Gaussian attributes such as position, scale, color, opacity, and rotation. We further find that these heterogeneous features can be compressed into a lower-dimensional (e.g., 3-channel) shared feature space using a carefully designed multi-branch network.The compressed UVGS can be treated as typical RGB images. Remarkably, we discover that typical VAEs trained with latent diffusion models can directly generalize to this new representation without additional training. Our novel representation makes it effortless to leverage foundational 2D models, such as diffusion models, to directly model 3DGS. Additionally, one can simply increase the 2D UV resolution to accommodate more Gaussians, making UVGS a scalable solution compared to typical 3D backbones. This approach immediately unlocks various novel generation applications of 3DGS by inherently utilizing the already developed superior 2D generation capabilities. In our experiments, we demonstrate various unconditional, conditional generation, and inpainting applications of 3DGS based on diffusion models, which were previously non-trivial.
3D-GSW: 3D Gaussian Splatting for Robust Watermarking
Youngdong Jang · Hyunje Park · Feng Yang · Heeju Ko · Euijin Choo · Sangpil Kim
As 3D Gaussian Splatting(3D-GS) gains significant attention and its commercial usage increases, the need for watermarking technologies to prevent unauthorized use of the 3D-GS models and rendered images has become increasingly important.In this paper, we introduce a robust watermarking method for 3D-GS that secures ownership of both the model and its rendered images.Our proposed method remains robust against distortions in rendered images and model attacks while maintaining high rendering quality. To achieve these objectives, we present Frequency-Guided Densification(FGD), which removes 3D Gaussians based on their contribution to rendering quality, enhancing real-time rendering and the robustness of the message. FGD utilizes Discrete Fourier Transform to split 3D Gaussians in high-frequency areas, improving rendering quality. Furthermore, we employ a gradient mask for 3D Gaussians and design a wavelet-subband loss to enhance rendering quality. Our experiments show that our method embeds the message in the rendered images invisibly and robustly against various attacks, including model distortion. Our method achieves state-of-the-art performance.
PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting
Alex Hanson · Allen Tu · Vasu Singla · Bethmage Mayuka Jayawardhana · Matthias Zwicker · Tom Goldstein
Recent advancements in novel view synthesis have enabled real-time rendering speeds with high reconstruction accuracy.3D Gaussian Splatting (3D-GS), a foundational point-based parametric 3D scene representation, models scenes as large sets of 3D Gaussians. However, complex scenes can consist of millions of Gaussians, resulting in high storage and memory requirements that limit the viability of 3D-GS on devices with limited resources. Current techniques for compressing these pretrained models by pruning Gaussians rely on combining heuristics to determine which Gaussians to remove. At high compression ratios, the pruned scenes suffer from heavy degradation of visual fidelity and loss of foreground details. In this paper, we propose a principled sensitivity pruning score that preserves visual fidelity and foreground details at significantly higher compression ratios than existing approaches. It is computed as a second-order approximation of the reconstruction error on the training views with respect to the spatial parameters of each Gaussian. Additionally, we propose a multi-round prune-refine pipeline that can be applied to any pretrained 3D-GS model without changing its training pipeline. After pruning $\mathit{90\\%}$ of Gaussians, a substantially higher percentage than previous methods, our PUP 3D-GS pipeline increases average rendering speed by $\mathit{3.56\times}$ while retaining more salient foreground information and achieving higher image quality metrics than existing techniques on scenes from the Mip-NeRF 360, Tanks \& Temples, and Deep Blending datasets.
Gaussian Splatting for Efficient Satellite Image Photogrammetry
Luca Savant Aira · Gabriele Facciolo · Thibaud Ehret
Recently, Gaussian splatting has emerged as a strong alternative to NeRF, demonstrating impressive 3D modeling capabilities while requiring only a fraction of the training and rendering time. In this paper, we show how the standard Gaussian splatting framework can be adapted for remote sensing, retaining its high efficiency. This enables us to achieve state-of-the-art performance in just a few minutes, compared to the day-long optimization required by the best-performing NeRF-based Earth observation methods. The proposed framework incorporates remote-sensing improvements from EO-NeRF, such as radiometric correction and shadow modeling, while introducing novel components, including sparsity, view consistency, and opacity regularizations.
HyperGS: Hyperspectral 3D Gaussian Splatting
Christopher Thirgood · Oscar Mendez · Erin Chao Ling · Jonathan Storey · Simon Hadfield
We introduce HyperGS, a novel framework for Hyperspectral Novel View Synthesis (HNVS), based on a new latent 3D Gaussian Splatting (3DGS) technique. Our approach enables simultaneous spatial and spectral renderings by encoding material properties from multi-view 3D hyperspectral datasets. HyperGS reconstructs high-fidelity views from arbitrary perspectives with improved accuracy and speed, outperforming existing methods. To address the challenges of high-dimensional data, we perform view synthesis in a learned latent space, incorporating a pixel-wise adaptive density function and a pruning technique for increased training stability and efficiency. Additionally, we introduce the first HNVS benchmark, implementing a number of new baselines based on recent SOTA RGB-NVS techniques, alongside the small number of prior works on HNVS. We demonstrate HyperGS's robustness via extensive evaluations of real and simulated hyperspectral scenes. This work marks a significant step forward in efficient multi-view 3D geometry synthesis for hyperspectral modeling.
RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images
Junjin Xiao · Qing Zhang · Yongwei Nie · Lei Zhu · Wei-Shi Zheng
This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code will be made publicly available.
GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping
Jinfeng Liu · Lingtong Kong · Bo Li · Dan Xu
High dynamic range (HDR) novel view synthesis (NVS) aims to reconstruct HDR scenes by leveraging multi-view low dynamic range (LDR) images captured at different exposure levels. Current training paradigms with 3D tone mapping often result in unstable HDR reconstruction, while training with 2D tone mapping reduces the model's capacity to fit LDR images. Additionally, the global tone mapper used in existing methods can impede the learning of both HDR and LDR representations. To address these challenges, we present GaussHDR, which unifies 3D and 2D local tone mapping through 3D Gaussian splatting. Specifically, we design a residual local tone mapper for both 3D and 2D tone mapping that accepts an additional context feature as input. We then propose combining the dual LDR rendering results from both 3D and 2D local tone mapping at the loss level. Finally, recognizing that different scenes may exhibit varying balances between the dual results, we introduce uncertainty learning and use the uncertainties for adaptive modulation. Extensive experiments demonstrate that GaussHDR significantly outperforms state-of-the-art methods in both synthetic and real-world scenarios.
MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views
Antoine Guédon · Tomoki Ichikawa · Kohei Yamashita · Ko Nishino
We present a novel appearance model that simultaneously realizes explicit high-quality 3D surface mesh recovery and photorealistic novel view synthesis from sparse view samples. Our key idea is to model the underlying scene geometry Mesh as an Atlas of Charts which we render with 2D Gaussian surfels (MAtCha Gaussians). MAtCha distills high-frequency scene surface details from an off-the-shelf monocular depth estimator and refines it through Gaussian surfel rendering. The Gaussian surfels are attached to the charts on the fly, satisfying photorealism of neural volumetric rendering and crisp geometry of a mesh model, \ie, two seemingly contradicting goals in a single model. At the core of MAtCha lies a novel neural deformation model and a structure loss that preserve the fine surface details distilled from learned monocular depths while addressing their fundamental scale ambiguities. Results of extensive experimental validation demonstrate MAtCha's state-of-the-art quality of surface reconstruction and photorealism on-par with top contenders but with dramatic reduction in the number of input views and computational time. We believe MAtCha will serve as a foundational tool for any visual application in vision, graphics, and robotics that require explicit geometry in addition to photorealism.
Exploiting Deblurring Networks for Radiance Fields
Haeyun Choi · Heemin Yang · Janghyeok Han · Sunghyun Cho
In this paper, we propose DeepDeblurRF, a novel radiance field deblurring approach that can synthesize high-quality novel views from blurred training views with significantly reduced training time. DeepDeblurRF leverages deep neural network (DNN)-based deblurring modules to enjoy their deblurring performance and computational efficiency. To effectively combine DNN-based deblurring and radiance field construction, we propose a novel radiance field (RF)-guided deblurring and an iterative framework that performs RF-guided deblurring and radiance field construction in an alternating manner. Moreover, DeepDeblurRF is compatible with various scene representations, such as voxel grids and 3D Gaussians, expanding its applicability. We also present BlurRF-Synth, the first large-scale synthetic dataset for training radiance field deblurring frameworks. We conduct extensive experiments on both camera motion blur and defocus blur, demonstrating that DeepDeblurRF achieves state-of-the-art novel-view synthesis quality with significantly reduced training time.
Decompositional Neural Scene Reconstruction with Generative Diffusion Prior
Junfeng Ni · Yu Liu · Ruijie Lu · ZiRui Zhou · Song-Chun Zhu · Yixin Chen · Siyuan Huang
Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms state-of-the-art methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization, yielding decomposed object meshes with detailed UV maps.
MET3R: Measuring Multi-View Consistency in Generated Images
Mohammad Asim · Christopher Wewer · Thomas Wimmer · Bernt Schiele · Jan Lenssen
We introduce MET3R, a metric for multi-view consistency in generated images. Large-scale generative models for multi-view image generation are rapidly advancing the field of 3D inference from sparse observations. However, due to the nature of generative modeling, traditional reconstruction metrics are not suitable to measure the quality of generated outputs and metrics that are independent of the sampling procedure are desperately needed. In this work, we specifically address the aspect of consistency between generated multi-view images, which can be evaluated independently of the specific scene. MET3R uses DUSt3R to obtain dense 3D reconstructions from image pairs in a feed-forward manner, which are used to warp image contents from one view into the other. Then, feature maps of these images are compared to obtain a similarity score that is invariant to view-dependent effects. Using MET3R, we evaluate the consistency of a large set of previous methods and our own, open, multi-view latent diffusion model.
MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds
Zhenggang Tang · Yuchen Fan · Dilin Wang · Hongyu Xu · Rakesh Ranjan · Alexander G. Schwing · Zhicheng Yan
Recent sparse view scene reconstruction advances like DUSt3R and MASt3R no longer require camera calibration and camera pose estimation. However, they only process a pair of views at a time to infer pixel-aligned pointmaps. When dealing with more than two views, a combinatorial number of error prone pairwise reconstructions are usually followed by an expensive global optimization, which often fails to rectify the pairwise reconstruction errors. To handle more views, reduce errors, and improve inference time, we propose the fast single-stage feed-forward network MV-DUSt3R. At its core are multi-view decoder blocks which exchange information across any number of views while considering one reference view. To make our method robust to reference view selection, we further propose MV-DUSt3R+, which employs cross-reference-view blocks to fuse information across different reference view choices. To further enable novel view synthesis, we extend both by adding and jointly training Gaussian splatting heads. Experiments on multi-view stereo reconstruction, multi-view pose estimation, and novel view synthesis confirm that our methods improve significantly upon prior art. Code will be released.
MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model
Chenjie Cao · Chaohui Yu · Shang Liu · Fan Wang · Xiangyang Xue · Yanwei Fu
We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset called MvD-1M, comprising up to 1.6 million scenes, equipped with well-aligned metric depth to train MVGenMaster. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released soon.
ERUPT: Efficient Rendering with Unposed Patch Transformer
Maxim Shugaev · Vincent Chen · Maxim Karrenbach · Kyle Ashley · Bridget Kennedy · Naresh Cuntoor
This work addresses the problem of novel view synthesis in diverse scenes from small collections of RGB images. We propose ERUPT (Efficient Rendering with Unposed Patch Transformer) a state-of-the-art scene reconstruction model capable of efficient scene rendering using unposed imagery. We introduce patch-based querying, in contrast to existing pixel-based queries, to reduce the compute required to render a target view. This makes our model highly efficient both during training and at inference, capable of rendering at 600 fps on commercial hardware. Notably, our model is designed to use a learned latent camera pose which allows for training using unposed targets in datasets with sparse or inaccurate ground truth camera pose. We show that our approach can generalize on large real-world data and introduce a new benchmark dataset (MSVS-1M) for latent view synthesis using street-view imagery collected from Mapillary. In contrast to NeRF and Gaussian Splatting which require dense imagery and precise metadata, ERUPT can render novel views of arbitrary scenes with as few as five unposed input images. ERUPT achieves better rendered image quality than current state-of-the-art methods for unposed image synthesis tasks, reduces labeled data requirements by ~95% and reduces computational requirements by an order of magnitude, providing efficient novel view synthesis for diverse real-world scenes.
Satellite to GroundScape - Large-scale Consistent Ground View Generation from Satellite Views
Ningli Xu · Rongjun Qin
Generating consistent ground-view images from satellite imagery is challenging, primarily due to the large discrepancies in viewing angles and resolution between satellite and ground-level domains. Previous efforts mainly concentrated on single-view generation, often resulting in inconsistencies across neighboring ground views. In this work, we propose a novel cross-view synthesis approach designed to overcome these challenges by ensuring consistency across ground-view images generated from satellite views. Our method, based on a fixed latent diffusion model, introduces two conditioning modules: satellite-guided denoising, which extracts high-level scene layout to guide the denoising process, and satellite-temporal denoising, which captures camera motion to maintain consistency across multiple generated views. We further contribute a large-scale satellite-ground dataset containing over 100,000 perspective pairs to facilitate extensive ground scene or video generation. Experimental results demonstrate that our approach outperforms existing methods on perceptual and temporal metrics, achieving high photorealism and consistency in multi-view outputs.
GenFusion: Closing the Loop between Reconstruction and Generation via Videos
Sibo Wu · Congrong Xu · Binbin Huang · Andreas Geiger · Anpei Chen
Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g. , scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach.
Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model
Shengjun Zhang · Jinzhao Li · Xin Fei · Hao Liu · Yueqi Duan
In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.
Multi-subject Open-set Personalization in Video Generation
Tsai-Shien Chen · Aliaksandr Siarohin · Willi Menapace · Yuwei Fang · Kwot Sin Lee · Ivan Skorokhodov · Kfir Aberman · Jun-Yan Zhu · Ming-Hsuan Yang · Sergey Tulyakov
Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist—a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
Generative Gaussian Splatting for Unbounded 3D City Generation
Haozhe Xie · Zhaoxi Chen · Fangzhou Hong · Ziwei Liu
3D city generation with NeRF-based methods shows promising generation results but is computationally inefficient. Recently 3D Gaussian splatting (3D-GS) has emerged as a highly efficient alternative for object-level 3D generation. However, adapting 3D-GS from finite-scale 3D objects and humans to infinite-scale 3D cities is non-trivial. Unbounded 3D city generation entails significant storage overhead (out-of-memory issues), arising from the need to expand points to billions, often demanding hundreds of Gigabytes of VRAM for a city scene spanning 10km^2. In this paper, we propose GaussianCity, a generative Gaussian splatting framework dedicated to efficiently synthesizing unbounded 3D cities with a single feed-forward pass. Our key insights are two-fold: 1) Compact 3D Scene Representation: We introduce BEV-Point as a highly compact intermediate representation, ensuring that the growth in VRAM usage for unbounded scenes remains constant, thus enabling unbounded city generation. 2) Spatial-aware Gaussian Attribute Decoder: We present spatial-aware BEV-Point decoder to produce 3D Gaussian attributes, which leverages Point Serializer to integrate the structural and contextual characteristics of BEV points. Extensive experiments demonstrate that GaussianCity achieves state-of-the-art results in both drone-view and street-view 3D city generation. Notably, compared to CityDreamer, GaussianCity exhibits superior performance with a speedup of 60 times (10.72 FPS v.s. 0.18 FPS).
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Xuanchi Ren · Tianchang Shen · Jiahui Huang · Huan Ling · Yifan Lu · Merlin Nimier-David · Thomas Müller · Alexander Keller · Sanja Fidler · Jun Gao
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video.
Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs
Yingji Zhong · Zhihao Li · Dave Zhenyu Chen · Lanqing Hong · Dan Xu
Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistency, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
Jinxiu Liu · Shaoheng Lin · Yinxiao Li · Ming-Hsuan Yang
Generating 360° panoramic and dynamic world content is important for spatial intelligence. The increasing demand for immersive AR/VR applications has heightened the need to generate high-quality scene-level dynamic content. However, most video diffusion models are constrained by limited resolution and aspect ratio, which restricts their applicability to scene-level dynamic content synthesis. In this work, we propose the DynamicScaler, addressing these challenges by enabling spatially scalable and panoramic dynamic scene synthesis that preserves coherence across panoramic scenes of arbitrary size. Specifically, we introduce a Offset Shifting Denoiser, facilitating efficient, synchronous, and coherent denoising panoramic dynamic scenes via a diffusion model with fixed resolution through a seamless rotating Window, which ensures seamless boundary transitions and consistency across the entire panoramic space, accommodating varying resolutions and aspect ratios. Additionally, we employ a Global Motion Guidance mechanism to ensure both local detail fidelity and global motion continuity. Extensive experiments demonstrate our method achieves superior content and motion quality in panoramic scene-level video generation, offering a training-free, efficient, and scalable solution for immersive dynamic scene creation with only 11GB VRAM (Video Random Access Memory) regardless of the output video resolution.
LIM: Large Interpolator Model for Dynamic Reconstruction
Remy Sabathier · Niloy J. Mitra · David Novotny
Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times $t_0$ and $t_1$, LIM produces a deformed shape at any continuous time $t\in[t_0,t_1]$ delivering high-quality interpolations in seconds (per frame).Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.
EgoLM: Multi-Modal Language Model of Egocentric Motions
Fangzhou Hong · Vladimir Guzov · Hyo Jin Kim · Yuting Ye · Richard Newcombe · Ziwei Liu · Lingni Ma
As wearable devices become more prevalent, understanding the user's motion is crucial for improving contextual AI systems. We introduce EgoLM, a versatile framework designed for egocentric motion understanding using multi-modal data. EgoLM integrates the rich contextual information from egocentric videos and motion sensors afforded by wearable devices. It also combines dense supervision signals from motion and language, leveraging the vast knowledge encoded in pre-trained large language models (LLMs). EgoLM models the joint distribution of egocentric motions and natural language using LLMs, conditioned on observations from egocentric videos and motion sensors. It unifies a range of motion understanding tasks, including motion narration from video or motion data, as well as motion generation from text or sparse sensor data. Unique to wearable devices, it also enables a novel task to generate text descriptions from sparse sensors. Through extensive experiments, we validate the effectiveness of EgoLM in addressing the challenges of under-constrained egocentric motion learning, and demonstrate its capability as a generalist model through a variety of applications.
MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds
Jiahui Lei · Yijia Weng · Adam W Harley · Leonidas Guibas · Kostas Daniilidis
We introduce 4D Motion Scaffolds (MoSca), a modern 4D reconstruction system designed to reconstruct and synthesize novel views of dynamic scenes from monocular videos captured casually in the wild. To address such a challenging and ill-posed inverse problem, we leverage prior knowledge from foundational vision models and lift the video data to a novel Motion Scaffold (MoSca) representation, which compactly and smoothly encodes the underlying motions/deformations. The scene geometry and appearance are then disentangled from the deformation field and are encodedby globally fusing the Gaussians anchored onto the MoSca and optimized via Gaussian Splatting. Additionally, camera focal length and poses can be solved using bundle adjustment without the need of any other pose estimation tools.Experiments demonstrate state-of-the-art performance on dynamic rendering benchmarks and its effectiveness on real videos.
PhysGen3D: Crafting a Miniature Interactive World from a Single Image
Boyuan Chen · Hanxiao Jiang · Shaowei Liu · Saurabh Gupta · Yunzhu Li · Hao Zhao · Shenlong Wang
Envisioning physically plausible outcomes from a single image requires a deep understanding of the world’s dynamics. To address this, we introduce MiniTwin, a novel framework that transforms a single image into an amodal, camera-centric, interactive 3D scene. By combining advanced image-based geometric and semantic understanding with physics-based simulation, MiniTwin creates an interactive 3D world from a static image, enabling us to "imagine" and simulate future scenarios based on user input. At its core, MiniTwin estimates 3D shapes, poses, physical and lighting properties of objects, thereby capturing essential physical attributes that drive realistic object interactions. This framework allows users to specify precise initial conditions, such as object speed or material properties, for enhanced control over generated video outcomes. We evaluate MiniTwin’s performance against closed-source state-of-the-art (SOTA) image-to-video models, including Pika, Kling, and Gen-3, showing MiniTwin's capacity to generate videos with realistic physics while offering greater flexibility and fine-grained control. Our results show that MiniTwin achieves a unique balance of photorealism, physical plausibility, and user-driven interactivity, opening new possibilities for generating dynamic, physics-grounded video from an image.
Link to the Past: Temporal Propagation for Fast 3D Human Reconstruction from Monocular Video
Marchellus Matthew · Nadhira Noor · In Kyu Park
Fast 3D clothed human reconstruction from monocular video remains a significant challenge in computer vision, particularly in balancing computational efficiency with reconstruction quality. Current approaches are either focused on static image reconstruction but too computationally intensive, or achieve high quality through per-video optimization that requires minutes to hours of processing, making them unsuitable for real-time applications. To this end, we present TemPoFast3D, a novel method that leverages temporal coherency of human appearance to reduce redundant computation while maintaining reconstruction quality. Our approach is a "plug-and play" solution that uniquely transforms pixel-aligned reconstruction networks to handle continuous video streams by maintaining and refining a canonical appearance representation through efficient coordinate mapping. Extensive experiments demonstrate that TemPoFast3D matches or exceeds state-of-the-art methods across standard metrics while providing high-quality textured reconstruction across diverse pose and appearance, with a maximum speed of 12 FPS.
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
Changan Chen · Juze Zhang · Shrinidhi Kowshika Lakshmikanth · Yusu Fang · Ruizhi Shao · Gordon Wetzstein · Li Fei-Fei · Ehsan Adeli
Human communication is inherently multimodal, involving a combination of verbal and non-verbal cues such as speech, facial expressions, and body gestures. Modeling these behaviors is essential for understanding human interaction and for creating virtual characters that can communicate naturally in applications like games, films, and virtual reality. However, existing motion generation models are typically limited to specific input modalities—either speech, text, or motion data—and cannot fully leverage the diversity of available data. In this paper, we propose a novel framework that unifies verbal and non-verbal language using multimodal language models for human motion understanding and generation. This model is flexible in taking text, speech, and motion or any combination of them as input and output. Coupled with our novel pre-training strategy, our model not only achieves state-of-the-art performance on co-speech gesture generation but also requires much less data for training. Our unified model also unlocks an array of novel tasks such as editable gesture generation and emotion prediction from motion. We believe unifying the verbal and non-verbal language of human motion is essential for real-world applications, and language models offer a powerful approach to achieving this goal.
Towards Explainable and Unprecedented Accuracy in Matching Challenging Finger Crease Patterns
Zhenyu Zhou · Chengdong Dong · Ajay Kumar
The primary obstacle in realizing the full potential of finger crease biometrics is the accurate identification of deformed knuckle patterns, often resulting from completely contactless imaging. Current methods struggle significantly with this task, yet accurate matching is crucial for applications ranging from forensic investigations, such as child abuse cases, to surveillance and mobile security. To address this challenge, our study introduces the largest publicly available dataset of deformed knuckle patterns, comprising 805,768 images from 351 subjects. We also propose a novel framework to accurately match knuckle patterns, even under severe pose deformations, by recovering interpretable knuckle crease keypoint feature templates. These templates can dynamically uncover graph structure and feature similarity among the matched correspondences. Our experiments, using the most challenging protocols, illustrate significantly outperforming results for matching such knuckle images. For the first time, we present and evaluate a theoretical model to estimate the uniqueness of 2D finger knuckle patterns, providing a more interpretable and accurate measure of distinctiveness, which is invaluable for forensic examiners in prosecuting suspects.
One-Step Event-Driven High-Speed Autofocus
Yuhan Bao · Shaohua Gao · Wenyong Li · Kaiwei Wang
High-speed autofocus in extreme scenes remains a significant challenge. Traditional methods rely on repeated sampling around the focus position, resulting in ''focus hunting''. Event-driven methods have advanced focusing speed and improved performance in low-light conditions; however, current approaches still require at least one lengthy round of ''focus hunting'', involving the collection of a complete focus stack. We introduce the Event Laplacian Product (ELP) focus detection function, which combines event data with grayscale Laplacian information, redefining focus search as a detection task. This innovation enables the first one-step event-driven autofocus, cutting focusing time by up to two-thirds and reducing focusing error by 24 times on the DAVIS346 dataset and 22 times on the EVK4 dataset. Additionally, we present an autofocus pipeline tailored for event-only cameras, achieving accurate results across a range of challenging motion and lighting conditions. All datasets and code will be made publicly available.
Self-Supervised Learning for Color Spike Camera Reconstruction
Yanchen Dong · Ruiqin Xiong · Xiaopeng Fan · Zhaofei Yu · Yonghong Tian · Tiejun Huang
Spike camera is a kind of neuromorphic camera with ultra-high temporal resolution, which can capture dynamic scenes by continuously firing spike signals. To capture color information, a color filter array (CFA) is employed on the sensor of the spike camera, resulting in Bayer-pattern spike streams. How to restore high-quality color images from the binary spike signals remains challenging. In this paper, we propose a motion-guided reconstruction method for spike cameras with CFA, utilizing color layout and estimated motion information. Specifically, we develop a joint motion estimation pipeline for the Bayer-pattern spike stream, exploiting the motion consistency of channels. We propose to estimate the missing pixels of each color channel according to temporally neighboring pixels of the corresponding color along the motion trajectory. As the spike signals are read out at discrete time points, there is quantization noise that impacts the image quality. Thus, we analyze the correlation of the noise in spatial and temporal domains and propose a self-supervised network utilizing a masked spike encoder to handle the noise. Experimental results on real-world captured Bayer-pattern spike streams show that our method can restore color images with better visual quality, compared with state-of-the-art methods. All the source codes and datasets will be publicly available.
PS-EIP: Robust Photometric Stereo Based on Event Interval Profile
Kazuma Kitazawa · Takahito Aoto · Satoshi Ikehata · Tsuyoshi Takatani
Recently, the energy-efficient photometric stereo method using an event camera (EventPS) has been proposed to recover surface normals from events triggered by changes in logarithmic Lambertian reflections under a moving directional light source. However, EventPS treats each event interval independently, making it sensitive to noise, shadows, and non-Lambertian reflections. This paper proposes Photometric Stereo based on Event Interval Profile (PS-EIP), a robust method that recovers pixelwise surface normals from a time-series profile of event intervals. By exploiting the continuity of the profile and introducing an outlier detection method based on profile shape, our approach enhances robustness against outliers from shadows and specular reflections. Experiments using real event data from 3D-printed objects demonstrate that PS-EIP significantly improves robustness to outliers compared to EventPS's deep-learning variant, EventPS-FCN, without relying on deep learning.
Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses
Yongfan Liu · Hyoukjun Kwon
Stereo depth estimation is a fundamental component in augmented reality (AR) applications. Although AR applications require very low latency for their real-time applications, traditional depth estimation models often rely on time-consuming preprocessing steps such as rectification to achieve high accuracy. Also, non standard ML operator based algorithms such as cost volume also require significant latency, which is aggravated on compute resource-constrained mobile platforms. Therefore, we develop hardware-friendly alternatives to the costly cost volume and preprocessing and design two new models based on them, MultiHeadDepth and HomoDepth. Our approaches for cost volume is replacing it with a new group-pointwise convolution-based operator and approximation of consine similarity based on layernorm and dot product. For online stereo rectification (preprocessing), we introduce homograhy matrix prediction network with a rectification positional encoding (RPE), which delivers both low latency and robustness to unrectified images, which eliminates the needs for preprocessing. Our MultiHeadDepth, which includes optimized cost volume, provides 11.8-30.3% improvements in accuracy and 22.9-25.2% reduction in latency compared to a state-of-the-art depth estimation model for AR glasses from industry. Our HomoDepth, which includes optimized preprocessing (Homograhpy + RPE) upon MultiHeadDepth, can process unrectified images and reduce the end-to-end latency by 44.5%. We adopt a multi-task learning framework to handle misaligned stereo inputs on HomoDepth, which reduces theAbsRel error by 10.0-24.3%. The results demonstrate the efficacy of our approaches in achieving both high model performance with low latency, which makes a step forward toward practical depth estimation on future AR devices.
Scalable Autoregressive Monocular Depth Estimation
Jinhong Wang · Jintai Chen · Jian liu · Dongqi Tang · Wentong Li · Weiqiang Wang · Danny Chen · Jian Wu
This paper proposes a new autoregressive model as an effective and scalable monocular depth estimator. Our idea is simple: We tackle the monocular depth estimation (MDE) task with an autoregressive prediction paradigm, based on two core designs. First, our depth autoregressive model (DAR) treats the depth map of different resolutions as a set of tokens, and conducts the low-to-high resolution autoregressive objective with a patch-wise casual mask. Second, our DAR recursively discretizes the entire depth range into more compact intervals, and attains the coarse-to-fine granularity autoregressive objective in an ordinal-regression manner. By coupling these two autoregressive objectives, our DAR establishes new state-of-the-art (SOTA) on KITTI and NYU Depth v2 by clear margins. Further, our scalable approach allows us to scale the model up to 2.0B and achieve the best RMSE of 1.799 on the KITTI dataset (5% improvement) compared to 1.896 by the current SOTA (Depth Anything). DAR further showcases zero-shot generalization ability on unseen datasets. These results suggest that DAR yields superior performance with an autoregressive prediction paradigm, providing a promising approach to equip modern autoregressive large models (e.g., GPT-4o) with depth estimation capabilities.
Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
Haoyu Guo · He Zhu · Sida Peng · Haotong Lin · Yunzhi Yan · Tao Xie · Wenguan Wang · Xiaowei Zhou · Hujun Bao
This paper aims to reconstruct the scene geometry from multi-view images with strong robustness and high quality. Previous learning-based methods incorporate neural networks into the multi-view stereo matching and have shown impressive reconstruction results. However, due to the reliance on matching across input images, they typically suffer from high GPU memory consumption and tend to fail in sparse view scenarios. To overcome this problem, we develop a new pipeline, named Murre, for multi-view geometry reconstruction of 3D scenes based on SfM-guided monocular depth estimation. For input images, Murre first recover the SfM point cloud that captures the global scene structure, and then use it to guide a conditional diffusion model to produce multi-view metric depth maps for the final TSDF fusion. By predicting the depth map from a single image, Murre bypasses the multi-view matching step and naturally resolves the issues of previous MVS-based methods. In addition, the diffusion-based model can easily leverage the powerful priors of 2D foundation models, achieving good generalization ability across diverse real-world scenes. To obtain multi-view consistent depth maps, our key design is providing effective guidance on the diffusion model through the SfM point cloud, which is a condensed form of multi-view information, highlighting the scene's salient structure, and can be readily transformed into point maps to drive the image-space estimation process. We evaluate the reconstruction quality of Murre in various types of real-world datasets including indoor, streetscapes, and aerial scenes, surpassing state-of-the-art MVS-based and implicit neural reconstruction-based methods. The code will be released for reproducibility.
FoundationStereo: Zero-Shot Stereo Matching
Bowen Wen · Matthew Trepte · Oluwaseun Joseph Aribido · Jan Kautz · Orazio Gallo · Stan Birchfield
Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization — a hallmark of foundation models in other computer vision tasks — remains challenging for stereo matching. We introduce StereoAnything, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation.
MonSter: Marry Monodepth to Stereo Unleashes Power
JunDa Cheng · Longliang Liu · Gangwei Xu · Xianqi Wang · Zhaoxing Zhang · Yong Deng · Jinliang Zang · Yurui Chen · zhipeng cai · Xin Yang
Stereo matching recovers depth from image correspondences. Existing methods struggle to handle ill-posed regions with limited matching cues, such as occlusions and textureless areas. To address this, we propose MonSter, a novel method that leverages the complementary strengths of monocular depth estimation and stereo matching. MonSter integrates monocular depth and stereo matching into a dual-branch architecture to iteratively improve each other. Confidence-based guidance adaptively selects reliable stereo cues for monodepth scale-shift recovery, and utilizes explicit monocular depth priors to enhance stereo matching at ill-posed regions. Such iterative mutual enhancement enables MonSter to evolve monodepth priors from coarse object-level structures to pixel-level geometry, fully unlocking the potential of stereo matching. As shown in Fig.2, MonSter ranks 1st across five most commonly used leaderboards --- SceneFlow, KITTI 2012, KITTI 2015, Middlebury, and ETH3D. Achieving up to 49.5% improvements over the previous best method (Bad 1.0 on ETH3D). Comprehensive analysis verifies the effectiveness of MonSter in ill-posed regions. In terms of zero-shot generalization, MonSter significantly and consistently outperforms state-of-the-art methods across the board. Code will be released upon acceptance.
Dual Exposure Stereo for Extended Dynamic Range 3D Imaging
Juhyung Choi · Jinneyong Kim · Seokjun Choi · Jinwoo Lee · Samuel Brucker · Mario Bijelic · Felix Heide · Seung-Hwan Baek
Achieving robust stereo 3D imaging under diverse illumination conditions is an importat however challenging task, largely due to the limited dynamic ranges (DRs) of cameras, which are significantly smaller than real world DR. As a result, the accuracy of existing stereo depth estimation methods is often compromised by under- or over-exposed images.In this work, we introduce dual-exposure stereo for extended dynamic range 3D imaging. We develop automatic dual-exposure control method that adjusts the dual exposures, diverging them when the scene DR exceeds the camera DR, thereby providing information about broader DR. From the captured dual-exposure stereo images, we estimate depth by developing a motion-aware dual-exposure stereo depth network.To validate our proposed method, we develop a robot-vision system, collect real-world stereo video datasets, and generate a synthetic dataset. Our approach outperforms traditional exposure control and depth estimation methods.
Adapting Dense Matching for Homography Estimation with Grid-based Acceleration
Kaining Zhang · Yuxin Deng · Jiayi Ma · Paolo Favaro
Current deep homography estimation methods are constrained to processing low-resolution image pairs due to network architecture and computational limitations. For high-resolution images, downsampling is often required, which can greatly degrade estimation accuracy. In contrast, traditional methods, which match pixels and compute homography from correspondences, provide greater resolution flexibility. So in this work, we go back to the traditional ways for homography estimation. Specifically, we propose GFNet, a Grid Flow regression Network that adapts the high-accuracy dense matching framework for homography estimation and enhances efficiency with a grid-based strategy—estimating flow only over a coarse grid by leveraging homography’s global smoothness. We demonstrate the effectiveness of GFNet on a wide range of experiments on multiple datasets, including the common scene MSCOCO, multimodal datasets VIS-IR and GoogleMap, and the dynamic scene VIRAT. Specifically, on 448$\times$448 GoogleMap, GFNet achieves an improvement of +9.9\% in auc@3 while reducing MACs by $\sim$47\% compared to the SOTA dense matching method. Additionally, it shows a 1.7$\times$ improvement in auc@3 over the SOTA deep homography method.
ProtoDepth: Unsupervised Continual Depth Completion with Prototypes
Patrick Rim · Hyoungseob Park · Suchisrit Gangopadhyay · Ziyao Zeng · Younjoon Chung · Alex Wong
We present ProtoDepth, a novel prototype-based approach for continual learning of unsupervised depth completion, the multimodal 3D reconstruction task of predicting dense depth maps from RGB images and sparse point clouds. The unsupervised learning paradigm is well-suited for continual learning, as ground truth is not needed. However, when training on new non-stationary distributions, depth completion models will catastrophically forget previously learned information. We address forgetting by learning prototype sets that adapt the latent features of a frozen pretrained model to new domains. Since the original weights are not modified, ProtoDepth does not forget when test-time domain identity is known. To extend ProtoDepth to the challenging setting where the test-time domain identity is withheld, we propose to learn domain descriptors that enable the model to select the appropriate prototype set for inference. We evaluate ProtoDepth on benchmark dataset sequences, where we reduce forgetting compared to baselines by 52.2% for indoor and 53.2% for outdoor to achieve the state of the art.
VGGT: Visual Geometry Grounded Transformer
Jianyuan Wang · Minghao Chen · Nikita Karaev · Andrea Vedaldi · Christian Rupprecht · David Novotny
We present VGGN, a feed-forward neural network that infers directly all key 3D attributes of a scene, such as camera poses, point maps, depth maps, and 3D point tracks, from few or hundreds of its views. Unlike recent alternatives, VGGN does not need to use visual geometry optimization techniques to refine the results in post-processing, obtaining all quantities of interest directly. This approach is simple and more efficient, reconstructing hundreds of images in seconds. We train VGGN on a large number of publicly available datasets with 3D annotations and demonstrate its ability to achieve state-of-the-art results in multiple 3D tasks, including camera pose estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. This is a step forward in 3D computer vision, where models have been typically constrained to and specialized for single tasks. We extensively evaluate our method on unseen datasets to demonstrate its superior performance. We will release the code and trained model.
DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion
Qitao Zhao · Amy Lin · Jeff Tan · Jason Y. Zhang · Deva Ramanan · Shubham Tulsiani
Current Structure-from-Motion (SfM) methods often adopt a two-stage pipeline involving learned or geometric pairwise reasoning followed by a global optimization. We instead propose a data-driven multi-view reasoning approach that directly infers cameras and 3D geometry from multi-view images. Our proposed framework, DiffusionSfM, parametrizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame, and learns a transformer-based denoising diffusion model to predict these from multi-view input. We develop mechanisms to overcome practical challenges in training diffusion models with missing data and unbounded scene coordinates, and demonstrate that DiffusionSfM allows accurate prediction of 3D and cameras. We empirically validate our approach on challenging real world data and find that DiffusionSfM improves over prior classical and learning-based methods, while also naturally modeling uncertainty and allowing external guidance to be incorporated in inference.
Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays
Laurie Bose · Piotr Dudek · Jianing Chen
This paper presents a novel approach for joint point-feature detection and tracking, specifically designed for Pixel Processor Array sensors (PPA). Instead of standard pixels, PPA sensors consists of thousands of "pixel-processors", enabling massive parallel computation of visual data at the point of light capture. Our approach performs all computation within these pixel-processors, meaning no raw image data need ever leave the sensor. Instead, sensor output can be reduced to merely the locations of tracked features, and the descriptors of newly initialized features, minimizing data transfer between sensor and external processing. To achieve this we store feature descriptors inside every pixel-processor, adjusting the layout of these descriptors every frame. The PPA's architecture enables us to compute the response of every stored descriptor in parallel. This "response map" is utilized for both detection and tracking of point-features across the pixel-processor array. This approach is very fast, our implementation upon the SCAMP-7 PPA prototype runs at over 3000 FPS (Frames Per Second), tracking point-features reliably even under violent motion. This is the first work performing point-feature detection and tracking entirely "in-pixel".
ScaleLSD: Scalable Deep Line Segment Detection Streamlined
Zeran Ke · Bin Tan · Xianwei Zheng · Yujun Shen · Tianfu Wu · Nan Xue
This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the curation of line geometry at scale from over 10M unlabeled real-world images. Our ScaleLSD works very well to detect much more number of line segments from any natural images even than the pioneered non-deep LSD approach, having a more complete and accurate geometric characterization of images using line segments. Experimentally, our proposed ScaleLSD is comprehensively testified under the zero-shot protocol in detection performance, single-view 3D geometry estimation, two-view line segment matching, and multiview 3D line mapping, all with excellent perfor- mance obtained. Based on the thorough evaluation, our ScaleLSD is observed to be the first deep approach that outperforms the pioneered non-deep LSD in all aspects we have tested, significantly expanding and reinforcing the versatility of the line geometry of images.
MEGA: Masked Generative Autoencoder for Human Mesh Recovery
Guénolé Fiche · Simon Leglaive · Xavier Alameda-Pineda · Francesc Moreno-Noguer
Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches. Code and trained models will be released upon acceptance.
Reconstructing Humans with a Biomechanically Accurate Skeleton
Yan Xia · Xiaowei Zhou · Etienne Vouga · Qixing Huang · Georgios Pavlakos
In this paper, we introduce a method for reconstructing humans in 3D from a single image using a biomechanically accurate skeleton model. To achieve this, we train a transformer that takes an image as input and estimates the parameters of the model. Due to the lack of training data for this task, we build a pipeline to generate pseudo ground truth data and implement a training procedure that iteratively refines these pseudo labels for improved accuracy. Compared to state-of-the-art methods in 3D human pose estimation, our model achieves competitive performance on standard benchmarks, while it significantly outperforms them in settings with extreme 3D poses and viewpoints. This result highlights the benefits of using a biomechanical skeleton with realistic degrees of freedom for robust pose estimation. Additionally, we show that previous models frequently violate joint angle limits, leading to unnatural rotations. In contrast, our approach leverages the biomechanically plausible degrees of freedom leading to more realistic joint rotation estimates. We validate our approach across multiple human pose estimation benchmarks. We will make all code, models and data publicly available upon publication.
EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching
Dongki Jung · Jaehoon Choi · Yonghan Lee · Somi Jeong · Taejae Lee · Dinesh Manocha · Suyong Yeon
We introduce the first learning-based dense matching algorithm, termed Equirectangular Projection-Oriented Dense Kernelized Feature Matching (EDM), specifically designed for omnidirectional images. Equirectangular projection (ERP) images, with their large fields of view, are particularly suited for dense matching techniques that aim to establish comprehensive correspondences across images. However, ERP images are subject to significant distortions, which we address by leveraging the spherical camera model and geodesic flow refinement in the dense matching method. To further mitigate these distortions, we propose spherical positional embeddings based on 3D Cartesian coordinates of the feature grid. Additionally, our method incorporates bidirectional transformations between spherical and Cartesian coordinate systems during refinement, utilizing a unit sphere to improve matching performance. We demonstrate that our proposed method achieves notable performance enhancements, with improvements of +26.72 and +42.62 in AUC@5° on the Matterport3D and Stanford2D3D datasets. Project page: https://anonymous000edm.github.io/
Feat2GS: Probing Visual Foundation Models with Gaussian Splatting
Yue Chen · Xingyu Chen · Anpei Chen · Gerard Pons-Moll · Yuliang Xiu
Given that visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images, a natural question arises: how well do they understand the 3D world? With the differences in architecture and training protocols (i.e., objectives, proxy tasks), a unified framework to fairly and comprehensively probe their 3D awareness is urgently needed. Existing works on 3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or two-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately, these tasks ignore texture awareness, and require 3D data as ground-truth, which limits the scale and diversity of their evaluation set. To address these issues, we introduce Feat2GS , which readout 3D Gaussians attributes from VFM features extracted from unposed images. This allows us to probe 3D awareness for geometry and texture without requiring 3D data. Additionally, the disentanglement of 3DGS parameters – geometry (x, α, Σ) and texture (c) – enables separate analysis of texture and geometry awareness. Under Feat2GS , we conduct extensive experiments to probe the 3D awareness of several VFMs, and investigate the ingredients that lead to a 3D aware VFM. Building on these findings, we develop several variants that achieve state-of-the-art across diverse datasets. This makes Feat2GS useful for probing VFMs, and as a simple-yet-effective baseline for novel-view synthesis. Code and data will be made available.
FG^2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching
Zimin Xia · Alex Alahi
We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image. To generate the ground points, we first map ground image features to a 3D point cloud. Our method then learns to select features along the height dimension to pool the 3D points to a Bird's-Eye-View (BEV) plane. This selection enables us to trace which feature in the ground image contributes to the BEV representation. Next, we sample a set of sparse matches from computed point correspondences between the two point planes and compute their relative pose using Procrustes alignment. Compared to the previous state-of-the-art, our method reduces the mean localization error by 42% on the VIGOR dataset. Qualitative results show that our method learns semantically consistent matches across ground and aerial views through weakly supervised learning from the ground truth camera pose.
Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent
Philip Doldo · Derek Everett · Amol Khanna · Andre T Nguyen · Edward Raff
Projected Gradient Descent (PGD) under the $L_\infty$ ball has become one of the defacto methods used in adversarial robustness evaluation for computer vision (CV) due to its reliability and efficacy, making a strong and easy-to-implement iterative baseline. However, PGD is computationally demanding to apply, especially when thousands of iterations is current best-practice recommendations to generate an adversarial example for a single image. In this work, we introduce a simple novel method for early termination of PGD based on cycle detection by exploiting the geometry of how PGD is implemented in practice and show that it can produce large speedup factors while providing the *exact* same estimate of model robustness as standard PGD. This method substantially speeds up PGD without sacrificing any attack strength, enabling evaluations of robustness that were previously computationally intractable
Self-Supervised Spatial Correspondence Across Modalities
Ayush Shrivastava · Andrew Owens
We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model finds which pairs of pixels within them correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle consistent feature representations for cross-modal and intra-modal matching. The resulting model is simple and has no explicit photoconsistency assumptions. It can be trained entirely using unlabeled data, without need for any spatially aligned multimodal image pairs. We evaluate our method on challenging RGB-to-depth and RGB-to-thermal matching problems (and vice versa), where we find that it obtains strong performance.
RDD: Robust Feature Detector and Descriptor using Deformable Transformer
Gonglin Chen · Tianwen Fu · Haiwei Chen · Wenbin Teng · Hanyuan Xiao · Yajie Zhao
As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present RDD, a novel and robust keypoint detector/descriptor leveraging the deformable transformer, which captures global context and geometric invariance through deformable self-attention mechanisms. Specifically, we observed that deformable attention focuses on key locations, effectively reducing the search space complexity and modeling the geometric invariance. Furthermore, we introduce a novel refinement module for semi-dense matching, which does not rely on fine-level features. Our proposed methods outperform all state-of-the-art keypoint detection/description methods in feature matching, pose estimation, and visual localization tasks. To ensure comprehensive evaluation, we introduce two challenging benchmarks: one emphasizing large viewpoint and scale variations, and the other being a novel Air-to-Ground benchmark — an evaluation setting which has gained popularity in recent years for 3D reconstruction at different altitudes. Our code and benchmarks will be released upon acceptance of the paper.
Dense-SfM: Structure from Motion with Dense Consistent Matching
JongMin Lee · Sungjoo Yoo
We present Dense-SfM, a novel Structure from Motion (SfM) framework designed for dense and accurate 3D reconstruction from multi-view images. Sparse keypoint matching, which traditional SfM methods often rely on, limits both accuracy and point density, especially in texture-less areas. Dense-SfM addresses this limitation by integrating dense matching with a Gaussian Splatting (GS) based track extension which gives more consistent, longer feature tracks. To further improve reconstruction accuracy, Dense-SfM is equipped with a multi-view kernelized matching module leveraging transformer and Gaussian Process architectures, for robust track refinement across multi-views. Evaluations on the ETH3D and Texture-Poor SfM datasets show that Dense-SfM offers significant improvements in accuracy and density over state-of-the-art methods.
HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery
Yuto Matsubara · Ko Nishino
We introduce a novel method for human shape and pose recovery that can fully leverage multiple static views. We target fixed-multiview people monitoring, including elderly care and safety monitoring, in which calibrated cameras can be installed at the corners of a room or an open space but whose configuration may vary depending on the environment. Our key idea is to formulate it as neural optimization. We achieve this with HeatFormer, a neural optimizer that iteratively refines the SMPL parameters given multiview images, which is fundamentally agonistic to the configuration of views. HeatFormer realizes this SMPL parameter estimation as heat map generation and alignment with a novel transformer encoder and decoder. We demonstrate the effectiveness of HeatFormer including its accuracy, robustness to occlusion, and generalizability through an extensive set of experiments. We believe HeatFormer can serve a key role in passive human behavior modeling.
DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction
Ben Kaye · Tomas Jakab · Shangzhe Wu · Christian Rupprecht · Andrea Vedaldi
The choice of data representation is a key factor in the success of deep learning in geometric tasks. For instance, DUSt3R has recently introduced the concept of viewpoint-invariant point maps, generalizing depth prediction, and showing that one can reduce all the key problems in the 3D reconstruction of static scenes to predicting such point maps. In this paper, we develop an analogous concept for a very different problem, namely, the reconstruction of the 3D shape and pose of deformable objects. To this end, we introduce the Dual Point Maps (DualPM), where a pair of point maps is extracted from the same image, one associating pixels to their 3D locations on the object, and the other to a canonical version of the object at rest pose. We also extend point maps to amodal reconstruction, seeing through self-occlusions to obtain the complete shape of the object. We show that 3D reconstruction and 3D pose estimation reduce to the prediction of the DualPMs. We demonstrate empirically that this representation is a good target for a deep network to predict; specifically, we consider modelling horses, showing that DualPMs can be trained purely on 3D synthetic data, consisting of a single model of a horse, while generalizing very well to real images. With this, we improve by a large margin previous methods for the 3D analysis and reconstruction of this type of objects.
iG-6DoF: Model-free 6DoF Pose Estimation for Unseen Object via Iterative 3D Gaussian Splatting
Tuo Cao · Fei LUO · Jiongming Qin · Yu Jiang · Yusen Wang · Chunxia Xiao
Traditional methods in pose estimation often rely on precise 3D models or additional data such as depth and normals, limiting their generalization, especially when objects undergo large translations or rotations. We propose iG-6DoF, a novel model-free 6D pose estimation method iterative 3D Gaussian Splatting to estimate the pose of unseen objects. We first estimates an initial pose by leveraging multi-scale data augmentation and the rotation-equivariant features to create a better pose hypothesis from a set of candidates. Then, we propose an iterative 3DGS approach through iteratively rendering and comparing the rendered image with the input image to further progressively improve pose estimation accuracy. The proposed end-to-end network consists of an object detector, a multi-scale rotation-equivariant feature based initial pose estimator, and a coarse-to-fine pose refiner. Such combination allows our method to focus on the target object in a complex scene and deal with large movement and weak textures.Such framework allows to deal with large movement and weak texture and complex scene. We conduct extensive experiments on benchmark and real datasets. Our method achieves state-of-the-art results on the LINEMOD, OnePose-LowTexture, and GenMOP datasets, demonstrating its strong generalization to unseen objects and robustness across various scenes.
RefPose: Leveraging Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects
Jaeguk Kim · Jaewoo Park · Keuntek Lee · Nam Ik Cho
Estimating the 6D pose of unseen objects from monocular RGB images remains a challenging problem, especially due to the lack of prior object-specific knowledge. To tackle this issue, we propose RefPose, an innovative approach to object pose estimation that leverages a reference image and geometric correspondence as guidance. RefPose first predicts an initial pose by using object templates to render the reference image and establish the geometric correspondence needed for the refinement stage. During the refinement stage, RefPose estimates the geometric correspondence of the query based on the generated references and iteratively refines the pose through a render-and-compare approach. To enhance this estimation, we introduce a correlation volume-guided attention mechanism that effectively captures correlations between the query and reference images. Unlike traditional methods that depend on pre-defined object models, RefPose dynamically adapts to new object shapes by leveraging a reference image and geometric correspondence. This results in robust performance across previously unseen objects. Extensive evaluation on the BOP benchmark datasets shows that RefPose achieves state-of-the-art results while maintaining a competitive runtime.
One2Any: One-Reference 6D Pose Estimation for Any Object
Mengya Liu · Siyuan Li · Ajad Chhatkuli · Prune Truong · Luc Van Gool · Federico Tombari
6D object pose estimation remains challenging for many applications due to dependencies on complete 3D models, multi-view images, or training limited to specific object categories. These requirements make generalization to novel objects difficult for which neither 3D models nor multi-view images may be available. To address this, we propose a novel method One2Any that estimates the relative 6-degrees of freedom (DOF) object pose using only a single reference-single query RGB-D image, without prior knowledge of its 3D model, multi-view data, or category constraints.We treat object pose estimation as an encoding-decoding process: first, we obtain a comprehensive Reference Object Pose Embedding (ROPE) that encodes an object’s shape, orientation, and texture from a single reference view. Using this embedding, a U-Net-based pose decoding module produces Reference Object Coordinate (ROC) for new views, enabling fast and accurate pose estimation. This simple encoding-decoding framework allows our model to be trained on any pair-wise pose data, enabling large-scale training and demonstrating great scalability.Experiments on multiple benchmark datasets demonstrate that our model generalizes well to novel objects, achieving state-of-the-art accuracy and robustness even rivaling methods that require multi-view or CAD inputs, at a fraction of compute.
Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space
Leonhard Sommer · Olaf Dünkel · Christian Theobalt · Adam Kortylewski
3D morphable models (3DMMs) are a powerful tool to represent the possible shapes and appearances of an object category. Given a single test image, 3DMMs can be used to solve various tasks, such as predicting the 3D shape, pose, semantic correspondence, and instance segmentation of an object. Unfortunately, 3DMMs are only available for very few object categories that are of particular interest, like faces or human bodies, as they require a demanding 3D data acquisition and category-specific training process. In contrast, we introduce a new method, Common3D, that learns 3DMMs of common objects in a fully self-supervised manner from a collection of object-centric videos. For this purpose, our model represents objects as a learned 3D template mesh and a deformation field that is parameterized as an image-conditioned neural network. Different from prior works, Common3D represents the object appearance with neural features instead of RGB colors, which enables the learning of more generalizable representations through an abstraction from pixel intensities. Importantly, we train the appearance features using a contrastive objective by exploiting the correspondences defined through the deformable template mesh. This leads to higher quality correspondence features compared to related works and a significantly improved model performance at estimating 3D object pose and semantic correspondence. Common3D is the first completely self-supervised method that can solve various vision tasks in a zero-shot manner.
ESCAPE: Equivariant Shape Completion via Anchor Point Encoding
Burak Bekci · Nassir Navab · Federico Tombari · Mahdi Saleh
Shape completion, a crucial task in 3D computer vision, involves predicting and filling the missing regions of scanned or partially observed objects. Current methods expect known pose or canonical coordinates and do not perform well under varying rotations, limiting their real-world applicability. We introduce \textbf{ESCAPE} (Equivariant Shape Completion via Anchor Point Encoding), a novel framework designed to achieve rotation-equivariant shape completion. Our approach employs a distinctive encoding strategy by selecting anchor points from a shape and representing all points as a distance to all anchor points. This enables the model to capture a consistent, rotation-equivariant understanding of the object’s geometry. ESCAPE leverages a transformer architecture to encode and decode the distance transformations, ensuring that generated shape completions remain accurate and equivariant under rotational transformations. Subsequently, we perform optimization to calculate the predicted shapes from the encodings. Experimental evaluations demonstrate that ESCAPE achieves robust, high-quality reconstructions across arbitrary rotations and translations, showcasing its effectiveness in real-world applications without additional pose estimation modules.
Open-World Amodal Appearance Completion
Jiayang Ao · Yanbei Jiang · Qiuhong Ke · Krista A. Ehinger
Understanding and reconstructing occluded objects is a challenging problem, especially in open-world scenarios where categories and contexts are diverse and unpredictable. Traditional methods, however, are typically restricted to closed sets of object categories, limiting their use in complex, open-world scenes. We introduce Open-World Amodal Appearance Completion, a training-free framework that expands amodal completion capabilities by accepting flexible text queries as input. Our approach generalizes to arbitrary objects specified by both direct terms and abstract queries. We term this capability reasoning amodal completion, where the system reconstructs the full appearance of the queried object based on the provided image and language query. Our framework unifies segmentation, occlusion analysis, and inpainting to handle complex occlusions and generates completed objects as RGBA elements, enabling seamless integration into applications such as 3D reconstruction and image editing. Extensive evaluations demonstrate the effectiveness of our approach in generalizing to novel objects and occlusions, establishing a new benchmark for amodal completion in open-world settings. The code and datasets will be released after paper acceptance.
Exploring Historical Information for RGBE Visual Tracking with Mamba
Chuanyu Sun · Jiqing Zhang · Yang Wang · Huilin Ge · qianchen xia · Baocai Yin · Xin Yang
Combining the advantages of conventional and event cameras for robust visual tracing has drawn extensive interest. However, existing tracking approaches heavily engage in complex cross-modal fusion modules, leading to higher computational complexity and training challenges. Besides, these methods generally ignore the effective integration of historical information, which is crucial to grasping the change in the target's appearance and motion trends. Given the recent advancements in Mamba's long-range modeling and linear complexity, we explore its potential in addressing the above issues in RGBE tracking tasks. Specifically, we first propose an efficient fusion module based on Mamba, which utilizes a simple gate-based interaction scheme to achieve effective modality-selective fusion. This module can be seamlessly integrated into the encoding layer of prevalent Transformer-based backbones. Moreover, we further present a novel historical decoder that leverages Mamba's advanced long sequence modeling to effectively capture the target appearance changes with autoregressive queries. Extensive experiments show that our proposed approach achieves state-of-the-art performance on multiple challenging short-term and long-term RGBE benchmarks. Besides, the effectiveness of each key Mamba-based component of our approach is evidenced by our thorough ablation study.
EBS-EKF: Accurate and High Frequency Event-based Star Tracking
Albert Reed · Connor Hashemi · Dennis Melamed · Nitesh Menon · Keigo Hirakawa · Scott McCloskey
Event-based sensors (EBS) are a promising new technology for star tracking due to their low latency and power efficiency, but prior work has thus far been evaluated exclusively in simulation with simplified signal models. We propose a novel algorithm for event-based star tracking, grounded in an analysis of the EBS circuit and an extended Kalman filter (EKF). We quantitatively evaluate our method using real night sky data, comparing its results with those from a space-ready active-pixel sensor (APS) star tracker. We demonstrate that our method is an order-of-magnitude more accurate than existing methods due to improved signal modeling and state estimation, while providing more frequent updates and greater motion tolerance than conventional APS trackers. We provide all code and the first dataset of events synchronized with APS solutions.
MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors
Fanqi Pu · Yifan Wang · Jiru Deng · Wenming Yang
Perspective projection has been extensively utilized in monocular 3D object detection methods. It introduces geometric priors from 2D bounding boxes and 3D object dimensions to reduce the uncertainty of depth estimation. However, due to depth errors originating from the object's visual surface, the height of the bounding box often fails to represent the actual projected central height, which undermines the effectiveness of geometric depth. Direct prediction for the projected height unavoidably results in a loss of 2D priors, while multi-depth prediction with complex branches does not fully leverage geometric depth. This paper presents a Transformer-based monocular 3D object detection method called MonoDGP, which adopts perspective-invariant geometry errors to modify the projection formula. We also try to systematically discuss and explain the mechanisms and efficacy behind geometry errors, which serve as a simple but effective alternative to multi-depth prediction. Additionally, MonoDGP decouples the depth-guided decoder and constructs a 2D decoder only dependent on visual features, providing 2D priors and initializing object queries without the disturbance of 3D detection. To further optimize and fine-tune input tokens of the transformer decoder, we also introduce a Region Segmentation Head (RSH) that generates enhanced features and segment embeddings. Our monocular method demonstrates state-of-the-art performance on the KITTI benchmark without extra data. Code is available in the supplementary materials and will be publicly released after the review.
MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection
Rishubh Parihar · Srinjay Sarkar · Sarthak Vora · Jogendra Kundu Kundu · R. Venkatesh Babu
Current monocular 3D detectors are held back by the limited diversity and scale of real-world datasets. While data augmentation certainly helps, it's particularly difficult to generate realistic scene-aware augmented data for outdoor settings. Most current approaches to synthetic data generation focus on realistic object appearance through improved rendering techniques. However, we show that where and how objects are positioned is just as crucial for training effective 3D monocular detectors. The key obstacle lies in automatically determining realistic object placement parameters - including position, dimensions, and directional alignment when introducing synthetic objects into actual scenes. To address this, we introduce MonoPlace3D, a novel system that considers the 3D scene content to create realistic augmentations. Specifically, given a background scene, MonoPlace3D learns a distribution over plausible 3D bounding boxes. Subsequently, we render realistic objects and place them according to the locations sampled from the learned distribution. Our comprehensive evaluation on two standard datasets KITTI and NuScenes, demonstrates that MonoPlace3D significantly improves the accuracy of multiple existing monocular 3D detectors while being highly data efficient.
4Deform: Neural Surface Deformation for Robust Shape Interpolation
Lu Sang · Zehranaz Canfes · Dongliang Cao · Riccardo Marin · Florian Bernard · Daniel Cremers
Generating realistic intermediate shapes between non-rigidly deformed shapes is a challenging task in computer vision, especially with unstructured data (e.g., point clouds) where temporal consistency across frames is lacking, and topologies are changing. Most interpolation methods are designed for structured data (i.e., meshes) and do not apply to real-world point clouds. In contrast, our approach leverages neural implicit representation (NIR) to enable free-topology changing shape deformation. Unlike previous mesh-based methods, which model learns vertex-based deformation fields, our method learns a continuous velocity field in Euclidean space, making it suitable for less structured data such as point clouds.Additionally, our method does not require intermediate-shape supervision during training; instead, we incorporate physical and geometrical constraints to regularize the velocity field. We reconstruct intermediate surfaces using a modified level-set equation, directly linking our NIR with the velocity field. Experiments show that our method significantly outperforms previous NIR approaches across various scenarios (e.g., noisy, partial, topology-changing, non-isometric shapes) and, for the first time, enables new applications like 4D Kinect sequence upsampling and real-world high-resolution mesh deformation.
Toward Robust Neural Reconstruction from Sparse Point Sets
Amine Ouasfi · Shubhendu Jena · Eric Marchand · Adnane Boukhayma
We consider the challenging problem of learning Signed Distance Functions (SDF) from sparse and noisy 3D point clouds. In contrast to recent methods that depend on smoothness priors, our method, rooted in a distributionally robust optimization (DRO) framework, incorporates a regularization term that leverages samples from the uncertainty regions of the model to improve the learned SDFs. Thanks to tractable dual formulations, we show that this framework enables a stable and efficient optimization of SDFs in the absence of ground truth supervision. Using a variety of synthetic and real data evaluations from different modalities, we show that of our DRO based learning framework can improve SDF learning with respect to baselines and the state-of-the-art.
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
Ruicheng Wang · Sicheng Xu · Cassie Lee Dai · Jianfeng XIANG · Yu Deng · Xin Tong · Jiaolong Yang
We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitates effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervision techniques that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks, including monocular estimation of 3D point map, depth map, and camera field of view.
ArcPro: Architectural Programs for Structured 3D Abstraction of Sparse Points
Qirui Huang · Runze Zhang · Kangjun Liu · Minglun Gong · Hao Zhang · Hui Huang
We introduce ArcPro, a novel learning framework built on architectural programs to recover structured 3D abstractions from highly sparse and low-quality point clouds. Specifically, we design a domain-specific language (DSL) to hierarchically represent building structures as a program, which can be efficiently converted into a mesh. We bridge feedforward and inverse procedural modeling by using a feedforward process for training data synthesis, allowing the network to make reverse predictions. We train an encoder-decoder on the points-program pairs to establish a mapping from unstructured point clouds to architectural programs, where a 3D convolutional encoder extracts point cloud features and a transformer decoder autoregressively predicts the programs in a tokenized form. Inference by our method is highly efficient and produces plausible and faithful 3D abstractions. Comprehensive experiments demonstrate that ArcPro outperforms both traditional architectural proxy reconstruction and learning-based abstraction methods. We further explore its potential when working with multi-view image and natural language inputs.
ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration
Johan Edstedt · André Mateus · Alberto Jaenal
Structure-from-Motion (SfM) is the task of estimating 3D structure and camera poses from images. We define Collaborative SfM (ColabSfM) as sharing distributed SfM reconstructions. Sharing maps requires estimating a joint reference frame, which is typically referred to as registration. However, there is a lack of scalable methods and training datasets for registering SfM reconstructions.In this paper, we tackle this challenge by proposing the scalable task of point cloud registration for SfM reconstructions. We find that current registration methods cannot register SfM point clouds when trained on existing datasets.To this end, we propose a SfM registration dataset generation pipeline, leveraging partial reconstructions from synthetically generated camera trajectories for each scene. Finally, we propose a simple but impactful neural refiner on top of the SotA registration method RoITr that yields significant improvements, which we call RefineRoITr. Our extensive experimental evaluation shows that our proposed pipeline and model enables ColabSfM. Code and data will be made available.
MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning
Xu Han · Yuan Tang · Jinfeng Xu · Xianzhi Li
We introduce Monarch Sparse Tuning (MoST), the first reparameterization-based parameter-efficient fine-tuning (PEFT) method for 3D representation learning. Unlike existing adapter-based and prompt-tuning 3D PEFT methods, MoST introduces no additional inference overhead and is compatible with any 3D representation learning backbone. At its core, we tailor a new family of structured matrices for 3D point clouds, Point Monarch, which can capture local geometric features of irregular points while offering high expressiveness. MoST reparameterizes the dense update weight matrices as our sparse Point Monarch matrices, significantly reducing parameters while retaining strong performance. Experiments on various backbones show that MoST is simple, effective, and highly generalizable. It captures local features in point clouds, achieving state-of-the-art results on multiple benchmarks, such as 97.5% acc. on ScanObjectNN (PB50RS) and 96.2% on ModelNet40 classification, while it can also combine with other matrix decompositions (e.g., Low-rank, Kronecker) to further reduce parameters.
Flash3D: Super-scaling Point Transformers through Joint Hardware-Geometry Locality
Liyan Chen · Gregory P. Meyer · Zaiwei Zhang · Eric M. Wolff · Paul Vernaza
Recent efforts recognize the power of scale in 3D learning (e.g. PTv3) and attention mechanisms (e.g. FlashAttention).However, current point cloud backbones fail to holistically unify geometric locality, attention mechanisms, and GPU architectures in one view.In this paper, we introduce $\textbf{Flash3D Transformer}$, which aligns geometric locality and GPU tiling through a principled locality mechanism based on Perfect Spatial Hashing (PSH).The common alignment with GPU tiling naturally fuses our PSH locality mechanism with FlashAttention at negligible extra cost.This mechanism affords flexible design choices throughout the backbone that result in superior downstream task results.Flash3D outperforms state-of-the-art PTv3 results on benchmark datasets, delivering a 2.25x speed increase and 2.4x memory efficiency boost. This efficiency enables scaling to wider attention scopes and larger models without additional overhead.Such scaling allows Flash3D to achieve even higher task accuracies than PTv3 under the same compute budget.
PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning
Song Wang · Xiaolu Liu · Lingdong Kong · Jianyun Xu · Chunyong Hu · Gongfan Fang · Wentong Li · Jianke Zhu · Xinchao Wang
Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. However, as pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. Parameter-efficient fine-tuning (PEFT) methods offer a promising solution to mitigate these resource requirements, yet most current approaches rely on complex adapter and prompt mechanisms that increase tunable parameters. In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. Our approach embeds LoRA layers within the most parameter-intensive components of point cloud transformers, reducing the need for tunable parameters while enhancing global feature capture. Additionally, multi-scale token selection extracts critical local information to serve as prompts for downstream fine-tuning, effectively complementing the global context captured by LoRA. The experimental results across various pre-trained models and three challenging public datasets demonstrate that our approach achieves competitive performance with only 3.43% of the trainable parameters, making it highly effective for resource-constrained applications. Code will be publicly available.
MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing
Feifei Shao · Ping Liu · Zhao Wang · Yawei Luo · Hongwei Wang · Jun Xiao
Point cloud processing (PCP) encompasses tasks like reconstruction, denoising, registration, and segmentation, each often requiring specialized models to address unique task characteristics. While in-context learning (ICL) has shown promise across tasks by using a single model with task-specific demonstration prompts, its application to PCP reveals significant limitations. We identify inter-task and intra-task sensitivity issues in current ICL methods for PCP, which we attribute to inflexible sampling strategies lacking context adaptation at the point and prompt levels. To address these challenges, we propose MICAS, an advanced ICL framework featuring a multi-grained adaptive sampling mechanism tailored for PCP. MICAS introduces two core components: task-adaptive point sampling, which leverages inter-task cues for point-level sampling, and query-specific prompt sampling, which selects optimal prompts per query to mitigate intra-task sensitivity. To our knowledge, this is the first approach to introduce adaptive sampling tailored to the unique requirements of point clouds within an ICL framework. Extensive experiments show that MICAS not only efficiently handles various PCP tasks but also significantly outperforms existing methods. Notably, it achieves a remarkable $4.1\%$ improvement in the part segmentation task and delivers consistent gains across various PCP applications.
LidarGait++: Learning Local Features and Size Awareness from LiDAR Point Clouds for 3D Gait Recognition
Chuanfu Shen · Rui Wang · Lixin Duan · Shiqi Yu
Point clouds have gained growing interest in gait recognition. However, current methods, which typically convert point clouds into 3D voxels, often fail to extract essential gait-specific features. In this paper, we explore gait recognition within 3D point clouds from the perspectives of architectural designs and gait representation modeling. We indicate the significance of local and body size features in 3D gait recognition and introduce S$^3$GaitNet, a novel framework combining advanced local representation learning techniques with a novel size-aware learning mechanism. Specifically, S$^3$GaitNet utilizes Set Abstraction (SA) layer and Pyramid Point Pooling (P$^3$) layer for learning locally fine-grained gait representations from 3D point clouds directly. Both the SA and P$^3$ layers can be further enhanced with size-aware learning to make the model aware of the actual size of the subjects. In the end, S$^3$GaitNet not only outperforms current state-of-the-art methods, but it also consistently demonstrates robust performance and great generalizability on two benchmarks. Our extensive experiments validate the effectiveness of size and local features in gait recognition. The code will be publicly available.
CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework
Yanlong Xu · Haoxuan Qu · Jun Liu · Wenxiao Zhang · Xun Yang
The goal of point cloud localization based on linguistic description is to identify a 3D position using textual description in large urban environments, which has potential applications in various fields, such as determining the location for vehicle pickup or goods delivery. Ideally, for a textual description and its corresponding 3D location, the objects around the 3D location should be fully described in the text description. However, in practical scenarios, e.g., vehicle pickup, passengers usually describe only the part of the most significant and nearby surroundings instead of the entire environment. In response to this $\textbf{partially relevant}$ challenge, we propose $\textbf{CMMLoc}$, an uncertainty-aware $\textbf{C}$auchy-$\textbf{M}$ixture-$\textbf{M}$odel ($\textbf{CMM}$) based framework for text-to-point-cloud $\textbf{Loc}$alization. To model the uncertain semantic relations between text and point cloud, we integrate CMM constraints as a prior during the interaction between the two modalities. We further design a spatial consolidation scheme to enable adaptive aggregation of different 3D objects with varying receptive fields. To achieve precise localization, we propose a cardinal direction integration module alongside a modality pre-alignment strategy, helping capture the spatial relationships among objects and bringing the 3D objects closer to the text modality. Comprehensive experiments validate that CMMLoc outperforms existing methods, achieving state-of-the-art results on the KITTI360Pose dataset. Codes are available in this anonymous GitHub repository https://github.com/anonymous0819/CMMLoc.
HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views
Ethan Griffiths · Maryam Haghighat · Simon Denman · Clinton Fookes · Milad Ramezani
We present HOTFormerLoc, a novel and versatile Hierarchical Octree-based TransFormer, for large-scale 3D place recognition in both ground-to-ground and ground-to-aerial scenarios across urban and forest environments. Leveraging an octree-based structure, we propose a multi-scale attention mechanism that captures spatial and semantic features across granularities. To address the variable density of point distributions from common spinning lidar, we present cylindrical octree attention windows to better reflect the underlying distribution during attention. We introduce relay tokens to enable efficient global-local interactions and multi-scale representation learning at reduced computational cost. Our pyramid attentional pooling then synthesises a robust global descriptor for end-to-end place recognition in challenging environments. In addition, we introduce our novel dataset: In-house, a 3D cross-source dataset featuring point cloud data from aerial and ground lidar scans captured in dense forests. Point clouds in In-house contain representational gaps and distinctive attributes such as varying point densities and noise patterns, making it a challenging benchmark for cross-view localisation in the wild. Our results demonstrate that HOTFormerLoc achieves a top-1 average recall improvement of 5.5\% -- 11.5\% on the In-house benchmark. Furthermore, it consistently outperforms SOTA 3D place recognition methods, with an average performance gain of 5.8\% on well-established urban and forest datasets. The code and In-house benchmark will be made available upon acceptance.
ForestLPR: LiDAR Place Recognition in Forests Attentioning Multiple BEV Density Images
Yanqing Shen · Turcan Tuna · Marco Hutter · Cesar Cadena · Nanning Zheng
Place recognition is essential to maintain global consistency in large-scale localization systems. While research in urban environments has progressed significantly using LiDARs or cameras, applications in natural forest-like environments remain largely underexplored.Furthermore, forests present particular challenges due to high self-similarity and substantial variations in vegetation growth over time.In this work, we propose a robust LiDAR-based place recognition method for natural forests, ForestLPR. We hypothesize that a set of cross-sectional images of the forest’s geometry at different heights contains the information needed to recognize revisiting a place.The cross-sectional images are represented by bird’s-eye view (BEV) density images of horizontal slices of the point cloud at different heights. Our approach utilizes a visual transformer as the shared backbone to produce sets of local descriptors and introduces a multi-BEV interaction module to attend to information at different heights adaptively. It is followed by an aggregation layer that produces a rotation-invariant place descriptor. We evaluated the efficacy of our method extensively on real-world data from public benchmarks as well as robotic datasets and compared it against the state-of-the-art (SOTA) methods. The results indicate that ForestLPR has consistently good performance on all evaluations and achieves an average increase of 7.38\% and 9.11\% on Recall@1 over the closest competitor on intra-sequence loop closure detection and inter-sequence re-localization, respectively, validating our hypothesis.
PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds
Barza Nisar · Steven L. Waslander
Self-supervised learning (SSL) on 3D point clouds has the potential to learn feature representations that can transfer to diverse sensors and multiple downstream perception tasks. However, recent SSL approaches fail to define pretext tasks that retain geometric information such as object pose and scale, which can be detrimental to the performance of downstream localization and geometry-sensitive 3D scene understanding tasks, such as 3D semantic segmentation and 3D object detection. Further, no methods exist that exhibit transferable performance across different LiDAR beam patterns. We propose PSA-SSL, a novel extension to point cloud SSL that learns LiDAR pattern-agnostic and object pose and size-aware (PSA) features. Our approach defines 1) a self-supervised bounding box regression pretext task, which retains object pose and size information, and 2) a LiDAR beam pattern augmentation on input point clouds, which encourages learning sensor-agnostic features. Our experiments demonstrate that with a single pre-trained model, our light-weight yet effective extensions achieve significant improvements on 3D semantic segmentation with limited labels (up to \textbf{+2.66 mIoU} for DepthContrast and \textbf{+2.39 mIoU} for SegContrast) across popular autonomous driving datasets (Waymo, NuScenes, SemanticKitti). Moreover, our approach outperforms other state-of-the-art SSL methods on 3D semantic segmentation and shows improvements on 3D object detection when transferring to different LiDAR sensors. Our code will be released on (anonymized link).
LightLoc: Learning Outdoor LiDAR Localization at Light Speed
Wen Li · Chen Liu · Shangshu Yu · dq Liu · Yin Zhou · Siqi Shen · Chenglu Wen · Cheng Wang
Scene Coordinate Regression (SCR) achieves impressive results in outdoor LiDAR localization but requires days of training. Since training needs to be repeated for each new scene, long training times make these methods impractical for applications requiring time-sensitive system upgrades, such as autonomous driving, drones, robotics, etc. We identify large coverage areas and vast amounts of data in large-scale outdoor scenes as key challenges that limit fast training. In this paper, we propose LightLoc, the first method capable of efficiently learning localization in a new scene at light speed. Beyond freezing the scene-agnostic feature backbone and training only the scene-specific prediction heads, we introduce two novel techniques to address these challenges. First, we introduce sample classification guidance to assist regression learning, reducing ambiguity from similar samples and improving training efficiency. Second, we propose a redundant sample downsampling technique to remove well-learned frames during training, reducing training time without compromising accuracy. In addition, the fast training and confidence estimation characteristics of sample classification enable its integration into SLAM, effectively eliminating error accumulation. Extensive experiments on large-scale outdoor datasets demonstrate that LightLoc achieves state-of-the-art performance with just 1 hour of training—50× faster than existing methods. Our code will be made available upon acceptance.
No Thing, Nothing: Highlighting Safety-Critical Classes for Robust LiDAR Semantic Segmentation in Adverse Weather
Junsung Park · HwiJeong Lee · Inha Kang · Hyunjung Shim
Existing domain generalization methods for LiDAR semantic segmentation under adverse weather struggle to accurately predict "things" categories compared to "stuff'' categories. In typical driving scenes, "things" categories can be dynamic and associated with higher collision risks, making them crucial for safe navigation and planning. Recognizing the importance of "things'' categories, we identify their performance drop as a serious bottleneck in existing approaches. We observed that adverse weather induces degradation of semantic-level features and both corruption of local features, leading to a misprediction of "things" as "stuff". To address semantic-level feature corruption, we bind each point feature to its superclass, preventing the misprediction of \emph{things} classes into visually dissimilar categories. Additionally, to enhance robustness against local corruption caused by adverse weather, we define each LiDAR beam as a local region and propose a regularization term that aligns the clean data with its corrupted counterpart in feature space. Our method achieves state-of-the-art performance with a +2.6 mIoU gain on the SemanticKITTI-to-SemanticSTF benchmark and +7.9 mIoU on the SemanticPOSS-to-SemanticSTF benchmark. Notably, our method achieves a +4.8 and +7.9 mIoU improvement on "things'' classes, respectively, highlighting its effectiveness.
RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network
Van-Tin Luu · Yong-Lin Cai · Vu-Hoang Tran · Wei-Chen Chiu · Yi-Ting Chen · Ching-Chun Huang
This paper presents a groundbreaking approach - the first online automatic geometric calibration method for radar and camera systems. Given the significant data sparsity and measurement uncertainty in radar height data, achieving automatic calibration during system operation has long been a challenge. To address the sparsity issue, we propose a Dual-Perspective representation that gathers features from both frontal and bird’s-eye views. The frontal view contains rich but sensitive height information, whereas the bird’s-eye view provides robust features against height uncertainty. We thereby propose a novel Selective Fusion Mechanism to identify and fuse reliable features from both perspectives, reducing the effect of height uncertainty. Moreover, for each view, we incorporate a Multi-Modal Cross-Attention Mechanism to explicitly find location correspondences through cross-modal matching. During the training phase, we also design a Noise-Resistant Matcher to provide better supervision and enhance the robustness of the matching mechanism against sparsity and height uncertainty. Our experimental results, tested on the nuScenes dataset, demonstrate that our method significantly outperforms previous radar-camera auto-calibration methods, as well as existing state-of-the-art LiDAR-camera calibration techniques, establishing a new benchmark for future research.
Pseudo Visible Feature Fine-Grained Fusion for Thermal Object Detection
Ting Li · Mao Ye · Tianwen Wu · Nianxin Li · Shuaifeng Li · Song Tang · Luping Ji
Thermal object detection is a critical task in various fields, such as surveillance and autonomous driving. Current state-of-the-art (SOTA) models always leverage a prior Thermal-To-Visible (T2V) translation model to obtain visible spectrum information, followed by a cross-modality aggregation module to fuse information from both modalities. However, this fusion approach does not fully exploit the complementary visible spectrum information beneficial for thermal detection. To address this issue, we propose a novel cross-modal fusion method called Pesudo Visible Feature Fine-Grained Fusion (PFGF). Unlike previous cross-modal fusion methods, our approach explicitly models high-level relationships between cross-modal data, effectively fusing different granularity information. Specifically, a graph is constructed with nodes generated from features at different levels of the backbone, along with pseudo-visual latent features produced by the T2V model. Each feature corresponds to a subgraph. An Inter-Mamba block is proposed to perform cross-modality fusion between nodes at the lowest scale; while a Cascade Knowledge Integration (CKI) strategy is used to fuse low-scale fused information to high-scale subgraphs in a cascade manner. After several iterations of graph node updating, each subgraph outputs an aggregated feature to the detection head respectively. Experimental results demonstrate that our method achieves SOTA detection performance and is more efficient. Code, data, and models will be released upon publication of this paper.
Resilient Sensor Fusion Under Adverse Sensor Failures via Multi-Modal Expert Fusion
Konyul Park · Yecheol Kim · Daehun Kim · Jun Won Choi
Modern autonomous driving perception systems utilize complementary multi-modal sensors, such as LiDAR and cameras. Although sensor fusion architectures enhance performance in challenging environments, they still suffer significant performance drops under severe sensor failures, such as LiDAR beam reduction, LiDAR drop, limited field of view, camera drop, and occlusion. This limitation stems from inter-modality dependencies in current sensor fusion frameworks. In this study, we introduce an efficient and robust LiDAR-camera 3D object detector, referred to as Immortal, which can achieve robust performance through a mixture of experts approach. Our Immortal fully decouples modality dependencies using three parallel expert decoders, which use camera features, LiDAR features, or a combination of both to decode object queries, respectively. We propose Mixture of Modal Experts (MoME) framework, where each query is decoded selectively using one of three expert decoders. MoME utilizes an Adaptive Query Router (AQR) to select the most appropriate expert decoder for each query based on the quality of camera and LiDAR features. This ensures that each query is processed by the best-suited expert, resulting in robust performance across diverse sensor failure scenarios. We evaluated the performance of Immortal on the nuScenes-R benchmark. Our Immortal achieved state-of-the-art performance in extreme weather and sensor failure conditions, significantly outperforming the existing models across various sensor failure scenarios.
Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking
chaocan xue · Bineng Zhong · Qihua Liang · Yaozong Zheng · Ning Li · Yuanliang Xue · Shuxiang Song
Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize redundant ViT layers dynamically. Our approach disables a large number of representation-similar layers, and selectively retains only a single optimal layer among them for alleviating precision drop. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision.
MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM
Vladimir Yugay · Theo Gevers · Martin R. Oswald
Simultaneous localization and mapping (SLAM) systems with novel view synthesis capabilities are widely used in computer vision, with applications in augmented reality, robotics, and autonomous driving. However, existing approaches are limited to single-agent operation. Recent work has addressed this problem using a distributed neural scene representation. Unfortunately, existing methods are slow, cannot accurately render real-world data, are restricted to two agents, and have limited tracking accuracy. In contrast, we propose a rigidly deformable 3D Gaussian-based scene representation that dramatically speeds up the system. However, improving tracking accuracy and reconstructing a globally consistent map from multiple agents remains challenging due to trajectory drift and discrepancies across agents' observations. Therefore, we propose new tracking and map-merging mechanisms and integrate loop closure in the Gaussian-based SLAM pipeline. We evaluate \ours on synthetic and real-world datasets and find it more accurate and faster than the state of the art.
SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction
ZaiPeng Duan · Xuzhong Hu · Pei An · Junfeng Ding · Jie Zhan · Chenxu Dang · Yunbiao Xu · Jie Ma
Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at https://github.com/
VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction
Ziyue Zhu · Shenlong Wang · Jin Xie · Jiang-Jiang Liu · Jingdong Wang · Jian Yang
Recent advancements in camera-based occupancy prediction have focused on the simultaneous prediction of 3D semantics and scene flow, a task that presents significant challenges due to specific difficulties, e.g., occlusions and unbalanced dynamic environments. In this paper, we analyze these challenges and their underlying causes. To address them, we propose a novel regularization framework called VoxelSplat. This framework leverages recent developments in 3D Gaussian Splatting to enhance model performance in two key ways: (i)Enhanced Semantics Supervision through 2D Projection: During training, our method decodes sparse semantic 3D Gaussians from 3D representations and projects them onto the 2D camera view. This provides additional supervision signals in the camera-visible space, allowing 2D labels to improve the learning of 3D semantics. (ii) Scene Flow Learning: Our framework uses the predicted scene flow to model the motion of Gaussians, and is thus able to learn the scene flow of moving objects in a self-supervised manner using the labels of adjacent frames. Our method can be seamlessly integrated into various existing occupancy models, enhancingperformance without increasing inference time. Extensive experiments on benchmark datasets demonstrate the effectiveness of VoxelSplat in improving the accuracy of both semantic occupancy and scene flow estimation.
GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction
Sicheng Zuo · Wenzhao Zheng · Yuanhui Huang · Jie Zhou · Jiwen Lu
3D occupancy prediction is important for autonomous driving due to its comprehensive perception of the surroundings. To incorporate sequential inputs, most existing methods fuse representations from previous frames to infer the current 3D occupancy. However, they fail to consider the continuity of driving scenarios and ignore the strong prior provided by the evolution of 3D scenes (e.g., only dynamic objects move). In this paper, we propose a world-model-based framework to exploit scene evolutions for perception. We reformulate 3D occupancy prediction as a 4D occupancy forecasting problem conditioned on the current sensor input. We decompose the scene evolution into three factors: 1) Ego motion alignment of static scenes; 2) local movements of dynamic objects; and 3) completion of newly-observed scenes. We then employ a Gaussian world model (GaussianWorld) to explicitly exploit these priors and infer the scene evolutions in the 3D Gaussian space considering the current RGB observation. We evaluate the effectiveness of our framework on the widely used nuScenes datasets. Our GaussianWorld improves the performance of the single-frame counterpart by over 2\% in mIoU without introducing additional computations.
DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes
Chensheng Peng · Chengwei Zhang · Yixiao Wang · Chenfeng Xu · Yichen Xie · Wenzhao Zheng · Kurt Keutzer · Masayoshi Tomizuka · Wei Zhan
We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling effective static-dynamic decomposition and high-fidelity surface reconstruction in complex driving scenarios. Our approach employs a two-stage optimization pipeline of dynamic street Gaussians. In the first stage, we extract 2D motion masks based on the observation that 3D Gaussian Splatting inherently can reconstruct only the static regions in dynamic environments. These extracted 2D motion priors are then mapped into the Gaussian space in a differentiable manner, leveraging an efficient formulation of dynamic Gaussians in the second stage. Combined with the introduced geometric regularizations, our method are able to address the over-fitting issues caused by data sparsity in autonomous driving, reconstructing physically plausible Gaussians that align with object surfaces rather than floating in air. Furthermore, we introduce temporal cross-view consistency to ensure coherence across time and viewpoints, resulting in high-quality surface reconstruction. Comprehensive experiments demonstrate the efficiency and effectiveness of DeSiRe-GS, surpassing prior self-supervised arts and achieving accuracy comparable to methods relying on external 3D bounding box annotations.
JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data
Runjian Chen · Wenqi Shao · Bo Zhang · Shaoshuai Shi · Li Jiang · Ping Luo
Deep-learning-based autonomous driving (AD) perception introduces a promising picture for safe and environment-friendly transportation. However, the over-reliance on real labeled data in LiDAR perception limits the scale of on-road attempts. 3D real world data is notoriously time-and-energy-consuming to annotate and lacks corner cases like rare traffic participants. On the contrary, in simulators like CARLA, generating labeled LiDAR point clouds with corner cases is a piece of cake. However, introducing synthetic point clouds to improve real perception is non-trivial. This stems from two challenges: 1) sample efficiency of simulation datasets 2) simulation-to-real gaps. To overcome both challenges, we propose a plug-and-play method called JiSAM, shorthand for Jittering augmentation, domain-aware backbone and memory-based Sectorized AlignMent. In extensive experiments conducted on the famous AD dataset NuScenes, we demonstrate that, with SOTA 3D object detector, JiSAM is able to utilize the simulation data and only labels on $2.5\%$ available real data to achieve comparable performance to models trained on all real data. Additionally, JiSAM achieves more than 15 mAPs on the objects not labeled in the real training set. We will release models and codes.
Rethinking Lanes and Points in Complex Scenarios for Monocular 3D Lane Detection
Yifan Chang · Junjie Huang · Xiaofeng Wang · Yun Ye · Zhujin LIANG · Yi Shan · Dalong Du · Xingang Wang
Monocular 3D lane detection is a fundamental task in autonomous driving. Although sparse-point methods lower computational load and maintain high accuracy in complex lane geometries, current methods fail to fully leverage the geometric structure of lanes in both lane geometry representations and model design. In lane geometry representations, we present a theoretical analysis alongside experimental validation to verify that current sparse lane representation methods contain inherent flaws, resulting in potential errors of up to 20 meters, which raise significant safety concerns for driving. To address this issue, we propose a novel patching strategy to completely represent the full lane structure. To enable existing models to match this strategy, we introduce the EndPoint head (EP-head), which adds a patching distance to endpoints. The EP-head enables the model to predict more complete lane representations even with fewer preset points, effectively addressing existing limitations and paving the way for models that are faster and require fewer parameters in the future. In model design, to enhance the model's perception of lane structures, we propose the PointLane attention (PL-attention), which incorporates prior geometric knowledge into the attention mechanism. Extensive experiments demonstrate the effectiveness of the proposed methods on various state-of-the-art models. For instance, our methods enhance Persformer by 4.4 points, Anchor3DLane by 3.2 points, and LATR by 2.8 points on the overall F1-score. The code will be released upon publication.
SceneCrafter: Controllable Multi-View Driving Scene Editing
Zehao Zhu · Yuliang Zou · Chiyu “Max” Jiang · Bo Sun · Vincent Casser · XIUKUN HUANG · Jiahao Wang · Zhenpei Yang · Ruiqi Gao · Leonidas Guibas · Mingxing Tan · Dragomir Anguelov
Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other hand, leverage source scenes from real driving logs, and enable the simulation of different traffic layouts, behaviors, and operating conditions such as weather and time of day. While image editing is an established topic in computer vision, it presents fresh sets of challenges in driving simulation: (1) the need for cross-camera 3D consistency, (2) learning "empty street" priors from driving data with foreground occlusions, and (3) obtaining paired image tuples of varied editing conditions while preserving consistent layout and geometry. To address these challenges, we propose SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. We build on recent advancements in multi-view diffusion models, using a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes and high-definition maps. To generate paired data for supervising the editing model, we propose a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits. We also introduce an alpha-blending framework to synthesize data with local edits, leveraging a model trained on empty street priors through novel masked training and multi-view repaint paradigm. SceneCrafter demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines.
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Xinyuan Chang · Maixuan Xue · Xinran Liu · Zheng Pan · Xing Wei
Ensuring adherence to traffic sign regulations is essential for both human and autonomous vehicle navigation. While current online mapping solutions often prioritize the construction of the geometric and connectivity layers of HD maps, overlooking the construction of the traffic regulation layer within HD maps. Addressing this gap, we introduce MapDR, a novel dataset designed for the extraction of Driving Rules from traffic signs and their association with vectorized, locally perceived HD Maps. MapDR features over $10,000$ annotated video clips that capture the intricate correlation between traffic sign regulations and lanes. Built upon this benchmark and the newly defined task of integrating traffic regulations into online HD maps, we provide modular and end-to-end solutions: VLE-MEE and RuleVLM, offering a strong baseline for advancing autonomous driving technology. It fills a critical gap in the integration of traffic sign rules, contributing to the development of reliable autonomous driving systems.
CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization
Junhao Xu · Yanan Zhang · Zhi Cai · Di Huang
Multi-agent collaborative perception enhances perceptual capabilities by utilizing information from multiple agents and is considered a fundamental solution to the problem of weak single-vehicle perception in autonomous driving. However, existing collaborative perception methods face a dilemma between communication efficiency and perception accuracy. To address this issue, we propose a novel communication-efficient collaborative perception framework based on supply-demand awareness and intermediate-late hybridization, dubbed as $\texttt{CoSDH}$. By modeling the supply-demand relationship between agents, the framework refines the selection of collaboration regions, reducing unnecessary communication cost while maintaining accuracy. In addition, we innovatively introduce the intermediate-late hybrid collaboration mode, where late-stage collaboration compensates for the performance degradation in collaborative perception under low communication bandwidth. Extensive experiments on multiple datasets, including both simulated and real-world scenarios, demonstrate that $\texttt{CoSDH}$ achieves state-of-the-art detection accuracy and optimal bandwidth trade-offs, delivering superior detection precision under real communication bandwidths, thus proving its effectiveness and practical applicability. The code and model will be publicly available.
Generating Multimodal Driving Scenes via Next-Scene Prediction
Yanhao Wu · Haoyang Zhang · Tianwei Lin · Alan Huang · Shujie Luo · Rui Wu · Congpei Qiu · Wei Ke · Tong Zhang
Generative models in Autonomous Driving (AD) enable diverse scenario creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenarios for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenarios over extended sequences, ensuring multimodal consistency and offering fine-grained control over scenario elements. Visualization of generated multimodal driving scenes can be found in supplementary materials.
Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning
Bozhou Zhang · Nan Song · Xin Jin · Li Zhang
End-to-end autonomous driving unifies tasks in a differentiable framework, enabling planning-oriented optimization and attracting growing attention.Current methods aggregate historical information either through dense historical bird’s-eye-view (BEV) features or by querying a sparse memory bank, following paradigms inherited from detection.However, we argue that these paradigms either omit historical information in motion planning or fail to align with its multi-step nature, which requires predicting or planning multiple future time steps. In line with the philosophy of "future is a continuation of past'', we propose BridgeAD, which reformulates motion and planning queries as multi-step queries to differentiate the queries for each future time step. This design enables the effective use of historical prediction and planning by applying them to the appropriate parts of the end-to-end system based on the time steps, which improves both perception and motion planning. Specifically, historical queries for the current frame are combined with perception, while queries for future frames are integrated with motion planning. In this way, we bridge the gap between past and future by aggregating historical insights at every time step, enhancing the overall coherence and accuracy of the end-to-end autonomous driving pipeline. Extensive experiments on the nuScenes dataset in both open-loop and closed-loop settings demonstrate that BridgeAD achieves state-of-the-art performance. We will make our code and models publicly available.
MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception
Wenzhuo Liu · Wenshuo Wang · Yicheng Qiao · Qiannan Guo · Jiayin Zhu · Pengfei Li · Zilong Chen · Huiming Yang · Zhiwei Li · Lening Wang · Tiao Tan · Huaping Liu
Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks.
CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
Xinhao Liu · Jintong Li · Yicheng Jiang · Niranjan Sujay · Zhicheng Yang · Juexiao Zhang · John Abanes · Jing Zhang · Chen Feng
Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings.
Evaluating Vision-Language Models as Evaluators in Path Planning
Mohamed Aghzal · Xiang Yue · Erion Plaku · Ziyu Yao
Despite their promise to perform complex reasoning, large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning. This has inspired an intriguing question: if these models cannot plan well, can they still contribute to the planning framework as a helpful plan evaluator? In this work, we generalize this question to consider LLMs augmented with visual understanding, i.e., Vision-Language Models (VLMs). We introduce PathEval, a novel benchmark evaluating VLMs as plan evaluators in complex path-planning scenarios. Succeeding in the benchmark requires a VLM to be able to abstract traits of optimal paths from the scenario description, demonstrate precise low-level perception on each path, and integrate this information to decide the better path. Our analysis of state-of-the-art VLMs reveals that these models face significant challenges on the benchmark. We observe that the VLMs can precisely abstract given scenarios to identify the desired traits and exhibit mixed performance in integrating the provided information. Yet, their vision component presents a critical bottleneck, with models struggling to perceive low-level details about a path. Our experimental results show that this issue cannot be trivially addressed via end-to-end fine-tuning; rather, task-specific discriminative adaptation of these vision encoders is needed for these VLMs to become effective path evaluators.
Scene Map-based Prompt Tuning for Navigation Instruction Generation
Sheng Fan · Rui Liu · Wenguan Wang · Yi Yang
Navigation instruction generation (NIG), which offers interactive feedback and guidance to humans along a trajectory, is essential for developing embodied agents capable of human-machine communication and collaboration through natural language. Early data-driven methods directly map sequences of RGB frames to route descriptions on limited datasets. While recent approaches leverage Large Language Models (LLMs) to improve NIG, they often overlook the map representation of the navigation environment, which encodes multi-view semantic and topological information along the trajectory. Instead of solely inputting textual descriptions of the map into LLMs, we propose a scene map-based prompt tuning framework for NIG, \textsc{MAPInstructor}, which incorporates map priors for parameter-efficient updating of LLMs. \textsc{MAPInstructor} consists of (i) scene representation encoding, where egocentric observations are projected into 3D voxels for finer-grained scene understanding; (ii) mapping prompt tuning, which integrates the topological map representation of the entire route into the LLM-based decoder; and (iii) landmark uncertainty assessment, which reduces hallucinations in landmark prediction and further enhances instruction generation. Extensive experiments on three navigation datasets (i.e., R2R, REVERIE, RxR) confirm the generalization and effectiveness of our framework. Our code will be released.
Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision
Manon Dampfhoffer · Thomas Mesquida · Damien Joubert · Thomas Dalgaty · Pascal Vivet · Christoph Posch
Event-based cameras asynchronously detect changes in light intensity with high temporal resolution, making them a promising alternative to RGB camera for low-latency and low-power optical flow estimation. However, state-of-the-art convolutional neural network methods create frames from the event stream, therefore losing the opportunity to exploit events for both sparse computations and low-latency prediction. On the other hand, asynchronous event graph methods could leverage both, but at the cost of avoiding any form of time accumulation, which limits the prediction accuracy. In this paper, we propose to break this accuracy-latency trade-off with a novel architecture combining an asynchronous accumulation-free event branch and a periodic aggregation branch. The periodic branch performs feature aggregations on the event graphs of past data to extract global context information, which improves accuracy without introducing any latency.The solution could predict optical flow per event with a latency of tens of microseconds on asynchronous hardware, which represents a gain of three orders of magnitude with respect to state-of-the-art frame-based methods, with 48x less operations per second. We show that the solution can detect rapid motion changes faster than a periodic output. This work proposes, for the first time, an effective solution for ultra low-latency and low-power optical flow prediction from event cameras.
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
Enshen Zhou · Qi Su · Cheng Chi · Zhizheng Zhang · Zhongyuan Wang · Tiejun Huang · Lu Sheng · He Wang
Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.
RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training
Raktim Gautam Goswami · Prashanth Krishnamurthy · Yann LeCun · Farshad Khorrami
Vision-based pose estimation of articulated robots with unknown joint angles has applications in collaborative robotics and human-robot interaction tasks. Current frameworks use neural network encoders to extract image features and downstream layers to predict joint angles and robot pose. While images of robots inherently contain rich information about the robot’s physical structures, existing methods often fail to leverage it fully; therefore, limiting performance under occlusions and truncations. To address this, we introduce RoboPEPP, a method that fuses information about the robot’s physical model into the encoder using a masking-based self-supervised embedding-predictive architecture. Specifically, we mask the robot’s joints and pre-train an encoder-predictor model to infer the joints’ embeddings from surrounding unmasked regions, enhancing the encoder’s understanding of the robot’s physical model. The pre-trained encoder-predictor pair, along with joint angle and keypoint prediction networks, is then fine-tuned for pose and joint angle estimation. Random masking of input during fine-tuning and keypoint filtering during evaluation further improves robustness. Our method, evaluated on several datasets, achieves the best results in robot pose and joint angle estimation while being the least sensitive to occlusions and requiring the lowest execution time.
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
Weijie Zhou · Manli Tao · Chaoyang Zhao · Haiyun Guo · Honghui Dong · Ming Tang · Jinqiao Wang
Understanding the environment and a robot's physical reachability is crucial for task execution. While state-of-the-art vision-language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model that integrates this reachability information into visual reasoning. Specifically, the S-P Map abstracts a robot's physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the S-P Map, enabling the model to reason about physical reachability without compromising its general vision-language capabilities. To train and evaluate PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a challenging benchmark, EQA-phys, which includes tasks for six different robots in both simulated and real-world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a 14\% improvement over GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA-val and OpenEQA benchmarks. Additionally, the S-P Map shows strong compatibility with various VLMs, and its integration into GPT-4o-mini yields a 7.1\% performance improvement.
GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation
Ruihai Wu · Ziyu Zhu · Yuran Wang · Yue Chen · Jiarui Wang · Hao Dong
Cluttered garments manipulation poses significant challenges in robotics due to the complex, deformable nature of garments and intricate garment relations. Unlike single-garment manipulation, cluttered scenarios require managing complex garment entanglements and interactions, while maintaining garment cleanliness and manipulation stability. To address these demands, we propose to learn point-level affordance, the dense representation modeling the complex space and multi-modal manipulation candidates, with novel designs for the awareness of garment geometry, structure, and inter-object relations. Additionally, we introduce an adaptation module, informed by learned affordance, to reorganize cluttered garments into configurations conducive to manipulation. Our framework demonstrates effectiveness over environments featuring diverse garment types and pile scenarios in both simulation and the real world.
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
Jiange Yang · Haoyi Zhu · Yating Wang · Gangshan Wu · Tong He · Limin Wang
Learning from multiple domains is a primary factor that influences the generalization of a single unified robot system. In this paper, we aim to learn the trajectory prediction model by using broad out-of-domain data to improve its performance and generalization ability. Trajectory model is designed to predict any-point trajectories in the current frame given an instruction and can provide detailed control guidance for robotic policy learning. To handle the diverse out-of-domain data distribution, we propose a sparsely-gated MoE (\textbf{Top-1} gating strategy) architecture for trajectory model, coined as \textbf{Tra-MoE}. The sparse activation design enables good balance between parameter cooperation and specialization, effectively benefiting from large-scale out-of-domain data while maintaining constant FLOPs per token. In addition, we further introduce an adaptive policy conditioning technique by learning 2D mask representations for predicted trajectories, which is explicitly aligned with image observations to guide action prediction more flexibly. We perform extensive experiments on both simulation and real-world scenarios to verify the effectiveness of Tra-MoE and adaptive policy conditioning technique. We also conduct a comprehensive empirical study to train Tra-MoE, demonstrating that our Tra-MoE consistently exhibits superior performance compared to the dense baseline model, even when the latter is scaled to match Tra-MoE's parameter count.
AffordDP: Generalizable Diffusion Policy with Transferable Affordance
Shijie Wu · Yihang Zhu · Yunao Huang · Kaizhen Zhu · Jiayuan Gu · Jingyi Yu · Ye Shi · Jingya Wang
Diffusion-based policies have shown impressive performance in robotic manipulation tasks while struggling with out-of-domain distributions. Recent efforts attempted to enhance generalization by improving the visual feature encoding for diffusion policy. However, their generalization is typically limited to the same category with similar appearances. Our key insight is that leveraging affordances—manipulation priors that define "where" and "how" an agent interacts with an object—can substantially enhance generalization to entirely unseen object instances and categories. We introduce the Diffusion Policy with transferable Affordance (AffordDP), designed for generalizable manipulation across novel categories. AffordDP models affordances through 3D contact points and post-contact trajectories, capturing the essential static and dynamic information for complex tasks. The transferable affordance from in-domain data to unseen objects is achieved by estimating a 6D transformation matrix using foundational vision models and point cloud registration techniques. More importantly, we incorporate affordance guidance during diffusion sampling that can refine action sequence generation. This guidance directs the generated action to gradually move towards the desired manipulation for unseen objects while keeping the generated action within the manifold of action space. Experimental results from both simulated and real-world environments demonstrate that AffordDP consistently outperforms previous diffusion-based methods, successfully generalizing to unseen instances and categories where others fail.
Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
Wenke Xia · Ruoxuan Feng · Dong Wang · Di Hu
Building a generalizable self-correction system as human cognition is crucial for robots to recover from failures.Despite advancements in Multimodal Large Language Models (MLLMs) that empower robots with semantic reflection ability for failure, translating this semantic reflection into "how to correct" fine-grained robotic actions remains a significant challenge.To address this gap, we build the Phoenix framework, which leverages motion instruction as a bridge to connect high-level semantic reflection with low-level robotic action correction. In this motion-based self-reflection framework,we start with a dual-process motion adjustment mechanism with MLLMs to translate the semantic reflection into coarse-grained motion instruction adjustment. To leverage this motion instruction for guiding "how to correct" fine-grained robotic actions, a multi-task motion-conditioned diffusion policy is proposed to integrate visual observations for high-frequency robotic action correction.By combining these two models, we could shift the demand for generalization capability from the low-level manipulation policy to the MLLMs-driven motion refinement model and facilitate precise, fine-grained robotic action correction.Utilizing this framework, we further develop a continual learning method to automatically improve the model's capability from interactions with dynamic environments.The experiments conducted in both the RoboMimic simulation and real-world scenarios prove the superior generalization and robustness of our framework across a variety of manipulation tasks.
ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning
Kailin Li · Puhao Li · Tengyu Liu · Yuyang Li · Siyuan Huang
Human hands are central to environmental interactions, motivating increasing research on dexterous robotic manipulation. Data-driven embodied AI algorithms demand precise, large-scale, human-like manipulation sequences, which are challenging to obtain with conventional reinforcement learning or real-world teleoperation. This paper introduces ManipTrans, a novel two-stage method for efficiently transferring human bimanual skills to dexterous robotic hands in simulation. ManipTrans first pre-trains a generalist trajectory imitator for mimicking hand motion, then fine-tunes a specialist residual module for each skill under physics-based interaction constraints. This approach decouples motion imitation from physical effects, enabling efficient learning and accurate execution of complex bimanual tasks.Experiments show ManipTrans surpasses state-of-the-art methods in success rate, fidelity, and efficiency. Using ManipTrans, we transfer multiple hand-object datasets to robotic hands, creating DexManipNet, a large-scale dataset featuring previously unexplored tasks like pen capping and bottle unscrewing. DexManipNet includes 3.3K episodes of robotic manipulation and supports easy expansion with our framework, facilitating further policy training for dexterous hands and enabling real-world applications. Both code and dataset will be publicly available.
PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model
Mingju Gao · Yike Pan · Huan-ang Gao · Zongzheng Zhang · Wenyi Li · Hao Dong · Hao Tang · Li Yi · Hao Zhao
As interest grows in world models that predict future states from current observations and actions, accurately modeling part-level dynamics has become increasingly relevant for various applications. Existing approaches, such as Puppet-Master, rely on fine-tuning large-scale pre-trained video diffusion models, which are impractical for real-world use due to the limitations of 2D video representation and slow processing times. To overcome these challenges, we present PartRM, a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object. PartRM builds upon large 3D Gaussian reconstruction models, leveraging their extensive knowledge of appearance and geometry in static objects. To address data scarcity in 4D, we introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states. We enhance the model’s understanding of interaction conditions with a multi-scale drag embedding module that captures dynamics at varying granularities. To prevent catastrophic forgetting during fine-tuning, we implement a two-stage training process that focuses sequentially on motion and appearance learning. Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics. Our code, data, and models will be made publicly available to facilitate future research.
InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing
Jinlu Zhang · Yixin Chen · Zan Wang · Jie Yang · Yizhou Wang · Siyuan Huang
Recent advances in 3D human-centric generation have made significant progress. However, existing methods still struggle with generating novel Human-Object Interactions (HOIs), particularly for open-set objects. We identify three main challenges of this task: precise human object relation reasoning, adaptive affordance parsing for unseen objects, and realistic human pose synthesis that aligns with the description and 3D object geometry. In this work, we propose a novel zero-shot 3D HOI generation framework, leveraging the knowledge from large-scale pretrained models without training from specific datasets. More specifically, we first generate an initial human pose by sampling multiple hypotheses through multi-view SDS based on the input text and object geometry. We then utilize a pre-trained 2D image diffusion model to parse unseen objects and extract contact points, avoiding the limitations imposed by existing 3D asset knowledge. Finally, we introduce a detailed optimization to generate fine-grained, precise and natural interaction, enforcing realistic 3D contact between the involved body parts, including hands in grasp, and 3D object. This is achieved by distilling relational feedback from LLMs to capture detailed human-object relations from the text inputs. Extensive experiments validate the effectiveness of our approach compared to prior work, particularly in terms of the fine-grained nature of interactions and the ability to handle open-set 3D objects.
How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions
Aditya Prakash · Benjamin E Lundell · Dmitry Andreychuk · David Forsyth · Saurabh Gupta · Harpreet S. Sawhney
We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input.Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
Liang Pan · Zeshi Yang · Zhiyang Dou · Wenjia Wang · Buzhen Huang · Bo Dai · Taku Komura · Jingbo Wang
Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks.
EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild
Yumeng Liu · Xiaoxiao Long · Zemin Yang · Yuan Liu · Marc Habermann · Christian Theobalt · Yuexin Ma · Wenping Wang
Our work aims to reconstruct hand-object interactions from a single-view image, which is a fundamental but ill-posed task.Unlike methods that reconstruct from videos, multi-view images, or predefined 3D templates, single-view reconstruction faces significant challenges due to inherent ambiguities and occlusions. These challenges are further amplified by the diverse nature of hand poses and the vast variety of object shapes and sizes.Our key insight is that current foundational models for segmentation, inpainting, and 3D reconstruction robustly generalize to in-the-wild images, which could provide strong visual and geometric priors for reconstructing hand-object interactions.Specifically, given a single image, we first design a novel pipeline to estimate the underlying hand pose and object shape using off-the-shelf large models. Furthermore, with the initial reconstruction, we employ a prior-guided optimization scheme, which optimizes hand pose to comply with 3D physical constraints and the 2D input image content.We perform experiments across several datasets and show that our method consistently outperforms baselines and faithfully reconstructs a diverse set of hand-object interactions.
Temporally Consistent Object-Centric Learning by Contrasting Slots
Anna Manasyan · Maximilian Seitzer · Filip Radovic · Georg Martius · Andrii Zadaianchuk
Unsupervised object-centric learning from videos is a promising approach to extract structured representations from large, unlabeled collections of videos. To support downstream tasks like autonomous control, these representations must be both compositional and temporally consistent. Existing approaches based on recurrent processing often lack long-term stability across frames because their training objective does not enforce temporal consistency. In this work, we introduce a novel object-level temporal contrastive loss for video object-centric models that explicitly promotes temporal consistency. Our method significantly improves the temporal consistency of the learned object-centric representations, yielding more reliable video decompositions that facilitate challenging downstream tasks such as unsupervised object dynamics prediction. Furthermore, the inductive bias added by our loss strongly improves object discovery, leading to state-of-the-art results on both synthetic and real-world datasets, outperforming even weakly-supervised methods that leverage motion masks as additional cues.
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
Sirui Xu · Dongting Li · Yucheng Zhang · Xiyan Xu · Qi Long · Ziyin Wang · Yunzhi Lu · Shuchang Dong · Hezi Jiang · Akshat Gupta · Yu-Xiong Wang · Liangyan Gui
While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remains challenging due to dataset limitations. These datasets often lack extensive, high-quality text-interaction pair data and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark with key contributions in both dataset and methodology. First, we consolidate 21.81 hours of HOI data from diverse sources, standardizing and enriching them with detailed textual annotations. Second, we propose a unified optimization framework that enhances data quality by minimizing artifacts and restoring hand motions. Leveraging the insight of contact invariance, we preserve human-object relationships while introducing motion variations, thereby expanding the dataset to 30.70 hours. Third, we introduce six tasks to benchmark existing methods and develop a unified HOI generative model based on multi-task learning that achieves state-of-the-art results. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. The dataset will be publicly accessible to support further research in the field.
HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos
Prithviraj Banerjee · Sindi Shkodrani · Pierre Moulon · Shreyas Hampali · Shangchen Han · Fan Zhang · Linguang Zhang · Jade Fountain · Edward Miller · Selen Basol · Richard Newcombe · Robert Wang · Jakob Engel · Tomas Hodan
We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.
Estimating Body and Hand Motion in an Ego‑sensed World
Brent Yi · Vickie Ye · Maya Zheng · Yunqi Li · Lea Müller · Georgios Pavlakos · Yi Ma · Jitendra Malik · Angjoo Kanazawa
We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture a device wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve hand estimation: the resulting kinematic and temporal constraints can reduce world-frame errors in single-frame estimates by 40%.
UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units
Huakun Liu · Hiroki Ota · Xin Wei · Yutaro Hirao · Monica Perusquia-Hernandez · Hideaki Uchiyama · Kiyoshi Kiyokawa
Sparse wearable inertial measurement units (IMUs) have gained popularity for estimating 3D human motion. However, challenges such as pose ambiguity, data drift, and limited adaptability to diverse bodies persist. To address these issues, we propose UMotion, an uncertainty-driven, online fusing-all state estimation framework for 3D human shape and pose estimation, supported by six integrated, body-worn ultra-wideband (UWB) distance sensors with IMUs. UWB sensors measure inter-node distances to infer spatial relationships, aiding in resolving pose ambiguities and body shape variations when combined with anthropometric data. Unfortunately, IMUs are prone to drift, and UWB sensors are affected by body occlusions. Consequently, we develop a tightly coupled Unscented Kalman Filter (UKF) framework that fuses uncertainties from sensor data and estimated human motion based on individual body shape. The UKF iteratively refines IMU and UWB measurements by aligning them with uncertain human motion constraints in real-time, producing optimal estimates for each. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of UMotion in stabilizing sensor data and the improvement over state of the art in pose accuracy. The code will be available for research purposes.
REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning
Jihyun Lee · Weipeng Xu · Alexander Richard · Shih-En Wei · Shunsuke Saito · Shaojie Bai · Te-Li Wang · Minhyuk Sung · Tae-Kyun Kim · Jason Saragih
We present a novel framework for egocentric 3D motion tracking, focusing on learning high-fidelity social motions from head-mounted camera image inputs. Our approach leverages a motion diffusion model based on cascaded body-hand diffusion denoising for accurate whole-body motion tracking. We build our network upon a modified Transformer-based architecture using windowed relative-temporal attention to achieve better generalizability to arbitrary-length motions. Additionally, we propose a novel exemplar-based identity conditioning method to further boost tracking quality when prior information (i.e., anchor poses) about the target identity is available. In experiments, our framework achieves state-of-the-art whole-body motion tracking accuracy while enabling real-time inference (> 60 FPS) through diffusion distillation. Our supplementary video demonstrates that our approach can estimate significantly more natural social motions from challenging (e.g., occluded or truncated) egocentric input observations compared to the existing state-of-the-arts.
LAL: Enhancing 3D Human Motion Prediction with Latency-aware Auxiliary Learning
Xiaoning Sun · Dong Wei · Huaijiang Sun · Shengxiang Hu
Making accurate prediction of human motions based on the historical observation is a crucial technology for robots to collaborate with humans. Existing human motion prediction methods are all built under an ideal assumption that robots can instantaneously react, which ignores the time delay introduced during data processing & analysis and future reaction planning -- jointly known as "response latency''. Consequently, the predictions made within this latency period become meaningless for practical use, as part of the time has passed and the corresponding real motions have already occurred before robot deliver its reaction. In this paper, we argue that the seemingly meaningless prediction period, however, can be leveraged to enhance prediction accuracy significantly. We propose LAL, a Latency-aware Auxiliary Learning framework, which shifts the existing "reaction instantaneous'' convention into a new motion prediction paradigm with both latency compatibility and utility. The framework consists of two branches handling different tasks: the primary branch learns to directly predict the valid target (excluding the beginning latency period) based on observation; while the auxiliary branch learns the same target, but based on the reformed observation with additional latency data incorporated. A direct and effective way of auxiliary feature sharing is forced by our tailored consistency loss, to gradually integrate auxiliary latency insights into the primary prediction branch. Estimated feature statistics-based alignment method is presented as optional step for primary branch refinement. Experiments show that LAL achieves significant improvement on prediction accuracy, without additional time consumption during testing.
LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos
Daniel Etaat · Dvij Rajesh Kalaria · Nima Rahmanian · Shankar Sastry
Table tennis is beyond a game of physical skill. Competitive players overcome the challenging high-speed and highly dynamic environment by anticipating their opponent’s intent – buying themselves the necessary time to respond. In this work, we take one step towards designing such an agent. Previous works have developed systems capable of real-time table tennis gameplay, though they often do not leverage anticipation; of the works that forecast opponent actions, their approaches are limited by dataset size and variety. Our paper contributes (1) a scalable system for reconstructing monocular video of table tennis matches in 3D, (2) a real-time Transformer-based model for anticipating opponent actions, and (3) an uncertainty-aware anticipatory controller for a KUKA robot arm. We demonstrate in simulation that our policy improves the ball return rate on high-speed hits from 44.3% to 61.1% as compared to a baseline non-anticipatory policy.
Pose Priors from Language Models
Sanjay Subramanian · Evonne Ng · Lea Müller · Dan Klein · Shiry Ginosar · Trevor Darrell
We present a zero-shot pose optimization method that enforces accurate physical contact constraints when estimating the 3D pose of humans. Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation.We can thus leverage this insight to improve pose estimation by converting natural language descriptors, generated by a large multimodal model (LMM), into tractable losses to constrain the 3D pose optimization. Despite its simplicity, our method produces surprisingly compelling pose reconstructions of people in close contact, correctly capturing the semantics of the social and physical interactions. We demonstrate that our method rivals more complex state-of-the-art approaches that require expensive human annotation of contact points and training specialized models. Moreover, unlike previous approaches, our method provides a unified framework for resolving self-contact and person-to-person contact.
HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models
Mingzhen Huang · Fu-Jen Chu · Bugra Tekin · Kevin Liang · Haoyu Ma · Weiyao Wang · Xingyu Chen · Pierre Gleize · Hongfei Xue · Siwei Lyu · Kris Kitani · Matt Feiszli · Hao Tang
We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (\eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.
HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction
Yuan Wang · Yali Li · Lixiang Li · Shengjin Wang
While flourishing developments have been witnessed in text-to-motion generation, synthesizing physically realistic, controllable, language-conditioned Human Scene Interactions (HSI) remains a relatively underexplored landscape. Current HSI methods naively rely on conditional Variational AutoEncoder (cVAE) and diffusion models. They are typically associated with \textbf{limited modalities of control signals} and \textbf{task-specific frameworks design}, leading to inflexible adaptation across various interaction scenarios and descriptive-unfaithful motions in diverse 3D physical environments. In this paper, we propose HSI-GPT, a General-Purpose \textbf{Large Scene-Motion-Language Model} that applies ``next-token prediction'' paradigm of Large Language Models to the HSI domain. HSI-GPT not only exhibits remarkable flexibility to accommodate diverse control signals (3D scenes, textual commands, key-frame poses, as well as scene affordances), but it seamlessly supports various HSI-related tasks (\textit{e.g}., multi-modal controlled HSI generation, HSI understanding, and general motion completion in 3D scenes). First, HSI-GPT quantizes textual descriptions and human motions into discrete, LLM-interpretable tokens with multi-modal tokenizers. Inspired by multi-modal learning, we develop a recipe for aligning mixed-modality tokens into the shared embedding space of LLMs. These interaction tokens are then organized into unified instruction following prompts, allowing our HSI-GPT to fine-tune on prompt-based question-and-answer tasks. Extensive experiments and visualizations validate that our general-purpose HSI-GPT model delivers exceptional performance across multiple HSI-related tasks.
SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing
Seokhyeon Hong · Chaelin Kim · Serin Yoon · Junghyun Nam · Sihun Cha · Junyong Noh
Text-driven motion generation has been significantly advanced with the rise of denoising diffusion models.However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions.Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning.In this paper, we introduce a skeleton-aware latent diffusion~(SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words.Furthermore, by leveraging cross-attention maps produced during the generation process, we enable the first zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts.Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation.
TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation
Yabiao Wang · Shuo Wang · Jiangning Zhang · Ke Fan · Jiafu Wu · Xuezhucun Xue · Yong Liu
Human-human motion generation is essential for understanding humans as social beings. Current methods fall into two main categories: single-person-based methods and separate modeling-based methods. To delve into this field, we abstract the overall generation process into a general framework MetaMotion, which consists of two phases: temporal modeling and interaction mixing. For temporal modeling, the single-person-based methods concatenate two people into a single one directly, while the separate modeling-based methods skip the modeling of interaction sequences. The inadequate modeling described above resulted in sub-optimal performance and redundant model parameters. In this paper, we introduce TIMotion (Temporal and Interactive Modeling), an efficient and effective framework for human-human motion generation. Specifically, we first propose Causal Interactive Injection to model two separate sequences as a causal sequence leveraging the temporal and causal properties. Then we present Role-Evolving Scanning to adjust to the change in the active and passive roles throughout the interaction. Finally, to generate smoother and more rational motion, we design Localized Pattern Amplification to capture short-term motion patterns.Extensive experiments on InterHuman and InterX demonstrate that our method achieves superior performance. The project code will be released upon acceptance.
HuMoCon: Concept Discovery for Human Motion Understanding
Qihang Fang · Chengcheng Tang · Bugra Tekin · Shugao Ma · Yanchao Yang
We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding.
ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer
Jiayi Gao · Zijin Yin · Changcheng Hua · Yuxin Peng · Kongming Liang · Zhanyu Ma · Jun Guo · Yang Liu
The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current techniques struggle to preserve the diversity and accuracy of motion while transferring it to subjects with varying shapes. Additionally, they fail to handle videos with multiple subjects, making it difficult to disentangle and control motion transfer to specific instances.To overcome these challenges, we introduce ConMo, a training-free framework that disentangle and recompose the motions of subjects and camera movements. Unlike existing methods, ConMo separates individual subject motions and background motions from compound motion trajectories in complex videos, allowing for precise control and flexible manipulation, such as resizing and repositioning. This leads to more accurate motion transfer across subjects of different shapes and better handling of videos with multiple subjects. Extensive experiments demonstrate that ConMo can effectively achieve the aforementioned various applications and surpass existing state-of-the-art methods in terms of motion fidelity and temporal consistency. The code will be published.
DreamTrack: Dreaming the Future for Multimodal Visual Object Tracking
Mingzhe Guo · Weiping Tan · Wenyu Ran · Liping Jing · Zhipeng Zhang
Aiming to achieve class-agnostic perception in visual object tracking, current trackers commonly formulate tracking as a one-shot detection problem with the template-matching architecture. Despite the success, severe environmental variations in long-term tracking raise challenges to generalizing the tracker in novel situations. Temporal trackers try to fix it by preserving the time-validity of target information with historical predictions, e.g., updating the template. However, solely transmitting the previous observations instead of learning from them leads to an inferior capability of understanding the tracking scenario from past experience, which is critical for the generalization in new frames. To address this issue, we reformulate temporal learning in visual tracking as a History-to-Future process and propose a novel tracking framework DreamTrack. Our DreamTrack learns the temporal dynamics from past observations to dream the future variations of the environment, which boosts the generalization with the extended future information from history. Considering the uncertainty of future variation, multimodal prediction is designed to infer the target trajectory of each possible future situation. Without bells and whistles, DreamTrack achieves leading performance with real-time inference speed. In particular, DreamTrack obtains SUC scores of 76.6%/87.9% on LaSOT/TrackingNet, surpassing all recent SOTA trackers.
Seurat: From Moving Points to Depth
Seokju Cho · Gabriel Huang · Seungryong Kim · Joon-Young Lee
Accurate depth estimation from monocular videos remains challenging due to ambiguities inherent in single-view geometry, as crucial depth cues like stereopsis are absent. However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. Specifically, we use off-the-shelf point tracking models to capture 2D trajectories. Then, our approach employs spatial and temporal transformers to process these trajectories and directly infer depth changes over time. Evaluated on the TAPVid-3D benchmark, our method demonstrates robust zero-shot performance, generalizing effectively from synthetic to real-world datasets. Results indicate that our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.
CH3Depth: Efficient and Flexible Depth Foundation Model with Flow Matching
Jiaqi Li · Yiran Wang · Jinghong Zheng · Junrui Zhang · Liao Shen · Tianqi Liu · Zhiguo Cao
Depth estimation is a fundamental task in 3D vision. An ideal depth estimation model is expected to embrace meticulous detail, temporal consistency, and high efficiency. Although existing foundation models can perform well in certain specific aspects, most of them fall short of fulfilling all the above requirements simultaneously. In this paper, we present CH$_3$Depth, an efficient and flexible model for depth estimation with flow matching to address this challenge. Specifically, 1) we reframe the optimization objective of flow matching as a scale-variable velocity field to improve accuracy. 2) To enhance efficiency, we propose non-uniform sampling to achieve better prediction with fewer sampling steps. 3) We design the Latent Temporal Stabilizer (LTS) to enhance temporal consistency by aggregating latent codes of adjacent frames, enabling our method to be lightweight and compatible for video depth estimation. CH$_3$Depth achieves state-of-the-art performance in zero-shot evaluations across multiple image and video datasets, excelling in prediction accuracy, efficiency, and temporal consistency, highlighting its potential as the next foundation model for depth estimation.
Video Depth without Video Models
Bingxin Ke · Dominik Narnhofer · Shengyu Huang · Lei Ke · Torben Peters · Katerina Fragkiadaki · Anton Obukhov · Konrad Schindler
Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call \textbf{RollingDepth}, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models.
BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions
Wonyong Seo · Jihyong Oh · Munchurl Kim
Existing Video Frame interpolation (VFI) models tend to suffer from time-to-location ambiguity when trained with video of non-uniform motions, such as accelerating, decelerating, and changing directions, which often yield blurred interpolated frames.In this paper, we propose (i) a novel motion description map, Bidirectional Motion field (BiM), to effectively describe non-uniform motions; (ii) a BiM-guided Flow Net (BiMFN) with Content-Aware Upsampling Network (CAUN) for precise optical flow estimation; and (iii) Knowledge Distillation for VFI-centric Flow supervision (KDVCF) to supervise the motion estimation of VFI model with VFI-centric teacher flows.The proposed VFI is called a Bidirectional Motion field-guided VFI (BiM-VFI) model.Extensive experiments show that our BiM-VFI model significantly surpasses the recent state-of-the-art VFI methods by 26\% and 45\% improvements in LPIPS and STLPIPS respectively, yielding interpolated frames with much fewer blurs at arbitrary time instances.
Autoregressive Sequential Pretraining for Visual Tracking
Shiyi Liang · Yifan Bai · Yihong Gong · Xing Wei
Recent advancements in visual object tracking have shifted towards a sequential generation paradigm, where object deformation and motion exhibit strong temporal dependencies. Despite the importance of these dependencies, widely adopted image-level pretrained backbones barely capture the dynamics in the consecutive video, which is the essence of tracking. Thus, we propose AutoRegressive Generative Pretraining (ARG), an unsupervised spatio-temporal learner, via generating the evolution of object appearance and motion in video sequences. Our method leverages a diffusion model to autoregressively generate the future frame appearance, conditioned on historical embeddings extracted by a general encoder. Furthermore, to ensure trajectory coherence, the same encoder is employed to learn trajectory consistency by generating coordinate sequences in a reverse autoregressive fashion, a process we term back-tracking. Further, we integrate the pretrained ARG into ARTrackV2, creating ARGTrack, which is further fine-tuned for tracking tasks. ARGTrack achieves state-of-the-art performance across multiple benchmarks, becoming the first tracker to surpass 80% AO on GOT-10k, while maintaining high efficiency. These results demonstrate the effectiveness of our approach in capturing temporal dependencies for continuous video tracking. The code will be released soon.
IM-Zero: Instance-level Motion Controllable Video Generation in a Zero-shot Manner
Yuyang Huang · Yabo Chen · Li Ding · Xiaopeng Zhang · Wenrui Dai · Junni Zou · Hongkai Xiong · Qi Tian
Controllability of video generation has been recently concerned in addition to the quality of generated videos. The main challenge to controllable video generation is to synthesize videos based on user-specified instance spatial locations and movement trajectories. However, existing methods suffer from a dilemma between the resource consumption, generation quality, and user controllability. As an efficient alternative to prohibitive training-based video generation, existing zero-shot video generation methods cannot generate high-quality and motion-consistent videos under the control of layouts and movement trajectories. In this paper, we propose a novel zero-shot method named IM-Zero that ameliorates instance-level motion controllable video generation with enhanced control accuracy, motion consistency, and richness of details to address this problem. Specifically, we first present a motion generation stage that extracts motion and textural guidance from keyframe candidates from pre-trained grounded text-to-image model to generate the desired coarse motion video. Subsequently, we develop a video refinement stage that injects the motion priors of pre-trained text-to-video models and detail priors of pre-trained text-to-image models into the latents of coarse motion videos to further enhance video motion consistency and richness of details. To our best knowledge, IM-Zero is the first to simultaneously achieve high-quality video generation and allow to control both layouts and movement trajectories in a zero-shot manner. Extensive experiments demonstrate that IM-Zero outperforms existing methods in terms of video quality, inter-frame consistency, and the alignment of location and trajectory. Furthermore, compared with existing methods, IM-Zero enjoys extra advantages of versatility in video generation, including motion control of subparts within instances, finer control of specifying instance shapes via masks, and more difficult tasks of motion transfer for customizing fine-grained motion patterns through reference videos and high-quality text-to-video generation.
Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation
Hyeonho Jeong · Chun-Hao P. Huang · Jong Chul Ye · Niloy J. Mitra · Duygu Ceylan
While recent foundational video generators produce visually rich output, they still struggle with appearance drift, where objects gradually degrade or change inconsistently across frames, breaking visual coherence. We hypothesize that this is because there is no explicit supervision in terms of spatial tracking at the feature level. We propose Track4Gen, a spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. Track4Gen merges the video generation and point tracking tasks into a single network by making minimal changes to existing video generation architectures. Using Stable Video Diffusion as a backbone, Track4Gen demonstrates that it is possible to unify video generation and point tracking, which are typically handled as separate tasks. Our extensive evaluations show that Track4Gen effectively reduces appearance drift, resulting in temporally stable and visually coherent video generation. Anonymous project page: https://anonymous1anonymous2.github.io/
Consistent and Controllable Image Animation with Motion Diffusion Models
Xin Ma · Yaohui Wang · Gengyun Jia · Xinyuan Chen · Tien-Tsin Wong · Yuan-Fang Li · Cunjian Chen
Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and objects in the input static image) and ensuring motion smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better image consistency and motion smoothness. In general, we propose two effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.
MatAnyone: Stable Video Matting with Consistent Memory Propagation
Peiqing Yang · Shangchen Zhou · Jixin Zhao · Qingyi Tao · Chen Change Loy
Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To tackle this, we propose MatAnyone, a practical framework designed for target-assigned video matting. Specifically, building on a memory-based framework, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively combines memory from the previous frame. This ensures stable semantic consistency in core regions while maintaining fine details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, further improving matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust, accurate video matting in diverse real-world scenarios, outperforming existing methods. The code and model will be publicly available.
Unboxed: Geometrically and Temporally Consistent Video Outpainting
Zhongrui Yu · Martina Megaro-Boldini · Robert Sumner · Abdelaziz Djelouah
Extending the field of view of video content beyond its original version has many applications: immersive viewing experience with VR devices, reformatting 4:3 legacy content to today's viewing conditions with wide screens, or simply extending vertically captured phone videos. Many existing works focus on synthesizing the video using generative models only. Despite promising results, this strategy seems at the moment limited in terms of quality. In this work, we address this problem using two key ideas: 3D supported outpainting for the static regions of the images, and leveraging pre-trained video diffusion model to ensure realistic and temporally coherent results, particularly for the dynamic parts. In the first stage, we iterate between image outpainting and updating the 3D scene representation - we use 3D Gaussian Splatting. Then we consider dynamic objects independently per frame and inpaint missing pixels. Finally, we propose a denoising scheme that allows to maintain known reliable regions and update the dynamic parts to obtain temporally realistic results.We achieve state-of-the-art video outpainting. This is validated quantitatively and through a user study. We are also able to extend the field of view largely beyond the limits reached by existing methods.
High Dynamic Range Video Compression: A Large-Scale Benchmark Dataset and A Learned Bit-depth Scalable Compression Algorithm
Zhaoyi Tian · Feifeng Wang · Shiwei Wang · Zihao Zhou · Yao Zhu · Liquan Shen
Recently, learned video compression (LVC) is undergoing a period of rapid development. However, due to absence of large and high-quality high dynamic range (HDR) video training data, LVC on HDR video is still unexplored. In this paper, we are the first to collect a large-scale HDR video benchmark dataset, named HDRVD2K, featuring huge quantity, diverse scenes and multiple motion types. HDRVD2K fills gaps of video training data and facilitate the development of LVC on HDR videos. Based on HDRVD2K, we further propose the first learned bit-depth scalable video compression (LBSVC) network for HDR videos by effectively exploiting bit-depth redundancy between videos of multiple dynamic ranges. To achieve this, we first propose a compression-friendly bit-depth enhancement module (BEM) to effectively predict original HDR videos based on compressed tone-mapped low dynamic range (LDR) videos and dynamic range prior, instead of reducing redundancy only through spatio-temporal predictions. Our method greatly improves the reconstruction quality and compression performance on HDR videos. Extensive experiments demonstrate the effectiveness of HDRVD2K on learned HDR video compression and great compression performance of our proposed LBSVC network.
ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression
Wei Jiang · Junru Li · Kai Zhang · Li zhang
In Learned Video Compression (LVC), improving inter prediction, such as enhancing temporal context mining and mitigating accumulated errors, is crucial for boosting rate-distortion performance. Existing LVCs mainly focus on mining the temporal movements while neglecting non-local correlations among frames. Additionally, current contextual video compression models use a single reference frame, which is insufficient for handling complex movements. To address these issues, we propose leveraging non-local correlations across multiple frames to enhance temporal priors, significantly boosting rate-distortion performance. To mitigate error accumulation, we introduce a partial cascaded fine-tuning strategy that supports fine-tuning on full-length sequences with constrained computational resources. This method reduces the train-test mismatch in sequence lengths and significantly decreases accumulated errors. Based on the proposed techniques, we present a video compression scheme ECVC. Experiments demonstrate that our ECVC achieves state-of-the-art performance, reducing $10.5\\%$ and $11.5\\%$ more bit-rates than previous SOTA method DCVC-FM over VTM-13.2 low delay B (LDB) under the intra period (IP) of $32$ and $-1$, respectively.
RivuletMLP: An MLP-based Architecture for Efficient Compressed Video Quality Enhancement
Gang He · Weiran Wang · Guancheng Quan · Shihao Wang · Dajiang Zhou · Yunsong Li
Quality degradation from video compression manifests both spatially along texture edges and temporally with continuous motion changes. Despite recent advances, extracting aligned spatiotemporal information from adjacent frames remains challenging. This is mainly due to limitations in receptive field size and computational complexity, which makes existing methods struggle to efficiently enhance video quality. To address this issue, we propose RivuletMLP, an MLP-based network architecture. Specifically, our framework first employs a Spatiotemporal Dynamic Alignment (STDA) module to adaptively explore and align multi-frame feature information. Subsequently, we introduce two modules for feature reconstruction: a Spatiotemporal Flow Module (SFM) and a Benign Selection Compensation Module (BSCM). The SFM establishes non-local dependencies through an innovative feature permutation mechanism, recovering details while reducing computational cost. Additionally, the BSCM utilizes a collaborative learning strategy in both spatial and frequency domains to alleviate compression-induced inter-frame motion discontinuities. Experimental results demonstrate that RivuletMLP achieves superior computational efficiency while maintaining powerful reconstruction capabilities.
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
Feng Liu · Shiwei Zhang · Xiaofeng Wang · Yujie Wei · Haonan Qiu · Yuzhong Zhao · Yingya Zhang · Qixiang Ye · Fang Wan
As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising.Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps.However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality.In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps.Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost.TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching.Experiments show that TeaCache achieves up to 4.41$\times$ acceleration over Open-Sora-Plan with negligible (-0.07\% Vbench score) degradation of visual quality. Code is enclosed in the supplementary material.
AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
Mingzhen Sun · Weining Wang · Li · Jiawei Liu · Jiahui Sun · Wanquan Feng · Shanshan Lao · SiYu Zhou · Qian HE · Jing Liu
The task of video generation requires synthesizing visually realistic and temporally coherent video frames. Existing methods primarily use asynchronous auto-regressive models or synchronous diffusion models to address this challenge. However, asynchronous auto-regressive models often suffer from inconsistencies between training and inference, leading to issues such as error accumulation, while synchronous diffusion models are limited by their reliance on rigid sequence length. To address these issues, we introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible, asynchronous video generation. Specifically, our approach leverages diffusion to gradually corrupt video frames in both training and inference, reducing the discrepancy between these phases. Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames, ensuring that earlier frames remain clearer than subsequent ones. This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence. In addition, we design two specialized timestep schedulers: the FoPP scheduler for balanced timestep sampling during training, and the AD scheduler for flexible timestep differences during inference, supporting both synchronous and asynchronous generation. Extensive experiments demonstrate the superiority of our proposed method, by achieving competitive and state-of-the-art results across four challenging benchmarks.
Taming Teacher Forcing for Masked Autoregressive Video Generation
Deyu Zhou · Quan Sun · Yuang Peng · Kun Yan · Runpei Dong · Duomin Wang · Zheng Ge · Nan Duan · Xiangyu Zhang
We introduce Masked Autoregressive Video Generation (MAGI), a hybrid framework that combines the strengths of masked and causal modeling paradigms to achieve efficient video generation. MAGI utilizes masked modeling for intra-frame generation and causal modeling to capture temporal dependencies across frames. A key innovation is our Standard Teacher Forcing paradigm, which conditions masked frames on complete observation frames, enabling a smooth transition from token-level to frame-level autoregressive modeling. To mitigate issues like exposure bias, we incorporate targeted training strategies, setting a new benchmark for autoregressive video generation. Extensive experiments demonstrate that MAGI can generate long, coherent video sequences of over 100 frames, even when trained on as few as 16 frames, establishing its potential for scalable, high-quality video generation.
Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution
Shijun Shi · Jing Xu · Lijing Lu · Zhihang Li · Kai Hu
Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
Face Forgery Video Detection via Temporal Forgery Cue Unraveling
Zonghui Guo · YingJie Liu · Jie Zhang · Haiyong Zheng · Shiguang Shan
Face Forgery Video Detection ($\textbf{FFVD}$) is a critical yet challenging task in determining whether a digital facial video is authentic or forged. Existing FFVD methods typically focus on isolated spatial or coarsely fused spatiotemporal information, failing to leverage temporal forgery cues thus resulting in unsatisfactory performance. We strive to unravel these cues across three progressive levels: $\textit{momentary anomaly}$, $\textit{gradual inconsistency}$, and $\textit{cumulative distortion}$. Accordingly, we design a consecutive correlate module to capture momentary anomaly cues by correlating interactions among consecutive frames. Then, we devise a future guide module to unravel inconsistency cues by iteratively aggregating historical anomaly cues and gradually propagating them into future frames. Finally, we introduce a historical review module that unravels distortion cues via momentum accumulation from future to historical frames. These three modules form our Temporal Forgery Cue Unraveling ($\textbf{TFCU}$) framework, sequentially highlighting spatial discriminative features by unraveling temporal forgery cues bidirectionally between historical and future frames. Extensive experiments and ablation studies demonstrate the effectiveness of our TFCU method, achieving state-of-the-art performance across diverse unseen datasets and manipulation methods. Code will be made publicly available.
SVFR: A Unified Framework for Generalized Video Face Restoration
Zhiyao Wang · Xu Chen · Chengming Xu · Junwei Zhu · Xiaobin Hu · Jiangning Zhang · Chengjie Wang · Yuqi Liu · Yiyi Zhou · Rongrong Ji
Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting. In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines the complementary strengths of these tasks, enhancing temporal coherence and achieving superior restoration quality. This work advances the state-of-the-art in video FR and establishes a new paradigm for generalized video face restoration.
RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark
Xin Zhang · Xue Yang · Yuxuan Li · Jian Yang · Ming-Ming Cheng · Xiang Li
Rotated object detection has made significant progress in the optical remote sensing. However, advancements in the Synthetic Aperture Radar (SAR) field are laggard behind, primarily due to the absence of a large-scale dataset. Annotating such a dataset is inefficient and costly. A promising solution is to employ a weakly supervised model (e.g., trained with available horizontal boxes only) to generate pseudo-rotated boxes for reference before manual calibration. Unfortunately, the existing weakly supervised models exhibit limited accuracy in predicting the object's angle. Previous works attempt to enhance angle prediction by using angle resolvers that decouple angles into cosine and sine encodings. In this work, we first reevaluate these resolvers from a unified perspective of dimension mapping and expose that they share the same shortcomings: these methods overlook the unit cycle constraint inherent in these encodings, easily leading to prediction biases. To address this issue, we propose the Unit Cycle Resolver (UCR), which incorporates a unit circle constraint loss to improve angle prediction accuracy. Our approach can effectively improve the performance of existing state-of-the-art weakly supervised methods and even surpasses fully supervised models on existing optical benchmarks (i.e., DOTA-v1.0 dataset). With the aid of UCR, we further annotate and introduce RSAR, the largest multi-class rotated SAR object detection dataset to date. Extensive experiments on both RSAR and optical datasets demonstrate that our UCR enhances angle prediction accuracy.
RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability
Minh Kha Do · Kang Han · Phu Lai · Khoa T. Phan · Wei Xiang
Foundation models for remote sensing have garnered increasing attention for their great performance across various observation tasks. However, current models lack robustness when managing diverse input types and handling incomplete data in downstream tasks. In this paper, we propose RobSense, a robust multi-modal foundation model for Multi-spectral and Synthetic Aperture Radar data. RobSense is designed with modular components and pre-trained by a combination of temporal multi-modal and masked autoencoder strategies on a huge-scale dataset. Therefore, it can effectively support diverse input types, from static to temporal, uni-modal to multi-modal. To further handle the incomplete data, we incorporate two uni-modal latent reconstructors to recover rich representations from incomplete inputs, addressing variability in spectral bands and temporal sequence irregularities. Extensive experiments show that RobSense consistently outperforms state-of-the-art baselines on the complete dataset across four input types for segmentation and classification tasks. Furthermore, the proposed model outperforms by considerably larger margins when the missing rate increases in the incomplete datasets.
A Selective Re-learning Mechanism for Hyperspectral Fusion Imaging
Yuanye Liu · jinyang liu · Renwei Dian · Shutao Li
Hyperspectral fusion imaging is challenged by high computational cost due to the abundant spectral information. We find that pixels in regions with smooth spatial-spectral structure can be reconstructed well using a shallow network, while only those in regions with complex spatial-spectral structure require a deeper network. However, existing methods process all pixels uniformly, which ignores this property. To leverage this property, we propose a Selective Re-learning Fusion Network (SRLF) that initially extracts features from all pixels uniformly and then selectively refines distorted feature points. Specifically, SRLF first employs a Preliminary Fusion Module with robust global modeling capability to generate a preliminary fusion feature. Afterward, it applies a Selective Re-learning Module to focus on improving distorted feature points in the preliminary fusion feature. To achieve targeted learning, we present a novel Spatial-Spectral Structure-Guided Selective Re-learning Mechanism (SSG-SRL) that integrates the observation model to identify the feature points with spatial or spectral distortions. Only these distorted points are sent to the corresponding re-learning blocks, reducing both computational cost and the risk of overfitting. Finally, we develop an SRLF-Net, composed of multiple cascaded SRLFs, which surpasses multiple state-of-the-art methods on several datasets with minimal computational cost.
A General Adaptive Dual-level Weighting Mechanism for Remote Sensing Pansharpening
Jie Huang · Haorui Chen · Jiaxuan Ren · Siran Peng · Liang-Jian Deng
Currently, deep learning-based methods for remote sensing pansharpening have advanced rapidly. However, many existing methods struggle to fully leverage feature heterogeneity and redundancy, thereby limiting their effectiveness. To address these challenges across two key dimensions, we introduce a general adaptive dual-level weighting mechanism (ADWM), designed to enhance a wide range of existing deep-learning methods. First, Intra-Feature Weighting (IFW) evaluates correlations among channels within each feature and selectively weighs to reduce redundancy and enhance unique information. Second, Cross-Feature Weighting (CFW) adjusts contributions across layers based on inter-layer correlations, refining the final output by preserving key distinctions across feature depths. This dual-level weighting is efficiently implemented through our proposed Correlation-Aware Covariance Weighting (CACW), which generates weights by utilizing the correlations captured within the covariance matrix. Extensive experiments demonstrate the superior performance of ADWM compared to recent state-of-the-art (SOTA) methods. Furthermore, we validate the effectiveness of our approach through generality experiments, ablation studies, comparison experiments, and detailed visual analysis.
Task-driven Image Fusion with Learnable Fusion Loss
Haowen Bai · Jiangshe Zhang · Zixiang Zhao · Yichen Wu · Lilun Deng · Yukun Cui · Tao Feng · Shuang Xu
Multi-modal image fusion aggregates information from multiple sensor sources, achieving superior visual quality and perceptual characteristics compared to any single source, often enhancing downstream tasks. However, current fusion methods for downstream tasks still use predefined fusion objectives that potentially mismatch the downstream tasks, limiting adaptive guidance and reducing model flexibility.To address this, we propose Task-driven Image Fusion (TDFusion), a fusion framework incorporating a learnable fusion loss guided by task loss. Specifically, our fusion loss includes learnable parameters modeled by a neural network called the loss generation module. This module is supervised by the loss of downstream tasks in a meta-learning manner.The learning objective is to minimize the task loss of the fused images, once the fusion module has been optimized by the fusion loss.Iterative updates between the fusion module and the loss module ensure that the fusion network evolves toward minimizing task loss, guiding the fusion process toward the task objectives.TDFusion’s training relies solely on the loss of downstream tasks, making it adaptable to any specific task.It can be applied to any architecture of fusion and task networks.Experiments demonstrate TDFusion’s performance in both fusion and task-related applications, including four public fusion datasets, semantic segmentation, and object detection.The code will be released.
Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining
Guanglu Dong · Tianheng Zheng · Yuanzhouhan Cao · Linbo Qing · Chao Ren
Recently, deep image deraining models based on paired datasets have made a series of remarkable progress. However, they cannot be well applied in real-world applications due to the difficulty of obtaining real paired datasets and the poor generalization performance. In this paper, we propose a novel Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining framework, CSUD, to tackle the aforementioned challenges. During training with unpaired data, CSUD is capable of generating high-quality pseudo clean and rainy image pairs which are used to enhance the performance of deraining network. Specifically, to preserve more image background details while transferring rain streaks from rainy images to the unpaired clean images, we propose a novel Channel Consistency Loss (CCLoss) by introducing the Channel Consistency Prior (CCP) of rain streaks into training process, thereby ensuring that the generated pseudo rainy images closely resemble the real ones. Furthermore, we propose a novel Self-Reconstruction (SR) strategy to alleviate the redundant information transfer problem of the generator, further improving the deraining performance and the generalization capability of our method. Extensive experiments on multiple synthetic and real-world datasets demonstrate that the deraining performance of CSUD surpasses other state-of-the-art unsupervised methods and CSUD exhibits superior generalization capability.
OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction
Gehui Li · Bin Chen · Chen Zhao · Lei Zhang · Jian Zhang
Exposure correction is a fundamental problem in computer vision and image processing. Recently, frequency domain-based methods have achieved impressive improvement, yet they still struggle with complex real-world scenarios under extreme exposure conditions. This is due to the local convolutional receptive fields failing to model long-range dependencies in the spectrum, and the non-generative learning paradigm being inadequate for retrieving lost details from severely degraded regions. In this paper, we propose Omnidirectional Spectral Mamba (OSMamba), a novel exposure correction network that incorporates the advantages of state space models and generative diffusion models to address these limitations. Specifically, OSMamba introduces an omnidirectional spectral scanning mechanism that adapts Mamba to the frequency domain to capture comprehensive long-range dependencies in both the amplitude and phase spectra of deep image features, hence enhancing illumination correction and structure recovery. Furthermore, we develop a dual-domain prior generator that learns from well-exposed images to generate a degradation-free diffusion prior containing correct information about severely under- and over-exposed regions for better detail restoration. Extensive experiments on multiple-exposure and mixed-exposure datasets demonstrate that the proposed OSMamba achieves state-of-the-art performance both quantitatively and qualitatively.
MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration
Boyun Li · Haiyu Zhao · Wenxin Wang · Peng Hu · Yuanbiao Gou · Xi Peng
Recent advancements in Mamba have shown promising results in image restoration. These methods typically flatten 2D images into multiple distinct 1D sequences along rows and columns, process each sequence independently using selective scan operation, and recombine them to form the outputs. However, such a paradigm overlooks two vital aspects: i) the local relationships and spatial continuity inherent in natural images, and ii) the discrepancies among sequences unfolded through totally different ways. To overcome the drawbacks, we explore two problems in Mamba-based restoration methods: i) how to design a scanning strategy preserving both locality and continuity while facilitating restoration, and ii) how to aggregate the distinct sequences unfolded in totally different ways. To address these problems, we propose a novel Mamba-based Image Restoration model (MaIR), which consists of Nested S-shaped Scanning strategy (NSS) and Sequence Shuffle Attention block (SSA). Specifically, NSS preserves locality and continuity of the input images through the stripe-based scanning region and the S-shaped scanning path, respectively. SSA aggregates sequences through calculating attention weights within the corresponding channels of different sequences. Thanks to NSS and SSA, MaIR surpasses 40 baselines across 14 challenging datasets, achieving state-of-the-art performance on the tasks of image super-resolution, denoising, deblurring and dehazing. Our codes will be available after acceptance.
Zero-Shot Blind-spot Image Denoising via Implicit Neural Sampling
Yuhui Quan · Tianxiang Zheng · Zhiyuan Ma · Hui Ji
The blind-spot principle has been a widely used tool in zero-shot image denoising but faces challenges with real-world noise that exhibits strong local correlations. Existing methods focus on reducing noise correlation, such as masking or reshuffling, which also weaken the pixel correlations needed for estimating missing pixels, resulting in performance degradation. In this paper, we first present a rigorous analysis of how noise correlation and pixel correlation impact the statistical risk of a linear blind-spot denoiser. We then propose using an implicit neural representation to resample noisy pixels, effectively reducing noise correlation while preserving the essential pixel correlations for successful blind-spot denoising. Extensive experiments show our method surpasses existing zero-shot denoising techniques on real-world noisy images.
Adaptive Dropout: Unleashing Dropout across Layers for Generalizable Image Super-Resolution
Hang Xu · Jie Huang · Wei Yu · Jiangtong Tan · Zhen Zou · Feng Zhao
Blind Super-Resolution(blind SR) aims to enhance the model's generalization ability with unknown degradation, yet it still encounters severe overfitting issues. Some previous methods inspired by dropout, which enhances generalization by regularizing features, have shown promising results in blind SR. Nevertheless, these methods focus solely on regularizing features before the final layer and overlook the need for generalization in features at intermediate layers. Without explicit regularization of features at intermediate layers, the blind SR network struggles to obtain well-generalized feature representations. However, the key challenge is that directly applying dropout to intermediate layers leads to a significant performance drop, which we attribute to the inconsistency in training-testing and across layers it introduced. Therefore, we propose Adaptive Dropout, a new regularization method for blind SR models, which mitigates the inconsistency and facilitates application across intermediate layers of networks. Specifically, for training-testing inconsistency, we re-design the form of dropout and integrate the features before and after dropout adaptively. For inconsistency in generalization requirements across different layers, we innovatively design an adaptive training strategy to strengthen feature propagation by layer-wise annealing. Experimental results show that our method outperforms all past regularization methods on both synthetic and real-world benchmark datasets, also highly effective in other image restoration tasks.
Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks
Haijin Zeng · Xiangming Wang · Yongyong Chen · Jingyong Su · Jie Liu
Dynamic image degradations, including noise, blur and lighting inconsistencies, pose significant challenges in image restoration, often due to sensor limitations or adverse environmental conditions. Existing Deep Unfolding Networks (DUNs) offer stable restoration performance but require manual selection of degradation matrices for each degradation type, limiting their adaptability across diverse scenarios.To address this issue, we propose the Vision-Language-guided Unfolding Network (VLU-Net), a unified DUN framework for handling multiple degradation types simultaneously.VLU-Net leverages a Vision-Language Model (VLM) refined on degraded image-text pairs to align image features with degradation descriptions, selecting the appropriate transform for target degradation.By integrating an automatic VLM-based gradient estimation strategy into the Proximal Gradient Descent (PGD) algorithm, VLU-Net effectively tackles complex multi-degradation restoration tasks while maintaining interpretability. Furthermore, we design a hierarchical feature unfolding structure to enhance VLU-Net framework, efficiently synthesizing degradation patterns across various levels.VLU-Net is the first all-in-one DUN framework and outperforms current leading one-by-one and all-in-one end-to-end methods by 3.74 dB on the SOTS dehazing dataset and 1.70 dB on the Rain100L deraining dataset.
DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution
Xingyuan Li · Zirui Wang · Yang Zou · Zhixin Chen · Jun Ma · Zhiying Jiang · Long Ma · Jinyuan Liu
Infrared imaging is essential for autonomous driving and robotic operations as a supportive modality due to its reliable performance in challenging environments. Despite its popularity, the limitations of infrared cameras, such as low spatial resolution and complex degradations, consistently challenge imaging quality and subsequent visual tasks. Hence, infrared image super-resolution (IISR) has been developed to address this challenge. While recent developments in diffusion models have greatly advanced this field, current methods to solve it either ignore the unique modal characteristics of infrared imaging or overlook the machine perception requirements. To bridge these gaps, we propose DifIISR, an infrared image super-resolution diffusion model optimized for visual quality and perceptual performance. Our approach achieves task-based guidance for diffusion by injecting gradients derived from visual and perceptual priors into the noise during the reverse process. Specifically, we introduce an infrared thermal spectrum distribution regulation to preserve visual fidelity, ensuring that the reconstructed infrared images closely align with high-resolution images by matching their frequency components. Subsequently, we incorporate various visual foundational models as the perceptual guidance for downstream visual tasks, infusing generalizable perceptual features beneficial for detection and segmentation. As a result, our approach gains superior visual results while attaining State-Of-The-Art downstream task performance.
Reversing Flow for Image Restoration
Haina Qin · Wenyang Luo · Bing Li · Weiming Hu · libin wang · DanDan Zheng · Jingdong Chen · Ming Yang
Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restoration framework that models the degradation process as a deterministic path using continuous normalizing flows. ResFlow augments the degradation process with an auxiliary process that disambiguates the uncertainty in HQ prediction to enable reversible modeling of the degradation process. ResFlow adopts entropy-preserving flow paths and learns the augmented degradation flow by matching the velocity field. ResFlow significantly improves the performance and speed of image restoration, completing the task in fewer than four sampling steps. Extensive experiments demonstrate that ResFlow achieves state-of-the-art results across various image restoration benchmarks, offering a practical and efficient solution for real-world applications. Our code will be made publicly available.
Navigating Image Restoration with VAR’s Distribution Alignment Prior
Siyang Wang · Naishan Zheng · Jie Huang · Feng Zhao
Generative models trained on extensive high-quality datasets effectively capture the structural and statistical properties of clean images, rendering them powerful priors for transforming degraded features into clean ones in image restoration. VAR, a novel image generative paradigm, surpasses diffusion models in generation quality by applying a next-scale prediction approach. It progressively captures both global structures and fine-grained details through the autoregressive process, consistent with the multi-scale restoration principle widely acknowledged in the restoration community. Furthermore, we observe that during the image reconstruction process utilizing VAR, scale predictions automatically modulate the input, facilitating the alignment of representations at subsequent scales with the distribution of clean images. To harness VAR's adaptive distribution alignment capability in image restoration tasks, we formulate the multi-scale latent representations within VAR as the restoration prior, thus advancing our delicately designed VarFormer framework. The strategic application of these priors enables our VarFormer to achieve remarkable generalization on unseen tasks while also reducing training computational costs. Extensive experiments underscores that our VarFormer outperforms existing multi-task image restoration methods across various restoration tasks.
Image Quality Assessment: From Human to Machine Preference
Chunyi Li · Yuan Tian · Xiaoyue Ling · Zicheng Zhang · Haodong Duan · Haoning Wu · Ziheng Jia · Xiaohong Liu · Xiongkuo Min · Guo Lu · Weisi Lin · Guangtao Zhai
Image Quality Assessment (IQA) based on human subjective preferences has undergone extensive research in the past decades. However, with the development of communication protocols, the visual data consumption volume of machines has gradually surpassed that of humans. For machines, the preference depends on downstream tasks such as segmentation and detection, rather than visual appeal. Considering the huge gap between human and machine visual systems, this paper proposes the topic: Image Quality Assessment for Machine Vision for the first time. Specifically, we (1) defined the subjective preferences of machines, including downstream tasks, test models, and evaluation metrics; (2) established the Machine Preference Database (MPD), which contains 2.25M fine-grained annotations and 30k reference/distorted image pair instances; (3) verified the performance of mainstream IQA algorithms on MPD. Experiments show that current IQA metrics are human-centric and cannot accurately characterize machine preferences. We sincerely hope that MPD can promote the evolution of IQA from human to machine preferences.
DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables
Sidi Yang · Binxiao Huang · Yulun Zhang · Dahai Yu · Yujiu Yang · Ngai Wong
While deep neural networks have revolutionized image denoising capabilities, their deployment on edge devices remains challenging due to substantial computational and memory requirements. To this end, we present DnLUT, an ultra-efficient lookup table-based framework that achieves high-quality color image denoising with minimal resource consumption. Our key innovation lies in two complementary components: a Pairwise Channel Mixer (PCM) that effectively captures inter-channel correlations and spatial dependencies in parallel, and a novel L-shaped convolution design that maximizes receptive field coverage while minimizing storage overhead. By converting these components into optimized lookup tables post-training, DnLUT achieves remarkable efficiency - requiring only 500KB storage and 0.1\% energy consumption compared to its CNN contestant DnCNN, while delivering 20× faster inference. Extensive experiments demonstrate that DnLUT outperforms all existing LUT-based methods by over 1dB in PSNR, establishing a new state-of-the-art in resource-efficient color image denoising.
Detail-Preserving Latent Diffusion for Stable Shadow Removal
Jiamin Xu · Yuxin Zheng · Zelong Li · Chi Wang · Renshu Gu · Weiwei Xu · Gang Xu
Achieving high-quality shadow removal with strong generalizability is challenging in scenes with complex global illumination. Due to the limited diversity in shadow removal datasets, current methods are prone to overfitting training data, often leading to reduced performance on unseen cases. To address this, we leverage the rich visual priors of a pre-trained Stable Diffusion (SD) model and propose a two-stage fine-tuning pipeline to adapt the SD model for stable and efficient shadow removal. In the first stage, we fix the VAE and fine-tune the denoiser in latent space, which yields substantial shadow removal but may lose some high-frequency details. To resolve this, we introduce a second stage, called the detail injection stage. This stage selectively extracts features from the VAE encoder to modulate the decoder, injecting fine details into the final results. Experimental results show that our method outperforms state-of-the-art shadow removal techniques. The cross-dataset evaluation further demonstrates that our method generalizes effectively to unseen data, enhancing the applicability of shadow removal methods.
Shadow Generation Using Diffusion Model with Geometry Prior
Haonan Zhao · Qingyang Liu · Xinhao Tao · Li Niu · Guangtao Zhai
Image composition involves integrating foreground object into background image to obtain a composite image. One of the key challenges is to produce realistic shadow for the inserted foreground object. Recently, diffusion-based methods have shown superior performance compared to GAN-based methods in shadow generation. However, they are still struggling to generate shadows with plausible geometry in complex cases. In this paper, we focus on promoting diffusion-based methods by leveraging geometry priors. Specifically, we first predict the rotated bounding box and matched shadow shapes for the foreground shadow. Then, the geometry information of rotated bounding box and matched shadow shapes is injected into ControlNet to facilitate shadow generation. Extensive experiments on both DESOBAv2 dataset and real composite images validate the effectiveness of our proposed method.
TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting
Liangbin Xie · Daniil Pakhomov · Zhonghao Wang · Zongze Wu · Ziyan Chen · Yuqian Zhou · Haitian Zheng · Zhifei Zhang · Zhe Lin · Jiantao Zhou · Chao Dong
This paper introduces TurboFill, a fast image inpainting model that augments a few-step text-to-image diffusion model with an inpainting adapter to achieve high-quality and efficient inpainting. While standard diffusion models can generate high quality results, they incur high computational cost. We address this limitation by directly training an inpainting adapter on a few-step distilled text-to-image model, specifically DMD2, with a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which assesses model performance across variable mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill significantly outperforms both existing diffusion-based inpainting models and few-step inpainting methods, setting a new benchmark for practical high-performance inpainting tasks.
Linear Attention Modeling for Learned Image Compression
Donghui Feng · Zhengxue Cheng · Shen Wang · Ronghua Wu · Hongwei Hu · Guo Lu · Li Song
Recent years, learned image compression has made tremendous progress to achieve impressive coding efficiency. Its coding gain mainly comes from non-linear neural network-based transform and learnable entropy modeling. However, most of recent focuses have been solely on a strong backbone, and few studies consider the low-complexity design. In this paper, we propose LALIC, a linear attention modeling for learned image compression. Specially, we propose to use Bi-RWKV blocks, by utilizing the Spatial Mix and Channel Mix modules to achieve more compact features extraction, and apply the Conv based Omni-Shift module to adapt to two-dimensional latent representation. Furthermore, we propose a RWKV-based Spatial-Channel ConTeXt model (RWKV-SCCTX), that leverages the Bi-RWKV to modeling the correlation between neighboring features effectively, to further improve the RD performance. To our knowledge, our work is the first work to utilize efficient Bi-RWKV models with linear attention for learned image compression. Experimental results demonstrate that our method achieves competitive RD performances by outperforming VTM-9.1 by -14.84\%, -15.20\%, -17.32\% in BD-rate on Kodak, Tecnick and CLIC Professional validation datasets.
Multirate Neural Image Compression with Adaptive Lattice Vector Quantization
Hao Xu · Xiaolin Wu · Xi Zhang
Recent research has explored integrating lattice vector quantization (LVQ) into learned image compression models. Due to its more efficient Voronoi covering of vector space than scalar quantization (SQ), LVQ achieves better rate-distortion (R-D) performance than SQ, while still retaining the low complexity advantage of SQ. However, existing LVQ-based methods have two shortcomings: 1) lack of a multirate coding mode, hence incapable to operate at different rates; 2) the use of a fixed lattice basis, hence nonadaptive to changing source distributions. To overcome these shortcomings, we propose a novel adaptive LVQ method, which is the first among LVQ-based methods to achieve both rate and domain adaptations. By scaling the lattice basis vector, our method can adjust the density of lattice points to achieve various bit rate targets, achieving superior R-D performance to current SQ-based variable rate models. Additionally, by using a learned invertible linear transformation between two different input domains, we can reshape the predefined lattice cell to better represent the target domain, further improving the R-D performance. To our knowledge, this paper represents the first attempt to propose a unified solution for rate adaptation and domain adaptation through quantizer design.
Generative Image Layer Decomposition with Visual Effects
Jinrui Yang · Qing Liu · Yijun Li · Soo Ye Kim · Daniil Pakhomov · Mengwei Ren · Jianming Zhang · Zhe Lin · Cihang Xie · Yuyin Zhou
Recent advancements in large generative models, particularly diffusion-based methods, have significantly enhanced the capabilities of image editing. However, achieving precise control over image composition tasks remains a challenge. Layered representations, which allow for independent editing of image components, are essential for user-driven content creation, yet existing approaches often struggle to decompose image into plausible layers with accurately retained transparent visual effects such as shadows and reflections. We propose $\textbf{LayerDecomp}$, a generative framework for image layer decomposition which outputs photorealistic clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects.To enable effective training, we first introduce a dataset preparation pipeline that automatically scales up simulated multi-layer data with synthesized visual effects. To further enhance real-world applicability, we supplement this simulated dataset with camera-captured images containing natural visual effects. Additionally, we propose a consistency loss which enforces the model to learn accurate representations for the transparent foreground layer when ground-truth annotations are not available.Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks across several benchmarks and multiple user studies, unlocking various creative possibilities for layer-wise image editing.
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
Dar-Yen Chen · Hmrishav Bandyopadhyay · Kai Zou · Yi-Zhe Song
We introduce NitroFusion, a fundamentally different approach to single-step diffusion that achieves high-quality generation through a dynamic adversarial framework. While one-step methods offer dramatic speed advantages, they typically suffer from quality degradation compared to their multi-step counterparts. Just as a panel of art critics provides comprehensive feedback by specializing in different aspects like composition, color, and technique, our approach maintains a large pool of specialized discriminator heads that collectively guide the generation process. Each discriminator group develops expertise in specific quality aspects at different noise levels, providing diverse feedback that enables high-fidelity one-step generation. Our framework combines: (i) a dynamic discriminator pool with specialized discriminator groups to improve generation quality, (ii) strategic refresh mechanisms to prevent discriminator overfitting, and (iii) global-local discriminator heads for multi-scale quality assessment, and unconditional/conditional training for balanced generation. Additionally, our framework uniquely supports flexible deployment through bottom-up refinement, allowing users to dynamically choose between 1-4 denoising steps with the same model for direct quality-speed trade-offs. Through comprehensive experiments, we demonstrate that NitroFusion significantly outperforms existing single-step methods across multiple evaluation metrics, particularly excelling in preserving fine details and global consistency.
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention
Lianghui Zhu · Zilong Huang · Bencheng Liao · Jun Hao Liew · Hanshu Yan · Jiashi Feng · Xinggang Wang
Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic complexity efficiency, especially when handling long sequences. In this paper, we aim to incorporate the sub-quadratic modeling capability of Gated Linear Attention (GLA) into the 2D diffusion backbone. Specifically, we introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead. We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness. In addition to superior performance to DiT and other sub-quadratic-time diffusion models at $256 \times 256$ resolution, DiG demonstrates greater efficiency than these methods starting from a $512$ resolution. Specifically, DiG-S/2 is $2.5\times$ faster and saves $75.7\%$ GPU memory compared to DiT-S/2 at a $1792$ resolution. Additionally, DiG-XL/2 is $4.2\times$ faster than the Mamba-based model at a $1024$ resolution and $1.8\times$ faster than DiT with FlashAttention-2 at a $2048$ resolution.
Reanimating Images using Neural Representations of Dynamic Stimuli
Jacob Yeung · Andrew Luo · Gabriel Sarch · Margaret Marie Henderson · Deva Ramanan · Michael J. Tarr
While computer vision models have made incredible strides in static image recognition, they still do not match human performance in tasks that require the understanding of complex, dynamic motion. This is notably true for real-world scenarios where embodied agents face complex and motion-rich environments. Our approach leverages state-of-the-art video diffusion models to decouple static image representation from motion generation, enabling us to utilize fMRI brain activity for a deeper understanding of human responses to dynamic visual stimuli. Conversely, we also demonstrate that information about the brain's representation of motion can enhance the prediction of optical flow in artificial systems. Our novel approach leads to four main findings: (1) Visual motion, represented as fine-grained, object-level resolution optical flow, can be decoded from brain activity generated by participants viewing video stimuli; (2) Video encoders outperform image-based models in predicting video-driven brain activity; (3) Brain-decoded motion signals enable realistic video reanimation based only on the initial frame of the video; and (4) We extend prior work to achieve full video decoding from video-driven brain activity. This framework advances our understanding of how the brain represents spatial and temporal information in dynamic visual scenes. Our findings demonstrate the potential of combining brain imaging with video diffusion models for developing more robust and biologically-inspired computer vision systems.
Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training
Lexington Whalen · Zhenbang Du · Haoran You · Chaojian Li · Sixu Li · Yingyan (Celine) Lin
Training diffusion models (DMs) is highly computationally demanding, necessitating multiple forward and backward passes across numerous timesteps. This challenge has motivated the exploration of various efficient DM training techniques. In this paper, we propose $\textbf{EB-Diff-Train}$, a new and orthogonal efficient DM training approach by investigating and leveraging Early-Bird (EB) tickets—sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, which draw on insights from varying importance of different timestep regions/periods. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them through an ensemble during inference for image generation. This approach can significantly reduce training time both spatially and temporally—achieving 2.9$\times$$\sim$5.8$\times$ speedups over training unpruned dense models, and up to 10.3$\times$ faster training compared to standard train-prune-finetune pipelines—without compromising generative quality. Extensive experiments and ablation studies validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. Our work not only enhances the understanding of DM training dynamics but also significantly improves training efficiency by exploiting temporal and spatial sparsity. All codes and models will be released upon acceptance.
Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis
Yousef Yeganeh · Ioannis Charisiadis · Marta Hasny · Martin Hartenberger · Björn Ommer · Nassir Navab · Azade Farshad · Ehsan Adeli
Scaling by training on large datasets has been shown to enhance the quality and fidelity of image generation and manipulation with diffusion models; however, such large datasets are not always accessible in medical imaging due to cost and privacy issues, which contradicts one of the main applications of such models to produce synthetic samples where real data is scarce. Also, finetuning on pre-trained general models has been a challenge due to the distribution shift between the medical domain and the pre-trained models. Here, we propose Latent Drift (LD) for diffusion models that can be adopted for any fine-tuning method to mitigate the issues faced by the distribution shift or employed in inference time as a condition. Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation, which is crucial to investigate how parameters such as gender, age, and adding or removing diseases in a patient would alter the medical images. We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation. Our results demonstrate significant performance gains in various scenarios when combined with different fine-tuning schemes. The source code of this work will be publicly released upon its acceptance.
Style Quantization for Data-Efficient GAN Training
Jian Wang · Xin Lan · Ji-Zhe Zhou · Yuxin Tian · Jiancheng Lv
Under limited data setting, GANs often struggle to navigate and effectively exploit the input latent space. Consequently, images generated from adjacent variables in a sparse input latent space may exhibit significant discrepancies in realism, leading to suboptimal consistency regularization (CR) outcomes. To address this, we propose \textit{SQ-GAN}, a novel approach that enhances CR by introducing a style space quantization scheme. This method transforms the sparse, continuous input latent space into a compact, structured discrete proxy space, allowing each element to correspond to a specific data point, thereby improving CR performance. Instead of direct quantization, we first map the input latent variables into a less entangled ``style'' space and apply quantization using a learnable codebook. This enables each quantized code to control distinct factors of variation. Additionally, we optimize the optimal transport distance to align the codebook codes with features extracted from the training data by a foundation model, embedding external knowledge into the codebook and establishing a semantically rich vocabulary that properly describes the training dataset. Extensive experiments demonstrate significant improvements in both discriminator robustness and generation quality with our method.
Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts
Yu Cao · Zengqun Zhao · Ioannis Patras · Shaogang Gong
Visual artifacts remain a persistent challenge in diffusion models, even with training on massive datasets. Current solutions primarily rely on supervised detectors, yet lack understanding of why these artifacts occur in the first place. We discover that a diffusion generative process undergoes three temporal phases: Profiling, Mutation, and Refinement. We observe that artifacts typically emerge during the Mutation phase, where certain regions exhibit anomalous score dynamics over time, causing abrupt disruptions in the normal evolution pattern. This, reveals why existing methods that quantify the spatial uncertainty of the final output, and therefore lack modeling of the temporal dynamics of the diffusion process, have shown to be inadequate for artifact localization. Based on these insights, we propose a novel method, ASCED (Abnormal Score Correction for Enhancing Diffusion), that detects artifacts by monitoring abnormal score dynamics during the diffusion process, and a trajectory-aware on-the-fly mitigation strategy that appropriate generation of noise in the detected areas. Unlike most existing methods that apply post hoc corrections, e.g., by applying a noising-denoising scheme after generation, our mitigation strategy does not involve additional diffusion steps.Extensive experiments demonstrate that our proposed approach effectively reduces artifacts across diverse domains, matching or surpassing existing supervised methods without additional training.
Efficient Personalization of Quantized Diffusion Model without Backpropagation
Hoigi Seo · Wongi Jeong · Kyungryeol Lee · Se Young Chun
Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for the applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consume considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective diffusion timesteps. Our method achieves comparable performance to prior methods in image and text alignment scores for personalizing Stable Diffusion with only forward passes while reducing training memory demand up to $8.2\times$.
Layered Image Vectorization via Semantic Simplification
Zhenyu Wang · Jianxi Huang · Zhida Sun · Yuanhao Gong · Daniel Cohen-Or · Min Lu
This work presents a progressive image vectorization technique that reconstructs the raster image as layer-wise vectors from semantic-aligned macro structures to finer details. Our approach introduces a new image simplification method leveraging the feature-average effect in the Score Distillation Sampling mechanism, achieving effective visual abstraction from the detailed to coarse. Guided by the sequence of progressive simplified images, we propose a two-stage vectorization process of structural buildup and visual refinement, constructing the vectors in an organized and manageable manner. The resulting vectors are layered and well-aligned with the target image's explicit and implicit semantic structures. Our method demonstrates high performance across a wide range of images. Comparative analysis with existing vectorization methods highlights our technique's superiority in creating vectors with high visual fidelity, and more importantly, achieving higher semantic alignment and more compact layered representation.
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Yiyang Ma · Xingchao Liu · Xiaokang Chen · Wen Liu · Chengyue Wu · Zhiyu Wu · Zizheng Pan · Zhenda Xie · Haowei Zhang · Xingkai Yu · Liang Zhao · Yisong Wang · Jiaying Liu · Chong Ruan
We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model.JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling.Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications.To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training.Extensive experiments show that JaunsFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches.This work represents a step toward more efficient and versatile vision-language models.
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation
Hui Li · Mingwang Xu · Qingkun Su · Shan Mu · Li jiaye · Kaihui Cheng · Chen Yuxuan · Tan Chen · Mao Ye · Jingdong Wang · Siyu Zhu
Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce \textbf{OpenHumanVid}, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio.To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos.
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Minghong Cai · Xiaodong Cun · Xiaoyu Li · Wenze Liu · Zhaoyang Zhang · Yong Zhang · Ying Shan · Xiangyu Yue
Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer (MM-DiT) architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.
MotiF: Making Text Count in Image Animation with Motion Focal Loss
Shijie Wang · Samaneh Azadi · Rohit Girdhar · Sai Saketh Rambhatla · Chen Sun · Xi Yin
Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce MotiF, a simple yet effective approach that directs the model's learning to the regions with more motion, thereby improving the text alignment and motion generation. We use optical flow to generate a motion heatmap and weight the loss according to the intensity of the motion. This simple objective leads to noticeable improvements and complements existing methods that utilize motion priors as model inputs. Additionally, due to the lack of a diverse benchmark for evaluating TI2V generation, we propose TI2V Bench, a dataset consists of $320$ image-text pairs for robust evaluation. We present a human evaluation protocol that asks the annotators to select an overall preference between two videos followed by their justifications. Through a comprehensive evaluation, MotiF outperforms nine open-sourced models, achieving an average preference of 72%.
Visual Prompting for One-shot Controllable Video Editing without Inversion
Zhengbo Zhang · Yuxi Zhou · DUO PENG · Joo Lim · Zhigang Tu · De Soh Soh · Lin Geng Foo
One-shot controllable video editing (OCVE) is an important yet challenging task, aiming to propagate user edits that are made---using any image editing tool---on the first frame of a video to all subsequent frames, while ensuring content consistency between edited frames and source frames. To achieve this, prior methods employ DDIM inversion to transform source frames into latent noise, which is then fed into a pre-trained diffusion model, conditioned on the user-edited first frame, to generate the edited video. However, the DDIM inversion process accumulates errors, which hinder the latent noise from accurately reconstructing the source frames, ultimately compromising content consistency in the generated edited frames. To overcome it, our method eliminates the need for DDIM inversion by performing OCVE through a novel perspective based on visual prompting. Furthermore, inspired by consistency models that can perform multi-step consistency sampling to generate a sequence of content-consistent images, we propose a content consistency sampling (CCS) to ensure content consistency between the generated edited frames and the source frames. Moreover, we introduce a temporal-content consistency sampling (TCS) based on Stein Variational Gradient Descent to ensure temporal consistency across the edited frames. Extensive experiments validate the effectiveness of our approach.
Image tiling—the seamless connection of disparate images to create a coherent visual field—is crucial for applications such as texture creation, video game asset development, and digital art. Traditionally, tiles have been constructed manually, a method that poses significant limitations in scalability and flexibility. Recent research has attempted to automate this process using generative models. However, current approaches primarily focus on tiling textures and manipulating models for single-image generation, without inherently supporting the creation of multiple interconnected tiles across diverse domains.This paper presents Tiled Diffusion, a novel approach that extends the capabilities of diffusion models to accommodate the generation of cohesive tiling patterns across various domains of image synthesis that require tiling. Our method supports a wide range of tiling scenarios, from self-tiling to complex many-to-many connections, enabling seamless integration of multiple images.Tiled Diffusion automates the tiling process, eliminating the need for manual intervention and enhancing creative possibilities in various applications, such as seamlessly tiling of existing images, tiled texture creation, and 360° synthesis.
Evaluating Model Perception of Color Illusions in Photorealistic Scenes
Lingjun Mao · Zineng Tang · Alane Suhr
We study the perception of color illusions by vision-language models. Color illusion, where a person's visual system perceives color differently from actual color, is well-studied in human vision. However, it remains underexplored whether vision-language models (VLMs), trained on large-scale human data, exhibit similar perceptual biases when confronted with such color illusions. We propose an automated framework for generating color illusion images, resulting in RCID (Realistic Color Illusion Dataset), a dataset of 19,000 realistic illusion images. Our experiments show that all studied VLMs exhibit perceptual biases similar human vision. Finally, we train a model to distinguish both human perception and actual pixel differences.
Charm: The Missing Piece in ViT Fine-Tuning for Image Aesthetic Assessment
Fatemeh Behrad · Tinne Tuytelaars · Johan Wagemans
The capacity of Vision transformers (ViTs) to handle variable-sized inputs is often constrained by computational complexity and batch processing limitations. Consequently, ViTs are typically trained on small, fixed-size images obtained through downscaling or cropping. While reducing computational burden, these methods result in significant information loss, negatively affecting tasks like image aesthetic assessment. We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. Charm prioritizes high-resolution details in specific regions while downscaling others, enabling shorter fixed-size input sequences for ViTs while incorporating essential information. Charm is designed to be compatible with pre-trained ViTs and their learned positional embeddings. By providing multiscale input and introducing variety to input tokens, Charm improves ViT performance and generalizability for image aesthetic assessment. We avoid cropping or changing the aspect ratio to further preserve information. Extensive experiments demonstrate significant performance improvements on various image aesthetic and quality assessment datasets (up to 8.1%) using a lightweight ViT backbone. Code and pre-trained models will be made publicly available upon acceptance.
Morpheus: Text-Driven 3D Gaussian Splat Shape and Color Stylization
Jamie Wynn · Zawar Qureshi · Jakub Powierza · Jamie Watson · Mohamed Sayed
Exploring real-world spaces using novel-view synthesis is fun, and reimagining those worlds in a different style adds another layer of excitement. Stylized worlds can also be used for downstream tasks where there is limited training data and a need to expand a model's training distribution. Most current novel-view synthesis stylization techniques lack the ability to convincingly change geometry. This is because any geometry change requires increased style strength which is often capped for stylization stability and consistency. In this work, we propose a new autoregressive 3D Gaussian Splatting stylization method. As part of this method, we contribute a new RGBD diffusion model that allows for strength control over appearance and shape stylization. To ensure consistency across stylized frames, we use a combination of novel depth-guided cross attention, feature injection, and a Warp ControlNet conditioned on composite frames for guiding the stylization of new frames. We validate our method via extensive qualitative results, quantitative experiments, and a user study.
Optical-Flow Guided Prompt Optimization for Coherent Video Generation
Hyelin Nam · Jaemin Kim · Dohun Lee · Jong Chul Ye
While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models. Project Page: https://motionprompt.github.io/anon7165/
OmniStyle: Filtering High Quality Style Transfer Data at Scale
Ye Wang · Ruiqi Liu · Jiang Lin · Fei Liu · Zili Yi · Yilin Wang · Rui Ma
In this paper, we introduce OmniStyle-1M, a large-scale paired style transfer dataset comprising over one million content-style-stylized image triplets across 1,000 diverse style categories, each enhanced with textual descriptions and instruction prompts. We show that OmniStyle-1M can not only improve the generalization of style transfer models through supervised training but also facilitates precise control over target stylization. Especially, to ensure the quality of the dataset, we introduce OmniFilter, a comprehensive style transfer quality assessment framework, which filters high-quality triplets based on content preservation, style consistency, and aesthetic appeal. Building upon this foundation, we propose OmniStyle, a framework based on the Diffusion Transformer (DiT) architecture designed for high-quality and efficient style transfer. This framework supports both instruction-guided and image-guided style transfer, generating high resolution outputs with exceptional detail. Extensive qualitative and quantitative evaluations demonstrate OmniStyle's superior performance compared to existing approaches, highlighting its efficiency and versatility. OmniStyle-1M and its accompanying methodologies provide a significant contribution to advancing high-quality style transfer, offering a valuable resource for the research community.
Pathways on the Image Manifold: Image Editing via Video Generation
Noam Rotstein · Gal Yona · Daniel Silver · Roy Velich · David Bensaid · Ron Kimmel
Generative image editing has recently seen substantial progress with the emergence of image diffusion models, though critical challenges remain. Models tend to lack fidelity, altering essential aspects of original images and still struggling to accurately follow complex edit instructions.Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects.Our approach achieves state-of-the-art results on text-based image editing benchmarks, demonstrating significant improvements in both edit accuracy and image preservation.
PhyS-EdiT: Physics-aware Semantic Image Editing with Text Description
Ziqi Cai · Shuchen Weng · Yifei Xia · Boxin Shi
Achieving joint control over material properties, lighting, and high-level semantics in images is essential for applications in digital media, advertising, and interactive design. Existing methods often isolate these properties, lacking a cohesive approach to manipulating materials, lighting, and semantics simultaneously. We introduce PhyS-EdiT, a novel diffusion-based model that enables precise control over four critical material properties: roughness, metallicity, albedo, and transparency while integrating lighting and semantic adjustments within a single framework. To facilitate this disentangled control, we present PR-TIPS, a large and diverse synthetic dataset designed to improve the disentanglement of material and lighting effects. PhyS-EdiT incorporates a dual-network architecture and robust training strategies to balance low-level physical realism with high-level semantic coherence, supporting localized and continuous property adjustments. Extensive experiments demonstrate the superiority of PhyS-EdiT in editing both synthetic and real-world images, achieving state-of-the-art performance on material, lighting, and semantic editing tasks.
Stable Flow: Vital Layers for Training-Free Image Editing
Omri Avrahami · Or Patashnik · Ohad Fried · Egor Nemchinov · Kfir Aberman · Dani Lischinski · Daniel Cohen-Or
Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify ``vital layers'' within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications.
Improving Editability in Image Generation with Layer-wise Memory
Daneul Kim · Jaeah Lee · Jaesik Park
Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining previous edits along with adapting new objects naturally into the existing content. These limitations significantly hinder complex editing scenarios where multiple objects need to be modified while preserving their contextual relationships. We address this fundamental challenge through two key proposals: enabling rough mask inputs that preserve existing content while naturally integrating new elements and supporting consistent editing across multiple modifications. Our framework achieves this through layer-wise memory, which stores latent representations and prompt embeddings from previous edits. We propose Background Consistency Guidance that leverages memorized latents to maintain scene coherence and Multi-Query Disentanglement in cross-attention that ensures natural adaptation to existing content. To evaluate our method, we present a new benchmark dataset incorporating semantic alignment metrics and interactive editing scenarios. Through comprehensive experiments, we demonstrate superior performance in iterative image editing tasks with minimal user effort, requiring only rough masks while maintaining high-quality results throughout multiple editing steps.
EditAR: Unified Conditional Generation with Autoregressive Models
Jiteng Mu · Nuno Vasconcelos · Xiaolong Wang
Recent progress in controllable image generation and editing is largely driven by diffusion-based methods. Although diffusion models perform exceptionally well in specific tasks with tailored designs, establishing a unified model is still challenging. In contrast, autoregressive models inherently feature a unified tokenized representation, which simplifies the creation of a single foundational model for various tasks. In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e.g., image editing, depth-to-image, edge-to-image, segmentation-to-image. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. To enhance the text-to-image alignment, we further propose to distill the knowledge from foundation models into the autoregressive modeling process. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods.
Zero-Shot Styled Text Image Generation, but Make It Autoregressive
Vittorio Pippi · Fabio Quattrini · Silvia Cascianelli · Alessio Tonioni · Rita Cucchiara
Styled Handwritten Text Generation (HTG) has recently received attention from the computer vision and document analysis communities, which have developed several solutions, either GAN- or diffusion-based, that achieved promising results. Nonetheless, these strategies fail to generalize to novel styles and have technical constraints, particularly in terms of maximum output length and training efficiency. To overcome these limitations, in this work, we propose a novel framework for text-image generation, dubbed Emuru. Our approach leverages a powerful text-image representation model (a variational autoencoder) combined with an autoregressive Transformer. Our approach enables the generation of styled text images conditioned on textual content and style examples, such as specific fonts or handwriting styles. We train our model solely on a diverse, synthetic dataset of English text rendered in over 100,000 typewritten and calligraphy fonts, which gives it the capability to reproduce unseen styles (both fonts and users' handwriting) in zero-shot. To the best of our knowledge, Emuru is the first autoregressive model for HTG, and the first designed specifically for generalization to novel styles. Moreover, our model generates images without background artifacts, which are easier to use for downstream applications. Extensive evaluation on both typewritten and handwritten, any-length text image generation scenarios demonstrates the effectiveness of our approach.
Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
Yu Yuan · Xijun Wang · Yichen Sheng · Prateek Chennuri · Xingguang Zhang · Stanley H. Chan
Image generation today can produce somewhat realistic images from text prompts. However, if one asks the generator to synthesize a particular camera setting such as creating different fields of view using a 24mm lens versus a 70mm lens, the generator will not be able to interpret and generate scene-consistent images. This limitation not only hinders the adoption of generative tools in photography applications but also exemplifies a broader issue of bridging the gap between the data-driven models and the physical world. In this paper, we introduce the concept of Generative Photography, a framework designed to control camera intrinsic settings during content generation. The core innovation of this work are the concepts of Dimensionality Lifting and Contrastive Camera Learning, which achieve continuous and consistent transitions for different camera settings. Experimental results show that our method produces significantly more scene-consistent photorealistic images than state-of-the-art models such as Stable Diffusion 3 and FLUX.
Text-to-image models are powerful tools for image creation. However, the generation process is akin to a dice roll and makes it difficult to achieve a single image that captures everything a user wants. In this paper, we propose a framework for creating the desired image by compositing it from various parts of generated images, in essence forming a Generative Photomontage. Given a stack of images generated by ControlNet using the same input condition and different seeds, we let users select desired parts from the generated results using a brush stroke interface. We introduce a novel technique that takes in the user's brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. Our method faithfully preserves the user-selected regions while compositing them harmoniously. We demonstrate that our flexible framework can be used for many applications, including generating new appearance combinations, fixing incorrect shapes and artifacts, and improving prompt alignment. We show compelling results for each application and demonstrate that our method outperforms existing image blending methods and various baselines.
Multi-party Collaborative Attention Control for Image Customization
Han Yang · Chuanguang Yang · Qiuli Wang · Zhulin An · Weilun Feng · Libo Huang · Yongjun Xu
The rapid development of diffusion models has fueled a growing demand for customized image generation. However, current customization methods face several limitations: 1) typically accept either image or text conditions alone; 2) customization in complex visual scenarios often leads to subject leakage or confusion; 3) image-conditioned outputs tend to suffer from inconsistent backgrounds; and 4) high computational costs. To address these issues, this paper introduces Multi-party Collaborative Attention Control (MCA-Ctrl), a tuning-free approach that enables high-quality image customization under both text and complex visual conditions. Specifically, MCA-Ctrl leverages two key operations within the self-attention layer to coordinate multiple parallel diffusion processes and guide the target image generation. This approach allows MCA-Ctrl to capture the content and appearance of specific subjects while maintaining semantic consistency with the conditional input. Additionally, to mitigate subject leakage and confusion issues common in complex visual scenarios, we introduce a Subject Localization Module that extracts precise subject and editable image layers based on user instructions. Extensive quantitative and human evaluation experiments demonstrate that MCA-Ctrl outperforms previous methods in zero-shot image customization, effectively addressing the aforementioned issues.
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
Yifan Pu · Yiming Zhao · Zhicong Tang · Ruihong Yin · Haoxing Ye · Yuhui Yuan · Dong Chen · Jianmin Bao · Sirui Zhang · Yanbin Wang · Lin Liang · Lijuan Wang · Ji Li · Xiu Li · Zhouhui Lian · Gao Huang · Baining Guo
Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Inspired by Schema theory, this anonymous region layout allows the generative model to autonomously determine which set of visual tokens should align with which text tokens, which is in contrast to the previously dominant semantic layout for the image generation task. In addition, the layer-wise region crop mechanism, which only selects the visual tokens belonging to each anonymous region, significantly reduces attention computation costs and enables the efficient generation of images with numerous distinct layers (e.g., 50+). When compared to the full attention approach, our method is over 12 times faster and exhibits fewer layer conflicts. Furthermore, we propose a high-quality multi-layer transparent image autoencoder that supports the direct encoding and decoding of the transparency of variable multi-layer images in a joint manner. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.
UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation
Lunhao Duan · Shanshan Zhao · Wenjun Yan · Yinglun Li · Qing-Guo Chen · Zhao Xu · Weihua Luo · Kaifu Zhang · Mingming Gong · Gui-Song Xia
Recently, text-to-image generation models have achieved remarkable advancements, particularly with diffusion models facilitating high-quality image synthesis from textual descriptions. However, these models often struggle with achieving precise control over pixel-level layouts, object appearances, and global styles when using text prompts alone. To mitigate this issue, previous works introduce conditional images as auxiliary inputs for image generation, enhancing control but typically necessitating specialized models tailored to different types of reference inputs. In this paper, we explore a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture, to enable flexible and controllable generation across diverse conditions without the need for multiple specialized models. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions, injecting this information into the image generation process through a cross-attention mechanism enhanced by Rotary Position Embedding. Experimental results across a variety of tasks, including pixel-level spatial control, subject-driven image generation, and style-image-based image synthesis, demonstrate the effectiveness of our UNIC-Adapter in unified controllable image generation.
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Jian Yang · Dacheng Yin · Yizhou Zhou · Fengyun Rao · Wei Zhai · Yang Cao · Zheng-Jun Zha
Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation. However, we have identified that recent methods suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps. To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss in an efficient way. Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding. In this way, when the model transits from image generation to understanding through text generation, the backbone model's hidden representation of the image is not limited to the last denoising step. To successfully train our method, we also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals. Extensive evaluations on 18 image understanding benchmarks show that MMAR significantly outperforms most of the existing joint multi-modal models, surpassing the method that employs pre-trained CLIP vision encoder. Meanwhile, MMAR is able to generate high quality images. We also show that our method is scalable with larger data and model size.
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Chaehun Shin · Jooyoung Choi · Heeseung Kim · Sungroh Yoon
Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications.
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
Jierun Chen · Dongting Hu · Xijie Huang · Huseyin Coskun · Arpit Sahni · Aarush Gupta · Anujraaj Goyal "argo" · Dishani Lahiri · Rajesh Singh · Yerlan Idelbayev · Junli Cao · Yanyu Li · Kwang-Ting Cheng · Mingming Gong · S.-H. Gary Chan · Sergey Tulyakov · Anil Kag · Yanwu Xu · Jian Ren
Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. Our model, for the first time, demonstrates the generation of 1024x1024 px images on a mobile device in 1.2 to 2.3 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
Runtao Liu · Haoyu Wu · Zheng Ziqiang · Chen Wei · Yingqing He · Renjie Pi · Qifeng Chen
Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected.
Personalized Preference Fine-tuning of Diffusion Models
Meihua Dang · Anikait Singh · Linqi Zhou · Stefano Ermon · Jiaming Song
RLHF techniques like DPO can significantly improve the generation quality of text-to-image diffusion models. However, these methods optimize for a single reward that aligns model generation with population-level preferences, neglecting the nuances of individual users' beliefs or values. This lack of personalization limits the efficacy of these models. To bridge this gap, we introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users. Specifically, our approach (1) leverages a vision-language model (VLM) to extract personal preference embeddings from a small set of pairwise preference examples, and then (2) incorporates the embeddings into diffusion models through cross attention. Conditioning on user embeddings, the text-to-image models are fine-tuned with the DPO objective, simultaneously optimizing for alignment with the preferences of multiple users. Empirical results demonstrate that our method effectively optimizes for multiple reward functions and can interpolate between them during inference. In real-world user scenarios, with as few as four preference examples from a new user, our approach achieves an average win rate of 76\% over Stable Cascade, generating images that more accurately reflect specific user preferences.
Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps
Jeeyung Kim · Erfan Esmaeili Fakhabi · Qiang Qiu
In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, “a black car and a white clock”, the cross-attention maps for “black” and “car” should focus on overlapping regions to depict a black car, while “car” and “clock” should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens---used as conditioning inputs---can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to faithfully capture syntactic relations already within text attention maps. As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. Our approach leverages this inherent yet unexploited information within text attention maps to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.
VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
SeungJu Cha · Kwanyoung Lee · Ye-Chan Kim · Hyunwoo Oh · Dong-Jin Kim
Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words.In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.
Learning Visual Generative Priors without Text
Shuailei Ma · Kecheng Zheng · Ying Wei · Wei Wu · Fan Lu · Yifei Zhang · Chen-Wei Xie · Biao Gong · Jiapeng Zhu · Yujun Shen
Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant vision tasks, like image-to-3D and image-to-video. The code and the models will be made public.
Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation
Gianni Franchi · Nacim Belkhir · Dat NGUYEN · Guoxuan Xia · Andrea Pilzer
Uncertainty quantification in text-to-image (T2I) generative models is crucial for understanding model behavior and improving output reliability. In this paper, we are the first to quantify and evaluate the uncertainty of T2I models with respect to the prompt. Alongside adapting existing approaches designed to measure uncertainty in the image space, we also introduce Prompt-based UNCertainty Estimation for T2I models (PUNC), a novel method leveraging Large Vision-Language Models (LVLMs) to better address uncertainties arising from the semantics of the prompt and generated images. PUNC utilizes a LVLM to caption a generated image, and then compares the caption with the original prompt in the more semantically meaningful text space. PUNC also enables the disentanglement of both aleatoric and epistemic uncertainties via precision and recall, which image-space approaches are unable to do. Extensive experiments demonstrate that PUNC outperforms state-of-the-art uncertainty estimation techniques across various settings. Uncertainty quantification in text-to-image generation models can be used on various applications including bias detection, copyright protection, and OOD detection. We also introduce a comprehensive dataset of text prompts and generation pairs to foster further research in uncertainty quantification for generative models. Our findings illustrate that PUNC not only achieves competitive performance but also enables novel applications in evaluating and improving the trustworthiness of text-to-image models.
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Wei Chen · Lin Li · Yongqi Yang · Bin Wen · Fan Yang · Tingting Gao · Yu Wu · Long Chen
Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced pre-trained models to ensure the development of sentences, consistency of inserted images, and semantic alignment between them. Various quality evaluation metrics are designed to prove the high quality of the filtered dataset. Meanwhile, extensive few-shot experiments on various downstream tasks demonstrate CoMM's effectiveness in significantly enhancing the in-context learning capabilities of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved generation abilities, supported by a comprehensive evaluation framework. We believe CoMM opens a new avenue for advanced MLLMs with superior multimodal in-context learning and understanding ability. The dataset and codes will be released.
PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering
Yifan Gao · Zihang Lin · Chuanbin Liu · Min Zhou · Tiezheng Ge · Bo Zheng · Hongtao Xie
Product posters, which integrate subject, scene, and text, are crucial promotional tools for attracting customers. Creating such posters using modern image generation methods is valuable, while the main challenge lies in accurately rendering text, especially for complex writing systems like Chinese, which contains over 10,000 individual characters. In this work, we identify the key to precise text rendering as constructing a character-discriminative visual feature as a control signal. Based on this insight, we propose a robust character-wise representation as control and we develop TextRenderNet, which achieves a high text rendering accuracy of over 90\%. Another challenge in poster generation is maintaining the fidelity of user-specific products. We address this by introducing SceneGenNet, an inpainting-based model, and propose subject fidelity feedback learning to further enhance fidelity. Based on TextRenderNet and SceneGenNet, we present PosterMaker, an end-to-end generation framework. To optimize PosterMaker efficiently, we implement a two-stage training strategy that decouples text rendering and background generation learning. Experimental results show that PosterMaker outperforms existing baselines by a remarkable margin, which demonstrates its effectiveness.
Multitwine: Multi-Object Compositing with Text and Layout Control
Gemma Canet Tarrés · Zhe Lin · Zhifei Zhang · He Zhang · Andrew Gilbert · John Collomosse · Soo Ye Kim
We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like 'taking a selfie', our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize multimodal, aligned training data.
MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation
Sankalp Sinha · Mohammad Sadil Khan · Muhammad Usama · Shino Sam · Didier Stricker · Sk Aziz Ali · Muhammad Zeshan Afzal
Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-$40$M+, an extensive dataset with $40$ million text annotations for over $8.9$ million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed ($150$-$200$ words) to concise semantic tags ($10$-$20$ words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D, a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our annotations and use a pretrained image-to-3D network to generate 3D textured meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly outperforms existing datasets in annotation quality and linguistic diversity, achieving win rates of $72.41\%$ by GPT-4 and $73.40\%$ by human evaluators.
PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation
HsiaoYuan Hsu · Yuxin Peng
In poster design, content-aware layout generation is crucial for automatically arranging visual-textual elements on the given image.With limited training data, existing work focused on image-centric enhancement. However, this neglects the diversity of layouts and fails to cope with shape-variant elements or diverse design intents in generalized settings. To this end, we proposed a layout-centric approach that leverages layout knowledge implicit in large language models (LLMs) to create posters for omnifarious purposes, hence the name PosterO. Specifically, it structures layouts from datasets as trees in SVG language by universal shape, design intent vectorization, and hierarchical node representation. Then, it applies LLMs during inference to predict new layout trees by in-context learning with intent-aligned example selection. After layout trees are generated, we can seamlessly realize them into poster designs by editing the chat with LLMs. Extensive experimental results have demonstrated that PosterO can generate visually appealing layouts for given images, achieving new state-of-the-art performance across various benchmarks. To further explore PosterO's abilities under the generalized settings, we built PStylish7, the first dataset with multi-purpose posters and various-shaped elements, further offering a challenging test for advanced research. Code and dataset will be publicly available upon acceptance.
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
Jiawei Lin · Shizhao Sun · Danqing Huang · Ting Liu · Ji Li · Jiang Bian
In this work, we investigate automatic design composition from multimodal graphic elements.Although recent studies have developed various generative models for graphic design, they usually face the following limitations: they only focus on certain subtasks and are far from achieving the design composition task; they do not consider the hierarchical information of graphic designs during the generation process.To tackle these issues, we introduce the layered design principle into Large Multimodal Models (LMMs) and propose a novel approach, called LaDeCo, to accomplish this challenging task.Specifically, LaDeCo first performs layer planning for a given element set, dividing the input elements into different semantic layers according to their contents.Based on the planning results, it subsequently predicts element attributes that control the design composition in a layer-wise manner, and includes the rendered image of previously generated layers into the context.With this insightful design, LaDeCo decomposes the difficult task into smaller manageable steps, making the generation process smoother and clearer.The experimental results demonstrate the effectiveness of LaDeCo in design composition.Furthermore, we show that LaDeCo enables some interesting applications in graphic design, such as resolution adjustment, element filling, design variation, etc.In addition, it even outperforms the specialized models in some design subtasks without any task-specific training.
AIpparel: A Multimodal Foundation Model for Digital Garments
Kiyohiro Nakayama · Jan Ackermann · Timur Levent Kesdogan · Yang Zheng · Maria Korosteleva · Olga Sorkine-Hornung · Leonidas Guibas · Guandao Yang · Gordon Wetzstein
Apparel is essential to human life, offering protection, mirroring cultural identities, and showcasing personal style. Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a large multimodal model for generating and editing sewing patterns. Our model fine-tunes state-of-the-art large multimodal models (LMMs) on a custom-curated large-scale dataset of over 120,000 unique garments, each with multimodal annotations including text, images, and sewing patterns. Additionally, we propose a novel tokenization scheme that concisely encodes these complex sewing patterns so that LLMs can learn to predict them efficiently. AIpparel achieves state-of-the-art performance in single-modal tasks, including text-to-garment and image-to-garment prediction, and it enables novel multimodal garment generation applications such as interactive garment editing.
ChatHuman: Chatting about 3D Humans with Tools
Jing Lin · Yao Feng · Weiyang Liu · Michael J. Black
Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more. While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that combines and integrates the skills of these specialized methods. ChatHuman functions as an assistant proficient in utilizing, analyzing, and interacting with tools specific to 3D human tasks, adeptly discussing and resolving related challenges. Built on a Large Language Model (LLM) framework, ChatHuman is trained to autonomously select, apply, and interpret a diverse set of tools in response to user inputs. Adapting LLMs to 3D human tasks presents challenges, including the need for domain-specific knowledge and the ability to interpret complex 3D outputs. The innovative features of ChatHuman include leveraging academic publications to instruct the LLM how to use the tools, employing a retrieval-augmented generation model to create in-context learning examples for managing new tools, and effectively discriminating and integrating tool results to enhance tasks related to 3D humans by transforming specialized 3D outputs into comprehensible formats. Our experiments demonstrate that ChatHuman surpasses existing models in both tool selection accuracy and overall performance across various 3D human tasks, and it supports interactive chatting with users. ChatHuman represents a significant step toward consolidating diverse analytical methods into a unified, robust system for 3D human tasks.
Interpretable Generative Models through Post-hoc Concept Bottlenecks
Akshay R. Kulkarni · Ge Yan · Chung-En Sun · Tuomas Oikarinen · Tsui-Wei Weng
Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, the existing approach to design interpretable generative models based on CBMs is not efficient and scalable, as it requires expensive generative model training from scratch as well as real images with labor-intensive concept supervision. To address these challenges, we present two novel and low-cost methods to build interpretable generative models through post-hoc interpretability and we name them concept-bottleneck autoencoder (CB-AE) and concept controller (CC). Our approach enables efficient and scalable training by using generated images and our method can work with minimal to no concept supervision. Our proposed methods generalize across modern generative model families including generative adversarial networks and diffusion models. We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average $\sim$25\%) over the prior work, while being 4-15$\times$ faster to train. Finally, we also perform a large-scale user study to validate the interpretability and steerability of our methods.
Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization
lingyun zhang · Yu Xie · Yanwei Fu · Ping Chen
As large-scale diffusion models continue to advance, they excel at producing high-quality images but often generate unwanted content, such as sexually explicit or violent content. Existing methods for concept removal generally guide the image generation process but can unintentionally modify unrelated regions, leading to inconsistencies with the original model. We propose a novel approach for targeted concept replacing in diffusion models, enabling specific concepts to be removed without affecting non-target areas. Our method introduces a dedicated concept localizer for precisely identifying the target concept during the denoising process, trained with few-shot learning to require minimal labeled data. Within the identified region, we introduce a training-free Dual Prompts Cross-Attention (DPCA) module to substitute the target concept, ensuring minimal disruption to surrounding content. We evaluate our method on concept localization precision and replacement efficiency. Experimental results demonstrate that our method achieves superior precision in localizing target concepts and performs coherent concept replacement with minimal impact on non-target areas, outperforming existing approaches.
Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models
Chen Chen · Daochang Liu · Mubarak Shah · Chang Xu
Text-to-image diffusion models have demonstrated remarkable capabilities in creating images highly aligned with user prompts, yet their proclivity for memorizing training set images has sparked concerns about the originality of the generated images and privacy issues, potentially leading to legal complications for both model owners and users, particularly when the memorized images contain proprietary content. Although methods to mitigate these issues have been suggested, enhancing privacy often results in a significant decrease in the utility of the outputs, as indicated by text-alignment scores. To bridge the research gap, we introduce a novel method, PRSS, which refines the classifier-free guidance approach in diffusion models by integrating prompt re-anchoring (PR) to improve privacy and incorporating semantic prompt search (SS) to enhance utility. Extensive experiments across various privacy levels demonstrate that our approach consistently improves the privacy-utility trade-off, establishing a new state-of-the-art.
Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability
Yingdong Shi · Changming Li · Yifan Wang · Yongxiang Zhao · Anqi Pang · Sibei Yang · Jingyi Yu · Kan Ren
Diffusion models have demonstrated impressive capabilities in synthesizing diverse content. However, despite their high-quality outputs, these models often perpetuate social biases, including those related to gender and race. These biases can potentially contribute to harmful real-world consequences, reinforcing stereotypes and exacerbating inequalities in various social contexts. While existing research on diffusion bias mitigation has predominantly focused on guiding content generation, it often neglects the intrinsic mechanisms within diffusion models that causally drive biased outputs. In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. By directly manipulating these features, our method precisely isolates and adjusts the elements responsible for bias generation, permitting granular control over the bias levels in the generated content. Through experiments on both unconditional and conditional diffusion models across various social bias attributes, we demonstrate our method's efficacy in managing generation distribution while preserving image quality. We also dissect the discovered model mechanism, revealing different intrinsic features controlling fine-grained aspects of generation, boosting further research on mechanistic interpretability of diffusion models.
Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models
Sangwon Jang · June Suk Choi · Jaehyeong Jo · Kimin Lee · Sung Ju Hwang
Text-to-image diffusion models have achieved remarkable success in generating high-quality contents from text prompts. However, their reliance on publicly available data and the growing trend of data sharing for fine-tuning make these models particularly vulnerable to data poisoning attacks. In this work, we introduce the Silent Branding Attack, a novel data poisoning method that manipulates text-to-image diffusion models to generate images containing specific brand logos or symbols without any text triggers. We find that when certain visual patterns are repeatedly in the training data, the model learns to reproduce them naturally in its outputs, even without prompt mentions. Leveraging this, we develop an automated data poisoning algorithm that unobtrusively injects logos into original images, ensuring they blend naturally and remain undetected. Models trained on this poisoned dataset generate images containing logos without degrading image quality or text alignment. We experimentally validate our silent branding attack across two realistic settings on large-scale high-quality image datasets and style personalization datasets, achieving high success rates even without a specific text trigger. Human evaluation and quantitative metrics including logo detection show that our method can stealthily embed logos.
SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
Zilan Wang · Junfeng Guo · Jiacheng Zhu · Yiming Li · Heng Huang · Muhao Chen · Zhengzhong Tu
Recent advances in large-scale text-to-image (T2I) diffusion models have enabled a variety of downstream applications, including style customization, subject-driven personalization, and conditional generation. As T2I models require extensive data and computational resources for training, they constitute highly valued intellectual property (IP) for their legitimate owners, yet making them incentive targets for unauthorized fine-tuning by adversaries seeking to leverage these models for customized, usually profitable applications. Existing IP protection methods for diffusion models generally involve embedding watermark patterns and then verifying ownership through generated outputs examination, or inspecting the model’s feature space. However, these techniques are inherently ineffective in practical scenarios when the watermarked model undergoes fine-tuning, and the feature space is inaccessible during verification (\ie, black-box setting). The model is prone to forgetting the previously learned watermark knowledge when it adapts to a new task. To address this challenge, we propose SleeperMark, a novel framework designed to embed resilient watermarks into T2I diffusion models. SleeperMark explicitly guides the model to disentangle the watermark information from the semantic concepts it learns, allowing the model to retain the embedded watermark while continuing to be fine-tuned to new downstream tasks. Our extensive experiments demonstrate the effectiveness of SleeperMark across various types of diffusion models, including latent diffusion models (\eg, Stable Diffusion) and pixel diffusion models (\eg, DeepFloyd-IF), showing robustness against downstream fine-tuning and various attacks at both the image and model levels, with minimal impact on the model’s generative capability.
Watermarking One for All: A Robust Watermarking Scheme Against Partial Image Theft
Gaozhi Liu · Silu Cao · Zhenxing Qian · Xinpeng Zhang · Sheng Li · Wanli Peng
The proliferation of digital images on the Internet has provided unprecedented convenience, but also poses significant risks of malicious theft and misuse. Digital watermarking has long been researched as an effective tool for copyright protection. However, it often falls short when addressing partial image theft, a common yet little-researched issue in practical applications. Most existing schemes typically require the entire image as input to extract watermarks. However, in practice, malicious users often steal only a portion of the image to create new content. The stolen portion can have arbitrary shape or content, being fused with a new background and may have undergone geometric transformations, making it challenging for current methods to extract correctly. To address the issues above, we propose WOFA (Watermarking One for All), a robust watermarking scheme against partial image theft. First of all, we define the entire process of partial image theft and construct a dataset accordingly. To gain robustness against partial image theft, we then design a comprehensive distortion layer that incorporates the process of partial image theft and several common distortions in channel. For easier network convergence, we employ a multi-level network structure on the basis of the commonly used embedder-distortion layer-extractor architecture and adopt a progressive training strategy. Abundant experiments demonstrate that our superior performance in the scenario of partial image theft, offering a more reliable solution for protecting digital images against unauthorized use in practical use.
Enhancing Facial Privacy Protection via Weakening Diffusion Purification
Ali Salar · Qing Liu · Yingli Tian · Guoying Zhao
The rapid growth of social media has led to the widespread sharing of individual portrait images, which pose serious privacy risks due to the capabilities of automatic face recognition (AFR) systems for mass surveillance. Hence, protecting facial privacy against unauthorized AFR systems is essential. Inspired by the generation capability of the emerging diffusion models, recent methods employ diffusion models to generate adversarial face images for privacy protection. However, they suffer from the diffusion purification effect, leading to a low protection success rate (PSR). In this paper, we first propose learning unconditional embeddings to increase the learning capacity for adversarial modifications and then use them to guide the modification of the adversarial latent code to weaken the diffusion purification effect. Moreover, we integrate an identity-preserving structure to maintain structural consistency between the original and generated images, allowing human observers to recognize the generated image as having the same identity as the original. Extensive experiments conducted on two public datasets, i.e., CelebA-HQ and LADN, demonstrate the superiority of our approach. The protected faces generated by our method outperform those produced by existing facial privacy protection approaches in terms of transferability and natural appearance. The code is available at https://anonymous.4open.science/r/CVPR_2025-EACE.
Community Forensics: Using Thousands of Generators to Train Fake Image Detectors
Jeongsoo Park · Andrew Owens
One of the key challenges of detecting AI-generated images is spotting images that have been created by previously unseen generative models. We argue that the limited diversity of the training data is a major obstacle to addressing this problem, and we propose a new dataset that is significantly larger and more diverse than prior works. As part of creating this dataset, we systematically download thousands of text-to-image latent diffusion models and sample images from them. We also collect images from dozens of popular open source and commercial models. The resulting dataset contains 2.7M images that have been sampled from 4803 different models. These images collectively capture a wide range of scene content, generator architectures, and image processing settings. Using this dataset, we study the generalization abilities of fake image detectors. Our experiments suggest that detection performance improves as the number of models in the training set increases, even when these models have similar architectures. We also find that increasing the diversity of the models improves detection performance, and that our trained detectors generalize better than those trained on other datasets.
Beyond Generation: A Diffusion-based Low-level Feature Extractor for Detecting AI-generated Images
Nan Zhong · Haoyu Chen · Yiran Xu · Zhenxing Qian · Xinpeng Zhang
The prevalence of AI-generated images has evoked concerns regarding the potential misuse of image generation technologies. In response, numerous detection methods aim to identify AI-generated images by analyzing generative artifacts. Unfortunately, most detectors quickly become obsolete with the development of generative models. In this paper, we first design a low-level feature extractor that transforms spatial images into feature space, where different source images exhibit distinct distributions. The pretext task for the feature extractor is to distinguish between images that differ only at the pixel level. This image set comprises the original image as well as versions that have been subjected to varying levels of noise and subsequently denoised using a pre-trained diffusion model. We employ the diffusion model as a denoising tool rather than an image generation tool. Then, we frame the AI-generated image detection task as a one-class classification. We estimate the low-level intrinsic feature distribution of real photographic images and identify features that deviate from this distribution as indicators of AI-generated images. We evaluate our method against over 20 different generative models, including GenImage and DRCT-2M dataset. Extensive experiments demonstrate its effectiveness on AI-generated images produced not only by diffusion models but also by GANs, flow-based models, and their variants.
Unveiling Differences in Generative Models: A Scalable Differential Clustering Approach
Jingwei Zhang · Mohammad Jalali · Cheuk Ting Li · Farzan Farnia
An interpretable comparison of generative models requires the identification of sample types produced differently by each of the involved models. While several quantitative scores have been proposed in the literature to rank different generative models, such score-based evaluations do not reveal the nuanced differences between the generative models in capturing various sample types. In this work, we attempt to solve a differential clustering problem to detect sample types expressed differently by two generative models. To solve the differential clustering problem, we propose a method called Fourier-based Identification of Novel Clusters (FINC) to identify modes produced by a generative model with a higher frequency in comparison to a reference distribution. FINC provides a scalable stochastic algorithm based on random Fourier features to estimate the eigenspace of kernel covariance matrices of two generative models and utilize the principal eigendirections to detect the sample types present more dominantly in each model. We demonstrate the application of the FINC method to large-scale computer vision datasets and generative model frameworks. Our numerical results suggest the scalability of the developed Fourier-based method in highlighting the sample types produced with different frequencies by widely-used generative models.
The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition
Otto Brookes · Maksim Kukushkin · Majid Mirmehdi · Colleen Stephens · Paula Dieguez · Thurston Cleveland Hicks · Sorrel CZ Jones · Kevin C. Lee · Maureen S. McCarthy · Amelia C. Meier · NORMAND E. · Erin G. Wessling · Roman M. Wittig · Kevin Langergraber · Klaus Zuberbühler · Lukas Boesch · Thomas Schmid · Mimi Arandjelovic · Hjalmar S. Kühl · Tilo Burghardt
Computer vision analysis of camera trap video footage is essential for wildlife conservation, as captured behaviours offer some of the earliest indicators of changes in population health. Recently, several high-impact animal behaviour datasets and methods have been introduced to encourage their use; however, the role of behaviour-correlated background information and its significant effect on out-of-distribution generalisation remain unexplored. In response, we present the PanAf-FGBG dataset, featuring 20 hours of wild chimpanzee behaviours, recorded at over 350 individual camera locations. Uniquely, it pairs every video with a chimpanzee (referred to as a foreground video) with a corresponding background video (with no chimpanzee) from the same camera location. We present two views of the dataset: one with overlapping camera locations and one with disjoint locations. This setup enables, for the first time, direct evaluation of in-distribution and out-of-distribution conditions, and for the impact of backgrounds on behaviour recognition models to be quantified. All clips come with rich behavioural annotations and metadata including unique camera IDs and detailed textual scene descriptions. Additionally, we establish several baselines and present a highly effective latent-space normalisation technique that boosts out-of-distribution performance by +5.42\% mAP for convolutional and +3.75\% mAP for transformer-based models. Finally, we provide an in-depth analysis on the role of backgrounds in out-of-distribution behaviour recognition, including the so far unexplored impact of background durations (i.e., the count of background frames within foreground videos). The full dataset, baseline models, and weights will be available at `anonymous'.
MIRE: Matched Implicit Neural Representations
Dhananjaya Jayasundara · Heng Zhao · Demetrio Labate · Vishal M. Patel
Implicit Neural Representations (INRs) are continuous function learners for conventional digital signal representations. With the aid of positional embeddings and/or exhaustively fine-tuned activation functions, INRs have surpassed many limitations of traditional discrete representations. However, existing works only find a continuous representation for the digital signal by solely using a single, fixed activation function throughout the INR, and it has not yet been explored to match the INR to the given signal. As current INRs are not matched to the signal being represented by the INR, we hypothesize that this approach could restrict the representation power and generalization capabilities of INRs, limiting their broader applicability. A way to match the INR to the signal being represented is through matching the activation of each layer in the sense of minimizing the mean squared loss. In this paper, we introduce MIRE, a method to find the highly matched activation function for each layer in INR through dictionary learning. To showcase the effectiveness of the proposed method, we utilize a dictionary that includes seven activation atoms: Raised Cosines (RC), Root Raised Cosines (RRC), Prolate Spheroidal Wave Function (PSWF), Sinc, Gabor Wavelet, Gaussian, and Sinusoidal. Experimental results demonstrate that MIRE not only significantly improves INR performance across various tasks, such as image representation, image inpainting, 3D shape representation, novel view synthesis, super-resolution, and reliable edge detection, but also eliminates the need for the previously required exhaustive search for activation parameters, which had to be conducted even before INR training could begin.
Learning to Normalize on the SPD Manifold under Bures-Wasserstein Geometry
Rui Wang · Shaocheng Jin · Ziheng Chen · Xiaoqing Luo · Xiaojun Wu
Covariance matrices have proven highly effective across many scientific fields. Since these matrices lie within the Symmetric Positive Definite (SPD) manifold—a Riemannian space with intrinsic non-Euclidean geometry, the primary challenge in representation learning is to respect this underlying geometric structure. Drawing inspiration from the success of Euclidean deep learning, researchers have developed neural networks on the SPD manifolds for more faithful covariance embedding learning. A notable advancement in this area is the implementation of Riemannian batch normalization (RBN), which has been shown to improve the performance of SPD network models. Nonetheless, the Riemannian metric beneath the existing RBN might fail to effectively deal with the ill-conditioned SPD matrices (ICSM), undermining the effectiveness of RBN. In contrast, the Bures-Wasserstein metric (BWM) demonstrates superior performance for ill-conditioning. In addition, the recently introduced Generalized BWM (GBWM) parameterizes the vanilla BWM via an SPD matrix, allowing for a more nuanced representation of vibrant geometries of the SPD manifold. Therefore, we propose a novel RBN algorithm based on the GBW geometry, incorporating a learnable metric parameter. Moreover, the deformation of GBWM by matrix power is also introduced to further enhance the representational capacity of GBWM-based RBN. Experimental results on different datasets validate the effectiveness of our proposed method.
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery
Sara Al-Emadi · Yin Yang · Ferda Ofli
Object detectors have achieved remarkable performance in many applications; however, these deep learning models are typically designed under the i.i.d. assumption, meaning they are trained and evaluated on data sampled from the same (source) distribution. In real-world deployment, however, target distributions often differ from source data, leading to substantial performance degradation. Domain Generalisation (DG) seeks to bridge this gap by enabling models to generalise to Out-Of-Distribution (OOD) data without access to target distributions during training, enhancing robustness to unseen conditions. In this work, we examine the generalisability and robustness of state-of-the-art object detectors under real-world distribution shifts, focusing particularly on spatial domain shifts. Despite the need, a standardised benchmark dataset specifically designed for assessing object detection under realistic DG scenarios is currently lacking. To address this, we introduce a suite RWDS consisting of three novel DG benchmarking datasets\footnote{The datasets are available in the Supplementary Materials} that focus on humanitarian and climate change applications. These datasets allow investigation of domain shifts across (i) climate zones and (ii) various disasters and geographic regions. To our knowledge, these are the first DG benchmarking datasets tailored for object detection in real-world, high-impact contexts. We aim for these datasets to serve as valuable resources for evaluating the robustness and generalisation of future object detection models.
Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities
Michele Mazzamuto · Antonino Furnari · Yoichi Sato · Giovanni Maria Farinella
We address the challenge of unsupervised mistake detection in egocentric video of skilled human activities through the analysis of gaze signals. While traditional methods rely on manually labeled mistakes, our approach does not require mistake annotations, hence overcoming the need of domain-specific labeled data. Based on the observation that eye movements closely follow object manipulation activities, we assess to what extent eye-gaze signals can support mistake detection, proposing to identify deviations in attention patterns measured through a gaze tracker with respect to those estimated by a gaze prediction model. Since predicting gaze in video is characterized by high uncertainty, we propose a novel gaze completion task, where eye fixations are predicted from visual observations and partial gaze trajectories, and contribute a novel gaze completion approach which explicitly models correlations between gaze information and local visual tokens. Inconsistencies between predicted and observed gaze trajectories act as an indicator to identify mistakes. Experiments highlight the effectiveness of the proposed approach in different settings, with relative gains up to +14%, +11%, and +5% in EPIC-Tent, HoloAssist and IndustReal respectively, remarkably matching results of supervised approaches without seeing any labels. We further show that gaze-based analysis is particularly useful in the presence of skilled actions, low action execution confidence, and actions requiring hand-eye coordination and object manipulation skills. Our method is ranked first on the HoloAssist Mistake Detection challenge.
Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model
Changchang Sun · Gaowen Liu · Charles Fleming · Yan Yan
Recently, conditional diffusion models have gained increasing attention due to their impressive results for cross-modal synthesis.Typically, existing methods target at achieving strong alignment between conditioning input and generated output by training a time-conditioned U-Net augmented with cross-attention mechanism. In this paper, we focus on the problem of generating music synchronized with rhythmic visual cues of the given dance video. Considering that bi-directional guidance is more beneficial for training a diffusion model, we propose to improve the quality of generated music and its synchronization with dance videos by adopting both positive rhythmic information and negative ones (PN-Diffusion) as conditions, where a dual diffusion and reverse processes is devised. Different from existing dance-to-music diffusion models, PN-Diffusion consists of a noise prediction objective for positive conditioning and an additional noise prediction objective for negative conditioning to train a sequential multi-modal U-Net structure. To ensure the accurate definition and selection of negative conditioning, we ingeniously leverage temporal correlations between music and dance videos, where positive and negative rhythmic visual cues and motion information are captured by playing dance videos forward and backward, respectively. By subjectively and objectively evaluating input-output correspondence in terms of dance-music beats and the quality of generated music, experimental results on dance video datasets AIST++ and TikTok demonstrate the superiority of our model over SOTA dance-to-music generation models.
SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding
Mingfei Chen · Israel D. Gebru · Ishwarya Ananthabhotla · Christian Richardt · Dejan Markovic · Steven Krenn · Todd Keebler · Jacob Sandakly · Alexander Richard · Eli Shlizerman
We introduce SoundVista, a method to generate the ambient sound of an arbitrary scene at novel viewpoints. Given a pre-acquired recording of the scene from sparsely distributed microphones, SoundVista can synthesize the sound of that scene from an unseen target viewpoint. The method learns the underlying acoustic transfer function that relates the signals acquired at the distributed microphones to the signal at the target viewpoint, using a limited number of known recordings. Unlike existing works, our method does not require constraints or prior knowledge of sound source details. Moreover, our method efficiently adapts to diverse room layouts, reference microphone configurations and unseen environments. To enable this, we introduce a visual-acoustic binding module that learns visual embeddings linked with local acoustic properties from panoramic RGB and depth data. We first leverage these embeddings to optimize the placement of reference microphones in any given scene. During synthesis, we leverage multiple embeddings extracted from reference locations to get adaptive weights for their contribution, conditioned on target viewpoint. We benchmark the task on both publicly available data and real-world settings. We demonstrate significant improvements over existing methods.
Object-aware Sound Source Localization via Audio-Visual Scene Understanding
Sung Jin Um · Dongjin Kim · Sangmin Lee · Jung Uk Kim
Sound source localization task aims to localize each region of sound-making objects in visual scenes. Recent methods, which rely on simple audio-visual correspondence, often struggle to accurately localize each object in complex environments, such as those with visually similar silent objects. To address these challenges, we propose a novel sound source localization framework, which incorporates detailed contextual information for fine-grained sound source localization. Our approach utilizes Multimodal Large Language Models (MLLMs) to generate detailed information through understanding audio-visual scenes. To effectively incorporate generated detail information, we propose two loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. By utilizing these losses, our method effectively performs precise localization through fine-grained audio-visual correspondence. Our extensive experiments on MUSIC and VGGSound datasets demonstrate significant improvements in both single- and multi-source sound localization. Our code and generated detail information will be made publicly available.
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer
Shaofei Huang · Rui Ling · Tianrui Hui · Hongyu Li · Xu Zhou · Shifeng Zhang · Si Liu · Richang Hong · Meng Wang
Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
Towards Open-Vocabulary Audio-Visual Event Localization
Jinxing Zhou · Dan Guo · Ruohao Guo · Yuxin Mao · Jingjing Hu · Yiran Zhong · Xiaojun Chang · Meng Wang
The Audio-Visual Event Localization (AVEL) task aims to temporally locate and classify video events that are both audible and visible.Most research in this field assumes a closed-set setting, which restricts these models' ability to handle test data containing event categories absent (unseen) during training. Recently, a few studies have explored AVEL in an open-set setting, enabling the recognition of unseen events as ``unknown'', but without providing category-specific semantics.In this paper, we advance the field by introducing the Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) problem, which requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.To address this new task, we propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes (seen:unseen = 46:21), each with manual segment-level annotation.We also establish three evaluation metrics for this task.Moreover, we investigate two baseline approaches, one training-free and one using a further fine-tuning paradigm.Specifically, we utilize the unified multimodal space from the pretrained ImageBind model to extract audio, visual, and textual (event classes) features.The training-free baseline then determines predictions by comparing the consistency of audio-text and visual-text feature similarities.The fine-tuning baseline incorporates lightweight temporal layers to encode temporal relations within the audio and visual modalities, using OV-AVEBench training data for model fine-tuning.We evaluate these baselines on the proposed OV-AVEBench dataset and discuss potential directions for future work in this new field.
Contextual AD Narration with Interleaved Multimodal Sequence
Hanlin Wang · Zhan Tong · Kecheng Zheng · Yujun Shen · Limin Wang
The Audio Description (AD) task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie. With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie. To achieve this goal, we propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs with interleaved multimodal sequence as input, termed as Uni-AD. To enhance the alignment of features across various modalities with finer granularity, we introduce a simple and lightweight module that maps video features into the textual feature space. Moreover, we also propose a character-refinement module to provide more precise information by identifying the main characters who play more significant role in the video context. With these unique designs, we further incorporate contextual information and a contrastive loss into our architecture to generate more smooth and contextual ADs. Experiments on the MAD-eval dataset show that Uni-AD can achieve state-of-the-art performance on AD generation, which demonstrates the effectiveness of our approach.
Towards Universal Soccer Video Understanding
Jiayuan Rao · Haoning Wu · Hao Jiang · Ya Zhang · Yanfeng Wang · Weidi Xie
As a globally celebrated sport, soccer has attracted widespread interest from fans over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding.Specifically, we make the following contributions in this paper:(i) we introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline;(ii) we present the first visual-language foundation model in the soccer domain, MatchVision, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks;(iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition,and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research. The code and model will be publicly available for reproduction.
Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification
S P Sharan · Minkyu Choi · Sahil Shah · Harsh Goel · Mohammad Omama · Sandeep P. Chinchali
Recent advancements in text-to-video models such as Sora, Gen-3, MovieGen, and CogVideoX are pushing the boundaries of synthetic video generation, with adoption seen in fields like robotics, autonomous driving, and entertainment. As these models become prevalent, various metrics and benchmarks have emerged to evaluate the quality of the generated videos. However, these metrics emphasize visual quality and smoothness, neglecting temporal fidelity and text-to-video alignment, which are crucial for safety-critical applications. To address this gap, we introduce NeuS-V, a novel synthetic video evaluation metric that rigorously assesses text-to-video alignment using neuro-symbolic formal verification techniques. Our approach first converts the prompt into a formally defined Temporal Logic (TL) specification and translates the generated video into an automaton representation. Then, it evaluates the text-to-video alignment by formally checking the video automaton against the TL specification. Furthermore, we present a dataset of temporally extended prompts to evaluate state-of-the-art video generation models against our benchmark. We find that NeuS-V demonstrates a higher correlation by over $5 \times$ with human evaluations when compared to existing metrics. Our evaluation further reveals that current video generation models perform poorly on these temporally complex prompts, highlighting the need for future work in improving text-to-video generation capabilities.
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
Kaiyue Sun · Kaiyi Huang · Xian Liu · Yue Wu · Zihan Xu · Zhenguo Li · Xihui Liu
Text-to-video (T2V) generative models have advanced significantly, yet their ability to compose different objects, attributes, actions, and motions into a video remains unexplored. Previous text-to-video benchmarks also neglect this important ability for evaluation. In this work, we conduct the first systematic study on compositional text-to-video generation. We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation. T2V-CompBench encompasses diverse aspects of compositionality, including consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. We further carefully design evaluation metrics of multimodal large language model (MLLM)-based, detection-based, and tracking-based metrics, which can better reflect the compositional text-to-video generation quality of seven proposed categories with 1400 text prompts. The effectiveness of the proposed metrics is verified by correlation with human evaluations. We also benchmark various text-to-video generative models and conduct in-depth analysis across different models and various compositional categories. We find that compositional text-to-video generation is highly challenging for current models, and we hope our attempt could shed light on future research in this direction.
Event-Equalized Dense Video Captioning
Kangyi Wu · Pengna Li · Jingwen Fu · Yizhe Li · Yang Wu · Yuhan Liu · Jinjun Wang · Sanping Zhou
Dense video captioning aims to localize and caption all events in arbitrary untrimmed videos. Although previous methods have achieved appealing results, they still face the issue of temporal bias, i.e, models tend to focus more on events with certain temporal characteristics. Specifically, 1) the temporal distribution of events in training datasets is uneven. Models trained on these datasets will pay less attention to out-of-distribution events. 2) long-duration events have more frame features than short ones and will attract more attention. To address this, we argue that events, with varying temporal characteristics, should be \textbf{treated equally} when it comes to dense video captioning. Intuitively, different events tend to have distinct visual differences due to varied camera views, backgrounds, or subjects. Inspired by that, we intend to utilize visual features to have an approximate perception of possible events and pay equal attention to them. In this paper, we introduce a simple but effective framework, called Event-Equalized Dense Video Captioning(E$^2$DVC) to overcome the temporal bias and treat all possible events equally. Specifically, an event perception module(EPM) is proposed to do uneven clustering on visual frame features to generate pseudo-events. We enforce the model's attention to these pseudo-events through the pseudo-event initialization module(PEI). A novel event-enhanced encoder(EEE) is also devised to enhance the model's ability to explore frame-frame and frame-event relationships. Experimental results validate the effectiveness of the proposed methods.
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Qiuheng Wang · Yukai Shi · Jiarong Ou · Rui Chen · Ke Lin · Jiahao Wang · Boyuan Jiang · Haotian Yang · Mingwu Zheng · Xin Tao · Fei Yang · Pengfei Wan · Di ZHANG
As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models. We argue that temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The core of our approach lies in improving the consistency between fine-grained conditions and video content. Specifically, we employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency. We then provide structured captions for the segmented videos, with an average length of 200 words, to improve text-video alignment. Additionally, we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus. Finally, we incorporate several metrics into the training process of the generation model, further refining the fine-grained conditions. Our experiments demonstrate the effectiveness of our data processing pipeline and the quality of the proposed Koala-36M dataset.
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Fida Mohammad Thoker · Letian Jiang · Chen Zhao · Bernard Ghanem
Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal correlation, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new video self-supervised learning paradigm, capable of learning robust video representations without requiring any video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Wenyi Hong · Yean Cheng · Zhuoyi Yang · Weihan Wang · Lefan Wang · Xiaotao Gu · Shiyu Huang · Yuxiao Dong · Jie Tang
In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability — fine-grained motion comprehension — remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content.Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions.To enhance VLM's ability to perceive fine-grained motion within a limited LLM sequence length, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression, and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs together with TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension.
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
Ziyang Luo · Haoning Wu · Dongxu Li · Jing Ma · Mohan Kankanhalli · Junnan Li
Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice question answering in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation—and due to the prohibitive cost and slow pace of human annotation for video tasks—we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a ``gold standard" using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis. We will release them upon acceptance.
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Yilun Zhao · Lujing Xie · Haowei Zhang · Guo Gan · Weiyuan Chen · Yitao Long · Tongyan Hu · Zhijian Xu · Chengye Wang · Chuhan Li · Ziyao Shangguan · Yixin Liu · Zhenwen Liang · Zhiyuan Hu · Chen Zhao · Arman Cohan
We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls, validating that correct answers cannot be inferred solely from textual cues or answer shortcuts. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth. Our evaluation of 20 frontier models highlights a significant performance gap between the leading models and human experts. Through comprehensive error analysis, case studies, and an exploration of retrieval-augmented generation methods, we offer actionable insights to guide future advancements.
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
Yunlong Tang · JunJia Guo · Hang Hua · Susan Liang · Mingqian Feng · Xinyang Li · Rui Mao · Chao Huang · Jing Bi · Zeliang Zhang · Pooyan Fazli · Chenliang Xu
The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content.However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts.We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations.VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc.Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement.Our benchmark will be publicly available for evaluating more models.
VITED: Video Temporal Evidence Distillation
Yujie Lu · Yale Song · Lorenzo Torresani · William Yang Wang · Tushar Nagarajan
We investigate complex video question answering via chain-of-evidence reasoning --- identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them.Existing models struggle with multi-step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question.We train our model (ViTED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi-step reasoning across them in long-form video content.We show the value of our evidence-distilled models on a suite of long video QA benchmarks where we outperform state-of-the-art approaches that lack evidence reasoning capabilities.
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding
Yudong Han · Qingpei Guo · Liyuan Pan · Liu Liu · Yu Guan · Ming Yang
The challenge in LLM-based video understanding lies in preserving visual and semantic information in long videos while maintaining a memory-affordable token count. However, redundancy and correspondence in videos have hindered the performance potential of existing methods. Through statistical learning on current datasets, we observe that redundancy occurs in both repeated and answer-irrelevant frames, and the corresponding frames vary with different questions. This suggests the possibility of adopting dynamic encoding to balance detailed video information preservation with token budget reduction. To this end, we propose a dynamic cooperative network, DynFocus, for memory-efficient video encoding in this paper. Specifically, i) a Dynamic Event Prototype Estimation (DPE) module to dynamically select meaningful frames for question answering; (ii) a Compact Cooperative Encoding (CCE) module that encodes meaningful frames with detailed visual appearance and the remaining frames with sketchy perception separately. We evaluate our method on five publicly available benchmarks, and experimental results consistently demonstrate that our method achieves competitive performance.
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
Yudi Shi · Shangzhe Di · Qirui Chen · Weidi Xie
This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we propose Agent-of-Thoughts Distillation (AoTD), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.
AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction
Yuanbin Man · Ying Huang · Chengming Zhang · Bingzhe Li · Wei Niu · Miao Yin
The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM$^2$, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM$^2$ achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5\% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65\%.
Temporal Alignment-Free Video Matching for Few-shot Action Recognition
SuBeen Lee · WonJun Moon · Hyun Seok Seong · Jae-Pil Heo
Few-Shot Action Recognition (FSAR) aims to train a model with only a few labeled video instances.A key challenge in FSAR is handling divergent narrative trajectories for precise video matching.While the frame- and tuple-level alignment approaches have been promising, their methods heavily rely on pre-defined and length-dependent alignment units (e.g., frames or tuples), which limits flexibility for actions of varying lengths and speeds. In this work, we introduce a novel TEmporal Alignment-free Matching (TEAM) approach, which eliminates the need for temporal units in action representation and brute-force alignment during matching.Specifically, TEAM represents each video with a fixed set of pattern tokens that capture globally discriminative clues within the video instance regardless of action length or speed, ensuring its flexibility.Furthermore, TEAM is inherently efficient, using token-wise comparisons to measure similarity between videos, unlike existing methods that rely on pairwise comparisons for temporal alignment. Additionally, we propose an adaptation process that identifies and removes common information across novel classes, establishing clear boundaries even between novel categories. Extensive experiments demonstrate the effectiveness of TEAM.
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad · Vibhav Vineet · Yogesh S. Rawat
Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed Hierarchical Querying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ’s state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis. All code will be made available upon acceptance.
Video Language Model Pretraining with Spatio-temporal Masking
Yue Wu · Zhaobo Qi · Junshu Sun · Yaowei Wang · Qingming Huang · Shuhui Wang
The development of video-language self-supervised models based on mask learning has significantly advanced downstream video tasks. These models leverage masked reconstruction to facilitate joint learning of visual and linguistic information. However, a recent study reveals that reconstructing image features yields superior downstream performance compared to video feature reconstruction. We hypothesize that this performance gap stems from how masking strategies influence the model's attention to temporal dynamics.To validate this hypothesis, we conduct two sets of experiments demonstrating that alignment between masked object and reconstruction target is crucial for effective video-language self-supervised learning. Based on these findings, we propose a spatio-temporal masking strategy (STM) for video-language model pretraining that operates across adjacent frames, and a decoder leverages semantic information to enhance the spatio-temporal representations of masked tokens. Through the combination of masking strategy and reconstruction decoder, STM enforces the model to learn the spatio-temporal feature representation more comprehensively. Experimental results across three video understanding downstream tasks validate the superiority of our method.
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
Zhihang Liu · Chen-Wei Xie · Pandeng Li · Liming Zhao · Longxiang Tang · Yun Zheng · Chuanbin Liu · Hongtao Xie
Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the \textbf{H}ybrid-level Instruction \textbf{I}njection Strategy for \textbf{C}onditional Token C\textbf{om}pression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 3.88\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code, dataset, and model will be released.
Re-thinking Temporal Search for Long-Form Video Understanding
Jinhui Ye · Zihan Wang · Haosen Sun · Keshigeyan Chandrasegaran · Zane Durante · Cristobal Eyzaguirre · Yonatan Bisk · Juan Carlos Niebles · Ehsan Adeli · Li Fei-Fei · Jiajun Wu · Manling Li
Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: **First**, we formulate temporal search as a **Long Video Haystack** problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create **LV-Haystack**, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset.**Next**, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, $T^*$ , which casts the expensive temporal search as a spatial search problem. $T^*$ leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, $T^*$ significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, $T^*$ improves GPT-4o's performance from 50.5% to **53.1%** and LLaVA-OneVision-72B's performance from 56.5% to **62.4%** on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
Hongyu Li · Jinyu Chen · Ziyu Wei · Shaofei Huang · Tianrui Hui · Jialin Gao · Xiaoming Wei · Si Liu
Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages. LLaVA-ST achieves outstanding performance on 12 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released.
Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents
Yunseok Jang · Yeda Song · Sungryull Sohn · Lajanugen Logeswaran · Tiange Luo · Dong-Ki Kim · GyungHoon Bae · Honglak Lee
Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing agents capable of mobile operating system (mobile OS) navigation. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models trained on MONDAY demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving 21.41\%p better performance on previously unseen mobile OS configurations. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework combines robust OCR-based scene detection (95.04% F1-score), near-perfect UI component detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.
PHGC: Procedural Heterogeneous Graph Completion for Natural Language Task Verification in Egocentric Videos
Xun Jiang · Zhiyi Huang · Xing Xu · Jingkuan Song · Fumin Shen · Heng Tao Shen
Natural Language-based Egocentric Task Verification (NLETV) aims to equip agents with the ability to determine if operation flows of procedural tasks in egocentric videos align with natural language instructions. Describing rules with natural language provides generalizable applications, but also raises cross-modal heterogeneity and hierarchical misalignment challenges. In this paper, we proposed a novel approach termed Procedural Heterogeneous Graph Completion (PHGC), which addresses these challenges with heterogeneous graphs representing the logic in rules and operation flows. Specifically, our PHGC method mainly consists of three key components: (1) Heterogeneous Graph Construction module that defines objective states and operation flows as vertices, with temporal and sequential relations as edges. (2) Cross-Modal Path Finding module that aligns semantic relations between hierarchical video and text elements. (3) Discriminative Entity Representation module excavating hidden entities that integrate the general logical relations and discriminative cues to reveal final verification results. Additionally, we further constructed a new dataset called CSV-NL comprised of realistic videos. Extensive experiments on the two benchmark datasets covering both digital and physical scenarios, i.e., EgoTV and CSV-NL, demonstrate that our proposed PHGC establishes state-of-the-art performance across different settings. Our implementation is available at https://anonymous.4open.science/r/PHGC-7A1B.
Rethinking Spiking Self-Attention Mechanism: Implementing α-XNOR Similarity Calculation in Spiking Transformers
Yichen Xiao · Shuai Wang · Dehao Zhang · Wenjie Wei · Yimeng Shan · Xiaoli Liu · Yulin Jiang · Malu Zhang
Transformers significantly raise the performance limits across various tasks, spurring research into integrating them into spiking neural networks. However, a notable performance gap remains between existing spiking Transformers and their artificial neural network counterparts. Here, we first analyze the reason for this gap and identify that the dot product ineffectively calculates similarity between spiking Queries (Q) and Keys (K). To address this challenge, we introduce an innovative α-XNOR similarity calculation method tailored for spike trains. α-XNOR similarity redefines the correlation of non-spike pairs as a specific value α, effectively overcoming the limitations of dot-product similarity caused by numerous non-spiking events. Additionally, considering the sparse nature of spike trains where spikes carry more information than non-spikes, the α-XNOR similarity correspondingly highlights the distinct importance of spikes over non-spikes. Extensive experiments demonstrate that our α-XNOR similarity significantly improves performance across different spiking Transformer architectures in various static and neuromorphic datasets. This is the first attempt to develop a spiking self-attention paradigm tailored for the binary characteristics of spike trains.
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Andong Deng · Tongjia Chen · Shoubin Yu · Taojiannan Yang · Lincoln Spencer · Yapeng Tian · Ajmal Mian · Mohit Bansal · Chen Chen
In this paper, we introduce Motion-Grounded Video Reasoning, a new motionunderstanding task that requires generating visual answers (video segmentationmasks) according to the input question, and hence needs implicit spatiotemporalreasoning and grounding. This task extends existing spatiotemporal groundingwork focusing on explicit action/motion grounding, to a more general format byenabling implicit reasoning via questions. To facilitate the development of the newtask, we collect a large-scale dataset called GROUNDMORE, which comprises1,715 video clips, 249K object masks that are deliberately designed with 4 questiontypes (Causal, Sequential, Counterfactual, and Descriptive) for benchmarkingdeep and comprehensive motion reasoning abilities. GROUNDMORE uniquelyrequires models to generate visual answers, providing a more concrete and visuallyinterpretable response than plain texts. It evaluates models on both spatiotemporalgrounding and reasoning, fostering to address complex challenges in motion-relatedvideo reasoning, temporal perception, and pixel-level understanding. Furthermore,we introduce a novel baseline model named Motion-Grounded Video ReasoningAssistant (MORA). MORA incorporates the multimodal reasoning ability from theMultimodal LLM, the pixel-level perception capability from the grounding model(SAM), and the temporal perception ability from a lightweight localization head.MORA achieves respectable performance on GROUNDMORE outperforming thebest existing visual grounding baseline model by an average of 21.5% relatively.We hope this novel and challenging task will pave the way for future advancementsin robust and general motion understanding via video reasoning segmentation.
V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts
Adnen Abdessaied · Anna Rohrbach · Marcus Rohrbach · Andreas Bulling
We present **V$^2$Dial** – a novel expert-based model specifically geared towards simultaneously handling image and video input data for multimodal conversational tasks. Current multimodal models primarily focus on simpler tasks (e.g., VQA, VideoQA, video-text retrieval) and often neglect the more challenging conversational counterparts, such as video and visual/image dialog. Moreover, works on both conversational tasks evolved separately from each other despite their apparent similarities limiting their applicability potential. To this end, we propose to unify both tasks using a single model that for the first time jointly learns the spatial and temporal features of images and videos by routing them through dedicated experts and aligns them using matching and contrastive learning techniques. Furthermore, we systemically study the domain shift between the two tasks by investigating whether and to what extent these seemingly related tasks can mutually benefit from their respective training data. Extensive evaluations on the widely used video and visual dialog datasets of AVSD and VisDial show that our model achieves new state-of-the-art results across four benchmarks both in zero-shot and fine-tuning settings.
Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation
Rohith Peddi · Saurabh . · Ayush Abhay Shrivastava · Parag Singla · Vibhav Giridhar Gogate
Spatio-Temporal Scene Graphs (STSGs) provide a concise and expressive representation of dynamic scenes by modelling objects and their evolving relationships over time. However, real-world visual relationships often exhibit a long-tailed distribution, causing existing methods for tasks like Video Scene Graph Generation (VidSGG) and Scene Graph Anticipation (SGA) to produce biased scene graphs. To this end, we propose \textbf{ImparTail}, a novel training framework that leverages curriculum learning and loss masking to mitigate bias in the generation and anticipation of spatio-temporal scene graphs. Our approach gradually decreases the dominance of the head relationship classes during training and focuses more on tail classes, leading to more balanced training. Furthermore, we introduce two new tasks—Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation—designed to evaluate the robustness of STSG models against distribution shifts. Extensive experiments on the Action Genome dataset demonstrate that our framework significantly enhances the unbiased performance and robustness of STSG models compared to existing methods.
GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
Lang Lin · Xueyang Yu · Ziqi Pang · Yu-Xiong Wang
This paper proposes a novel framework utilizing multi-modal large language models (MLLMs) for referring video object segmentation (RefVOS). Previous MLLM-based methods commonly struggle with the dilemma between "Ref" and "VOS": they either specialize in understanding a few key frames (global reasoning) or tracking objects on continuous frames (local reasoning), and rely on external VOS or frame selectors to mitigate the other end of the challenge. However, our framework GLUS shows that Global and Local consistency can be Unified into a single video Segmentation MLLM: a set of sparse "context frames" provides global information, while a stream of continuous "query frames" conducts local object tracking. This is further supported by jointly training the MLLM with a pre-trained VOS memory bank to simultaneously digest short-range and long-range temporal information. To improve the information efficiency within the limited context window of MLLMs, we introduce object contrastive learning to distinguish hard false-positive objects and a self-refined framework to identify crucial frames and perform propagation. By collectively integrating these insights, our GLUS delivers a simple yet effective baseline, achieving new state-of-the-art for MLLMs on the MeViS and Ref-Youtube-VOS benchmark.
LiVOS: Light Video Object Segmentation with Gated Linear Matching
Qin Liu · Jianfeng Wang · Zhengyuan Yang · Linjie Li · Kevin Lin · Marc Niethammer · Lijuan Wang
Semi-supervised video object segmentation (VOS) has been largely driven by space-time memory (STM) networks, which store past frame features in a spatiotemporal memory to segment the current frame via softmax attention. However, STM networks face memory limitations due to the quadratic complexity of softmax matching, restricting their applicability as video length and resolution increase. To address this, we propose LiVOS, a lightweight memory network that employs linear matching via linear attention, reformulating memory matching into a recurrent process that reduces the quadratic attention matrix to a constant-size, spatiotemporal-agnostic 2D state. To enhance selectivity, we introduce gated linear matching, where a data-dependent gate matrix is multiplied with the state matrix to control what information to retain or discard. Experiments on diverse benchmarks demonstrated the effectiveness of our method. It achieved 64.8 J&F on MOSE and 85.1 J&F on DAVIS, surpassing all non-STM methods and narrowing the gap with STM-based approaches. For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU--a previously cost-prohibitive capability--opening the door for long and high-resolution video foundation models.
VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models
Muchao Ye · Weiyang Liu · Pan He
The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.
Track Any Anomalous Object:A Granular Video Anomaly Detection Pipeline
Yuzhi Huang · Chenxin Li · Haitao Zhang · Zixu Lin · yunlong lin · Hengyu Liu · Wuyang Li · Xinyu Liu · Jiechao Gao · Yue Huang · Xinghao Ding · Yixuan Yuan
Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Albeit existing methods have primarily focused on detecting anomalous objects in videos—either by identifying anomalous frames or objects—they often neglect finer-grained analysis, such as anomalous pixels, which limits their ability to capture a broader range of anomalies. To address this challenge, we propose an innovative VAD framework called $\underline{T}$rack Any $\underline{A}$nomalous $\underline{O}$bject ($\textit{\textit{TAO}}$), which introduces a Granular Video Anomaly Detection Framework that, for the first time, integrates the detection of multiple fine-grained anomalous objects into a unified framework. Unlike methods that assign anomaly scores to every pixel at each moment, our approach transforms the problem into pixel-level tracking of anomalous objects. By linking anomaly scores to subsequent tasks such as image segmentation and video tracking, our method eliminates the need for threshold selection and achieves more precise anomaly localization, even in long and challenging video sequences. Experiments on extensive datasets demonstrate that $\textit{\textit{TAO}}$ achieves state-of-the-art performance, setting a new progress for VAD by providing a practical, granular, and holistic solution.
Context-Enhanced Memory-Refined Transformer for Online Action Detection
Zhanzhong Pang · Fadime Sener · Angela Yao
Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context.We uncover a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance.CMeRT improves learning for OAD and achieves state-of-the-art in online detection and anticipation on THUMOS'14, CrossTask, and EPIC-Kitchens-100.
Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer
Ziyi Liu · Yangcen Liu
Weakly-supervised Temporal Action Localization (WTAL) has achieved notable success but still suffers from a lack of snippet-level annotations, leading to a performance and framework gap with fully-supervised methods. While recent approaches employ pseudo labels for training, three key challenges: generating high-quality pseudo labels, making full use of different priors, and optimizing training methods with noisy labels remain unresolved. Due to these perspectives, we propose PseudoFormer, a novel two-branch framework that bridges the gap between weakly and Fully-supervised Temporal Action Localization (TAL). We first introduce RickerFusion, which maps all predicted action proposals to a global shared space to generate pseudo labels with better quality. Then, we leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch. Finally, the uncertainty mask and iterative refinement mechanism are applied for training with noisy pseudo labels. PseudoFormer achieves state-of-the-art results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3, demonstrating its extensiveness in Point-supervised TAL (PTAL).
Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition
Yang Chen · Jingcai Guo · Song Guo · Dacheng Tao
Zero-shot skeleton action recognition is a non-trivial task that requires robust unseen generalization with prior knowledge from only seen classes and shared semantics. Existing methods typically build the skeleton-semantics interactions by uncontrollable mappings and conspicuous representations, thereby can hardly capture the intricate and fine-grained relationship for effective cross-modal transferability. To address these issues, we propose a novel dy$N$amically $E$volving d$U$al skeleton-semantic syne$R$gistic framework with the guidance of c$O$ntext-aware side informatio$N$ (dubbed $Neuron$), to explore more fine-grained cross-modal correspondence from micro to macro perspectives at both spatial and temporal levels, respectively. Concretely, 1) we first construct the spatial-temporal evolving micro-prototypes and integrate dynamic context-aware side information to capture the intricate and synergistic skeleton-semantic correlations step-by-step, progressively refining cross-model alignment; and 2) we introduce the spatial compression and temporal memory mechanisms to guide the growth of spatial-temporal micro-prototypes, enabling them to absorb structure-related spatial representations and regularity-dependent temporal patterns. Notably, such processes are analogous to the learning and growth of neurons, equipping the framework with the capacity to generalize to new action categories. Extensive experiments on various benchmark datasets demonstrated the superiority of the proposed method. The code is anonymously available at: \url{https://anonymous.4open.science/r/Neuron}.
MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
Xinqi Liu · Li Zhou · Zikun Zhou · Jianqiu Chen · Zhenyu He
The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update. Besides, we introduce a modality-selection module that dynamically adjusts the weighting between visual and language references, mitigating potential ambiguities from either reference type. Extensive experimental results show that our method performs favorably against state-of-the-art trackers across diverse benchmarks.
Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues
Youngjoon Jang · Haran Raajesh · Liliane Momeni · Gul Varol · Andrew Zisserman
Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a novel translation model. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (I) captions describing the background show; (ii) pseudo-glosses transcribing the signing; and (iii) translation of previous sentences. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems
Song Xia · Yi Yu · Wenhan Yang · MEIWEN DING · Zhuo Chen · Ling-Yu Duan · Alex C. Kot · Xudong Jiang
By locally encoding raw data into intermediate features, collaborative inference enables end users to leverage powerful deep learning models without exposure of sensitive raw data to cloud servers.However, recent studies have revealed that these intermediate features may not sufficiently preserve privacy, as information can be leaked and raw data can be reconstructed via model inversion attacks (MIAs).Obfuscation-based methods, such as noise corruption, adversarial representation learning, and information filters, enhance the inversion robustness by obfuscating the task-irrelevant redundancy empirically.However, methods for quantifying such redundancy remain elusive, and the explicit mathematical relation between this redundancy and worst-case robustness against inversion has not yet been established.To address that, this work first theoretically proves that the conditional entropy of inputs given intermediate features provides a guaranteed lower bound on the reconstruction mean square error (MSE) under any MIA.Then, we derive a differentiable and solvable measure for bounding this conditional entropy based on the Gaussian mixture estimation and propose a conditional entropy maximization (CEM) algorithm to enhance the inversion robustness. Experimental results on four datasets demonstrate the effectiveness and adaptability of our proposed CEM; without compromising the feature utility and computing efficiency, integrating the proposed CEM into obfuscation-based defense mechanisms consistently boosts their inversion robustness, achieving average gains ranging from 12.9% to 48.2%.
Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted
Shuaiwei Yuan · Junyu Dong · Yuezun Li
With the advancement of AI generative techniques, Deepfake faces have become incredibly realistic and nearly indistinguishable to the human eye. To counter this, Deepfake detectors have been developed as reliable tools for assessing face authenticity. These detectors are typically developed on Deep Neural Networks (DNNs) and trained using third-party datasets. However, this protocol raises a new security risk that can seriously undermine the trustfulness of Deepfake detectors: Once the third-party data providers insert poisoned (corrupted) data maliciously, Deepfake detectors trained on these datasets will be injected ``backdoors'' that cause abnormal behavior when presented with samples containing specific triggers. This is a practical concern, as third-party providers may distribute or sell these triggers to malicious users, allowing them to manipulate detector performance and escape accountability.This paper investigates this risk in depth and describes a solution to stealthily infect Deepfake detectors. Specifically, we develop a trigger generator, that can synthesize passcode-controlled, semantic-suppression, adaptive, and invisible trigger patterns, ensuring both the stealthiness and effectiveness of these triggers. Then we discuss two poisoning scenarios, dirty-label poisoning and clean-label poisoning, to accomplish the injection of backdoors. Extensive experiments demonstrate the effectiveness, stealthiness, and practicality of our method compared to several baselines.
FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing
Hossein Kashiani · Niloufar Alipour Talemi · Fatemeh Afghah
Deepfake detectors often struggle to generalize to novel forgery types due to biases learned from limited training data. In this paper, we identify a new type of model bias in the frequency domain, termed spectral bias, where detectors overly rely on specific frequency bands, restricting their ability to generalize across unseen forgeries. To address this, we propose FreqDebias, a frequency debiasing framework that mitigates spectral bias through two complementary strategies. First, we introduce a novel Forgery Mixup (Fo-Mixup) augmentation, which dynamically diversifies frequency characteristics of training samples. Second, we incorporate a dual consistency regularization (CR), which enforces both local consistency using class activation maps (CAMs) and global consistency through a von Mises-Fisher (vMF) distribution on a hyperspherical embedding space. This dual CR mitigates over-reliance on certain frequency components by promoting consistent representation learning under both local and global supervision. Extensive experiments show that FreqDebias significantly enhances cross-domain generalization and outperforms state-of-the-art methods in both cross-domain and in-domain settings.
Omni-ID: Holistic Identity Representation Designed for Generative Tasks
Guocheng Qian · Kuan-Chieh Wang · Or Patashnik · Negin Heravi · Daniil Ostashev · Sergey Tulyakov · Daniel Cohen-Or · Kfir Aberman
We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation. It consolidates information from a varied number of unstructured input images into a structured representation, where each entry represents certain global or local identity features. Our approach leverages a few-to-many identity reconstruction training paradigm, where a few images of an individual serve as input to reconstruct multiple target images of the same individual in varied poses and expressions. To train the Omni-ID encoder, we use a multi-decoder framework that leverages the complementary strengths of different decoders during the representation learning phase. Unlike conventional representations, such as CLIP and ArcFace, which are typically learned through discriminative or contrastive objectives, Omni-ID is optimized with a generative objective, resulting in a more comprehensive and nuanced identity capture for generative tasks. Trained on our MFHQ dataset -- a multi-view facial image collection, Omni-ID demonstrates substantial improvements over conventional representations across various generative tasks.
VISTREAM: Improving Computation Efficiency of Visual Streaming Perception via Law-of-Charge-Conservation Inspired Spiking Neural Network
Kang You · Ziling Wei · Jing Yan · Boning Zhang · Qinghai Guo · Yaoyu Zhang · Zhezhi He
Visual streaming perception (VSP) involves online intelligent processing of sequential frames captured by vision sensors, enabling real-time decision-making in applications such as autonomous driving, UAVs, and AR/VR. However, the computational efficiency of VSP on edge devices remains a challenge due to power constraints and the underutilization of temporal dependencies between frames. While spiking neural networks (SNNs) offer biologically inspired event-driven processing with potential energy benefits, their practical advantage over artificial neural networks (ANNs) for VSP tasks remains unproven.In this work, we introduce a novel framework, VISTREAM, which leverages the Law of Charge Conservation (LoCC) property in ST-BIF neurons and a differential encoding (DiffEncode) scheme to optimize SNN inference for VSP. By encoding temporal differences between neighboring frames and eliminating frequent membrane resets, VISTREAM achieves significant computational efficiency while maintaining accuracy equivalent to its ANN counterpart. We provide theoretical proofs of equivalence and validate VISTREAM across diverse VSP tasks, including object detection, tracking, and segmentation, demonstrating substantial energy savings without compromising performance.
Temporal Separation with Entropy Regularization for Knowledge Distillation in Spiking Neural Networks
Kairong Yu · Chengting Yu · Tianqing Zhang · Xiaochen Zhao · Shu Yang · Hongwei Wang · Qiang Zhang · Qi Xu
Spiking Neural Networks (SNNs), inspired by the human brain, offer significant computational efficiency through discrete spike-based information transfer. Despite their potential to reduce inference energy consumption, a performance gap persists between SNNs and Artificial Neural Networks (ANNs), primarily due to current training methods and inherent model limitations. While recent research has aimed to enhance SNN learning by employing knowledge distillation (KD) from ANN teacher networks, traditional distillation techniques often overlook the distinctive spatiotemporal properties of SNNs, thus failing to fully leverage their advantages. To overcome these challenge, we propose a novel logit distillation method characterized by temporal separation and entropy regularization. This approach improves existing SNN distillation techniques by performing distillation learning on logits across different time steps, rather than merely on aggregated output features. Furthermore, the integration of entropy regularization stabilizes model optimization and further boosts the performance. Extensive experimental results indicate that our method surpasses prior SNN distillation strategies, whether based on logit distillation, feature distillation, or a combination of both. The code will be available on GitHub.
Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing
Shiyang Zhou · Haijin Zeng · Yunfan Lu · Tong Shao · Ke Tang · Yongyong Chen · Jie Liu · Jingyong Su
Quad Bayer demosaicing is the central challenge for enabling the widespread application of Hybrid Event-based Vision Sensors (HybridEVS). Although existing learning-based methods that leverage long-range dependency modeling have achieved promising results, their complexity severely limits deployment on mobile devices for real-world applications. To address these limitations, we propose a lightweight Mamba-based binary neural network designed for efficient and high-performing demosaicing of HybridEVS RAW images. First, to effectively capture both global and local dependencies, we introduce a hybrid Binarized Mamba-Transformer architecture that combines the strengths of the Mamba and Swin Transformer architectures. Next, to significantly reduce computational complexity, we propose a binarized semantic Mamba (BiS-Mamba), which binarizes all projections while retaining the core Selective Scan in full precision. BiS-Mamba also incorporates additional semantic information to enhance global context and mitigate precision loss. We conduct quantitative and qualitative experiments to demonstrate the effectiveness of BMTNet in both performance and computational efficiency, providing a lightweight demosaicing solution suited for real-world edge devices. Codes and models are available in the supplementary materials.
From Laboratory to Real World: A New Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification
Yan Jiang · Hao Yu · Xu Cheng · Haoyu Chen · Zhaodong Sun · Guoying Zhao
Aiming to match pedestrian images captured under varying lighting conditions, visible-infrared person re-identification (VI-ReID) has drawn intensive research attention and achieved promising results. However, in real-world surveillance contexts, data is distributed across multiple devices/entities, raising privacy and ownership concerns that make existing centralized training impractical for VI-ReID. To tackle these challenges, we propose L2RW, a benchmark that brings VI-ReID closer to real-world applications. The rationale of L2RW is that integrating decentralized training into VI-ReID can address privacy concerns in scenarios with limited data-sharing regulation. Specifically, we design protocols and corresponding algorithms for different privacy sensitivity levels. In our new benchmark, we ensure the model training is done in the conditions that: 1) data from each camera remains completely isolated, or 2) different data entities (e.g., data controllers of a certain region) can selectively share the data. In this way, we simulate scenarios with strict privacy constraints which is closer to real-world conditions. Intensive experiments with various server-side federated algorithms are conducted, showing the feasibility of decentralized VI-ReID training. Notably, when evaluated in unseen domains (i.e., new data entities), our L2RW, trained with isolated data (privacy-preserved), achieves performance comparable to state-of-the-art methods trained with shared data (privacy-unrestricted). We hope this work offers a novel research entry for deploying VI-ReID that fits real-world scenarios and can benefit the community. The source code will be made available.
DefMamba: Deformable Visual State Space Model
Leiye Liu · Miao Zhang · Jihao Yin · Tingwei Liu · Wei Ji · Yongri Piao · Huchuan Lu
Recently, state space models (SSM), particularly Mamba, have attracted significant attention from scholars due to their ability to effectively balance computational efficiency and performance. However, most existing visual Mamba methods flatten images into 1D sequences using predefined scan orders, which results the model being less capable of utilizing the spatial structural information of the image during the feature extraction process. To address this issue, we proposed a novel visual foundation model called DefMamba. This model includes a multi-scale backbone structure and deformable mamba (DM) blocks, which dynamically adjust the scanning path to prioritize important information, thus enhancing the capture and processing of relevant input features. By combining a deformable scanning (DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code will be open-sourced after the paper is published.
ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects
Woojin Lee · Hyugjae Chang · Jaeho Moon · Jaehyup Lee · Munchurl Kim
Weakly supervised Oriented Object Detection (WS-OOD) has gained attention as a cost-effective alternative to fully supervised methods, providing efficiency and high accuracy. Among weakly supervised approaches, horizontal bounding box (HBox) supervised OOD stands out for its ability to directly leverage existing HBox annotations while achieving the highest accuracy under weak supervision settings. This paper introduces adaptive bounding box scaling and symmetry-prior-based orientation prediction, called ABBSPO that is a framework for WS-OOD. Our ABBSPO addresses the limitations of previous HBox-supervised OOD methods, which compare ground truth (GT) HBoxes directly with predicted RBoxes’ minimum circumscribed rectangles, often leading to inaccuracies. To overcome this, we propose: (i) Adaptive Bounding Box Scaling (ABBS) that appropriately scales the GT HBoxes to optimize for the size of each predicted RBox, ensuring more accurate prediction for RBoxes' scales; and (ii) a Symmetric Prior Angle (SPA) loss that uses the inherent symmetry of aerial objects for self-supervised learning, addressing the issue in previous methods where learning fails if they consistently make incorrect predictions for all three augmented views (original, rotated, and flipped). Extensive experimental results demonstrate that our ABBSPO achieves state-of-the-art results, outperforming existing methods.
Towards RAW Object Detection in Diverse Conditions
Zhong-Yu Li · Xin Jin · Bo-Yuan Sun · Chun-Le Guo · Ming-Ming Cheng
Existing object detection methods often consider sRGB input, which was compressed from RAW data using ISP originally designed for visualization. However, such compression might lose crucial information for detection, especially under complex light and weather conditions. We introduce the AODRaw dataset, which offers 7,785 high-resolution real RAW images with 135,601 annotated instances spanning 62 categories, capturing a broad range of indoor and outdoor scenes under 9 distinct light and weather conditions. Based on AODRaw that supports RAW and sRGB object detection, we provide a comprehensive benchmark for evaluating current detection methods. We find that sRGB pre-training constrains the potential of RAW object detection due to the domain gap between sRGB and RAW, prompting us to directly pre-train on the RAW domain. However, it is harder for RAW pre-training to learn rich representations than sRGB pre-training due to the camera noise. To assist RAW pre-training, we distill the knowledge from an off-the-shelf model pre-trained on the sRGB domain. As a result, we achieve substantial improvements under diverse and adverse conditions without relying on extra pre-processing modules. The code and dataset will be made publicly available.
Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models
Zhejun Zhang · Peter Karkus · Maximilian Igl · Wenhao Ding · Yuxiao Chen · Boris Ivanovic · Marco Pavone
Traffic simulation aims to learn a policy for traffic agents that, when unrolled in closed-loop, faithfully recovers the joint distribution of trajectories observed in the real world. Inspired by large language models, tokenized multi-agent policies have recently become the state-of-the-art in traffic simulation. However, they are typically trained through open-loop behavior cloning, and thus suffer from covariate shift when executed in closed-loop during simulation. In this work, we present Closest Among Top-K (CAT-K) rollouts, a simple yet effective closed-loop fine-tuning strategy to mitigate covariate shift. CAT-K fine-tuning only requires existing trajectory data, without reinforcement learning or generative adversarial imitation. Concretely, CAT-K fine-tuning enables a small 7M-parameter tokenized traffic simulation policy to outperform a 102M-parameter model from the same model family, achieving the top spot on the Waymo Sim Agent Challenge leaderboard at the time of submission. Our code will be made publicly available.
Advancing Manga Analysis: Comprehensive Segmentation Annotations for the Manga109 Dataset
Minshan Xie · Jian Lin · Hanyuan Liu · Chengze Li · Tien-Tsin Wong
Manga, a popular form of multimodal artwork, has traditionally been overlooked in deep learning advancements due to the absence of a robust dataset and comprehensive annotation. Manga segmentation is the key to the digital migration of manga. There exists a significant domain gap between the manga and the natural images, that fails most existing learning-based methods. To address this gap, we introduce an augmented segmentation annotation for the Manga109 dataset, a collection of 109 manga volumes, that offers intricate artworks in a rich variety of styles. We introduce a detail annotation that extends beyond the original simple bounding boxes to the segmentation masks with pixel-level precision. It provides object category, location, and instance information that can be used for semantic segmentation and instance segmentation. We also provide a comprehensive analysis of our annotation dataset from various aspects. We further measure the improvement of state-of-the-art segmentation model after training it with our augmented dataset. The benefits of this augmented dataset are profound, with the potential to significantly enhance manga analysis algorithms and catalyze the novel development in digital art processing and cultural analytics. The annotation will be publicly available upon acceptance.
Sketchy Bounding-box Supervision for 3D Instance Segmentation
qian deng · Le Hui · Jin Xie · Jian Yang
Bounding box supervision has gained considerable attention in weakly supervised 3D instance segmentation. While this approach alleviates the need for extensive point-level annotations, obtaining accurate bounding boxes in practical applications remains challenging. To this end, we explore the inaccurate bounding box, named sketchy bounding box, which is imitated through perturbing ground truth bounding boxes by adding scaling, translation, and rotation. In this paper, we propose Sketchy-3DIS, a novel weakly 3D instance segmentation framework, which jointly learns pseudo labeler and segmentator to improve the performance under the sketchy bounding-box annotations. Specifically, we first propose an adaptive box-to-point pseudo labeler that adaptively learns to assign points located in the overlapped parts between two sketchy bounding boxes to the correct instance, resulting in compact and pure pseudo instance labels. Then, we present a coarse-to-fine instance segmentator that first predicts coarse instances from the entire point cloud and then learns fine instances based on the region of coarse instances. Finally, by using the pseudo instance labels to supervise the instance segmentator, we can gradually generate high-quality instances through joint training. Extensive experiments show that our method achieves state-of-the-art performance on both the ScanNetV2 and S3DIS benchmarks, and even outperforms several fully supervised methods using sketchy bounding boxes.
Relation3D : Enhancing Relation Modeling for Point Cloud Instance Segmentation
Edward LOO · Jiacheng Deng
3D instance segmentation aims to predict a set of object instances in a scene, representing them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines and superior predictions. However, these methods primarily focus on modeling the external relationships between scene features and query features through mask attention. They lack effective modeling of the internal relationships among scene features as well as between query features.In light of these disadvantages, we propose Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation. Specifically, we introduce an adaptive superpoint aggregation module and a contrastive learning-guided superpoint refinement module to better represent superpoint features (scene features) and leverage contrastive learning to guide the updates of these features.Furthermore, our relation-aware self-attention mechanism enhances the capabilities of modeling relationships between queries by incorporating positional and geometric relationships into the self-attention mechanism.Extensive experiments on the ScanNetV2, ScanNet++, ScanNet200 and S3DIS datasets demonstrate the superior performance of Relation3D.
FSHNet: Fully Sparse Hybrid Network for 3D Object Detection
Shuai Liu · Mingyue Cui · Boyang Li · Quanmin Liang · Tinghe Hong · Kai Huang · yunxiao shan · Kai Huang
Fully sparse 3D detectors have recently gained significant attention due to their efficiency in long-range detection. However, sparse 3D detectors extract features only from non-empty voxels, which impairs long-range interactions and causes the center feature missing. The former weakens the feature extraction capability, while the latter hinders network optimization. To address these challenges, we introduce the Fully Sparse Hybrid Network (FSHNet). FSHNet incorporates a proposed SlotFormer block to enhance the long-range feature extraction capability of existing sparse encoders. The SlotFormer divides sparse voxels using a slot partition approach, which, compared to traditional window partition, provides a larger receptive field. Additionally, we propose a dynamic sparse label assignment strategy to deeply optimize the network by providing more high-quality positive samples. To further enhance performance, we introduce a sparse upsampling module to refine downsampled voxels, preserving fine-grained details crucial for detecting small objects. Extensive experiments on the Waymo, nuScenes, and Argoverse2 benchmarks demonstrate the effectiveness of FSHNet. The code will be publicly available.
3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation
Weijie Wei · Osman Ülger · Fatemeh Karimi Nejadasl · Theo Gevers · Martin R. Oswald
Open-vocabulary segmentation methods offer promising capabilities in detecting unseen object categories, but the vocabulary must be known and needs to be provided by a human, either via a text prompt or pre-labeled datasets, thus limiting their scalability. We propose 3D-AVS, a method for Auto-Vocabulary Segmentation of 3D point clouds for which the vocabulary is unknown and auto-generated for each input at runtime, thus eliminating the human in the loop and typically providing a substantially larger vocabulary for richer annotations. 3D-AVS first recognizes semantic entities from image or point cloud data and then segments all points with the automatically generated vocabulary. Our method incorporates both image-based and point-based recognition, enhancing robustness under challenging lighting conditions where geometric information from LiDAR is especially valuable. Our point-based recognition features a Sparse Masked Attention Pooling (SMAP) module to enrich the diversity of recognized objects. To address the challenges of evaluating unknown vocabularies and avoid annotation biases from label synonyms, hierarchies, or semantic overlaps, we introduce the annotation-free Text-Point Semantic Similarity (TPSS) metric for assessing segmentation quality. Our evaluations on nuScenes and ScanNet demonstrate our method's ability to generate semantic classes with accurate point-wise segmentations.
NTClick: Achieving Precise Interactive Segmentation With Noise-tolerant Clicks
Chenyi Zhang · Ting Liu · Xiaochao Qu · Luoqi Liu · Yao Zhao · Yunchao Wei
Interactive segmentation is a pivotal task in computer vision, focused on predicting precise masks with minimal user input. Although the $ \textit{click}$ has recently become the most prevalent form of interaction due to its flexibility and efficiency, its advantages diminish as the complexity and details of target objects increase because it's time-consuming and user-unfriendly to precisely locate and click on narrow, fine regions. To tackle this problem, we propose NTClick, a powerful click-based interactive segmentation method capable of predicting accurate masks even with imprecise user clicks when dealing with intricate targets. We first introduce a novel interaction form called $ \textit{Noist-tolerant Click}$, a type of click that does not require user's precise localization when selecting fine regions. Then, we design a two-stage workflow, consisting of an Explicit Coarse Perception network for initial estimation and a High Resolution Refinement network for final classification. Quantitative results across extensive datasets demonstrate that NTClick not only maintains an efficient and flexible interaction mode but also significantly outperforms existing methods in segmentation accuracy.
HyperSeg: Hybrid Segmentation Assistant with Fine-grained Visual Perceiver
Cong Wei · Haoxian Tan · Yujie Zhong · Yong Liu · Jie Hu · Dengjie Li · Zheng Zhao · Yujiu Yang
This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve accurate understanding of fine-grained visual text correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring challenging reasoning abilities and world knowledge. Besides, to fully leverage the recognition capacity of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for distinct segmentation tasks. Combined with temporal adapter, HyperSeg chieves a comprehensive understanding of space-time information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks.
CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools
Chinedu Innocent Nwoye · Kareem elgohary · Anvita A. Srinivas · Fauzan Zaid · Joël L. Lavanchy · Nicolas Padoy
Tool tracking in surgical videos is essential for advancing computer-assisted interventions, such as skill assessment, safety zone estimation, and human-machine collaboration. However, the lack of context-rich datasets limits AI applications in this field. Existing datasets rely on overly generic tracking formalizations that fail to capture surgical-specific dynamics, such as tools moving out of the camera’s view or exiting the body. This results in less clinically relevant trajectories and a lack of flexibility for real-world surgical applications. Methods trained on these datasets often struggle with visual challenges such as smoke, reflection, and bleeding, further exposing the limitations of current approaches. We introduce CholecTrack20, a specialized dataset for multi-class, multi-tool tracking in surgical procedures. It redefines tracking formalization with three perspectives: (1) intraoperative, (2) intracorporeal, and (3) visibility, enabling adaptable and clinically meaningful tool trajectories. The dataset comprises 20 full-length surgical videos, annotated at 1 fps, yielding over 35K frames and 65K labeled tool instances. Annotations include spatial location, category, identity, operator, phase, and visual challenges. Benchmarking state-of-the-art methods on CholecTrack20 reveals significant performance gaps, with current approaches ($<45\%$ HOTA) failing to meet the accuracy required for clinical translation. These findings motivate the need for advanced and intuitive tracking algorithms and establish CholecTrack20 as a foundation for developing robust AI-driven surgical assistance systems.
Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation
Jiawei Fu · ZHANG Tiantian · Kai Chen · Qi Dou
Scene graph generation is a pivotal task in computer vision, aiming to identify all visual relation tuples within an image. The advancement of methods involving triplets has sought to enhance task performance by integrating triplets as contextual features for more precise predicate identification from component level. However, challenges remain due to interference from multi-role objects in overlapping tuples within complex environments, which impairs the model's ability to distinguish and align specific triplet features for reasoning diverse semantics of multi-role objects.To address these issues, we introduce a novel framework that incorporates a triplet alignment model into a hybrid reciprocal transformer architecture, starting from using triplet mask features to guide the learning of component-level relation graphs. To effectively distinguish multi-role objects characterized by overlapping visual relation tuples, we introduce a triplet alignment loss, which provides multi-role objects with aligned features from triplet and helps customize them.Additionally, we explore the inherent connectivity between hybrid aligned triplet and component features through a bidirectional refinement module, which enhances feature interaction and reciprocal reinforcement. Experimental results demonstrate that our model achieves state-of-the-art performance on the Visual Genome and Action Genome datasets, underscoring its effectiveness and adaptability.The code will be available.
Textured Gaussians for Enhanced 3D Scene Appearance Modeling
Brian Chao · Hung-Yu Tseng · Lorenzo Porzi · Chen Gao · Tuotuo Li · Qinbo Li · Ayush Saraf · Jia-Bin Huang · Johannes Kopf · Gordon Wetzstein · Changil Kim
3D Gaussian Splatting (3DGS) has recently emerged as a state-of-the-art 3D reconstruction and rendering technique due to its high-quality results and fast training and rendering time. However, pixels covered by the same Gaussian are always shaded in the same color up to a Gaussian falloff scaling factor. Furthermore, the finest geometric detail any individual Gaussian can represent is a simple ellipsoid. These properties of 3DGS greatly limit the expressivity of individual Gaussian primitives. To address these issues, we draw inspiration from texture and alpha mapping in traditional graphics and integrate it with 3DGS. Specifically, we propose a new generalized Gaussian appearance representation that augments each Gaussian with alpha (A), RGB, or RGBA texture maps to model spatially varying color and opacity across the extent of each Gaussian. As such, each Gaussian can represent a richer set of texture patterns and geometric structures, instead of just a single color and ellipsoid as in naive Gaussian Splatting. Surprisingly, we found that the expressivity of Gaussians can be greatly improved by using alpha-only texture maps, and further augmenting Gaussians with RGB texture maps achieves the highest expressivity. We validate our method on a wide variety of standard benchmark datasets and our own custom captures at both the object and scene levels. We demonstrate image quality improvements over existing methods while using a similar or lower number of Gaussians.
Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
Wei Deng · Mengshi Qi · Huadong Ma
Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few work studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is presented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, \ie room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discrete the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experiments results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. We will release our code and model.
CrossOver: 3D Scene Cross-Modal Alignment
Sayan Deb Sarkar · Ondrej Miksik · Marc Pollefeys · Daniel Barath · Iro Armeni
Multi-modal 3D object understanding has gained significant attention, yet current approaches often rely on rigid object-level modality alignment or assume complete data availability across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require paired data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities—RGB images, point clouds, CAD models, floorplans, and text descriptions—without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting CrossOver's adaptability for real-world applications in 3D scene understanding.
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
Duo Zheng · Shijia Huang · Liwei Wang
The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more accurately. Additionally, we have implemented a maximum coverage sampling technique to optimize the balance between computational costs and performance efficiency. Extensive experiments demonstrate that our model achieves state-of-the-art performance on several 3D scene understanding benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence
Xuewu Lin · Tianwei Lin · Alan Huang · HONGYU XIE · Zhizhong Su
In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments.Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity.In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods.Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception.In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69\% in the 3D detection task and 15.25\% in the 3D visual grounding task.
Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding
Atharv Mahesh Mane · Dulanga Weerakoon · Vigneshwaran Subbaraju · Sougata Sen · Sanjay Sarma · Archan Misra
3-Dimensional Embodied Reference Understanding (3D-ERU) is a complex vision-language task, crucial for supporting interactions between humans and situated AI agents. 3D-ERU combines a language description and an accompanying pointing gesture to identify the most relevant target object in a given 3D scene. While previous research has extensively explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-- **Imputer**, and use it to curate a new challenging benchmark dataset-- **ImputeRefer** for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose **Ges3ViG**, a novel model for 3D-ERU that achieves a $\sim$30\% improvement in accuracy, compared to other 3D-ERU models and $\sim$9\% compared to other purely language-based 3D grounding models. Our code and data will be released publicly upon acceptance of the paper.
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
YUEJIAO SU · Yi Wang · Qiongyang Hu · Chuang Yang · Lap-Pui Chau
Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses.Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions.The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.
Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy
Zaijing Li · Yuquan Xie · Rui Shao · Gongwei Chen · Dongmei Jiang · Liqiang Nie
Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA) dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts in training Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft.
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang · Jingdi Lei · Junxian Li · Xunzhi Wang · Yujie Liu · Zonglin Yang · Jiatong LI · Weida Wang · Suorong Yang · Jianbo Wu · Peng Ye · Wanli Ouyang · Dongzhan Zhou
Vision-language models (VLMs) have shown remarkable advancements in multimodal reasoning tasks. However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. This framework decouples the reasoning process and critic process by integrating two independent components: the Reasoner, which generates reasoning paths based on visual and textual inputs, and the Critic, which provides constructive critique to refine these paths.In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic. This interaction process was theoretically driven by a reinforcement learning framework where the Critic offers natural language critiques instead of scalar rewards, enabling more nuanced feedback to boost the Reasoner's capability on complex reasoning tasks. The Critic model is trained using Direct Preference Optimization (DPO), leveraging a preference dataset of critiques ranked by Rule-based Reward (RBR) to enhance its critic capabilities. Evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner and constructive feedback from the preference-optimized Critic enables a more reliable and context-sensitive multimodal reasoning process. Our approach provides a promising solution to enhance the reliability of VLMs, improving their performance in real-world reasoning-heavy multimodal applications such as autonomous driving and embodied intelligence.
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Yuhao Dong · Zuyan Liu · Hai-Long Sun · Jingkang Yang · Winston Hu · Yongming Rao · Ziwei Liu
Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA-NeXT model, our method shows an average improvement of 7.5% across seven challenging multi-modal benchmarks requiring visual reasoning. We also achieve a 4.2% improvement on a stronger base MLLM, highlighting the potential to further advance state-of-the-art models. Benefiting from our multi-agent system, Insight-V can also easily maintain or improve performance on perception-focused multi-modal tasks. We will make our data and code publicly available to promote future research in this emerging field.
Synthetic Visual Genome
Jae Sung Park · Zixian Ma · Linjie Li · Chenhao Zheng · Cheng-Yu Hsieh · Ximing Lu · Khyathi Chandu · Quan Kong · Norimasa Kobori · Ali Farhadi · Yejin Choi · Ranjay Krishna
Understanding and reasoning over visual relationships—spatial, functional, interactional, social, etc.—are considered to be a fundamental component of human cognition.Yet, despite the major advances in visual comprehension in multimodal language models, precise reasoning over relationships remains a challenge. We introduce Robin: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train Robin, we curate SVG, a scene graph based instruction tuning dataset containing $33K$ images and $855K$ relationships for $170K$ objects by completing the missing relations in existing scene graphs using GPT4-V and a carefully designed filtering process—combining rule-based and model-based filtering techniques to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o refines Robin's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. Results show that our Robin-3B model, despite being trained on less than $3$ million instances, outperforms similar-size models trained on over $300$ million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of $88.2$, surpassing the previous best of $87.4$. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning tasks.
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
Yexin Liu · Zhengyang Liang · Yueze Wang · Xianfeng Wu · feilong tang · Muyang He · Jian Li · Zheng Liu · Harry Yang · Ser-Nam Lim · Bo Zhao
Multimodal Large Language Models (MLLMs) have displayed remarkable performance in multimodal tasks, particularly in visual comprehension. However, we reveal that MLLMs often generate incorrect answers even when they understand the visual content. To this end, we manually construct a benchmark with 12 categories and design evaluation metrics that assess the degree of error in MLLM responses even when the visual content is seemingly understood. Based on this benchmark, we test 15 leading MLLMs and analyze the distribution of attention maps and logits of some MLLMs. Our investigation identifies two primary issues: 1) most instruction tuning datasets predominantly feature questions that ``directly" relate to the visual content, leading to a bias in MLLMs' responses to other indirect questions, and 2) MLLMs’ attention to visual tokens is notably lower than to system and question tokens. We further observe that attention scores between questions and visual tokens as well as the model's confidence in the answers are lower in response to misleading questions than to straightforward ones. To address the first challenge, we introduce a paired positive and negative data construction pipeline to diversify the dataset. For the second challenge, we propose to enhance the model's focus on visual content during decoding by refining the text and visual prompt. For the text prompt, we propose a content-guided refinement strategy that performs preliminary visual content analysis to generate structured information before answering the question. Additionally, we employ a visual attention refinement strategy that highlights question-relevant visual tokens to increase the model’s attention to visual content that aligns with the question. Extensive experiments demonstrate that these challenges can be significantly mitigated with our proposed dataset and techniques. The benchmark, training set, and code will be available.
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
Geng Li · Jinglin Xu · Yunzhen Zhao · Yuxin Peng
Humans can effortlessly locate desired objects in cluttered environments, relying on a cognitive mechanism known as visual search to efficiently filter out irrelevant information and focus on task-related regions. Inspired by this process, we propose DyFo (Dynamic Focus), a training-free dynamic focusing visual search method that enhances fine-grained visual understanding in large multimodal models (LMMs). Unlike existing approaches which require additional modules or data modifications, DyFo leverages a bidirectional interaction between LMMs and visual experts, using a Monte Carlo Tree Search (MCTS) algorithm to simulate human-like focus adjustments. This enables LMMs to focus on key visual regions while filtering out irrelevant content without the need for vocabulary expansion or specialized localization modules. Experimental results demonstrate that DyFo significantly improves fine-grained visual understanding and reduces hallucination issues in LMMs, achieving superior performance across both fixed and variable resolution models.
Taxonomy-Aware Evaluation of Vision-Language Models
Vésteinn Snæbjarnarson · Kevin Du · Niklas Stoehr · Serge Belongie · Ryan Cotterell · Nico Lang · Stella Frank
When a vision-and-language model (VLM) is prompted to identify an entity in an image, it may err on the side of caution and answer with "tree", instead of a more specific description such as "Pine tree''. Traditional binary accuracy metrics cannot differentiate between wrong predictions and insufficiently specific ones. They also do not give partial credit for close answers: "pine tree'' for a Norway Spruce should be better than "cypress'', taxonomically speaking, but string matching-based similarity measures will reject both equally.To address this shortcoming, we propose a framework for evaluating open-ended text predictions against a taxonomic hierarchy,using measures of hierarchical precision and recall to measure the level of correctness and specificity of predictions.We first show that existing text similarity measures and accuracy-based evaluation metrics do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the free-form outputs and the ground truth labels.Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our taxonomic evaluation. We find that models respond differently to instructions prompting for more specific answers, with GPT4V responding most specifically and others showing a trade-off between hierarchical precision and recall.
Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models
Kartik Thakral · Tamar Glaser · Tal Hassner · Mayank Vatsa · Richa Singh
Existing unlearning algorithms in text-to-image generative models often fail to preserve knowledge of semantically related concepts when removing specific target concepts—a challenge known as \textit{adjacency}. To address this, we propose \textbf{FADE} (\textit{Fine-grained Attenuation for Diffusion Erasure}), introducing adjacency-aware unlearning in diffusion models. FADE comprises two components: (1) the \textbf{Concept Lattice}, which identifies an adjacency set of related concepts, and (2) \textbf{Mesh Modules}, employing a structured combination of Expungement, Adjacency, and Guidance loss components. These enable precise erasure of target concepts while preserving fidelity across related and unrelated concepts. Evaluated on datasets like Stanford Dogs, Oxford Flowers, CUB, I2P, Imagenette, and ImageNet-1k, FADE effectively removes target concepts with minimal impact on correlated concepts, achieving at least a \textbf{12\% improvement in retention performance} over state-of-the-art methods.
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
Zhikai Li · Xuewen Liu · Dongrong Joe Fu · Jianquan Li · Qingyi Gu · Kurt Keutzer · Zhen Dong
The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the robustness of the system, we leverage probabilistic modeling and Bayesian updating techniques. We propose an exploration-exploitation-based matchmaking strategy to facilitate more informative comparisons. In our experiments, K-Sort Arena exhibits 16.3$\times$ faster convergence compared to the widely used ELO algorithm. To further validate the superiority and obtain a comprehensive leaderboard, we collect human feedback via crowdsourced evaluations of numerous cutting-edge text-to-image and text-to-video models. Thanks to its high efficiency, K-Sort Arena can continuously incorporate emerging models and update the leaderboard with minimal votes. Our project has undergone several months of internal testing and is now available online.
AVF-MAE++: Scaling Affective Video Facial Masked Autoencoders via Efficient Audio-Visual Self-Supervised Learning
Xuecheng Wu · Heli Sun · Yifan Wang · Jiayu Nie · Jie Zhang · Yabing Wang · Junxiao Xue · Liang He
Affective Video Facial Analysis (AVFA) is important for advancing emotion-aware AI, yet the persistent data scarcity in AVFA presents challenges. Recently, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained significant attention, particularly in its audio-visual adaptation. Insights from general domains suggest that scaling is vital for unlocking impressive improvements, though its effects on AVFA remain largely unexplored. Additionally, capturing both intra- and inter-modal correlations through robust representations is a crucial challenge in this field. To tackle these gaps, we introduce AVF-MAE++, a series audio-visual MAE designed to explore the impact of scaling on AVFA with a focus on advanced correlation modeling. Our method incorporates a novel audio-visual dual masking strategy and an improved modality encoder with a holistic view to better support scalable pre-training. Furthermore, we propose the Iteratively Audio-Visual Correlations Learning Module to improve correlations capture within the SSL framework, bridging the limitations of prior methods. To support smooth adaptation and mitigate overfitting, we also introduce a progressive semantics injection strategy, which structures training in three stages. Extensive experiments across 17 datasets, spanning three key AVFA tasks, demonstrate the superior performance of AVF-MAE++, establishing new state-of-the-art results. Ablation studies provide further insights into the critical design choices driving these gains. The code will be released soon.
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs
Xiaoqin Wang · Xusen Ma · Xianxu Hou · Meidan Ding · Yudong Li · Junliang Chen · Wenting Chen · Xiaoyang Peng · Linlin Shen
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by multi-modal training with our proposed face instruction-tuning data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, which are also comapred with human. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released upon acceptance of this work.
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
Qihui Zhang · Munan Ning · Zheyuan Liu · Yanbo Wang · Jiayi Ye · Yue Huang · Shuo Yang · Xiao Chen · Yibing Song · Li Yuan
Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation mechanisms face limitations due to the significant human workload required to design Q\&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce human workload through mutual model evaluations, they often introduce biases.To address these problems, we propose an unsupervised evaluation method—Unsupervised Peer review MLLM Evaluation framework. This framework utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload.Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) the model capability of visual understanding and reasoning; (iii) relevance of text-image matching.Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our UPME framework closely aligns with human-designed QA benchmarks and inherent human preferences.
WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation
Silin Cheng · Yang Liu · Xinwei He · Sebastien Ourselin · Lei Tan · Luo
Weakly supervised referring expression comprehension (WREC) and segmentation (WRES) aim to learn object grounding based on a given expression using weak supervision signals like image-text pairs. While these tasks have traditionally been modeled separately, we argue that they can benefit from joint learning in a multi-task framework. To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. Specifically, the WREC branch is formulated as anchor-based contrastive learning, which also acts as a teacher to supervise the WRES branch. In WeakMCN, we propose two innovative designs to facilitate multi-task collaboration, namely Dynamic Visual Feature Enhancement (DVFE) and Collaborative Consistency Module (CCM). DVFE dynamically combines various pre-trained visual knowledge to meet different task requirements, while CCM promotes cross-task consistency from the perspective of optimization. Extensive experimental results on three popular REC and RES benchmarks, i.e., RefCOCO, RefCOCO+, and RefCOCOg, consistently demonstrate performance gains of WeakMCN over state-of-the-art single-task alternatives, e.g., up to 3.91% and 13.11% on RefCOCO for WREC and WRES tasks, respectively. Furthermore, experiments also validate the strong generalization ability of WeakMCN in both semi-supervised REC and RES settings against existing methods, e.g., +8.94% for semi-REC and +7.71% for semi-RES on 1% RefCOCO.
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts
Jiansheng Li · Xingxuan Zhang · Hao Zou · Yige Guo · Renzhe Xu · Yilong Liu · Chuzhao Zhu · Yue He · Peng Cui
Current object detectors often suffer significant performance degradation in real-world applications when encountering distributional shifts, posing serious risks in high-stakes domains such as autonomous driving and medical diagnosis. Consequently, the out-of-distribution (OOD) generalization capability of object detectors has garnered increasing attention from researchers. Despite this growing interest, there remains a lack of a large-scale, comprehensive dataset and evaluation benchmark with fine-grained annotations tailored to assess the OOD generalization on more intricate tasks like object detection and grounding. To address this gap, we introduce COUNTS, a large-scale OOD dataset with object-level annotations. COUNTS encompasses 14 natural distributional shifts, over 222K samples, and more than 1,196K labeled bounding boxes. Leveraging COUNTS, we introduce two novel benchmarks: O(OD) and OODG. OODOD is designed to comprehensively evaluate the OOD generalization capabilities of object detectors by utilizing controlled distribution shifts between training and testing data. OODG, on the other hand, aims to assess the OOD generalization of grounding abilities in multimodal large language models (MLLMs). Our findings reveal that, while large models and extensive pre-training data substantially enhance performance in in-distribution (IID) scenarios, significant limitations and opportunities for improvement persist in OOD contexts for both object detectors and MLLMs. In visual grounding tasks, even the advanced GPT-4o and Gemini-1.5 only achieve 56.7% and 28.0% accuracy, respectively. We hope COUNTS facilitates advancements in the development and assessment of robust object detectors and MLLMs capable of maintaining high performance under distributional shifts.
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
Federico Cocchi · Nicholas Moratelli · Marcella Cornia · Lorenzo Baraldi · Rita Cucchiara
Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for knowledge-based visual question answering, highlighting its superior performance compared to existing methods. Source code and models will be publicly released.
Variance-Based Membership Inference Attacks Against Large-Scale Image Captioning Models
Daniel Samira · Edan Habler · Yuval Elovici · Asaf Shabtai
The proliferation of multi-modal generative models has introduced new privacy and security challenges, especially due to the risks of memorization and unintentional disclosure of sensitive information. This paper focuses on the vulnerability of multi-modal image captioning models to membership inference attacks (MIAs). These models, which synthesize textual descriptions from visual content, could inadvertently reveal personal or proprietary data embedded in their training datasets. We explore the feasibility of MIAs in the context of such models. Specifically, our approach leverages a variance-based strategy tailored for image captioning models, utilizing only image data without knowing the corresponding caption. We introduce the means-of-variance threshold attack (MVTA) and confidence-based weakly supervised attack (C-WSA) based on the metric, means-of-variance (MV), to assess variability among vector embeddings. Our experiments demonstrate that these models are susceptible to MIAs, indicating substantial privacy risks. The effectiveness of our methods is validated through rigorous evaluations on these real-world models, confirming the practical implications of our findings.
Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification
Jiayu Jiang · Changxing Ding · Wentao Tan · Junhong Wang · JIN Tao · Xiangmin Xu
Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization ability of ReID models. Code of this paper will be released.
Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment
Huakai Lai · Guoxin Xiong · Huayu Mai · Xiang Liu · Tianzhu Zhang
Video-Text Retrieval (VTR) is a core task in multi-modal understanding, drawing growing attention from both academia and industry in recent years. While numerous VTR methods have achieved success, most of them assume accurate visual-text correspondences during training, which is difficult to ensure in practice due to ubiquitous noise, known as noisy correspondences (NC). In this work, we rethink how to mitigate the NC from the perspective of representative reference features (termed agents), and propose a novel relation-aware purified consistency (RPC) network to amend direct pairwise correlation, including representative agents construction and relation-aware ranking distribution alignment. The proposed RPC enjoys several merits. First, to learn the agents well without any correspondence supervision, we customize the agents construction according to the three characteristics of reliability, representativeness, and resilience. Second, the ranking distribution-based alignment process leverages the structural information inherent in inter-pair relationships, making it more robust compared to individual comparisons. Extensive experiments on five datasets under different settings demonstrate the efficacy and robustness of our method.
Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation
Yiheng Li · Yang Yang · Zichang Tan · Huan Liu · Weihua Chen · Xu Zhou · Zhen Lei
To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation (DGM$^4$) has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local contents, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM$^4$. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair.Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features.Extensive experiments on DGM$^4$ datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content.Then, the forgery-aware reasoning or aggregating is adopted to deeply capture forgery cues based on the consistency features.Extensive experiments on DGM$^4$ datasets prove that CSCL achieves new state-of-the-art results, especially for the results of grounding manipulated content.
Seeing the Abstract: Translating the Abstract Language for Vision Language Models
Davide Talon · Federico Girella · Ziyue Liu · Marco Cristani · Yiming Wang
Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language.Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis.Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases.On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.Our code will be publicly available.
NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval
Zengrong Lin · Zheng Wang · Tianwen Qian · Pan Mu · Sixian Chan · Cong Bai
Cross-modal retrieval aims to bridge the semantic gap between different modalities, such as visual and textual data, enabling accurate retrieval across them. Despite significant advancements with models like CLIP that align cross-modal representations, a persistent challenge remains: the hubness problem, where a small subset of samples (hubs) dominate as nearest neighbors, leading to biased representations and degraded retrieval accuracy. Existing methods often mitigate hubness through post-hoc normalization techniques, relying on prior data distributions that may not be practical in real-world scenarios. In this paper, we directly mitigate hubness during training and introduce NeighborRetr, a novel method that effectively balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. Our approach not only mitigates the hubness problem but also enhances retrieval performance, achieving state-of-the-art results on multiple cross-modal retrieval benchmarks. Furthermore, NeighborRetr demonstrates robust generalization to new domains with substantial distribution shifts, highlighting its effectiveness in real-world applications.
Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models
Xin Zhang · Yanzhao Zhang · Wen Xie · Mingxin Li · Ziqi Dai · Dingkun Long · Pengjun Xie · Meishan Zhang · Wenjie Li · Min Zhang
Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text or images, or a combination of both. Previous work has attempted to adopt multimodal large language models (MLLMs) to realize UMR using only text data. However, our preliminary experiments demonstrate that a more diverse multimodal training data can further unlock the potential of MLLMs. Despite its effectiveness, the existing multimodal training data is highly imbalanced in terms of modality, which motivates us to develop a training data synthesis pipeline and construct a large-scale, high-quality fused-modal training dataset. Based on the synthetic training data, we develop the General Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR. Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the effectiveness of our approach. Experimental results show that our method achieves state-of-the-art performance among existing UMR methods. Last, we provide in-depth analyses of model scaling, training strategies, and perform ablation studies on both the model and synthetic data.
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
Davide Caffagni · Sara Sarto · Marcella Cornia · Lorenzo Baraldi · Rita Cucchiara
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries -- composed of both an image and a text -- and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that Ret achieves state-of-the-art performance across diverse settings.
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
Qirui Jiao · Daoyuan Chen · Yilun Huang · Bolin Ding · Yaliang Li · Ying Shen
High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel data synthesis method, leveraging insights from contrastive learning and image difference captioning to enhance fine-grained image recognition in MLLMs. By analyzing object differences in detailed regions between similar images, we challenge the model to identify both matching and distinct components. Specifically, our method initially create pairs of similar images that highlight object variations. After that, we introduce a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for differences describing. The outcome is a high-quality dataset of "object replacement" samples, named Img-Diff, which can be expanded as needed due to its automation. We use the generated dataset to finetune state-of-the-art (SOTA) MLLMs such as InternVL2, yielding comprehensive improvements across numerous image difference and Visual Question Answering tasks. For instance, the trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Additionally, we conduct thorough evaluations to confirm the dataset's diversity, quality, and robustness, presenting several insights on the synthesis of such a contrastive dataset. We release our codes and dataset to encourage further research on multimodal data synthesis and MLLMs' fundamental capabilities for image understanding.
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation
Reza Abbasi · Ali Nazari · Aminreza Sefid · Mohammadali Banayeeanzade · Mohammad Rohban · Mahdieh Baghshah
Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP's training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP's instability with rephrased but semantically similar captions. Extending this to longer captions and text-to-image models like Stable Diffusion, we demonstrate how prompt order influences object prominence in generated images.
Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition
Yifei Zhang · Chang Liu · Jin Wei · Xiaomeng Yang · Yu ZHOU · Can Ma · Xiangyang Ji
Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplementary for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information. Our code will be made available.
SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
Dongliang Luo · Hanshen Zhu · Ziyang Zhang · Dingkang Liang · Xudong Xie · Yuliang Liu · Xiang Bai
Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between detection and recognition tasks, and 2) sub-optimal supervisions caused by inconsistency between teacher/student. Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. Meanwhile, it extracts important information in locations and transcriptions from bidirectional flows to improve consistency. Extensive experiments on three datasets under various settings demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example, it outperforms previous state-of-the-art SSL methods by a large margin on end-to-end spotting (+8.7\%, +5.6\%, and +2.6\% H-mean under 0.5\%, 1\%, and 2\% labeled data settings on Total-Text, respectively). More importantly, it still improves upon a strongly supervised text spotter trained with plenty of labeled data by 2.0\%. Compelling domain adaptation ability shows practical potential. Moreover, our method demonstrates consistent improvement on different text spotters. Code will be available.
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
seil kang · Jinyeong Kim · Junhyeok Kim · Seong Jae Hwang
Visual grounding seeks to localize the image region corresponding to a free-form text description. Recently, the strong multimodal capabilities of Large Vision-Language Models (LVLMs) have driven substantial improvements in visual grounding, though they inevitably require fine-tuning and additional model components to explicitly generate bounding boxes or segmentation masks. However, we discover that a few attention heads in frozen LVLMs demonstrate strong visual grounding capabilities. We refer to these heads, which consistently capture object locations related to text semantics, as localization heads. Using localization heads, we introduce a straightforward and effective training-free visual grounding framework that utilizes text-to-image attention maps from localization heads to identify the target objects. Surprisingly, only three out of thousands of attention heads are sufficient to achieve competitive localization performance compared to existing LVLM-based visual grounding methods that require fine-tuning. Our findings suggest that LVLMs can innately ground objects based on a deep comprehension of the text-image relationship, as they implicitly focus on relevant image regions to generate informative text outputs. All the source codes will be made available to the public.
Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
Teng Hu · Jiangning Zhang · Ran Yi · Jieyu Weng · Yabiao Wang · Xianfang Zeng · Xuezhucun Xue · Lizhuang Ma
Employing LLMs for visual generation has recently become a research focus. However, the existing methods primarily transfer the LLM architecture to visual generation but rarely investigate the fundamental differences between language and vision. This oversight may lead to suboptimal utilization of visual generation capabilities within the LLM framework. In this paper, we explore the characteristics of visual embedding space under the LLM framework and discover that the correlation between visual embeddings can help achieve more stable and robust generation results. We present IAR, an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Firstly, we propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm to rearrange the visual codebook into clusters, ensuring high similarity among visual features within each cluster. Leveraging the rearranged codebook, we propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located. This approach ensures that even if the model predicts the wrong token index, there is a high probability the predicted token is located in the correct cluster, which significantly enhances the generation quality and robustness. Extensive experiments demonstrate that our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID. Additionally, our approach can be applied to various LLM-based visual generation models and adheres to the scaling law, providing a promising direction for future research in LLM-based visual generation.
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
Hongrui Jia · Chaoya Jiang · Haiyang Xu · Wei Ye · Mengfan Dong · Ming Yan · Ji Zhang · Fei Huang · Shikun Zhang
As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, existing LMMs face a critical issue: they often fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns. This indicates that LMMs do not achieve effective alignment between multimodal demonstrations and model outputs. To address this problem, we propose Symbol Demonstration Direct Preference Optimization (SymDPO). Specifically, SymDPO aims to break the traditional paradigm of constructing multimodal demonstrations by using random symbols to replace text answers within instances. This forces the model to carefully understand the demonstration images and establish a relationship between the images and the symbols to answer questions correctly. We validate the effectiveness of this method on multiple benchmarks, demonstrating that with SymDPO, LMMs can more effectively understand the multimodal context within examples and utilize this knowledge to answer questions better.
CASP: Compression of Large Multimodal Models Based on Attention Sparsity
Mohsen Gholami · Mohammad Akbari · Kevin Cannons · Yong Zhang
In this work, we propose an extreme compression technique for Large Multimodal Models (LMMs). While previous studies have explored quantization as an efficient post-training compression method for Large Language Models (LLMs), low-bit compression for multimodal models remains under-explored. The redundant nature of inputs in multimodal models results in a highly sparse attention matrix. We theoretically and experimentally demonstrate that the attention matrix's sparsity bounds the compression error of the Query and Key weight matrices. Based on this, we introduce CASP, a model compression technique for LMMs. Our approach performs a data-aware low-rank decomposition on the Query and Key weight matrix, followed by quantization across all layers based on an optimal bit allocation process. CASP is compatible with any quantization technique and enhances state-of-the-art 2-bit quantization methods (AQLM and QuIP#) by an average of 21% on image- and video-language benchmarks. The code is provided in the supplementary materials.
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
Hao Yin · Guangzong Si · Zilei Wang
Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models (LLMs). However, how MLLMs process and utilize visual information remains unclear. In this paper, a shift in the dominant flow of visual information is uncovered: (1) in shallow layers, strong interactions are observed between image tokens and instruction tokens, where most visual information is injected into instruction tokens to form cross-modal semantic representations; (2) in deeper layers, image tokens primarily interact with each other, aggregating the remaining visual information to optimize semantic representations within the visual modality. Based on these insights, we propose Hierarchical Modality-Aware Pruning (HiMAP), a plug-and-play inference acceleration method that dynamically prunes image tokens at specific layers, reducing computational costs by approximately 65% without sacrificing performance. Our findings offer a new understanding of visual information processing in MLLMs and provide a state-of-the-art solution for efficient inference.
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Saeed Ranjbar Alvar · Gursimran Singh · Mohammad Akbari · Yong Zhang
Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected tokens is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is provided in the supplementary material.
Libra-Merging: Importance-redundancy and Pruning-merging Trade-off for Acceleration Plug-in in Large Vision-Language Model
Longrong Yang · Dong Shen · Chaoxiang Cai · Kaibing Chen · Fan Yang · Tingting Gao · Di ZHANG · Xi Li
Large Vision-Language Models (LVLMs) have achieved significant progress in recent years. However, the expensive inference cost limits the realistic deployment of LVLMs. Some works find that visual tokens are redundant and compress tokens to reduce the inference cost. These works identify important non-redundant tokens as target tokens, then prune the remaining tokens (non-target tokens) or merge them into target tokens. However, target token identification faces the token importance-redundancy dilemma. Besides, token merging and pruning face a dilemma between disrupting target token information and losing non-target token information. To solve these problems, we propose a novel visual token compression scheme, named Libra-Merging. In target token identification, Libra-Merging selects the most important tokens from spatially discrete intervals, achieving a more robust token importance-redundancy trade-off than relying on a hyper-parameter. In token compression, when non-target tokens are dissimilar to target tokens, Libra-Merging does not merge them into the target tokens, thus avoiding disrupting target token information. Meanwhile, Libra-Merging condenses these non-target tokens into an information compensation token to prevent losing important non-target token information. Our method can serve as a plug-in for diverse LVLMs, and extensive experimental results demonstrate its effectiveness. The code will be publicly available.
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
Yiyang Du · Xiaochen Wang · Chi Chen · Jiabo Ye · Yiru Wang · Peng Li · Ming Yan · Ji Zhang · Fei Huang · Zhifang Sui · Maosong Sun · Yang Liu
Recently, model merging methods have demonstrated powerful strengths in combining abilities on various tasks from multiple Large Language Models (LLMs). While previous model merging methods mainly focus on merging homogeneous models with identical architecture, they meet challenges when dealing with Multimodal Large Language Models (MLLMs) with inherent heterogeneous property, including differences in model architecture and the asymmetry in the parameter space. In this work, we propose AdaMMS, a novel model merging method tailored for heterogeneous MLLMs. Our method tackles the challenges in three steps: mapping, merging and searching. Specifically, we first design mapping function between models to apply model merging on MLLMs with different architecture. Then we apply linear interpolation on model weights to actively adapt the asymmetry in the heterogeneous MLLMs. Finally in the hyper-parameter searching step, we propose an unsupervised hyper-parameter selection method for model merging. As the first model merging method capable of merging heterogeneous MLLMs without labeled data, extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks.
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
zefeng zhang · Hengzhu Tang · Jiawei Sheng · Zhenyu Zhang · YiMing Ren · Zhenyang Li · Dawei Yin · Duohe Ma · Tingwen Liu
Multimodal Large Language Models (MLLMs) excel in various tasks, yet often struggle with modality bias, tending to rely heavily on a single modality or prior knowledge when generating responses. In this paper, we propose a debiased preference optimization dataset, RLAIF-V-Bias, and introduce a Noise-Aware Preference Optimization (NAPO) algorithm. Specifically, we first construct the dataset by introducing perturbations to reduce the informational content of certain modalities, prompting the model to overly rely on a specific modality when generating responses. To address the inevitable noise in automatically constructed data, we combine the noise-robust Mean Absolute Error (MAE) with the Binary Cross-Entropy (BCE) in Direct Preference Optimization (DPO) using a negative Box-Cox transformation and dynamically adjust the algorithm’s noise robustness based on the evaluated noise levels in the data.Extensive experiments validate our approach, demonstrating not only its effectiveness in mitigating modality bias but also its significant role in minimizing hallucinations.
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
Mingyang Song · Xiaoye Qu · Jiawei Zhou · Yu Cheng
Large Vision-Language Models (LVLMs) have achieved significant progress in combining visual comprehension with language generation.Despite this success, the training data of LVLMs still suffers from $\textit{Long-Tail (LT)}$ problems, where the data distribution is highly imbalanced.Previous works have mainly focused on traditional VLM architectures, i.e., CLIP or ViT, and specific tasks such as recognition and classification. Nevertheless, the exploration of LVLM (e.g. LLaVA) and more general tasks (e.g. Visual Question Answering and Visual Reasoning) remains under-explored.In this paper, we first conduct an in-depth analysis of the LT issues in LVLMs and identify two core causes: the overrepresentation of head concepts and the underrepresentation of tail concepts.Based on the above observation, we propose an $\textbf{A}$daptive $\textbf{D}$ata $\textbf{R}$efinement Framework ($\textbf{ADR}$), which consists of two stages: $\textbf{D}$ata $\textbf{R}$ebalancing ($\textbf{DR}$) and $\textbf{D}$ata $\textbf{S}$ynthesis ($\textbf{DS}$).In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage Denoising Diffusion Probabilistic Models (DDPMs) and scarce images to supplement underrepresented portions.Through comprehensive evaluations across eleven benchmarks, our proposed ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by $4.36\%$, without increasing the training data volume. Our code and data will be publicly released.
EfficientLLaVA: Generalizable Auto-Pruning for Large Vision-language Models
Yinan Liang · Ziwei Wang · Xiuwei Xu · Jie Zhou · Jiwen Lu
While multimodal large language models demonstrate strong performance in complex reasoning tasks, they pose significant challenges related to model complexity during deployment, especially for resource-limited devices. In this paper, we propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Conventional methods rely on the training data of the original model to select the proper pruning ratio for different network components. However, these methods are impractical for large vision-language models due to the unaffordable search costs caused by web-scale training corpus. In contrast, our approach only leverages a small number of samples to search for the desired pruning policy by maximizing its generalization ability on unknown training data while maintaining the model accuracy, which enables the achievement of an optimal trade-off between accuracy and efficiency for large visual language models. Specifically, we formulate the generalization gap of the pruning strategy using the structural risk minimization principle. Based on both task performance and generalization capability, we iteratively search for the optimal pruning policy within a given search space and optimize the vision projector to evolve the search space with higher upper bound of performance. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering. Using only 64 samples for pruning policy search, EfficientLLaVA achieves an accuracy of 83.05\% on ScienceQA, along with a $\times$ 1.8 speedup compared to the dense LLaVA-v1.5-7B model.
Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
Qizhou Chen · Chengyu Wang · Dakan Wang · Taolin Zhang · Wangyue Li · Xiaofeng He
Model editing aims to correct inaccurate knowledge, update outdated information, and incorporate new data into Large Language Models (LLMs) without the need for retraining. This task poses challenges in lifelong scenarios where edits must be continuously applied for real-world applications. While some editors demonstrate strong robustness for lifelong editing in pure LLMs, Vision LLMs (VLLMs), which incorporate an additional vision modality, are not directly adaptable to existing LLM editors. In this paper, we propose LiveEdit, a lifelong vision language model edit to bridge the gap between lifelong LLM editing and VLLMs. We begin by training an editing expert generator to independently produce low-rank experts for each editing instance, with the goal of correcting the relevant responses of the VLLM. A hard filtering mechanism is developed to utilize visual semantic knowledge, thereby coarsely eliminating visually irrelevant experts for input queries during the inference stage of the post-edited model. Finally, to integrate visually relevant experts, we introduce a soft routing mechanism based on textual semantic relevance to achieve multi-expert fusion. For evaluation, we establish a benchmark for lifelong VLLM editing. Extensive experiments demonstrate that LiveEdit offers significant advantages in lifelong VLLM editing scenarios. Further experiments validate the rationality and effectiveness of each module design in LiveEdit.
Distraction is All You Need for Multimodal Large Language Model Jailbreaking
Zuopeng Yang · Jiluan Fan · Anli Yan · Erdun Gao · Xin Lin · Tao Li · Kanghua Mo · Changyu Dong
Multimodal Large Language Models (MLLMs) bridge the gap between visual and textual data, enabling a range of advanced applications. However, complex internal interactions among visual elements and their alignment with text can introduce vulnerabilities, which may be exploited to bypass safety mechanisms. To address this, we analyze the relationship between image content and task and find that the complexity of subimages, rather than their content, is key. Building on this insight, we propose the $\textbf{Distraction Hypothesis}$, followed by a novel framework called Contrasting Subimage Distraction Jailbreaking ($\textbf{CS-DJ}$), to achieve jailbreaking by disrupting MLLMs alignment through multi-level distraction strategies. CS-DJ consists of two components: structured distraction, achieved through query decomposition that induces a distributional shift by fragmenting harmful prompts into sub-queries, and visual-enhanced distraction, realized by constructing contrasting subimages to disrupt the interactions among visual elements within the model. This dual strategy disperses the model’s attention, reducing its ability to detect and mitigate harmful content. Extensive experiments across five representative scenarios and four popular closed-source MLLMs, including $\texttt{GPT-4o-mini}$, $\texttt{GPT-4o}$, $\texttt{GPT-4V}$, and $\texttt{Gemini-1.5-Flash}$, demonstrate that CS-DJ achieves average success rates of $\textbf{52.40\%}$ for the attack success rate and $\textbf{74.10\%}$ for the ensemble attack success rate. These results reveal the potential of distraction-based approaches to exploit and bypass MLLMs' defenses, offering new insights for attack strategies.
Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift
Siyuan Liang · Jiawei Liang · Tianyu Pang · Chao Du · Aishan Liu · Mingli Zhu · Xiaochun Cao · Dacheng Tao
Instruction tuning enhances large vision-language models (LVLMs) but increases their vulnerability to backdoor attacks due to their open design. Unlike prior studies in static settings, this paper explores backdoor attacks in LVLM instruction tuning across mismatched training and testing domains. We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness under visual and text domain shifts. Our findings reveal two insights: (1) backdoor generalizability improves when distinctive trigger patterns are independent of specific data domains or model architectures, and (2) triggers placed in preference over clean semantic regions significantly enhance attack generalization. Based on these insights, we propose a multimodal attribution backdoor attack (MABA) that injects domain-agnostic triggers into critical areas using attributional interpretation. Experiments with OpenFlamingo, Blip-2, and Otter show that MABA significantly boosts the attack success rate of generalization by 36.4\%, achieving a 97\% success rate at a 0.2\% poisoning rate. This study reveals limitations in current evaluations and highlights how enhanced backdoor generalizability poses a security threat to LVLMs, even without test data access.
ICP: Immediate Compensation Pruning for Mid-to-high Sparsity
Xin Luo · Fu Xueming · Zihang Jiang · S Kevin Zhou
The increasing adoption of large-scale models under 7 billion parameters in both language and vision domains enables inference tasks on a single consumer-grade GPU but makes fine-tuning models of this scale, especially 7B models, challenging. This limits the applicability of pruning methods that require full fine-tuning. Meanwhile, pruning methods that do not require fine-tuning perform well at low sparsity levels (10%-50%) but struggle at mid-to-high sparsity levels (50%-70%), where the error behaves equivalently to that of semi-structured pruning. To address these issues, this paper introduces ICP, which finds a balance between full fine-tuning and zero fine-tuning. First, Sparsity Rearrange is used to reorganize the predefined sparsity levels, followed by Block-wise Compensate Pruning, which alternates pruning and compensation on the model’s backbone, fully utilizing inference results while avoiding full model fine-tuning. Experiments show that ICP improves performance at mid-to-high sparsity levels compared to baselines, with only a slight increase in pruning time and no additional peak memory overhead.
Vision-Language Model IP Protection via Prompt-based Learning
Lianyu Wang · Meng Wang · Huazhu Fu · Daoqiang Zhang
Vision-language models (VLMs) like CLIP (Contrastive Language-Image Pre-Training) have seen remarkable success in visual recognition, highlighting the increasing need to safeguard the intellectual property (IP) of well-trained models. Effective IP protection extends beyond ensuring authorized usage; it also necessitates restricting model deployment to authorized data domains, particularly when the model is fine-tuned for specific target domains. However, current IP protection methods often rely solely on the visual backbone, which may lack sufficient semantic richness. To bridge this gap, we introduce IP-CLIP, a lightweight IP protection strategy tailored to CLIP, employing a prompt-based learning approach. By leveraging the frozen visual backbone of CLIP, we extract both image style and content information, incorporating them into the learning of IP prompt. This strategy acts as a robust barrier, effectively preventing the unauthorized transfer of features from authorized domains to unauthorized ones. Additionally, we propose a style-enhancement branch that constructs feature banks for both authorized and unauthorized domains. This branch integrates self-enhanced and cross-domain features, further strengthening IP-CLIP’s capability to block features from unauthorized domains. Finally, we present new three metrics designed to better balance the performance degradation of authorized and unauthorized domains. Comprehensive experiments in various scenarios demonstrate its promising potential for application in IP protection tasks for VLMs.
A3: Few-shot Prompt Learning of Unlearnable Examples with Cross-Modal Adversarial Feature Alignment
Xuan Wang · Xitong Gao · Dongping Liao · Tianrui Qin · Yu-liang Lu · Cheng-Zhong Xu
In the age of pervasive machine learning applications, protecting digital content from unauthorized use has become a pressing concern. Unlearnable examples (UEs), i.e., data modified with imperceptible perturbations to inhibit model training while preserving human usability, have emerged as a promising approach. However, existing UE methods assume unauthorized trainers have extensive exposure to UEs or that models are trained from scratch, which may not hold in practical scenarios, This paper investigates the effectiveness of UEs under the few-shot learning paradigm, pitching it against visual prompt learning (VPL) models that leverage pretrained vision-language models (VLMs), like CLIP, capable of generalizing to new classes with minimal data. To address this, we introduce an adaptive UE framework to generate unlearnable examples that specifically target the VPL process. In addition, we propose a novel UE countermeasure, A3, with cross-modal adversarial feature alignment, specifically designed to circumvent UEs under few-shot VPL. Experimental evaluations on 7 datasets show that A3 outperforms existing VPL methods, achieving up to 33% higher performance in learning from UEs. For example, in the scenario involving $\ell_\infty$-bounded EM perturbations, A3 has an average harmonic mean accuracy across 7 datasets of 82.43%, compared to CoCoOp's baseline of 65.47%. Our findings highlight the limitations of existing UEs against VPL and lay the foundation for future data protection mechanisms.
Explaining Domain Shifts in Language: Concept Erasing for Interpretable Image Classification
Zequn Zeng · Yudi Su · Jianqiao Sun · Tiansheng Wen · Hao Zhang · Zhengjue Wang · Bo Chen · Hongwei Liu · Jiawei Ma
Concept-based models are inherently interpretable, as they map black-box representations to human-understandable concepts, making the decision-making process more transparent. These models allow users to understand the reasoning behind predictions, which is crucial for high-stakes applications. However, they often introduce domain-specific concepts that contribute to the final predictions, which can undermine their generalization capabilities. In this paper, we propose a novel Language-guided Concept-Erasing (lanCE) framework. Specifically, we empirically demonstrate that pre-trained vision-language models (VLMs) can approximate distinct visual domain shifts via a large domain descriptor set. Based on these findings, we introduce a novel plug-in domain descriptor orthogonality (DDO) regularizer to mitigate the impact of these domain-specific concepts on the final predictions. To simulate a wide range of unseen visual domains, we generate a set of domain descriptors by prompting large language models (LLMs). Notably, our proposed DDO regularizer is agnostic to the design of concept-based models and thus can be widely integrated into various such models. By integrating the proposed DDO regularizer into several prevailing models and evaluating them on two standard domain generalization benchmarks and three new benchmarks introduced in this paper, we demonstrate that DDO loss can significantly improve the out-of-distribution (OOD) generalization capabilities over the previous state-of-the-art concept-based models. Codes are available in the supplementary material.
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
Xueqing Wu · Yuheng Ding · Bingxuan Li · Pan Lu · Da Yin · Kai-Wei Chang · Nanyun Peng
The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards their self-improvement. However, a systematic analysis of such capabilities in LVLMs is still lacking. We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs. Compared to existing work that uses a single scalar value to critique the entire reasoning [4], VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought and provide natural language explanations to support their judgments. Extensive evaluation of 24 LVLMs demonstrates that human-written critiques significantly enhance the performance after correction, showcasing the potential of the self-improvement strategy. However, the model-generated critiques are less helpful and sometimes detrimental to the performance, suggesting that critique is the crucial bottleneck. We identified three common patterns in critique failures: failure to critique visual perception, reluctance to "say no", and exaggerated assumption of error propagation. To address these issues, we propose an effective LookBack strategy that revisits the image to verify each piece of information in the initial reasoning. \ourscritic{} significantly improves critique and correction performance by up to 13.5%.
Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM
Qiyuan Dai · Sibei Yang
Vision-Language Models (VLMs) have become prominent in open-world image recognition for their strong generalization abilities. Yet, their effectiveness in practical applications is compromised by domain shifts and distributional changes, especially when test data distributions diverge from training data. Therefore, the paradigm of test-time adaptation (TTA) has emerged, enabling the use of online off-the-shelf data at test time, supporting independent sample predictions, and eliminating reliance on test annotations. Traditional TTA methods, however, often rely on costly training or optimization processes, or make unrealistic assumptions about accessing or storing historical training and test data.Instead, this study proposes FreeTTA, a training-free and universally available method that makes no assumptions, to enhance the flexibility of TTA. More importantly, FreeTTA is the first to explicitly model the test data distribution, enabling the use of intrinsic relationships among test samples to enhance predictions of individual samples without simultaneous access—a direction not previously explored. FreeTTA achieves these advantages by introducing an online EM algorithm that utilizes zero-shot predictions from VLMs as priors to iteratively compute the posterior probabilities of each online test sample and update parameters. Experiments demonstrate that FreeTTA achieves stable and significant improvements compared to state-of-the-art methods across 15 datasets in both cross-domain and out-of-distribution settings.
SAIST: Segment Any Infrared Small Target Model Guided by Contrastive Language-Image Pretraining
Mingjin Zhang · Xiaolong Li · Fei Gao · Jie Guo · Xinbo Gao · Jing Zhang
Infrared Small Target Detection (IRSTD) aims to identify low signal-to-noise ratio small targets in infrared images with complex backgrounds, which is crucial for various applications. However, existing IRSTD methods typically rely solely on image modalities for processing, which fail to fully capture contextual information, leading to limited detection accuracy and adaptability in complex environments. Inspired by vision-language models, this paper proposes a novel framework, SAIST, which integrates textual information with image modalities to enhance IRSTD performance. The framework consists of two main components: Scene Recognition Contrastive Language-Image Pretraining (SR-CLIP) and CLIP-guided Segment Anything Model (CG-SAM). SR-CLIP generates a set of visual descriptions through object-object similarity and object-scene relevance, embedding them into learnable prompts to refine the textual description set. This reduces the domain gap between vision and language, generating precise textual and visual prompts. CG-SAM utilizes the prompts generated by SR-CLIP to accurately guide the Mask Decoder in learning prior knowledge of background features, while incorporating infrared imaging equations to improve small target recognition in complex backgrounds and significantly reduce the false alarm rate. Additionally, this paper introduces the first multimodal IRSTD dataset, MIRSTD, which contains abundant image-text pairs. Experimental results demonstrate that the proposed SAIST method outperforms existing state-of-the-art approaches. The dataset and code will be made publicly available.
Domain Generalization in CLIP via Learning with Diverse Text Prompts
Changsong Wen · Zelin Peng · Yu Huang · Xiaokang Yang · Wei Shen
Domain generalization (DG) aims to train a model on source domains that can generalize well to unseen domains. Recent advances in Vision-Language Models (VLMs), such as CLIP, exhibit remarkable generalization capabilities across a wide range of data distributions, benefiting tasks like DG. However, CLIP is pre-trained by aligning images with their descriptions, which inevitably captures domain-specific details. Moreover, adapting CLIP to source domains with limited feature diversity introduces bias. These limitations hinder the model's ability to generalize across domains. In this paper, we propose a new DG approach by learning with diverse text prompts. These text prompts incorporate varied contexts to imitate different domains, enabling DG model to learn domain-invariant features. The text prompts guide DG model learning in three aspects: feature suppression, which uses these prompts to identify domain-sensitive features and suppress them; feature consistency, which ensures the model's features are robust to domain variations imitated by the diverse prompts; and feature diversification, which diversifies features based on the prompts to mitigate bias. Experimental results show that our approach improves domain generalization performance on five datasets on the DomainBed benchmark, achieving state-of-the-art results.
Enhanced Visual-Semantic Interaction with Tailored Prompts for Pedestrian Attribute Recognition
Junyi Wu · Yan Huang · Min Gao · Yuzhen Niu · Yuzhong Chen · Qiang Wu
Pedestrian attribute recognition (PAR) seeks to predict multiple semantic attributes associated with a specific pedestrian. There are two types of approaches for PAR: unimodal framework and bimodal framework. The former one is to seek a robust visual feature. However, the lack of exploiting semantic feature of linguistic modality is the main concern. The latter one adopts utilizes prompt learning techniques to integrate linguistic data. However, static prompt templates and simple bimodal concatenation cannot to capture the extensive intra-class attribute variability and support active modalities collaboration. In this paper, we propose an Enhanced Visual-Semantic Interaction with Tailored Prompts (EVSITP) framework for PAR. We present an Image-Conditional Dual-Prompt Initialization Module (IDIM) to adaptively generate context-sensitive prompts from visual inputs. Subsequently, a Prompt Enhanced and Regularization Module (PERM) is proposed to strengthen linguistic information from IDIM. We further design a Bimodal Mutual Interaction Module (BMIM) to ensure bidirectional modalities communication. In addition, existing PAR datasets are collected over a short period in limited scenarios, which do not align with real-world scenarios. Therefore, we annotate a long-term person re-identification dataset to create a new PAR dataset, Celeb-PAR. Experiments on several challenging PAR datasets show that our method outperforms state-of-the-art approaches.
LOCORE: Image Re-ranking with Long-Context Sequence Modeling
Zilin Xiao · Pavel Suma · Ayush Sachdeva · Hao-Jen Wang · Giorgos Kordopatis-Zilos · Giorgos Tolias · Vicente Ordonez
We introduce LOCORE, Long-Context Re-ranker, a model that takes as input local descriptors corresponding to an image query and a list of gallery images and outputs similarity scores between the query and each gallery image. This model is used for image retrieval, where typically a first ranking is performed with an efficient similarity measure, and then a shortlist of top-ranked images is re-ranked based on a more fine-grained similarity measure. Compared to existing methods that perform pair-wise similarity estimation with local descriptors or list-wise re-ranking with global descriptors, LOCORE is the first method to perform list-wise re-ranking with local descriptors. To achieve this, we leverage efficient long-context sequence models to effectively capture the dependencies between query and gallery images at the local-descriptor level. During testing, we process long shortlists with a sliding window strategy that is tailored to overcome the context size limitations of sequence models. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks of landmarks ($\mathcal{R}$Oxf and $\mathcal{R}$Par), products (SOP), fashion items (In-Shop), and bird species (CUB-200) while having significantly lower forward latency compared with pair-wise local feature re-rankers.
Visual Consensus Prompting for Co-Salient Object Detection
Jie Wang · Nana Yu · Zihao Zhang · Yahong Han
Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, thereby arousing the powerful potential of pre-trained models in addressing CoSOD tasks. Extensive experiments demonstrate that our concise VCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 9.6% improvement in $F_m$ metrics on the most challenging CoCA dataset).
Explainable Saliency: Articulating Reasoning with Contextual Prioritization
Nuo Chen · Ming Jiang · Qi Zhao
Deep saliency models, which predict what parts of an image capture our attention, are often like black boxes. This limits their use, especially in areas where understanding why a model makes a decision is crucial. Our research tackles this challenge by building a saliency model that can not only identify what is important in an image, but also explain its choices in a way that makes sense to humans. We achieve this by using vision-language models to reason about images and by focusing the model's attention on the most crucial information using a contextual prioritization mechanism. Unlike prior approaches that rely on fixation descriptions or soft-attention based semantic aggregation, our method directly models the reasoning steps involved in saliency prediction, generating selectively prioritized explanations clarify why specific regions are prioritized. Comprehensive evaluations demonstrate the effectiveness of our model in generating high-quality saliency maps and coherent, contextually relevant explanations. This research is a step towards more transparent and trustworthy AI systems that can help us understand and navigate the world around us.
Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation
Ziheng Zhang · Jianyang Gu · Arpita Chowdhury · Zheda Mai · David Carlyn · Tanya Berger-Wolf · Yu Su · Wei-Lun Chao
Class activation map (CAM) has been broadly investigated to highlight image regions that contribute to class predictions. Despite its simplicity to implement and computational efficiency, CAM often struggles to identify discriminative regions in fine-grained classification tasks. Previous efforts address this by introducing more sophisticated explanation processes on the model prediction, at a cost of extra complexity. In this paper, we propose Finer-CAM, which retains CAM's efficiency while achieving precise localization of discriminative regions. Our insight is that the deficiency of CAM is not about how it explains but what it explains. Previous methods independently look at all the possible hints that explain the target class prediction, which inevitably also activates regions predictive of similar classes. By explicitly comparing and spotting the difference between the target class and similar classes, Finer-CAM suppresses features shared with other classes, and emphasizes the unique key details of the target class. The method allows adjustable comparing strength, enabling Finer-CAM to focus coarsely on general object contours or discriminatively on fine-grained details. The method is compatible with various CAM methods, and can be extended to multi-modal models to accurately activate specific concepts. Quantitatively, we demonstrate that masking out the top 5% activated pixels by Finer-CAM leads to a larger relative confidence drop compared with baselines. The source code is attached in the supplementary material.
Perceptual Inductive Bias Is What You Need Before Contrastive Learning
Junru Zhao · Tianqin Li · Dunhan Jiang · Shenghao Wu · Alan Ramirez · Tai Sing Lee
David Marr’s seminal theory of human perception proposes that visual representation learning should follow a multi-stage process, prioritizing the derivation of boundary and surface properties before forming semantic object representations. In contrast, contrastive representation learning frameworks typically bypass this explicit multi-stage approach, defining their objective as the direct learning of a semantic representation space for objects. While effective in general contexts, this approach sacrifices the inductive biases of vision, leading to slower convergence speed and shortcuts learning such as texture bias. In this work, we demonstrate that leveraging Marr’s multi-stage theory—by first constructing boundary and surface-level representations using perceptual constructs from early visual processing stages and subsequently training for object semantics—leads to 2x faster convergence on ResNet18, improved final representations on semantic segmentation, depth estimation, and object recognition, and enhanced robustness and out-of-distribution capability. Together, we propose a pretraining stage before the general contrastive representation pretraining to further enhance the final representation quality and reduce the overall convergence time via inductive bias from human vision systems.
Scaling Vision Pre-Training to 4K Resolution
Baifeng Shi · Boyi Li · Han Cai · Yao Lu · Sifei Liu · Marco Pavone · Jan Kautz · Song Han · Trevor Darrell · Pavlo Molchanov · Danny Yin
High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 384x384) due to the quadratic cost of processing larger images. We introduce PS3, for Pre-training with Scale-Selective Scaling, that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of processing entire global images, PS3 is pre-trained to selectively process local regions and contrast them with local detailed captions, allowing it to learn detailed representation at high resolution with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global low-resolution image and select local high-resolution regions to process based on their saliency or relevance to a text prompt. When applied to multi-modal LLMs (MLLMs), PS3 demonstrates performance that effectively scales with the pre-training resolution and significantly improves over baselines without high-resolution pre-training. We also find current benchmarks do not require recognizing details at 4K resolution, which motivates us to propose 4KPro, a new benchmark that evaluates visual perception at 4K resolution, on which PS3 outperforms state-of-the-art MLLMs, including a 13% improvement over GPT-4o.
Multimodal Autoregressive Pre-training of Large Vision Encoders
Enrico Fini · Mustafa Shukor · Xiujun Li · Philipp Dufter · Michal Klein · David Haldimann · Sai Aitharaju · Victor Guilherme Turrisi da Costa · Louis Béthune · Zhe Gan · Alexander Toshev · Marcin Eichner · Moin Nabi · Yinfei Yang · Joshua Susskind · Alaaeldin El-Nouby
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Fur- thermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal im- age understanding across diverse settings.
Sensitivity-Aware Efficient Fine-Tuning via Compact Dynamic-Rank Adaptation
Tianran Chen · Jiarui Chen · Baoquan Zhang · Zhehao Yu · Shidong Chen · Rui Ye · Xutao Li · Yunming Ye
Parameter-Efficient Fine-Tuning (PEFT) is a fundamental research problem in computer vision, which aims to tune a few of parameters for efficient storage and adaptation of pre-trained vision models. Recently, sensitivity-aware parameter efficient fine-tuning method (SPT) addresses this problem by identifying sensitive parameters and then leveraging its sparse characteristic to combine unstructured and structured tuning for PEFT. However, existing methods only focus on sparse characteristic of sensitive parameters but overlook its distribution characteristic, which results in additional storage burden and limited performance improvement. In this paper, we find that the distribution of sensitive parameters is not chaotic, but concentrates in a small number of rows or columns in each parameter matrix. Inspired by this fact, we propose a Compact Dynamic-Rank Adaptation-based tuning method for Sensitivity-aware Parameter efficient fine-Tuning, called CDRA-SPT. Specifically, we first identify the sensitive parameters that require tuning for each downstream task. Then, we reorganize the sensitive parameters by following its row and column into a compact sub-parameter matrix. Finally, a dynamic-rank adaptation is designed and applied at sub-parameter matrix level for PEFT. Its advantage is that the dynamic-rank characteristic of sub-parameter matrix can be fully exploited for PEFT. Extensive experiments show that our method achieves superior performance over previous state-of-the-art methods.
UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning
Long Zhou · Fereshteh Shakeri · Aymen Sadraoui · Mounir Kaaniche · Jean-Christophe Pesquet · Ismail Ben Ayed
Transductive few-shot learning has recently triggered wide attention in computer vision. Yet, current methods introduce key hyper-parameters, which control the pre-diction statistics of the test batches, such as the level of class balance, affecting performances significantly. Such hyper-parameters are empirically grid-searched over validation data, and their configurations may vary substantially with the target dataset and pre-training model, making such empirical searches both sub-optimal and computationally intractable. In this work, we advocate and introduce the unrolling paradigm, also referred to as “learning to optimize”, in the context of few-shot learning, thereby learning efficiently and effectively a set of optimized hyperparameters. Specifically, we unroll a generalization of the ubiquitous Expectation-Maximization (EM) optimizer into a neural network architecture, mapping each of its iterates to a layer and learning a set of key hyper-parameters over validation data. Our unrolling approach covers various statistical feature distributions and pre-training paradigms, including recent foundational vision-language models and standard vision-only classifiers. We report comprehensive experiments, which cover a breadth of fine-grained downstream image classification tasks, showing significant gains brought by the proposed unrolled EM algorithm over iterative variants. The achieved improvements reach up to 10% and 7.5% on vision-only and vision-language benchmarks, respectively. The source code and learned parameters are available at https://anonymous.4open.science/r/UNEM .
MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining
Yunze Liu · Li Yi
Hybrid Mamba-Transformer networks have recently garnered broad attention. These networks can leverage the scalability of Transformers while capitalizing on Mamba's strengths in long-context modeling and computational efficiency. However, the challenge of effectively pretraining such hybrid networks remains an open question. Existing methods, such as Masked Autoencoders (MAE) or autoregressive (AR) pretraining, primarily focus on single-type network architectures. In contrast, pretraining strategies for hybrid architectures must be effective for both Mamba and Transformer components. Based on this, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Experimental results show that the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperforms other pretraining strategies, achieving state-of-the-art performance. We validate the method's effectiveness on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component. We will release the code and checkpoints.
APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers
Zhuguanyu Wu · Jiayi Zhang · Jiaxin Chen · Jinyang Guo · Di Huang · Yunhong Wang
Vision Transformers (ViTs) have become one of the most commonly used visual backbones. While ViTs generally achieve higher accuracy than equivalently sized CNNs, they often experience significant accuracy loss during quantized deployment, especially with ultra-low bit post-training quantization (PTQ) schemes. Current reconstruction-based PTQ methods commonly used for CNNs experience significant accuracy drops on ViTs, due to the inaccurate estimation of output importance and the large accuracy drop in quantizing post-GELU activation. To address these issues, we propose APHQ-ViT, a new PTQ paradigm based on average perturbation Hessian (APH) importance estimation: to assess output importance, we thoroughly analyzed the approximations used in previous works with Hessian loss and the reasons for their low accuracy, and proposed a more precise average perturbation Hessian loss. To address the issue of quantizing post-GELU activation, we proposed an MLP-Reconstruction (MR) technique that reconstructs the MLP using the average perturbation Hessian importance estimation. As well as reducing the activation range, MLP-Reconstruction replaces the activation function of the MLP from GELU to ReLU with the small unlabeled calibration set, further enhancing the model's accuracy. With only linear quantizers, APHQ-ViT outperforms existing methods with a substantial margin on 3-bit and 4-bit PTQ of ViTs in different vision tasks.
Two is Better than One: Efficient Ensemble Defense for Robust and Compact Models
Yoojin Jung · Byung Cheol Song
Deep learning-based computer vision systems adopt complex and large architectures to improve performance, yet they face challenges in deployment on resource-constrained mobile and edge devices. To address this issue, model compression techniques such as pruning, quantization, and matrix factorization have been proposed; however, these compressed models are often highly vulnerable to adversarial attacks. We introduce the Efficient Ensemble Defense (EED) technique, which diversifies the compression of a single base model based on different pruning importance scores and enhances ensemble diversity to achieve high adversarial robustness and resource efficiency. EED dynamically determines the number of necessary sub-models during the inference stage, minimizing unnecessary computations while maintaining high robustness. On the CIFAR-10 and SVHN datasets, EED demonstrated state-of-the-art robustness performance compared to existing adversarial pruning techniques, along with an inference speed improvement of up to 1.86 times. This proves that EED is a powerful defense solution in resource-constrained environments.
Building Vision Models upon Heat Conduction
Zhaozhi Wang · Yue Liu · Yunjie Tian · Yunfan Liu · Yaowei Wang · Qixiang Ye
Visual representation models leveraging attention mechanisms are challenged by significant computational overhead, particularly when pursuing large receptive fields. In this study, we aim to mitigate this challenge by introducing the Heat Conduction Operator (HCO) built upon the physical heat conduction principle. HCO conceptualizes image patches as heat sources and models their correlations through adaptive thermal energy diffusion, enabling robust visual representations. HCO enjoys a computational complexity of O(N^1.5), as it can be implemented using discrete cosine transformation (DCT) operations. HCO is plug-and-play, combining with deep learning backbones produces visual representation models (termed vHeat) with global receptive fields. Experiments across vision tasks demonstrate that, beyond the stronger performance, vHeat achieves up to a 3x throughput, 80% less GPU memory allocation, and 35% fewer computational FLOPs compared to the Swin-Transformer.
LSNet: See Large, Focus Small
Ao Wang · Hui Chen · Zijia Lin · Jungong Han · Guiguang Ding
Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS (\textbf{L}arge-\textbf{S}mall) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models will be publicly available.
SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers
Nikaan Nikzad · YI LIAO · Yongsheng Gao · Jun Zhou
Over the past few years, vision transformers (ViTs) have consistently demonstrated remarkable performance across various visual recognition tasks. However, attempts to enhance their robustness have yielded limited success, mainly focusing on different training strategies, input patch augmentation, or network structural enhancements. These approaches often involve extensive training and fine-tuning, which are time-consuming and resource-intensive. To tackle these obstacles, we introduce a novel approach named Spatial Autocorrelation Token Analysis (SATA). By harnessing spatial relationships between token features, SATA enhances both the representational capacity and robustness of ViT models. This is achieved through the analysis and grouping of tokens according to their spatial autocorrelation scores prior to their input into the Feed-Forward Network (FFN) block of the self-attention mechanism. Importantly, SATA seamlessly integrates into existing pre-trained ViT baselines without requiring retraining or additional fine-tuning, while concurrently improving efficiency by reducing the computational load of the FFN units. Experimental results show that the baseline ViTs enhanced with SATA not only achieve a new state-of-the-art top-1 accuracy on ImageNet-1K image classification (94.9\%) but also establish new state-of-the-art performance across multiple robustness benchmarks, including ImageNet-A (top-1=63.6\%), ImageNet-R (top-1=79.2\%), and ImageNet-C (mCE=13.6\%), all without requiring additional training or fine-tuning of baseline models.
Token Cropr: Faster ViTs for Quite a Few Tasks
Benjamin Bergner · Christoph Lippert · Aravindh Mahendran
The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput.To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens.However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks.In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance.These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner.We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance.As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.
Hypergraph Vision Transformers: Images are More than Nodes, More than Edges
Joshua Fixelle
Recent advancements in computer vision have highlighted the scalability of Vision Transformers (ViTs) across various tasks, yet challenges remain in balancing adaptability, computational efficiency, and the ability to model higher-order relationships. Vision Graph Neural Networks (ViGs) offer an alternative by leveraging graph-based methodologies but are hindered by the computational bottlenecks of clustering algorithms used for edge generation. To address these issues, we propose the Hypergraph Vision Transformer (HgVT), which incorporates a hierarchical bipartite hypergraph structure into the vision transformer framework to capture higher-order semantic relationships while maintaining computational efficiency. HgVT leverages population and diversity regularization for dynamic hypergraph construction without clustering, and expert edge pooling to enhance semantic extraction and facilitate graph-based image retrieval. Empirical results demonstrate that HgVT achieves state-of-the-art performance on image classification and competitive performance on image retrieval, positioning it as an efficient and adaptable framework for semantic-based vision tasks.
Interpretable Image Classification via Non-parametric Part Prototype Learning
Zhijie Zhu · Lei Fan · Maurice Pagnucco · Yang Song
Classifying images with an interpretable decision-making process is a long-standing problem in computer vision. In recent years, Prototypical Part Networks has gained traction as an approach for self-explainable neural networks, due to their ability to mimic human visual reasoning by providing explanations based on prototypical object parts. However, the quality of the explanations generated by these methods leaves room for improvement, as the prototypes usually focus on repetitive and redundant concepts. Leveraging recent advances in prototype learning, we present a framework for part-based interpretable image classification that learns a set of semantically distinctive object parts for each class, and provides diverse and comprehensive explanations. The core of our method is to learn the part-prototypes in a non-parametric fashion, through clustering deep features extracted from foundation vision models that encode robust semantic information. To quantitatively evaluate the quality of explanations provided by ProtoPNets, we introduce Distinctiveness Score and Comprehensiveness Score. Through evaluation on CUB-200-2011, Stanford Cars and Stanford Dogs datasets, we show that our framework compares favourably against existing ProtoPNets while achieving better interpretability.
COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
Fanding Huang · Jingyan Jiang · Qinting Jiang · Li Hebei · Faisal Nadeem Khan · Zhi Wang
Vision-language models (VLMs) face significant challenges in test-time adaptation to novel domains. While cache-based methods show promise by leveraging historical information, they struggle with both caching unreliable feature-label pairs and indiscriminately using single-class information during querying, significantly compromising adaptation accuracy. To address these limitations, we propose \textbf{COSMIC} (\underline{C}lique-\underline{O}riented \underline{S}emantic \underline{M}ulti-space \underline{I}ntegration for \underline{C}LIP), a robust test-time adaptation framework that enhances adaptability through multi-granular, cross-modal semantic caching and graph-based querying mechanisms. Our framework introduces two key innovations: \textit{Dual Semantics Graph} (DSG) and \textit{Clique Guided Hyper-class} (CGH). The Dual Semantics Graph constructs complementary semantic spaces by incorporating textual features, coarse-grained CLIP features, and fine-grained DINOv2 features to capture rich semantic relationships. Building upon these dual graphs, the Clique Guided Hyper-class component leverages structured class relationships to enhance prediction robustness through correlated class selection. Extensive experiments demonstrate COSMIC's superior performance across multiple benchmarks, achieving significant improvements over state-of-the-art methods: 15.81\% gain on out-of-distribution tasks and 5.33\% on cross-domain generation with CLIP RN-50.
Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation
Jiho Choi · Seonho Lee · Minhyun Lee · Seungho Lee · Hyunjung Shim
Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained object parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.
LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation
Vladan Stojnić · Yannis Kalantidis · Jiri Matas · Giorgos Tolias
We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better captures these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance across a diverse set of datasets.
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation
Reza Qorbani · Gianluca Villani · Theodoros Panagiotakopoulos · Marc Botet Colomer · Linus Härenstam-Nielsen · Mattia Segu · Pier Luigi Dovesi · Jussi Karlgren · Daniel Cremers · Federico Tombari · Matteo Poggi
Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world application. We introduce Semantic Library Adaptation (SemLa), a novel framework for training-free, test-time domain adaptation. SemLa leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on an 18-domain benchmark built over 10 standard datasets demonstrate SemLa's superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.
SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation
Hritam Basak · Zhaozheng Yin
Domain Adaptation (DA) and Semi-supervised Learning (SSL) converge in Semi-supervised Domain Adaptation (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre-trained language models to enhance feature representations across domains. Therefore, we propose the first language-guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision-language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class-imbalance challenges in long-tailed distributions, we introduce class-balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain adaptation scenarios, our approach demonstrates substantial performance improvements over contemporary state-of-the-art (SoTA) methodologies. Code will be released.
Weakly Supervised Semantic Segmentation via Progressive Confidence Region Expansion
Xiangfeng Xu · Pinyi Zhang · Wenxuan Huang · Yunhang Shen · Haosheng Chen · Jingzhong Lin · Wei Li · Gaoqi He · Jiao Xie · Shaohui Lin
Weakly supervised semantic segmentation (WSSS) has garnered considerable attention due to its effective reduction of annotation costs. Most approaches utilize Class Activation Maps (CAM) to produce pseudo-labels, thereby localizing target regions using only image-level annotations. However, the prevalent methods relying on vision transformers (ViT) encounter an "over-expansion" issue, i.e., CAM incorrectly expands high activation value from the target object to the background regions, as it is difficult to learn pixel-level local intrinsic inductive bias in ViT from weak supervisions. To solve this problem, we propose a Progressive Confidence Region Expansion (PCRE) framework for WSSS, it gradually learns a faithful mask over the target region and utilizes this mask to correct the confusion in CAM. PCRE has two key components: "Confidence Region Mask Expansion" (CRME) and "Class-Prototype Enhancement" (CPE). CRME progressively expands the mask in the small region with the highest confidence, eventually encompassing the entire target, thereby avoiding unintended CPE aims to enhance mask generation in CRME by leveraging the similarity between the learned, dataset-level class prototypes and patch features as supervision to optimize the mask output from CRME. Extensive experiments demonstrate that our method outperforms the existing single-stage and multi-stage approaches on the PASCAL VOC and MS COCO benchmark datasets.
Beyond Background Shift: Rethinking Instance Replay in Continual Semantic Segmentation
Hongmei Yin · Tingliang Feng · Fan Lyu · Fanhua Shang · Hongying Liu · Wei Feng · Liang Wan
In this work, we focus on continual semantic segmentation (CSS), where segmentation networks are required to continuously learn new classes without erasing knowledge of previously learned ones. Although storing images of old classes and directly incorporating them into the training of new models has proven effective in mitigating catastrophic forgetting in classification tasks, this strategy presents notable limitations in CSS. Specifically, the stored and new images with partial category annotations leads to confusion between unannotated categories and the background, complicating model fitting. To tackle this, this paper proposes EIR, which not only preserves knowledge of old classes while simultaneously eliminating background confusion by instance storage of old classes, but also mitigates background shifts present in the new images by integrating stored instances with new images. By effectively resolving background shifts in both stored and new images, EIR alleviates catastrophic forgetting in the CSS task, thereby enhancing the model's capacity for CSS. Experimental results validate the efficacy of our approach, which significantly outperforms state-of-the-art CSS methods.
Dual-Agent Optimization framework for Cross-Domain Few-Shot Segmentation
Zhaoyang Li · Yuan Wang · Wangkai Li · Tianzhu Zhang · Xiang Liu
Cross-Domain Few-Shot Segmentation (CD-FSS) extends the generalization ability of Few-Shot Segmentation (FSS) beyond a single domain, enabling more practical applications. However, directly employing conventional FSS methods suffers from severe performance degradation in cross-domain settings, primarily due to feature sensitivity and support-to-query matching process sensitivity across domains. Existing methods for CD-FSS either focus on domain adaptation of features or delve into designing matching strategies for enhanced cross-domain robustness. Nonetheless, they overlook the fact that these two issues are interdependent and should be addressed jointly. In this work, we tackle these two issues within a unified framework by optimizing features in the frequency domain and enhancing the matching process in the spatial domain, working jointly to handle the deviations introduced by the domain gap. To this end, we proposed a coherent Dual-Agent Optimization (DATO) framework, including a consistent mutual aggregation (CMA) and a correlation rectification strategy (CRS). In the consistent mutual aggregation module, we employ a set of agents to learn domain-invariant features across domains, and then use these features to enhance the original representations for feature adaptation. In the correlation rectification strategy, the agent-aggregated domain-invariant features serve as a bridge, transforming the support-to-query matching process into a referable feature space and reducing its domain sensitivity. Extensive experiments demonstrate the efficacy of our approach.
Beyond Image Classification: A Video Benchmark and Dual-Branch Hybrid Discrimination Framework for Compositional Zero-Shot Learning
Dongyao Jiang · Haodong Jing · Yongqiang Ma · Nanning Zheng
Human reasoning naturally combines concepts to identify unseen compositions, a capability that Compositional Zero-Shot Learning (CZSL) aims to replicate in machine learning models. However, we observe that focusing solely on typical image classification tasks in CZSL may limit models' compositional generalization potential. To address this, we introduce C-EgoExo, a video-based benchmark, along with a compositional action recognition task to enable more comprehensive evaluations. Inspired by human reasoning processes, we propose a Dual-branch Hybrid Discrimination (DHD) framework, featuring two branches that decode visual inputs in distinct observation sequences. Through a cross-attention mechanism and a contextual dependency encoder, DHD effectively mitigates challenges posed by conditional variance. We further design a Copula-based orthogonal decoding loss to counteract contextual interference in primitive decoding. Our approach demonstrates outstanding performance across diverse CZSL tasks, excelling in both image-based and video-based modalities and in attribute-object and action-object compositions, setting a new benchmark for CZSL evaluation.
Targeted Forgetting of Image Subgroups in CLIP Models
Zeliang Zhang · Gaowen Liu · Charles Fleming · Ramana Kompella · Chenliang Xu
Foundation models (FMs) such as CLIP have demonstrated impressive zero-shot performance across various tasks by leveraging large-scale, unsupervised pre-training. However, they often inherit harmful or unwanted knowledge from noisy internet-sourced datasets, which compromises their reliability in real-world applications. Existing model unlearning methods either rely on access to pre-trained datasets or focus on coarse-grained unlearning (e.g., entire classes), leaving a critical gap for fine-grained unlearning. In this paper, we address the challenging scenario of selectively forgetting specific portions of knowledge within a class—without access to pre-trained data—while preserving the model’s overall performance. We propose a novel three-stage approach that progressively unlearns targeted knowledge while mitigating over-forgetting. Our method consists of (1) a forgetting stage to fine-tune on samples to be forgotten, (2) a reminding stage to restore the model’s performance on retaining samples, and (3) a restoring stage to recover zero-shot capabilities using model souping. We also introduce knowledge distillation to handle the distribution disparity between forgetting/retaining samples and the unseen pre-trained data. Extensive experiments demonstrate that our approach effectively unlearns specific subgroups while maintaining strong zero-shot performance on other tasks and datasets, outperforming baseline unlearning methods.
Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration
Yiyang Chen · Tianyu Ding · Lei Wang · Jing Huo · Yang Gao · Wenbin Li
Few-shot Class-Incremental Learning (FSCIL) challenges models to adapt to new classes with limited samples, presenting greater difficulties than traditional class-incremental learning. While existing approaches rely heavily on visual models and require additional training during base or incremental phases, we propose a training-free framework that leverages pre-trained visual-language models like CLIP. At the core of our approach is a novel Bi-level Modality Calibration (BiMC) strategy. Our framework initially performs intra-modal calibration, combining LLM-generated fine-grained category descriptions with visual prototypes from the base session to achieve precise classifier estimation. This is further complemented by inter-modal calibration that fuses pre-trained linguistic knowledge with task-specific visual priors to mitigate modality-specific biases. To enhance prediction robustness, we introduce additional metrics and strategies that maximize the utilization of limited data. Extensive experimental results demonstrate that our approach significantly outperforms existing methods. The code will be made publicly available.
Generalized Category Discovery (GCD) is an intriguing open-world problem that has garnered increasing attention within the research community. Given a dataset that includes both labelled and unlabelled images, GCD aims to categorize all images in the unlabelled subset, regardless of whether they belong to known or unknown classes. In GCD, the common practice typically involves applying a sphere projection operator at the end of the self-supervised pretrained backbone. This approach leads to subsequent operations, being conducted within Euclidean or spherical geometry. However, both of these geometries have been shown to be suboptimal for encoding samples that possess hierarchical structures. In contrast, hyperbolic space exhibits a unique property where its volume grows exponentially relative to the radius, making it particularly suitable for embedding hierarchical data. In this paper, we are the first to investigate category discovery within hyperbolic space. We propose a simple yet effective framework for learning hierarchy-aware representations and classifiers for GCD through the lens of hyperbolic geometry. Specifically, we initiate this process by transforming a self-supervised backbone pretrained in Euclidean space into hyperbolic space via exponential mapping, then performing all representation and classifier optimization in the hyperbolic domain. We evaluate our proposed framework on widely used parametric and non-parametric GCD baselines, as well as the previous state-of-the-art (SOTA) method, achieving significant improvements and establishing a new SOTA.
Solving Instance Detection from an Open-World Perspective
Qianqian Shen · Yunhan Zhao · Nahyun Kwon · Jeeeun Kim · Yanan Li · Shu Kong
Instance detection (InsDet) aims to localize specific object instances within a novel scene imagery based on given visual references. Technically, it requires proposal detection to identify all possible object instances, followed by instance-level matching to pinpoint the ones of interest. Its open-world nature supports its wide-ranging applications from robotics to AR/VR, but also presents significant challenges: methods must generalize to unknown testing data distributions because (1) the testing scene imagery is unseen during training, and (2) there are domain gaps between visual references and detected proposals. Existing methods attempt to tackle these challenges by synthesizing diverse training examples or utilizing off-the-shelf foundation models (FMs). However, they only partially capitalize the available open-world information. In this paper, we approach InsDet from an Open-World perspective, introducing our method IDOW. We find that, while pretrained FMs yield high recall in instance detection, they are not specifically optimized for instance-level feature matching. To address this, we adapt pretrained FMs for improved instance-level matching using open-world data. Our approach incorporates metric learning along with novel data augmentations, which sample distractors as negative examples and synthesize novel-view instances to enrich the visual references. Extensive experiments demonstrate that our method significantly outperforms prior works, achieving $>$10 AP over previous results on two recently released challenging benchmark datasets in both conventional and novel instance detection settings.
Learning Class Prototypes for Unified Sparse-Supervised 3D Object Detection
Yun Zhu · Le Hui · Hang Yang · Jianjun Qian · Jin Xie · Jian Yang
Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objects. Specifically, we first propose a prototype-based object mining module that converts the unlabeled object mining into a matching problem between class prototypes and unlabeled features. By using optimal transport matching results, we assign prototype labels to high-confidence features, thereby achieving the mining of unlabeled objects. We then present a multi-label cooperative refinement module to effectively recover missed detections in sparse supervised 3D object detection through pseudo label quality control and prototype label cooperation. Experiments show that our method achieves state-of-the-art performance under the one object per scene sparse supervised setting across indoor and outdoor datasets. With only one labeled object per scene, our method achieves about 78\%, 90\%, and 96\% performance compared to the fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI, respectively, highlighting the scalability of our method.
Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection
Boyong He · Yuxiang Ji · Qianwen Ye · Zhuoyue Tan · Liaoni Wu
Domain generalization (DG) for object detection aims to enhance detectors' performance in unseen scenarios. This task remains challenging due to complex variations in real-world applications. Recently, diffusion models have demonstrated remarkable capabilities in diverse scene generation, which inspires us to explore their potential for improving DG tasks. Instead of generating images, our method extracts multi-step intermediate features during the diffusion process to obtain domain-invariant features for generalized detection. Furthermore, we propose an efficient knowledge transfer framework that enables detectors to inherit the generalization capabilities of diffusion models through feature and object-level alignment, without increasing inference time. We conduct extensive experiments on six challenging DG benchmarks. The results demonstrate that our method achieves substantial improvements of 14.0\% mAP over existing DG approaches across different domains and corruption types. Notably, our method even outperforms most domain adaptation methods without accessing any target domain data. Moreover, the diffusion-guided detectors show consistent improvements of 15.9\% mAP on average compared to the baseline. Our work aims to present an effective approach for domain-generalized detection and provide potential insights for robust visual recognition in real-world scenarios.
Mr. DETR: Instructive Multi-Route Training for Detection Transformers
Chang-Bin Zhang · Yujie Zhong · Kai Han
Existing methods enhance the training of detection transformers by incorporating an auxiliary one-to-many assignment.In this work, we treat the model as a multi-task framework, simultaneously performing one-to-one and one-to-many predictions.We investigate the roles of each component in the transformer decoder across these two training targets, including self-attention, cross-attention, and feed-forward network.Our empirical results demonstrate that any independent component in the decoder can effectively learn both targets simultaneously, even when other components are shared.This finding leads us to propose a multi-route training mechanism, featuring a primary route for one-to-one prediction and two auxiliary training routes for one-to-many prediction.We enhance the training mechanism with a novel instructive self-attention that dynamically and flexibly guides object queries for one-to-many prediction.The auxiliary routes are removed during inference, ensuring no impact on model architecture or inference cost.We conduct extensive experiments on various baselines, achieving consistent improvements.
PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches
Dennis Jacob · Chong Xiang · Prateek Mittal
Deep learning techniques have enabled vast improvements in computer vision technologies. Nevertheless, these models are vulnerable to adversarial patch attacks which catastrophically impair performance. The physically realizable nature of these attacks calls for certifiable defenses, which feature provable guarantees on robustness. While certifiable defenses have been successfully applied to single-label classification, limited work has been done for multi-label classification. In this work, we present PatchDEMUX, a certifiably robust framework for multi-label classifiers against adversarial patches. Our approach is a generalizable method which can provably extend any existing certifiable defense for single-label classification; this is done by considering the multi-label classification task as a series of isolated binary classification problems. In addition, because one patch can only be placed at one location, we further develop a novel certification procedure that provides a tighter robustness certification bound. Using the current state-of-the-art (SOTA) single-label certifiable defense PatchCleanser as a backbone, we find that PatchDEMUX can achieve non-trivial robustness on the MSCOCO 2014 validation dataset while maintaining high clean performance.
Disentangling Safe and Unsafe Image Corruptions via Anisotropy and Locality
Ramchandran Muthukumar · Ambar Pal · Jeremias Sulam · Rene Vidal
State-of-the-art machine learning systems are vulnerable to small perturbations to their input, where _small_ is defined according to a threat model that assigns a positive threat to each perturbation. Most prior works define a task-agnostic, isotropic, and global threat, like the $\ell_p$ norm, where the magnitude of the perturbation fully determines the degree of the threat and neither the direction of the attack nor its position in space matter. However, common corruptions in computer vision, such as blur, compression, or occlusions, are not well captured by such treat models. This paper proposes a novel threat model called $\texttt{Projected Displacement}$ (PD) to study robustness beyond existing isotropic and global threat models. The proposed threat model measures the threat of a perturbation via its alignment with _unsafe directions_, defined as directions in the input space along which a perturbation of sufficient magnitude changes the ground truth class label. Unsafe directions are identified locally for each input based on observed training data. In this way, the PD-threat model exhibits anisotropy and locality. The PD-threat model is computationally efficient and can be easily integrated into existing robustness pipelines. Experiments on Imagenet-1k data indicate that, for any input, the set of perturbations with small PD threat includes _safe_ perturbations of large $\ell_p$ norm that preserve the true label, such as noise, blur and compression, while simultaneously excluding _unsafe_ perturbations that alter the true label. Unlike perceptual threat models based on embeddings of large-vision models, the PD-threat model can be readily computed for arbitrary classification tasks without pre-training or finetuning. Further additional task information such as sensitivity to image regions or concept hierarchies can be easily integrated into the assessment of threat and thus the PD threat model presents practitioners a flexible, task-driven threat specification that alleviates the limitations of $\ell_p$-threat models.
Beyond Single-Modal Boundary: Cross-Modal Anomaly Detection through Visual Prototype and Harmonization
Kai Mao · Ping Wei · Yiyang Lian · Yangyang Wang · Nanning Zheng
Zero-shot and few-shot anomaly detection have recently garnered widespread attention due to their potential applications. Most current methods require an auxiliary dataset in the same modality as the test images for training, making them ineffective in cross-modal situations. However, cross-modal anomaly detection is an essential task for both its application and research value. Existing methods usually underperform under this cross-modal anomaly detection task. Instead, we propose a framework that can be trained using data from a variety of pre-existing modalities that generalizes well to unseen modalities. The model consists of (1) the Transferable Visual Prototype, which directly learns normal/abnormal semantics in the visual space; (2) a Prototype Harmonization strategy that adaptively utilizes the Transferable Visual Prototypes from various modalities for inference on the unknown modality; (3) a Visual Discrepancy Inference for inference under the few-shot setting, further enhancing performance. In the zero-shot setting, the proposed method achieves AUROC improvements of 4.1\%, 6.1\%, 7.6\%, and 6.8\% over the best competing methods in the RGB, 3D, MRI/CT, and Thermal modalities, respectively. In the few-shot setting, our model also achieves the highest AUROC/AP on ten datasets in four modalities, substantially outperforming existing methods.
Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection
Wei Luo · Yunkang Cao · Haiming Yao · Xiaotian Zhang · Jianan Lou · Yuqi Cheng · Weiming Shen · Wenyong Yu
Anomaly detection (AD) is essential for industrial inspection, yet existing methods typically rely on "comparing'' test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even within anomalous images, valuable normal information remains. We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. Specifically, we introduce the INP Extractor, which linearly combines normal tokens to represent INPs. We further propose an INP Coherence Loss to ensure INPs can faithfully represent normality for the testing image. These INPs then guide the INP-Guided Decoder to reconstruct only normal tokens, with reconstruction errors serving as anomaly scores. Additionally, we propose a Soft Mining Loss to prioritize hard-to-optimize samples during training. INP-Former achieves state-of-the-art performance in single-class, multi-class, and few-shot AD tasks across MVTec-AD, VisA, and Real-IAD, positioning it as a versatile and universal solution for AD. Remarkably, INP-Former also demonstrates some zero-shot AD capability. Code will be made available.
Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties
wenqiao Li · BoZhong Zheng · Xiaohao Xu · Jinye Gan · Fading Lu · Xiang Li · Na Ni · Zheng Tian · Xiaonan Huang · Shenghua Gao · Yingna Wu
Object anomaly detection is essential for industrial quality inspection, yet traditional single-sensor methods face critical limitations. They fail to capture the wide range of anomaly types, as single sensors are often constrained to either external appearance, geometric structure, or internal properties. To overcome these challenges, we introduce MulSen-AD, the first high-resolution, multi-sensor anomaly detection dataset tailored for industrial applications. MulSen-AD unifies data from RGB cameras, laser scanners, and lock-in infrared thermography, effectively capturing external appearance, geometric deformations, and internal defects. The dataset spans 15 industrial products with diverse, real-world anomalies. We also present MulSen-AD Bench, a benchmark designed to evaluate multi-sensor methods, and propose MulSen-TripleAD, a decision-level fusion algorithm that integrates these three modalities for robust, unsupervised object anomaly detection. Our experiments demonstrate that multi-sensor fusion substantially outperforms single-sensor approaches, achieving 96.1% AUROC in object-level detection accuracy. These results highlight the importance of integrating multi-sensor data for comprehensive industrial anomaly detection. The dataset and code are publicly available to support further research.
UniNet: A Contrastive Learning-guided Unified Framework with Feature Selection for Anomaly Detection
Shun Wei · Jielin Jiang · Xiaolong Xu
Anomaly detection (AD) is a crucial visual task aimed at recognizing abnormal pattern within samples. However, most existing AD methods suffer from limited generalizability, as they are primarily designed for domain-specific applications, such as industrial scenarios, and often perform poorly when applied to other domains. This challenge largely stems from the inherent discrepancies in features across domains. To bridge this domain gap, we introduce UniNet, a generic unified framework that incorporates effective feature selection and contrastive learning-guided anomaly discrimination. UniNet comprises student-teacher models and a bottleneck, featuring several vital innovations: First, we propose domain-related feature selection, where the student is guided to select and focus on representative features from the teacher with domain-relevant priors, while restoring them effectively. Second, a similarity contrastive loss function is developed to strengthen the correlations among homogeneous features. Meanwhile, a margin loss function is proposed to enforce the separation between the similarities of abnormality and normality, effectively improving the model's ability to discriminate anomalies. Third, we propose a weighted decision mechanism for dynamically evaluating the anomaly score to achieve robust AD. Large-scale experiments on 11 datasets from various domains show that UniNet surpasses state-of-the-art methods.
NN-Former: Rethinking Graph Structure in Neural Architecture Representation
Ruihan Xu · Haokui Zhang · Yaowei Wang · Wei Zeng · Shiliang Zhang
The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each method has its disadvantages. GNNs lack the capabilities to represent complicated features, while transformers face poor generalization when the depth of architecture grows. To mitigate the above problems, we rethink neural architecture topology and show that sibling nodes are pivotal while overlooked in previous research. Thus we propose a novel predictor leveraging the strengths of GNNs and transformers to learn the enhanced topology. We introduce a novel token mixer that considers siblings, and a new channel mixer named bidirectional graph isomorphism feed-forward network. Our approach consistently achieves promising performance in both accuracy and latency prediction, providing valuable insights for learning Directed Acyclic Graph (DAG) topology. The code will be released.
Enhancing Dataset Distillation via Non-Critical Region Refinement
Minh-Tuan Tran · Trung Le · Xuan-May Le · Thanh-Toan Do · Dinh Phung
Dataset distillation has gained popularity as a technique for compressing large datasets into smaller, more efficient representations while retaining essential information for model training. Data features can be broadly divided into two types: instance-specific features, which capture unique, fine-grained details of individual examples, and class-general features, which represent shared, broad patterns across a class. However, previous approaches often struggle to balance these; some focus solely on class-general features, missing finer instance details, while others concentrate on instance-specific features, overlooking the shared characteristics essential for class-level understanding. In this paper, we propose the Non-Critical Region Refinement Dataset Distillation (NRR-DD) method, which preserves the instance-specific and fine-grained regions in synthetic data while enriching non-critical regions with more class-general information. This approach enables our models to leverage all pixel information to capture both types of features, thereby improving overall performance. Furthermore, we introduce Distance-Based Representative (DBR) knowledge transfer, which eliminates the need for soft labels in training by relying solely on the distance between synthetic data predictions and one-hot encoded labels. Experimental results demonstrate that our NRR-DD achieves state-of-the-art performance on both small-scale and large-scale datasets. Additionally, by storing only two distances per instance, our method achieves comparable results across various settings. Code will be available at https://anonymous.4open.science/r/NRR-DD.
Efficient ANN-Guided Distillation: Aligning Rate-based Features of Spiking Neural Networks through Hybrid Block-wise Replacement
Shu Yang · Chengting Yu · Lei Liu · Hanzhi Ma · Aili Wang · Erping Li
Spiking Neural Networks (SNNs) have garnered considerable attention as a potential alternative to Artificial Neural Networks (ANNs). Recent studies have highlighted SNNs' potential on large-scale datasets. For SNN training, two main approaches exist: direct training and ANN-to-SNN (ANN2SNN) conversion. To fully leverage existing ANN models in guiding SNN learning, either direct ANN-to-SNN conversion or ANN-SNN distillation training can be employed. In this paper, we propose an ANN-SNN distillation framework from the ANN-to-SNN perspective, designed with a block-wise replacement strategy for ANN-guided learning. By generating intermediate hybrid models that progressively align SNN feature spaces to those of ANN through rate-based features, our framework naturally incorporates rate-based backpropagation as a training method. Our approach achieves results comparable to or better than state-of-the-art SNN distillation methods, showing both training and learning efficiency.
Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy
Zesen Cheng · Hang Zhang · Kehan Li · Sicong Leng · Zhiqiang Hu · Fei Wu · Deli Zhao · Xin Li · Lidong Bing
Contrastive loss is a powerful approach for representation learning, where larger batch sizes enhance performance by providing more negative samples to better distinguish between similar and dissimilar data. However, the full instantiation of the similarity matrix demands substantial GPU memory, making large batch training highly resource-intensive. To address this, we propose a tile-based computation strategy that partitions the contrastive loss calculation into small blocks, avoiding full materialization of the similarity matrix. Additionally, we introduce a multi-level tiling implementation to leverage the hierarchical structure of distributed systems, using ring-based communication at the GPU level to optimize synchronization and fused kernels at the CUDA core level to reduce I/O overhead. Experimental results show that the proposed method significantly reduces GPU memory usage in contrastive loss. For instance, it enables contrastive training of a CLIP-ViT-L/14 model with a batch size of 4M using only 8 A800 80GB GPUs, without sacrificing accuracy. Compared to state-of-the-art memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed. The code will be made publicly available.
LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty
Christoforos N. Spartalis · Theodoros Semertzidis · Efstratios Gavves · Petros Daras
This paper presents LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models.LoTUS smooths the prediction probabilities of the model, mitigating its overconfidence that stems from data memorization, up to an information-theoretic bound.We evaluate LoTUS on the Transformer and ResNet18 models, against seven baseline methods, on four public datasets. Beyond established MU benchmarks, we evaluate unlearning on a large-scale dataset (ImageNet1k) which deters retraining, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. Experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. We will share code.
DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models
Yongqi Huang · Peng Ye · Chenyu Huang · Jianjian Cao · Lin Zhang · Baopu Li · Gang Yu · Tao Chen
Upcycled Mixture-of-Experts (MoE) models have shown great potential in various tasks by converting the original Feed-Forward Network (FFN) layers in pre-trained dense models into MoE layers. However, these models still suffer from significant parameter inefficiency due to the introduction of multiple experts. In this work, we propose a novel DeRS (Decompose, Replace, and Synthesis) paradigm to overcome this shortcoming, which is motivated by our observations about the unique redundancy mechanisms of upcycled MoE experts. Specifically, DeRS decomposes the experts into one expert-shared base weight and multiple expert-specific delta weights, and subsequently represents these delta weights in lightweight forms. Our proposed DeRS paradigm can be applied to enhance parameter efficiency in two different scenarios, including: 1) DeRS Compression for inference stage, using sparsification or quantization to compress vanilla upcycled MoE models; and 2) DeRS Upcycling for training stage, employing lightweight sparse or low-rank matrixes to efficiently upcycle dense models into MoE models. Extensive experiments across three different tasks show that the proposed methods can achieve extreme parameter efficiency while maintaining the performance for both training and compression of upcycled MoE models.
Towards Consistent Multi-Task Learning: Unlocking the Potential of Task-Specific Parameters
Xiaohan Qin · Xiaoxing Wang · Junchi Yan
Multi-task learning (MTL) has gained widespread application for its ability to transfer knowledge across tasks, improving resource efficiency and generalization. However, gradient conflicts from different tasks remain a major challenge in MTL. Previous gradient-based and loss-based methods primarily focus on gradient optimization in shared parameters, often overlooking the potential of task-specific parameters. This work points out that task-specific parameters not only capture task-specific information but also influence the gradients propagated to shared parameters, which in turn affects gradient conflicts. Motivated by this insight, we propose ConsMTL, which models MTL as a bi-level optimization problem: in the upper-level optimization, we perform gradient aggregation on shared parameters to find a joint update vector that minimizes gradient conflicts; in the lower-level optimization, we introduce an additional loss for task-specific parameters guiding the $k$ gradients of shared parameters to gradually converge towards the joint update vector. Our design enables the optimization of both shared and task-specific parameters to consistently alleviate gradient conflicts. Extensive experiments show that ConsMTL achieves state-of-the-art performance across various benchmarks with task numbers ranging from 2 to 40, demonstrating its superior performance.
Do Your Best and Get Enough Rest for Continual Learning
Hankyul Kang · Gregor Seifer · Donghyun Lee · Jongbin Ryu
According to Ebbinghaus' forgetting curve theory, we can enhance memory retention by learning extensive data and taking adequate rest. This means that in order to effectively retain new knowledge, it is essential to learn it thoroughly and ensure sufficient rest so that our brains can memorize without forgetting. The main takeaway from this theory is that learning extensive data at once necessitates sufficient rest before learning the same data again. This aspect of human long-term memory retention can be effectively utilized to address the continual learning of neural networks. Retaining new knowledge for a long period of time without catastrophic forgetting is the critical problem of continual learning. Therefore, based on Ebbinghaus' theory, we introduce the Respacing method that adjusts the learning schedules to optimize the recall interval between retraining the same samples. The proposed Respacing method allows the network to get enough rest to learn extensive knowledge from the same samples with a recall interval of sufficient length. To this end, we specifically present two approaches: 1) a view-batch replay method that guarantees the optimal recall interval, and 2) a view-batch self-supervised learning that acquires extensive knowledge from a single training sample at a time. We empirically show that these approaches of our method are aligned with the forgetting curve theory, which can enhance long-term memory. In our experiments, we also demonstrate that our method significantly improves many state-of-the-art continual learning methods in various protocols and scenarios. We will open-source our code upon acceptance.
Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning
Huiyi Wang · Haodong Lu · Lina Yao · Dong Gong
Continual learning (CL) aims to continually accumulate knowledge from a non-stationary data stream without catastrophic forgetting of learned knowledge, requiring a balance between stability and adaptability. Relying on the generalizable representation in pre-trained models (PTMs), PTM-based CL methods perform effective continual adaptation on downstream tasks by adding learnable adapters or prompts upon the frozen PTMs. However, many existing PTM-based CL methods use restricted adaptation on a fixed set of these modules to avoid forgetting, suffering from limited CL ability. Periodically adding task-specific modules results in linear model growth rate and impaired knowledge reuse. We propose Self-Expansion of pre-trained models with Modularized Adaptation (SEMA), a novel approach to enhance the control of stability-plasticity balance in PTM-based CL. SEMA automatically decides to reuse or add adapter modules on demand in CL, depending on whether significant distribution shift that cannot be handled is detected at different representation levels. We design modular adapter consisting of a functional adapter and a representation descriptor. The representation descriptors are trained as a distribution shift indicator and used to trigger self-expansion signals. For better composing the adapters, an expandable weighting router is learned jointly for mixture of adapter outputs. SEMA enables better knowledge reuse and sub-linear expansion rate. Extensive experiments demonstrate the effectiveness of the proposed self-expansion method, achieving state-of-the-art performance compared to PTM-based CL methods without memory rehearsal.
Task-Agnostic Guided Feature Expansion for Class-Incremental Learning
Bowen Zheng · Da-Wei Zhou · Han-Jia Ye · De-Chuan Zhan
The ability to learn new concepts while preserve the learned knowledge is desirable for learning systems in Class-Incremental Learning (CIL).Recently, feature expansion of the model become a prevalent solution for CIL, where the old features are fixed during the training of the new task while new features are expanded for the new tasks.However, such task-specific features learned from the new task may collide with the old features, leading to misclassification between tasks.Therefore, the expanded model is often encouraged to capture diverse features from the new task, aiming to avoid such collision.However, the existing solution is largely restricted to the samples from the current task, because of the poor accessibility to previous samples.To promote the learning and transferring of diverse features across tasks, we propose a framework called Task-Agnostic Guided Feature Expansion (TagFex).Firstly, it captures task-agnostic features continually with a separate model, providing extra task-agnostic features for subsequent tasks.Secondly, to obtain useful features from the task-agnostic model for the current task, it aggregates the task-agnostic features with the task-specific feature using a merge attention.Then the aggregated feature is transferred back into the task-specific feature for inference, helping the task-specific model capture diverse features.Extensive experiments show the effectiveness and superiority of TagFex on various CIL settings.
OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP
Mohamad Hassan N C · Divyam Gupta · Mainak Singha · SAI BHARGAV RONGALI · Ankit Jha · Muhammad Haris Khan · Biplab Banerjee
We introduce Low-Shot Open-Set Domain Generalization (LSOSDG), a novel paradigm unifying low-shot learning with open-set domain generalization (ODG). While prompt-based methods using models like CLIP have advanced DG, they falter in low-data regimes (e.g., 1-shot) and lack precision in detecting open-set samples with fine-grained semantics related to training classes. To address these challenges, we propose \textsc{OSLoPrompt}, an advanced prompt-learning framework for CLIP with two core innovations. First, to manage limited supervision across source domains and improve DG, we introduce a domain-agnostic prompt-learning mechanism that integrates adaptable domain-specific cues and visually guided semantic attributes through a novel cross-attention module, besides being supported by learnable domain- and class-generic visual prompts to enhance cross-modal adaptability. Second, to improve outlier rejection during inference, we classify unfamiliar samples as ``unknown'' and train specialized prompts with systematically synthesized pseudo-open samples that maintain fine-grained relationships to known classes, generated through a targeted query strategy with off-the-shelf foundation models. This strategy enhances feature learning, enabling our model to detect open samples with varied granularity more effectively. Extensive evaluations across five benchmarks demonstrate that \textsc{OSLoPrompt} establishes a new state-of-the-art in LSOSDG, significantly outperforming existing methods.
Reducing Class-wise Confusion for Incremental Learning with Disentangled Manifolds
Huitong Chen · Yu Wang · Yan Fan · Guosong Jiang · Qinghua Hu
Class incremental learning (CIL) aims to enable models to continuously learn new classes without catastrophically forgetting old ones. A promising direction is to learn and use prototypes of classes during incremental updates. Despite simplicity and intuition, we find that such methods suffer from inadequate representation capability and unsatisfied feature overlap. These two factors cause class-wise confusion and limited performance. In this paper, we develop a Confusion-REduced AuTo-Encoder classifier (CREATE) for CIL. Specifically, our method employs a lightweight auto-encoder module to learn compact manifold for each class in the latent subspace, constraining samples to be well reconstructed only on the semantically correct auto-encoder. Thus, the representation stability and capability of class distributions are enhanced, alleviating the potential class-wise confusion problem. To further distinguish the overlapped features, we propose a confusion-aware latent space separation loss that ensures samples are closely distributed in their corresponding low-dimensional manifold while keeping away from the distributions of drifted features from other classes. Our method demonstrates stronger representational capacity and discrimination ability by learning disentangled manifolds and reduces class confusion. Extensive experiments on multiple datasets and settings show that CREATE outperforms other state-of-the-art methods up to 5.41%.
Unsupervised Continual Domain Shift Learning with Multi-Prototype Modeling
Haopeng Sun · Yingwei Zhang · Lumin Xu · Sheng Jin · Ping Luo · Chen Qian · Wentao Liu · Yiqiang Chen
In real-world applications, deep neural networks may encounter constantly changing environments, where the test data originates from continually shifting unlabeled target domains. This problem, known as Unsupervised Continual Domain Shift Learning (UCDSL), poses practical difficulties. Existing methods for UCDSL aim to learn domain-invariant representations for all target domains. However, due to the existence of adaptivity gap, the invariant representation may theoretically lead to large joint errors. To overcome the limitation, we propose a novel UCDSL method, called Multi-Prototype Modeling (MPM). Our model comprises two key components: (1) Multi-Prototype Learning (MPL) for acquiring domain-specific representations using multiple domain-specific prototypes. MPL achieves domain-specific error minimization instead of enforcing feature alignment across different domains. (2) Bi-Level Graph Enhancer (BiGE) for enhancing domain-level and category-level representations, resulting in more accurate predictions. We provide theoretical and empirical analysis to demonstrate the effectiveness of our proposed method. We evaluate our approach on multiple benchmark datasets and show that our model surpasses state-of-the-art methods across all datasets, highlighting its effectiveness and robustness in handling unsupervised continual domain shift learning. Codes will be publicly accessible.
A Theory of Learning Unified Model via Knowledge Integration from Label Space Varying Domains
Dexuan Zhang · Thomas Westfechtel · Tatsuya Harada
Existing domain adaptation systems can hardly be applied to real-world problems with new classes presenting at deployment time, especially regarding source-free scenarios where multiple source domains do not share the label space despite being given a few labeled target data. To address this, we define a novel problem setting: multi-source semi-supervised open-set domain adaptation and propose a learning theory via joint error, effectively tackling strong domain shift. To generalize the algorithm into source-free cases, we introdcue a computationally efficient and architecture-flexible attention-based feature generation module. Extensive experiments on various data sets demonstrate the significant improvement of our proposed algorithm over baselines.
Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach
Chen-Chen Zong · Sheng-Jun Huang
Active learning (AL), which iteratively queries the most informative examples from a large pool of unlabeled candidates for model training, faces significant challenges in the presence of open-set classes. Existing methods either prioritize query examples likely to belong to known classes, indicating low epistemic uncertainty (EU), or focus on querying those with highly uncertain predictions, reflecting high aleatoric uncertainty (AU). However, they both yield suboptimal performance, as low EU corresponds to limited useful information, and closed-set AU metrics for unknown class examples are less meaningful. In this paper, we propose an Energy-based Active Open-set Annotation (EAOA) framework, which effectively integrates EU and AU to achieve superior performance. EAOA features a $(C+1)$-class detector and a target classifier, incorporating an energy-based EU measure and a margin-based energy loss designed for the detector, alongside an energy-based AU measure for the target classifier. Another crucial component is the target-driven adaptive sampling strategy. It first forms a smaller candidate set with low EU scores to ensure closed-set properties, making AU metrics meaningful. Subsequently, examples with high AU scores are queried to form the final query set, with the candidate set size adjusted adaptively. Extensive experiments show that EAOA achieves state-of-the-art performance while maintaining high query precision and low training overhead.
Towards Cost-Effective Learning: A Synergy of Semi-Supervised and Active Learning
Tianxiang Yin · Ningzhong Liu · Han Sun
Active learning (AL) and semi-supervised learning (SSL) both aim to reduce annotation costs: AL selectively annotates high-value samples from the unlabeled data, while SSL leverages abundant unlabeled data to improve model performance. Although these two appear intuitively compatible, directly combining them remains challenging due to fundamental differences in their frameworks. Current semi-supervised active learning (SSAL) methods often lack theoretical foundations and often design AL strategies tailored to a specific SSL algorithm rather than genuinely integrating the two fields.In this paper, we incorporate AL objectives into the overall risk formulation within the mainstream pseudo-label-based SSL framework, clarifying key differences between SSAL and traditional AL scenarios. To bridge these gaps, we propose a feature re-alignment module that aligns the features of unlabeled data under different augmentations by leveraging clustering and consistency constraints. Experimental results demonstrate that our module enables flexible combinations of SOTA methods from both AL and SSL, yielding more efficient algorithm performance.
Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch
Yijie Liu · Xinyi Shang · Yiqun Zhang · Yang Lu · Chen Gong · Jing-Hao Xue · Hanzi Wang
Federated Semi-Supervised Learning (FSSL) aims to leverage unlabeled data across clients with limited labeled data to train a global model with strong generalization ability. Most FSSL methods rely on consistency regularization with pseudo-labels, converting predictions from local or global models into hard pseudo-labels as supervisory signals. However, we discover that the quality of pseudo-label is largely deteriorated by data heterogeneity, an intrinsic facet of federated learning. In this paper, we study the problem of FSSL in-depth and show that (1) heterogeneity exacerbates pseudo-label mismatches, further degrading model performance and convergence, and (2) local and global models' predictive tendencies diverge as heterogeneity increases. Motivated by these findings, we propose a simple and effective method called \textbf{S}emi-supervised \textbf{A}ggregation for \textbf{G}lobally-Enhanced \textbf{E}nsemble (SAGE), that can flexibly correct pseudo-labels based on confidence discrepancies. This strategy effectively mitigates performance degradation caused by incorrect pseudo-labels and enhances consensus between local and global models. Experimental results demonstrate that SAGE outperforms existing FSSL methods in both performance and convergence. Our code is available at https://anonymous.4open.science/r/CVPR2025-1926-code.
Beyond Clean Training Data: A Versatile and Model-Agnostic Framework for Out-of-Distribution Detection with Contaminated Training Data
Yuchuan Li · Jae-Mo Kang · Il-Min Kim
In real-world AI applications, training datasets are often contaminated, containing a mix of in-distribution (ID) and out-of-distribution (OOD) samples without labels. This contamination poses a significant challenge for developing and training OOD detection models, as nearly all existing methods assume access to a clean training dataset of only ID samples—a condition rarely met in real-world scenarios. Customizing each existing OOD detection method to handle such contamination is impractical, given the vast number of diverse methods designed for clean data. To address this issue, we propose a universal, model-agnostic framework that integrates with nearly all existing OOD detection methods, enabling training on contaminated datasets while achieving high OOD detection accuracy on test datasets. Additionally, our framework provides an accurate estimation of the unknown proportion of OOD samples within the training dataset—an important and distinct challenge in its own right. Our approach introduces a novel dynamic weighting function and transition mechanism within an iterative training structure, enabling both reliable estimation of the OOD sample proportion of the training data and precise OOD detection on test data. Extensive evaluations across diverse datasets, including ImageNet-1k, demonstrate that our framework accurately estimates OOD sample proportions of training data and substantially enhances OOD detection accuracy on test data. Code is provided as supplementary material.
DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection
Li Li · Huixian Gong · Hao Dong · Tiankai Yang · Zhengzhong Tu · Yue Zhao
Out-of-distribution (OOD) detection is crucial for ensuring the robustness of machine learning models by identifying samples that deviate from the training distribution. While traditional OOD detection has predominantly focused on single-modality inputs, such as images, recent advancements in multimodal models have shown the potential of utilizing multiple modalities (e.g., video, optical flow, audio) to improve detection performance. However, existing approaches often neglect intra-class variability within in-distribution (ID) data, assuming that samples of the same class are perfectly cohesive and consistent. This assumption can lead to performance degradation, especially when prediction discrepancies are indiscriminately amplified across all samples. To address this issue, we propose Dynamic Prototype Updating (DPU), a novel plug-and-play framework for multimodal OOD detection that accounts for intra-class variations. Our method dynamically updates class center representations for each class by measuring the variance of similar samples within each batch, enabling tailored adjustments. This approach allows us to intensify prediction discrepancies based on the updated class centers, thereby enhancing the model’s robustness and generalization across different modalities. Extensive experiments on two tasks, five datasets, and nine base OOD algorithms demonstrate that DPU significantly improves OOD detection performances, setting a new state-of-the-art in multimodal OOD detection, including improvements up to 80% in Far-OOD detection.To improve accessibility and reproducibility, our code is released anonymously at https://anonymous.4open.science/r/CVPR-9177.
dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis
Luyuan Xie · Tianyu Luan · Wenyuan Cai · Guochen Yan · Zhaoyu Chen · Nan Xi · Yuejian Fang · Qingni Shen · Zhonghai Wu · Junsong Yuan
Federated learning has wide applications in the medical field. It enables knowledge sharing among different healthcare institutes while protecting patients' privacy. However, existing federated learning systems are typically centralized, requiring clients to upload client-specific knowledge to a central server for aggregation. This centralized approach would integrate the knowledge from each client into a centralized server, and the knowledge would be already undermined during the centralized integration before it reaches back to each client. Besides, the centralized approach also creates a dependency on the central server, which may affect training stability if the server malfunctions or connections are unstable. To address these issues, we propose a decentralized federated learning framework named dFLMoE. In our framework, clients directly exchange lightweight head models with each other. After exchanging, each client treats both local and received head models as individual experts, and utilizes a client-specific Mixture of Experts (MoE) approach to make collective decisions. This design not only reduces the knowledge damage with client-specific aggregations but also removes the dependency on the central server to enhance the robustness of the framework. We validate our framework on multiple medical tasks, demonstrating that our method evidently outperforms state-of-the-art approaches under both model homogeneity and heterogeneity settings.
Population Normalization for Federated Learning
Zhuoyao Wang · Fan Yi · Peizhu Gong · Caitou He · Cheng Jin · Weizhong Zhang
Batch normalization (BN) is a standard technique for training deep neural networks. However, its effectiveness in Federated Learning (FL) applications, which typically involve heterogeneous training data and resource-limited clients, is compromised for two reasons. One is the population statistics (i.e., mean and variance) of the heterogeneous datasets differ greatly, which leads to inconsistent BN layers among the client models and finally causes these models to drift further away from each other. The other is estimating statistics from a mini-batch can be inaccurate since the batch size has to be small in resource-limited clients. In this paper, we propose a technique named Population Normalization for FL, in which the statistics are learned as trainable parameters during training instead of calculated from mini-batches directly as BN does. Thus, our normalization layers are homogeneous among the clients and the impact of small batch size is eliminated as the model can be well-trained even when the batch size equals to one. To make our method more flexible in real applications, we investigate the role of the stochastic uncertainty existing in the statistic estimation of BN and show that when a large batch size is available we can inject simple artificial noise into our method to mimic this stochastic uncertainty and improve the generalization ability. The experimental results demonstrate the effectiveness of our method in various FL tasks.
Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning
Hasin Us Sami · Swapneel Sen · Amit K. Roy-Chowdhury · Srikanth Krishnamurthy · Basak Guler
Federated learning (FL) allows multiple data-owners to collaboratively train machine learning models by exchanging local gradients, while keeping their private data on-device. To simultaneously enhance privacy and training efficiency, recently parameter-efficient fine-tuning (PEFT) of large-scale pretrained models has gained substantial attention in FL. While keeping a pretrained (backbone) model frozen, each user fine-tunes only a few lightweight modules to be used in conjunction, to fit specific downstream applications. Accordingly, only the gradients with respect to these lightweight modules are shared with the server. In this work, we investigate how the privacy of the fine-tuning data of the users can be compromised via a malicious design of the pretrained model and trainable adapter modules. We demonstrate gradient inversion attacks on a popular PEFT mechanism, the adapter, which allow an attacker to reconstruct local data samples of a target user, using only the accessible adapter gradients. Via extensive experiments, we demonstrate that a large batch of fine-tuning images can be retrieved with high fidelity. Our attack highlights the need for privacy-preserving mechanisms for PEFT, while opening up several future directions. Our code is available at https://github.com/PEFTLeak/Inversion.
Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis
Jeonghwan Park · Niall McLaughlin · Ihsen Alouani
Adversarial attacks remain a significant threat that can jeopardize the integrity of Machine Learning (ML) models. In particular, query-based black-box attacks can generate malicious noise without having access to the victim model's architecture, making them practical in real-world contexts. The community has proposed several defenses against adversarial attacks, only to be broken by more advanced and adaptive attack strategies. In this paper, we propose a framework that detects if an adversarial noise instance is being generated. Unlike existing stateful defenses that detect adversarial noise generation by monitoring the input space, our approach learns adversarial patterns in the input update similarity space. In fact, we propose to observe a new metric called Delta Similarity (DS), which we show it captures more efficiently the adversarial behavior.We evaluate our approach against 8 state-of-the-art attacks, including adaptive attacks, where the adversary is aware of the defense and tries to evade detection. We find that our approach is significantly more robust than existing defenses both in terms of specificity and sensitivity.
Many machine learning (ML) classifiers are claimed to outperform humans, but they still make mistakes that humans do not. The most notorious examples of such mistakes are adversarial visual metamers. This paper aims to define and investigate the phenomenon of adversarial Doppelgängers (AD), which includes adversarial visual metamers, and to compare the performance and robustness of ML classifiers to human performance.We find that AD are inputs that are close to each other with respect to a perceptual metric defined in this paper, and show that AD are qualitatively different from the usual adversarial examples. The vast majority of classifiers are vulnerable to AD and robustness-accuracy trade-offs may not improve them. Some classification problems may not admit any AD robust classifiers because the underlying classes are ambiguous. We provide criteria that can be used to determine whether a classification problem is well defined or not; describe of an AD robust classifiers’ structure and attributes; introduce and explore the notions of conceptual entropy and regions of conceptual ambiguity for classifiers that are vulnerable to AD attacks, along with methods to bound the AD fooling rate of an attack. We define the notion of classifiers that exhibit hyper-sensitive behavior, that is, classifiers whose only mistakes are adversarial Doppelgängers. Improving the AD robustness of hyper-sensitive classifiers proving accuracy. We identify conditions guaranteeing that all classifiers with sufficiently high accuracy are hyper-sensitive.
PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection
Wei Li · Pin-Yu Chen · Sijia Liu · Ren Wang
Deep neural networks are susceptible to backdoor attacks, where adversaries manipulate model predictions by inserting malicious samples into the training data. Currently, there is still a significant challenge in identifying suspicious training data to unveil potential backdoor samples. In this paper, we propose a novel method, Prediction Shift Backdoor Detection (PSBD), leveraging an uncertainty-based approach requiring minimal unlabeled clean validation data. PSBD is motivated by an intriguing Prediction Shift (PS) phenomenon, where poisoned models' predictions on clean data often shift away from true labels towards certain other labels with dropout applied during inference, while backdoor samples exhibit less PS. We hypothesize PS results from neuron bias effect, making neurons favor features of certain classes. PSBD identifies backdoor training samples by computing the Prediction Shift Uncertainty (PSU), the variance in probability values when dropout layers are toggled on and off during model inference. Extensive experiments have been conducted to verify the effectiveness and efficiency of PSBD, which achieves state-of-the-art results among mainstream detection methods.
Rethinking the Adversarial Robustness of Multi-Exit Neural Networks in an Attack-Defense Game
Keyizhi Xu · Chi Zhang · Zhan Chen · Zhongyuan Wang · Chunxia Xiao · Chao Liang
Multi-exit neural networks represent a promising approach to enhancing model inference efficiency, yet like common neural networks, they suffer from significantly reduced robustness against adversarial attacks. While some defense methods have been raised to strengthen the adversarial robustness of multi-exit neural networks, we identify a long-neglected flaw in the evaluation of previous studies: simply using a fixed set of exits for attack may lead to an overestimation of their defense capacity. Based on this finding, our work explores the following three key aspects in the adversarial robustness of multi-exit neural networks: (1) we discover that a mismatch of the network exits used by the attacker and defender is responsible for the overestimated robustness of previous defense methods; (2) by finding the best strategy in a two-player zero-sum game, we propose AIMER as an improved evaluation scheme to measure the intrinsic robustness of multi-exit neural networks; (3) going further, we introduce NEED defense method under the evaluation of AIMER that can optimize the defender's strategy by finding a Nash equilibrium of the game. Experiments over 3 datasets, 7 architectures, 6 attacks and 4 baselines show that AIMER evaluates the robustness 13.52% lower than previous methods under AutoAttack, while the robust performance of NEED surpasses single-exit networks of the same backbones by 5.58% maximally.
Learning-enabled Polynomial Lyapunov Function Synthesis via High-Accuracy Counterexample-Guided Framework
Hanrui Zhao · Niuniu Qi · Mengxin Ren · Banglong Liu · Shuming Shi · Zhengfeng Yang
Polynomial Lyapunov function $\mathcal{V}(\bf x)$ provides mathematically rigorous that converts stability analysis into efficiently solvable optimization problem. Traditional numerical methods rely on user-defined templates, while emerging neural $\mathcal{V}(\bf x)$ offer flexibility but exhibit poor generalization yield from naive {\it Square} polynomial networks. In this paper, we propose a novel learning-enabled polynomial $\mathcal{V}(\bf x)$ synthesis approach, where a data-driven machine learning process guided by target-based sampling to fit candidate $\mathcal{V}(\bf x)$ which naturally compatible with the sum-of-squares (SOS) soundness verification. The framework is structured as an iterative loop between a {\it Learner} and a {\it Verifier}, where the {\it Learner} trains expressive polynomial $\mathcal{V}(\bf x)$ network via polynomial expansions, while the {\it Verifier} encodes learned candidates with SOS constraints to identify a real $\mathcal{V}(\bf x)$ by solving LMI feasibility test problems. The entire procedure is driven by a high-accuracy counterexample guidance technique to further enhance efficiency. Experimental results demonstrate that our approach outperforms both SMT-based polynomial neural Lyapunov function synthesis and traditional SOS method.
AdaptCMVC: Robust Adaption to Incremental Views in Continual Multi-view Clustering
Jing Wang · Songhe Feng · Kristoffer Knutsen Wickstrøm · Michael C. Kampffmeyer
Most Multi-view Clustering approaches assume that all views are available for clustering. However, this assumption is often unrealistic as views are incrementally accumulated over time, leading to a need for continual multi-view clustering (CMVC) methods. Current approaches to CMVC leverage late fusion-based approaches, where a new model is typically learned individually for each view to obtain the corresponding partition matrix, and then used to update a consensus matrix via a moving average. These approaches are prone to view-specific noise and struggle to adapt to large gaps between different views. To address these shortcomings, we reconsider CMVC from the perspective of domain adaption and propose AdaptCMVC, which learns how to incrementally accumulate knowledge of new views as they become available and prevents catastrophic forgetting. Specifically, a self-training framework is introduced to extend the model to new views, particularly designed to be robust to view-specific noise. Further, to combat catastrophic forgetting, a structure alignment mechanism is proposed to enable the model to explore the global group structure across multiple views. Experiments on several multi-view benchmarks demonstrate the effectiveness of our proposed method on the CMVC task. Our code is released in the supplementary materials.
Medusa: A Multi-Scale High-order Contrastive Dual-Diffusion Approach for Multi-View Clustering
Liang Chen · Zhe Xue · Yawen Li · Meiyu Liang · Yan Wang · Anton van den Hengel · Yuankai Qi
Deep multi-view clustering methods utilize information from multiple views to achieve enhanced clustering results and have gained increasing popularity in recent years. Most existing methods typically focus on either inter-view or intra-view relationships, aiming to align information across views or analyze structural patterns within individual views. However, they often incorporate inter-view complementary information in a simplistic manner, while overlooking the complex, high-order relationships within multi-view data and the interactions among samples, resulting in an incomplete utilization of the rich information available. Instead, we propose a multi-scale approach that exploits all of the available information. We first introduce a dual graph diffusion module guided by a consensus graph. This module leverages inter-view information to enhance the representation of both nodes and edges within each view. Secondly, we propose a novel contrastive loss function based on hypergraphs to more effectively model and leverage complex intra-view data relationships. Finally, we propose to adaptively learn fusion weights at the sample level, which enables a more flexible and dynamic aggregation of multi-view information. Extensive experiments on eight datasets show favorable performance of the proposed method compared to state-of-the-art approaches, demonstrating its effectiveness across diverse scenarios.
A Tale of Two Classes: Adapting Supervised Contrastive Learning to Binary Imbalanced Datasets
David Mildenberger · Paul Hager · Daniel Rueckert · Martin J. Menten
Supervised contrastive learning (SupCon) has proven to be a powerful alternative to the standard cross-entropy loss for classification of multi-class balanced datasets. However, it struggles to learn well-conditioned representations of datasets with long-tailed class distributions. This problem is potentially exacerbated for binary imbalanced distributions, which are commonly encountered during many real-world problems such as medical diagnosis. In experiments on seven binary datasets of natural and medical images, we show that the performance of SupCon decreases with increasing class imbalance. To substantiate these findings, we introduce two novel metrics that evaluate the quality of the learned representation space. By measuring the class distribution in local neighborhoods, we are able to uncover structural deficiencies of the representation space that classical metrics cannot detect. Informed by these insights, we propose two new supervised contrastive learning strategies tailored to binary imbalanced datasets that improve the structure of the representation space and increase downstream classification accuracy over standard SupCon by up to 35%.
On the Out-Of-Distribution Generalization of Large Multimodal Models
Xingxuan Zhang · Jiansheng Li · Wenjing Chu · junjia hai · Renzhe Xu · Yuqing Yang · Shikai Guan · Jiazheng Xu · Liping Jing · Peng Cui
We investigate the generalization boundaries of current Large Multimodal Models (LMMs) via comprehensive evaluation under out-of-distribution scenarios and domain-specific tasks. We evaluate their zero-shot generalization across synthetic images, real-world distributional shifts, and specialized datasets like medical and molecular imagery. Empirical results indicate that LMMs struggle with generalization beyond common training domains, limiting their direct application without adaptation. To understand the cause of unreliable performance, we analyze three hypotheses: semantic misinterpretation, visual feature extraction insufficiency, and mapping deficiency. Results identify mapping deficiency as the primary hurdle. To address this problem, we show that in-context learning (ICL) can significantly enhance LMMs' generalization, opening new avenues for overcoming generalization barriers. We further explore the robustness of ICL under distribution shifts and show its vulnerability to domain shifts, label shifts, and spurious correlation shifts between in-context examples and test data.
DiffCAM: Data-Driven Saliency Maps by Capturing Feature Differences
Xingjian Li · Qiming Zhao · Neelesh Bisht · Mostofa Uddin Uddin · Jin Yu Kim · Bryan Zhang · Min Xu
In recent years, the interpretability of Deep Neural Networks (DNNs) has garnered significant attention, particularly due to their widespread deployment in critical domains like healthcare, finance, and autonomous systems. To address the challenge of understanding how DNNs make decisions, Explainable AI (XAI) methods, such as saliency maps, have been developed to provide insights into the inner workings of these models. This paper introduces DiffCAM, a novel XAI method designed to overcome limitations in existing Class Activation Map (CAM)-based techniques, which often rely on decision boundary gradients to estimate feature importance. DiffCAM differentiates itself by considering the actual data distribution of the reference class, identifying feature importance based on how a target example differs from reference examples. This approach captures the most discriminative features without relying on decision boundaries or prediction results, making DiffCAM applicable to a broader range of models, including foundation models. Through extensive experiments, we demonstrate the superior performance and flexibility of DiffCAM in providing meaningful explanations across diverse datasets and scenarios.
Dual-view X-ray Detection: Can AI Detect Prohibited Items from Dual-view X-ray Images like Humans?
Renshuai Tao · Haoyu Wang · Yuzhe Guo · Hairong Chen · Li Zhang · Xianglong Liu · Yunchao Wei · Yao Zhao
To detect prohibited items in challenging categories, human inspectors typically rely on images from two distinct views (vertical and side). Can AI detect prohibited items from dual-view X-ray images in the same way humans do? Existing X-ray datasets often suffer from limitations, such as single-view imaging or insufficient sample diversity. To address these gaps, we introduce the Large-scale Dual-view X-ray (LDXray), which consists of 353,646 instances across 12 categories, providing a diverse and comprehensive resource for training and evaluating models. To emulate human intelligence in dual-view detection, we propose the Auxiliary-view Enhanced Network (AENet), a novel detection framework that leverages both the main and auxiliary views of the same object. The main-view pipeline focuses on detecting common categories, while the auxiliary-view pipeline handles more challenging categories using ``expert models" learned from the main view. Extensive experiments on the LDXray dataset demonstrate that the dual-view mechanism significantly enhances detection performance, e.g., achieving improvements of up to +24.7\%for the challenging category of umbrellas. Furthermore, our results show that AENet exhibits strong generalization across seven different detection models.
Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation
Kang Liu · Zhuoqi Ma · Xiaolu Kang · Yunan Li · Kun XIE · Zhicheng Jiao · Qiguang Miao
Automated radiology report generation offers an effective solution to alleviate radiologists' workload. However, most existing methods focus primarily on single or fixed-view images to model current disease conditions, which limits diagnostic accuracy and overlooks disease progression. Although some approaches utilize longitudinal data to track disease progression, they still rely on single images to analyze current visits. To address these issues, we propose enhanced contrastive learning with Multi-view Longitudinal data to facilitate chest X-ray Report Generation, named MLRG. Specifically, we introduce a multi-view longitudinal contrastive learning method that integrates spatial information from current multi-view images and temporal information from longitudinal data. This method also utilizes the inherent spatiotemporal information of radiology reports to supervise the pre-training of visual and textual representations. Subsequently, we present a tokenized absence encoding technique to flexibly handle missing patient-specific prior knowledge, allowing the model to produce more accurate radiology reports based on available prior knowledge. Extensive experiments on MIMIC-CXR, MIMIC-ABN, and Two-view CXR datasets demonstrate that our MLRG outperforms recent state-of-the-art methods, achieving a 2.3\% BLEU-4 improvement on MIMIC-CXR, a 5.5\% F1 score improvement on MIMIC-ABN, and a 2.7\% F1 RadGraph improvement on Two-view CXR.
CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology
Yuxuan Sun · Yixuan Si · Chenglu Zhu · Xuan Gong · Kai Zhang · Pingyi Chen · Ye Zhang · Zhongyi Shui · Tao Lin · Lin Yang
The emergence of large multimodal models (LMMs) has brought significant advancements to pathology. Previous research has primarily focused on separately training patch-level and whole-slide image (WSI)-level models, limiting the integration of learned knowledge across patches and WSIs, and resulting in redundant models. In this work, we introduce CPath-Omni, the first 15-billion-parameter LMM designed to unify both patch and WSI level image analysis, consolidating a variety of tasks at both levels, including classification, visual question answering, captioning, and visual referring prompting.Extensive experiments demonstrate that CPath-Omni achieves state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42 datasets, outperforming or matching task-specific models trained for individual tasks. Additionally, we develop a specialized pathology CLIP-based visual processor for CPath-Omni, CPath-CLIP, which, for the first time, integrates different vision models and incorporates a large language model as a text encoder to build a more powerful CLIP model, which achieves SOTA performance on nine zero-shot and four few-shot datasets. Our findings highlight CPath-Omni’s ability to unify diverse pathology tasks, demonstrating its potential to streamline and advance the field. The code and datasets will be made available.
BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology
Amaya Gallagher-Syed · Henry Senior · Omnia Alwazzan · Elena Pontarini · Michele Bombardieri · Costantino Pitzalis · Myles J. Lewis · Michael R Barnes · Luca Rossi · Greg Slabaugh
The development of biologically interpretable and explainable models remains a key challenge in computational pathology, particularly for multistain immunohistochemistry analysis. We present BioX-CPath, an explainable graph neural network architecture for Whole Slide Image classification that leverages both spatial and semantic features across multiple stains. At its core, BioX-CPath introduces a novel Stain-Aware Attention Pooling (SAAP) module that generates biologically meaningful, stain-aware patient embeddings. Our approach achieves state-of-the-art performance on both Rheumatoid Arthritis (accuracy of 0.90 ±0.019) and Sjogren's Disease (accuracy of 0.84 ±0.018) multistain datasets. Beyond performance metrics, BioX-CPath provides interpretable insights through stain attention scores, entropy measures, and stain interaction scores that align with known pathological mechanisms. This biological grounding, combined with strong classification performance, makes BioX-CPath particularly suitable for clinical applications where interpretability is key. Code will be made public upon paper acceptance.
Robust Multimodal Survival Prediction with Conditional Latent Differentiation Variational AutoEncoder
Junjie Zhou · Jiao Tang · Yingli Zuo · Peng Wan · Daoqiang Zhang · WEI SHAO
The integrative analysis of histopathological images and genomic data has received increasing attention for survival prediction of human cancers. However, the existing studies always hold the assumption that full modalities are available. As a matter of fact, the cost for collecting genomic data is high, which sometimes makes genomic data unavailable in testing samples. A common way of tackling such incompleteness is to generate the genomic representations from the pathology images. Nevertheless, such strategy still faces the following two challenges: (1) The gigapixel whole slide images (WSIs) are huge and thus hard for representation. (2) It is difficult to generate the genomic embeddings with diverse function categories in a unified generative framework. To address the above challenges, we propose a Conditional Latent Differentiation Variational AutoEncoder (LD-CVAE) for robust multimodal survival prediction, even with missing genomic data. Specifically, a Variational Information Bottleneck Transformer (VIB-Trans) module is proposed to learn compressed pathological representations from the gigapixel WSIs. To generate different functional genomic features, we develop a novel Latent Differentiation Variational AutoEncoder (LD-VAE) to learn the common and specific posteriors for the genomic embeddings with diverse functions. Finally, we use the product-of-experts technique to integrate the genomic common posterior and image posterior for the joint latent distribution estimation in LD-CVAE. We test the effectiveness of our method on five different cancer datasets, and the experimental results demonstrate its superiority in both complete and missing modality scenarios.
Boost the Inference with Co-training: A Depth-guided Mutual Learning Framework for Semi-supervised Medical Polyp Segmentation
Yuxin Li · Zihao Zhu · Yuxiang Zhang · Yifan Chen · Zhibin Yu
Semi-supervised polyp segmentation has made significant progress in recent years as a potential solution for computer-assisted treatment. Since depth images can provide extra information other than RGB images to help segment these problematic areas, depth-assisted polyp segmentation has gained much attention. However, the utilization of depth information is still worth studying. The existing RGB-D segmentation methods rely on depth data in the inference stage, limiting their clinical applications. To tackle this problem, we propose a semi-supervised polyp segmentation framework based on the mean teacher architecture. We establish an auxiliary student network with depth images as input in the training stage, and the proposed depth-guided cross-modal mutual learning strategy is adopted to promote the learning of complementary information between different student networks. At the same time, the high-confidence pseudo-labels generated by the auxiliary student network are used to guide the learning of the main student network from different perspectives, and no depth images are required in the inference phase. In addition, we introduce a depth-guided patch augmentation method to improve the model's learning performance in difficult regions of unlabeled polyp images. Experimental results show that our method achieves state-of-the-art performance under different label conditions on five polyp datasets.
Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation
Suruchi Kumari · Pravendra Singh
Despite the remarkable progress of deep learning-based methods in medical image segmentation, their use in clinical practice remains limited for two main reasons. First, obtaining a large medical dataset with precise annotations to train segmentation models is challenging. Secondly, most current segmentation techniques generate a single deterministic segmentation mask for each image. However, in real-world scenarios, there is often significant uncertainty regarding what defines the ``correct" segmentation, and various expert annotators might provide different segmentations for the same image. To tackle both of these problems, we propose Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation (AmbiSSL). AmbiSSL combines a small amount of multi-annotator labeled data and a large set of unlabeled data to generate diverse and plausible segmentation maps. Our method consists of three key components: (1) The Diverse Pseudo-Label Generation (DPG) module utilizes multiple decoders, created by performing randomized pruning on the original backbone decoder. These pruned decoders enable the generation of a diverse pseudo-label set; (2) a Semi-Supervised Latent Distribution Learning (SSLDL) module constructs acommon latent space by utilizing both ground truth annotations andpseudo-label set; and (3) a Cross-Decoder Supervision (CDS) module, which enables pruned decoders to guide each other’s learning. We evaluated the proposed method on two publicly available datasets. Extensive experiments demonstrate that AmbiSSL can generate diverse segmentation maps using only a small amount of labeled data and abundant unlabeled data, offering a more practical solution for medical image segmentation by reducing reliance on large labeled datasets.
Unified Medical Lesion Segmentation via Self-referring Indicator
Shijie Chang · Xiaoqi Zhao · Lihe Zhang · Tiancheng Wang
The recently emerged in-context-learning-based (ICL-based) models have the potential towards the unification of medical lesion segmentation. However, due to their cross-fusion designs, existing ICL-based unified segmentation models fail to accurately localize lesions with low-matched reference sets. Considering that the query itself can be regarded as a high-matched reference, which better indicates the target, we design a self-referencing mechanism that adaptively extracts self-referring indicator vectors from the query based on coarse predictions, thus effectively overcoming the negative impact caused by low-match reference sets. To further facilitate the self-referring mechanism, we introduce reference indicator generation to efficiently extract reference information for coarse predictions instead of using cross-fusion modules, which heavily rely on reference sets. Our designs successfully address the challenges of applying ICL to unified medical lesion segmentation, forming a novel framework named SR-ICL. Our method achieves state-of-the-art results on 8 medical lesion segmentation tasks with only 4 image-mask pairs as reference. Notably, SR-ICL still accomplishes remarkable performance even when using weak reference annotations such as boxes and points, and maintains fixed and low memory consumption even if more tasks are combined. We hope that SR-ICL can provide new insights for the clinical application of medical lesion segmentation.
Minding Fuzzy Regions: A Data-driven Alternating Learning Paradigm for Stable Lesion Segmentation
Lexin Fang · Yunyang Xu · Xiang Ma · Xuemei Li · Caiming Zhang
Deep learning has achieved significant advancements in medical image segmentation, but existing models still face challenges in accurately segmenting lesion regions. The main reason is that some lesion regions in medical images have unclear boundaries, irregular shapes, and small tissue density differences, leading to label ambiguity. However, the existing model treats all data equally without taking quality differences into account in the training process, resulting in noisy labels negatively impacting model training and unstable feature representations. In this paper, a data-driven alternate learning (DALE) paradigm is proposed to optimize the model's training process, achieving stable and high-precision segmentation. The paradigm focuses on two key points: (1) reducing the impact of noisy labels, and (2) calibrating unstable representations. To mitigate the negative impact of noisy labels, a loss consistency-based collaborative optimization method is proposed, and its effectiveness is theoretically demonstrated. Specifically, the label confidence parameters are introduced to dynamically adjust the influence of labels of different confidence levels during model training, thus reducing the influence of noise labels. To calibrate the learning bias of unstable representations, a distribution alignment method is proposed. This method restores the underlying distribution of unstable representations, thereby enhancing the discriminative capability of fuzzy region representations. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of the DALE paradigm, achieving an average performance improvement of up to 7.16%.
EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation
Md Mostafijur Rahman · Radu Marculescu
Recent 3D deep networks such as SwinUNETR, SwinUNETRv2, and 3D UX-Net have shown promising performance by leveraging self-attention and large-kernel convolutions to capture the volumetric context. However, their substantial computational requirements limit their use in real-time and resource-constrained environments. The high #FLOPs and #Params in these networks stem largely fromcomplex decoder designs with high-resolution layers and excessive channel counts. In this paper, we propose EffiDec3D, an optimized 3D decoder that employs a channel reduction strategy across all decoder stages, which sets the number of channels to the minimum needed for accurate feature representation. Additionally, EffiDec3D removes the high-resolution layers when their contribution to segmentation quality is minimal. Our optimized EffiDec3D decoder achieves a 96.4% reduction in #Params and a 93.0% reduction in #FLOPs compared to the decoder of original 3D UX-Net. Similarly, for SwinUNETR and SwinUNETRv2 (which share an identical decoder), we observe reductions of 94.9% in #Params and 86.2% in #FLOPs. Our extensive experiments on 12 different medical imaging tasks confirm that EffiDec3D not only significantly reduces the computational demands, but also maintains a performance level comparable to original models, thus establishing a new standard for efficient 3D medical image segmentation. We will make the source code public upon paper acceptance.
Learning Dynamic Collaborative Network for Semi-supervised 3D Vessel Segmentation
Jiao Xu · Xin Chen · Lihe Zhang
In this paper, we present a new dynamic collaborative network for semi-supervised 3D vessel segmentation, termed DiCo. Conventional mean teacher (MT) methods typically employ a static approach, where the roles of the teacher and student models are fixed. However, due to the complexity of 3D vessel data, the teacher model may not always outperform the student model, leading to cognitive biases that can limit performance. To address this issue, we propose a dynamic collaborative network that allows the two models to dynamically switch their teacher-student roles. Additionally, we introduce a multi-view integration module to capture various perspectives of the inputs, mirroring the way doctors conduct medical analysis. We also incorporate adversarial supervision to constrain the shape of the segmented vessels in unlabeled data. In this process, the 3D volume is projected into 2D views to mitigate the impact of label inconsistencies.Experiments demonstrate that our DiCo method sets new state-of-the-art performance on three 3D vessel segmentation benchmarks. Code and models will be made available.
Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization
Peirong Liu · Ana Lawry Aguila · Juan Iglesias
Data-driven machine learning has made significant strides in medical image analysis. However, most existing methods are tailored to specific modalities and assume a particular resolution (often isotropic). This limits their generalizability in clinical settings, where variations in scan appearance arise from differences in sequence parameters, resolution, and orientation. Furthermore, most general-purpose models are designed for healthy subjects and suffer from performance degradation when pathology is present. We introduce UNA (Unraveling Normal Anatomy), the first modality-agnostic learning approach for normal brain anatomy reconstruction that can handle both healthy scans and cases with pathology. We propose a fluid-driven anomaly randomization method that generates an unlimited number of realistic pathology profiles on-the-fly. UNA is trained on a combination of synthetic and real data, and can be applied directly to real images with potential pathology without the need for fine-tuning. We demonstrate UNA's effectiveness in reconstructing healthy brain anatomy and showcase its direct application to anomaly detection, using both simulated and real images from 3D healthy and stroke datasets, including CT and MRI scans. By bridging the gap between healthy and diseased images, UNA enables the use of general-purpose models on diseased images, opening up new opportunities for large-scale analysis of uncurated clinical images in the presence of pathology.
Blood Flow Speed Estimation with Optical Coherence Tomography Angiography Images
Wensheng Cheng · Zhenghong Li · Jiaxiang Ren · Hyomin Jeong · Congwu Du · Yingtian Pan · Haibin Ling
Estimating blood flow speed is essential in many medical and physiological applications, yet it is extremely challenging due to complex vascular structure and flow dynamics, particularly for cerebral cortex regions.Existing techniques, such as Optical Doppler Tomography (ODT), generally require complex hardware control and signal processing, and still suffer from inherent system-level artifacts.To address these challenges, we propose a new learning-based approach named OCTA-Flow, which directly estimates vascular blood flow speed from Optical Coherence Tomography Angiography (OCTA) images, which are commonly used for vascular structure analysis. OCTA-Flow employs several novel components to achieve this goal. First, using an encoder-decoder architecture, OCTA-Flow leverages ODT data as pseudo labels during training, thus bypassing the difficulty of collecting ground truth data. Second, to capture the relationship between vessels of varying scales and their flow speed, we design an Adaptive Window Fusion module that employs multiscale window attention. Third, to mitigate ODT artifacts, we incorporate a Conditional Random Field Decoder that promotes smoothness and consistency in the estimated blood flow. Together, these innovations enable OCTA-Flow to effectively produce accurate flow estimation, suppress the artifacts in ODT, and enhance practicality, benefiting from the established techniques of OCTA data acquisition.The code and data will be made publicly available.
3D Dental Model Segmentation with Geometrical Boundary Preserving
Shufan Xi · Zexian Liu · Junlin Chang · Hongyu Wu · Xiaogang Wang · Aimin Hao
3D intraoral scan mesh is widely used in digital dentistry diagnosis, segmenting 3D intraoral scan mesh is a critical preliminary task.Numerous approaches have been devised for precise tooth segmentation. Currently, the deep learning-based methods are capable of the high accuracy segmentation of crown.However, the segmentation accuracy at the junction between the crown and the gum is still below average.This is because the existing down-sampling methods are unable to effectively preserve the geometric details at the junction.To address these problems, we propose CrossTooth, a boundary-preserving segmentation method that combines 3D mesh selective downsampling to retain more vertices at the tooth-gingiva area, along with cross-modal discriminative boundary features extracted from multi-view rendered images, enhancing the geometric representation of the segmentation network.Using a point network as a backbone and incorporating image complementary features, CrossTooth significantly improves segmentation accuracy, as demonstrated by experiments on a public intraoral scan dataset.