Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 3

Sat 14 Jun 8:30 a.m. PDT — 10:30 a.m. PDT
Abstract:
Chat is not available.


Poster #1
LLM-driven Multimodal and Multi-Identity Listening Head Generation

Peiwen Lai · Weizhi Zhong · Yipeng Qin · Xiaohang Ren · Baoyuan Wang · Guanbin Li

Generating natural listener responses in conversational scenarios is crucial for creating engaging digital humans and avatars. Recent work has shown that large language models (LLMs) can be effectively leveraged for this task, demonstrating remarkable capabilities in generating contextually appropriate listener behaviors. However, current LLM-based methods face two critical limitations: they rely solely on speech content, overlooking other crucial communication signals, and they entangle listener identity with response generation, compromising output fidelity and generalization. In this work, we present a novel framework that addresses these limitations while maintaining the advantages of LLMs. Our approach introduces a Multimodal-LM architecture that jointly processes speech content, prosody, and speaker emotion, capturing the full spectrum of communication cues. Additionally, we propose an identity disentanglement strategy using instance normalization and adaptive instance normalization in a VQ-VAE framework, enabling high-fidelity listening head synthesis with flexible identity control. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of response naturalness and fidelity, while enabling effective identity control without retraining.


Poster #2
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Yongming Zhu · Longhao Zhang · Zhengkun Rong · Tianshu Hu · Shuang Liang · Zhipengge

Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method.


Poster #3
AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Jiazhi Guan · Kaisiyuan Wang · Zhiliang Xu · Quanwei Yang · Yasheng SUN · Shengyi He · Borong Liang · Yukang Cao · Yingying Li · Haocheng Feng · Errui Ding · Jingdong Wang · Youjian Zhao · Hang Zhou · Ziwei Liu

Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details.


Poster #4
InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Jiahe Li · Jiawei Zhang · Xiao Bai · Jin Zheng · Jun Zhou · Lin Gu

Despite exhibiting impressive performance in synthesizing lifelike personalized 3D talking heads, prevailing methods based on radiance fields suffer from high demands for training data and time for each new identity. This paper introduces InsTaG, a 3D talking head synthesis framework that allows a fast learning of realistic personalized 3D talking head from few training data. Built upon a lightweight 3DGS person-specific synthesizer with universal motion priors, InsTaG achieves high-quality and fast adaptation while preserving high-level personalization and efficiency. As preparation, we first propose an Identity-Free Pre-training strategy that enables the pre-training of the person-specific model and encourages the collection of universal motion priors from long-video data corpus. To fully exploit the universal motion priors to learn an unseen new identity, we then present a Motion-Aligned Adaptation strategy to adaptively align the target head to the pre-trained field, and constrain a robust dynamic head structure under few training data. Extensive experiments demonstrate our outstanding performance and efficiency under various data scenarios to render high-quality personalized talking head videos.


Poster #5
Dynamic Stereotype Theory Induced Micro-expression Recognition with Oriented Deformation

Bohao Zhang · Xuejiao Wang · Changbo Wang · Gaoqi He

Micro-expression recognition (MER) aims to uncover genuine emotions and underlying psychological states. However, existing MER methods struggle with three main challenges. 1) Scarcity of micro-expression samples. 2) Difficulty in modeling nearly imperceptible facial movements. 3) Reliance on apex frame annotations. To address these issues, we propose a Self-supervised Oriented Deformation model for Apex-free Micro-expression Recognition (SODA4MER). Our approach enhances local deformation perception using muscle-group priors and amplifies subtle features through Dynamic Stereotype Theory (DST) based enhancement, while contrastive learning eliminates the need for manual apex annotations. Specifically, the Oriented deformation estimator of SODA4MER is first pre-trained in a self-supervised manner. Secondly, a Gated Temporal Variance Gaussian model (GTVG) is introduced to adaptively integrate facial muscle-group priors, enhancing local deformation perception and mitigating noise from head movements. Then, contrastive learning is employed to achieve apex detection by identifying the frame with the most significant local deformation. Finally, guided by DST, we introduced a feature enhancement strategy that models the temporal dynamics of local deformation in the activation and decay phases, leading to richer deformation features. Our rigorous experiments confirm the competitive performance and practical applicability of SODA4MER.


Poster #6
Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Shengze Wang · Xueting Li · Chao Liu · Matthew Chan · Michael Stengel · Henry Fuchs · Shalini De Mello · Koki Nagano

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a 3D avatar built from a single reference image, but fail to faithfully preserve the user's per-frame appearance (e.g., instantaneous facial expression and lighting). As a result, none of these two frameworks is an ideal solution for democratized 3D telepresence. In this work, we address this dilemma and propose a novel solution that maintains both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearance. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction and temporal consistency on in-studio and in-the-wild datasets.


Poster #7
Highlight
TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting

Jianchuan Chen · Jingchuan Hu · Gaige Wang · Zhonghua Jiang · Tiansong Zhou · Zhiwen Chen · Chengfei Lv

Realistic 3D full-body talking avatars hold great potential in AR, with applications ranging from e-commerce live streaming to holographic communication. Despite advances in 3D Gaussian Splatting (3DGS) for lifelike avatar creation, existing methods struggle with fine-grained control of facial expressions and body movements in full-body talking tasks. Additionally, they often lack sufficient details and cannot run in real-time on mobile devices. We present TaoAvatar, a high-fidelity, lightweight, 3DGS-based full-body talking avatar driven by various signals. Our approach starts by creating a personalized clothed human parametric template that binds Gaussians to represent appearances. We then pre-train a StyleUnet-based network to handle complex pose-dependent non-rigid deformation, which can capture high-frequency appearance details but is too resource-intensive for mobile devices. To overcome this, we "bake" the non-rigid deformations into a lightweight MLP-based network using a distillation technique and develop blend shapes to compensate for details. Extensive experiments show that TaoAvatar achieves state-of-the-art rendering quality while running in real-time across various devices, maintaining 90 FPS on high-definition stereo devices such as the Apple Vision Pro.


Poster #8
Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

Wojciech Zielonka · Stephan J. Garbin · Alexandros Lattas · George Kopanas · Paulo Gotardo · Thabo Beeler · Justus Thies · Timo Bolkart

We present SynShot, a novel method for few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle two major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints. With few input images, SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints. We model the head avatar using 3D Gaussian splatting and a convolutional encoder-decoder that outputs Gaussian parameters in UV texture space. To account for the different modeling complexities over parts of the head (e.g., skin vs hair), we embed the prior with explicit control for upsampling the number of per-part primitives. Compared to SOTA monocular methods that require thousands of real training images, SynShot significantly improves novel view and expression synthesis.


Poster #9
Highlight
RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars

Linzhou Li · Yumeng Li · Yanlin Weng · Youyi Zheng · Kun Zhou

We present Reduced Gaussian Blendshapes Avatar (RGBAvatar), a method for reconstructing photorealistic, animatable head avatars at speeds sufficient for on-the-fly reconstruction. Unlike prior approaches that utilize linear bases from 3D morphable models (3DMM) to model Gaussian blendshapes, our method maps tracked 3DMM parameters into reduced blendshape weights with an MLP, leading to a compact set of blendshape bases. The learned compact base composition effectively captures essential facial details for specific individuals, and does not rely on the fixed base composition weights of 3DMM, leading to enhanced reconstruction quality and higher efficiency. To further expedite the reconstruction process, we develop a novel color initialization estimation method and a batch-parallel Gaussian rasterization process, achieving state-of-the-art quality with training throughput of about 630 images per second. Moreover, we propose a local-global sampling strategy that enables direct on-the-fly reconstruction, immediately reconstructing the model as video streams in real time while achieving quality comparable to offline settings.


Poster #10
AvatarArtist: Open-Domain 4D Avatarization

Hongyu Liu · Xuan Wang · Ziyu Wan · Yue Ma · Jingye Chen · Yanbo Fan · Yujun Shen · Yibing Song · Qifeng Chen

This work focuses on open-domain 4D avatarization, with the purpose of creating a 4D avatar from a portrait image in an arbitrary style.We select parametric triplanes as the intermediate 4D representation, and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models.Our design stems from the observation that 4D GANs excel at bridging images and triplanes without supervision yet usually face challenges in handling diverse data distributions.A robust 2D diffusion prior emerges as the solution, assisting the GAN in transferring its expertise across various domains.The synergy between these experts permits the construction of a multi-domain image-triplane dataset, which drives the development of a general 4D avatar creator.Extensive experiments suggest that our model, termed \method, is capable of producing high-quality 4D avatars with strong robustness to various source image domains.The code, the data, and the models will be made publicly available to facilitate future studies.


Poster #11
Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance

Dimitrios Gerogiannis · Foivos Paraperas Papantoniou · Rolandos Alexandros Potamias · Alexandros Lattas · Stefanos Zafeiriou

Inspired by the effectiveness of 3D Gaussian Splatting (3DGS) in reconstructing detailed 3D scenes within multi-view setups and the emergence of large 2D human foundation models, we introduce Arc2Avatar, the first SDS-based method utilizing a human face foundation model as guidance with just a single image as input. To achieve that, we extend such a model for diverse-view human head generation by fine-tuning on synthetic data and modifying its conditioning. Our avatars maintain a dense correspondence with a human face mesh template, allowing blendshape-based expression generation. This is achieved through a modified 3DGS approach, connectivity regularizers, and a strategic initialization tailored for our task. Additionally, we propose an optional efficient SDS-based correction step to refine the blendshape expressions, enhancing realism and diversity. Experiments demonstrate that Arc2Avatar achieves state-of-the-art realism and identity preservation, effectively addressing color issues by allowing the use of very low guidance, enabled by our strong identity prior and initialization strategy, without compromising detail.


Poster #12
Highlight
Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Zhiyang Guo · Jinxu Xiang · Kai Ma · Wengang Zhou · Houqiang Li · Ran Zhang

3D characters are essential to modern creative industries, but making them animatable often demands extensive manual work in tasks like rigging and skinning. Existing automatic rigging tools face several limitations, including the necessity for manual annotations, rigid skeleton topologies, and limited generalization across diverse shapes and poses. An alternative approach generates animatable avatars pre-bound to a rigged template mesh. However, this method often lacks flexibility and is typically limited to realistic human shapes. To address these issues, we present Make-It-Animatable, a novel data-driven method to make any 3D humanoid model ready for character animation in less than one second, regardless of its shapes and poses. Our unified framework generates high-quality blend weights, bones, and pose transformations. By incorporating a particle-based shape autoencoder, our approach supports various 3D representations, including meshes and 3D Gaussian splats. Additionally, we employ a coarse-to-fine representation and a structure-aware modeling strategy to ensure both accuracy and robustness, even for characters with non-standard skeleton structures. We conducted extensive experiments to validate our framework's effectiveness. Compared to existing methods, our approach demonstrates significant improvements in both quality and speed. The source code will be made publicly available.


Poster #13
PhysAnimator: Physics-Guided Generative Cartoon Animation

Tianyi Xie · Yiwei Zhao · Ying Jiang · Chenfanfu Jiang

Creating hand-drawn animation sequences is labor-intensive and demands professional expertise. We introduce PhysAnimator, a novel approach for generating physically plausible meanwhile anime-stylized animation from static anime illustrations. Our method seamlessly integrates physics-based simulations with data-driven generative models to produce dynamic and visually compelling animations. To capture the fluidity and exaggeration characteristic of anime, we perform image-space deformable body simulations on extracted mesh geometries. We enhance artistic control by introducing customizable energy strokes and incorporating rigging point support, enabling the creation of tailored animation effects such as wind interactions. Finally, we extract and warp sketches from the simulation sequence, generating a texture-agnostic representation, and employ a sketch-guided video diffusion model to synthesize high-quality animation frames. The resulting animations exhibit temporal consistency and visual plausibility, demonstrating the effectiveness of our method in creating dynamic anime-style animations.


Poster #14
Zero-Shot Head Swapping in Real-World Scenarios

Sohyun Jeong · Taewoong Kang · Hyojin Jang · Jaegul Choo

With growing demand in media and social networks for personalized images, the need for advanced head-swapping techniques—integrating an entire head from the head image with the body from the body image—has increased. However, traditional head-swapping methods heavily rely on face-centered cropped data with primarily frontal-facing views, which limits their effectiveness in real-world applications. Additionally, their masking methods, designed to indicate regions requiring editing, are optimized for these types of dataset but struggle to achieve seamless blending in complex situations, such as when the original data includes features like long hair extending beyond the masked area. To overcome these limitations and enhance adaptability in diverse and complex scenarios, we propose a novel head swapping method, HID, that is robust to images including the full head and the upper body, and handles from frontal to side views, while automatically generating context-aware masks. For automatic mask generation, we introduce the IOMask, which enables seamless blending of the head and body, effectively addressing integration challenges. We further introduce the hair injection module to capture hair details with greater precision. Our experiments demonstrate that the proposed approach achieves state-of-the-art performance in head swapping, providing visually consistent and realistic results across a wide range of challenging conditions.


Poster #15
CaricatureBooth: Data-Free Interactive Caricature Generation in a Photo Booth

Zhiyu Qu · Yunqi Miao · Zhensong Zhang · Jifei Song · Jiankang Deng · Yi-Zhe Song

We present CaricatureBooth, a system that transforms caricature creation into a simple interactive experience -- as easy as using a photo booth! A key challenge in caricature generation is two-fold: the scarcity of high-quality caricature data and the difficulty in enabling precise creative control over the exaggeration process while maintaining identity. Prior approaches either require large-scale caricature and photo data or lack intuitive mechanisms for users to guide the deformation without losing identity. We address the data scarcity by synthesising training data through Thin Plate Spline (TPS) deformation of standard face images. For creative control, we design a Bézier curve interface where users can easily manipulate facial features, with these edits then driving TPS transformations at inference time. When combined with a pre-trained ID-preserving diffusion model, our system maintains both identity preservation and creative flexibility. Through extensive experiments, we demonstrate that CaricatureBooth achieves state-of-the-art quality while making the joy of caricature creation as accessible as taking a photo -- just walk in and walk out with your personalised caricature! Code will be made available at the first instance to facilitate follow-up efforts.


Poster #16
FFaceNeRF: Few-shot Face Editing in Neural Radiance Fields

Kwan Yun · Chaelin Kim · Hangyeul Shin · Junyong Noh

Recent 3D face editing methods using masks have produced high-quality edited images by leveraging Neural Radiance Fields (NeRF). Despite their impressive performance, existing methods often provide limited user control due to the use of pre-trained segmentation masks. To utilize masks with a desired layout, an extensive training dataset is required, which is challenging to gather. We present FFaceNeRF, a NeRF-based face editing technique that can overcome the challenge of limited user control due to the use of fixed mask layouts. Our method employs a geometry adapter with feature injection, allowing for effective manipulation of geometry attributes. Additionally, we adopt latent mixing for tri-plane augmentation, which enables training with fewer samples. This facilitates rapid model adaptation to desired mask layouts, crucial for applications in fields like personalized medical imaging or creative face editing. Our comparative evaluations indicate that FFaceNeRF surpasses existing mask based face editing methods in terms of flexibility, control, and generated image quality, paving a way for future advancements in customized and high-fidelity 3D face editing.


Poster #17
D^3-Human: Dynamic Disentangled Digital Human from Monocular Video

Honghu Chen · Bo Peng · Yunfan Tao · Juyong Zhang

We introduce $\text{D}^3\text{-Human}$, a method for reconstructing Dynamic Disentangled Digital Human geometry from monocular videos. Past monocular video human reconstruction primarily focuses on reconstructing undecoupled clothed human bodies or only reconstructing clothing, making it difficult to apply directly in applications such as animation production. The challenge in reconstructing decoupled clothing and body lies in the occlusion caused by clothing over the body. To this end, the details of the visible area and the plausibility of the invisible area must be ensured during the reconstruction process. Our proposed method combines explicit and implicit representations to model the decoupled clothed human body, leveraging the robustness of explicit representations and the flexibility of implicit representations. Specifically, we reconstruct the visible region as SDF and propose a novel human manifold signed distance field (hmSDF) to segment the visible clothing and visible body, and then merge the visible and invisible body. Extensive experimental results demonstrate that, compared with existing reconstruction schemes, $\text{D}^3\text{-Human}$ can achieve high-quality decoupled reconstruction of the human body wearing different clothing, and can be directly applied to clothing transfer and animation production.


Poster #18
DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models

Radu Alexandru Rosu · Keyu Wu · Yao Feng · Youyi Zheng · Michael J. Black

We address the task of reconstructing 3D hair geometry from a single image, which is challenging due to the diversity of hairstyles and the lack of paired image-to-3D hair data.Previous methods are primarily trained on synthetic data and cope with the limited amount of such data by using low-dimensional intermediate representations, such as guide strands and scalp-level embeddings, that require post-processing to decode, upsample, and add realism. These approaches fail to reconstruct detailed hair, struggle with curly hair, or are limited to handling only a few hairstyles. To overcome these limitations, we propose DiffLocks, a novel framework that enables detailed reconstruction of a wide variety of hairstyles directly from a single image.First, we address the lack of 3D hair data by automating the creation of the largest synthetic hair dataset to date, containing 40K hairstyles. Second, we leverage the synthetic hair dataset to learn an image-conditioned diffusion-transfomer model that reconstructs accurate 3D strands from a single frontal image. By using a pretrained image backbone, our method generalizes to in-the-wild images despite being trained only on synthetic data.Our diffusion model predicts a scalp texture map in which any point in the map contains the latent code for an individual hair strand. These codes are directly decoded to 3D strands without post-processing techniques.Representing individual strands, instead of guide strands, enables the transformer to model the detailed spatial structure of complex hairstyles.With this, DiffLocks can reconstruct highly curled hair, like afro hairstyles, from a single image for the first time.Qualitative and quantitative results demonstrate that DiffLocks outperforms exising state-of-the-art approaches. Data and code will be available for research.


Poster #19
Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios

Hang Shao · lei luo · Jianjun Qian · Mengkai Yan · Shuo Chen · Jian Yang

Physiological activities can be manifested by the sensitive changes in facial imaging. While they are barely observable to our eyes, computer vision manners can, and the derived remote photoplethysmography (rPPG) has shown considerable promise. However, existing studies mainly rely on spatial skin recognition and temporal rhythmic interactions, so they focus on identifying explicit features under ideal light conditions, but perform poorly in-the-wild with intricate obstacles and extreme illumination exposure. In this paper, we propose an end-to-end video transformer model for rPPG. It strives to eliminate complex and unknown external time-varying interferences, whether they are sufficient to occupy subtle biosignal amplitudes or exist as periodic perturbations that hinder network training. In the specific implementation, we utilize global \text{interference sharing, subject} background reference, and self-supervised disentanglement to eliminate interference, and further guide learning based on spatiotemporal filtering, reconstruction guidance, and frequency domain and biological prior constraints to achieve effective rPPG. To the best of our knowledge, this is the first robust rPPG model for real outdoor scenarios based on natural face videos, and is lightweight to deploy. Extensive experiments show the competitiveness and performance of our model in rPPG prediction across datasets and scenes.


Poster #20
GCC: Generative Color Constancy via Diffusing a Color Checker

Chen-Wei Chang · Cheng-De Fan · Chia-Che Chang · Yi-Chen Lo · Yu-Chee Tseng · Jiun-Long Huang · Yu-Lun Liu

Color constancy methods often struggle to generalize across different camera sensors due to varying spectral sensitivities. We present GCC, which leverages diffusion models to inpaint color checkers into images for illumination estimation. Our key innovations include (1) a single-step deterministic inference approach that inpaints color checkers reflecting scene illumination, (2) a Laplacian composition technique that preserves checker structure while allowing illumination-dependent color adaptation, and (3) a mask-based data augmentation strategy for handling imprecise color checker annotations. GCC demonstrates superior robustness in cross-camera scenarios, achieving state-of-the-art worst-25% error rates of 5.22° and 4.32° in bi-directional evaluations. These results highlight our method's stability and generalization capability across different camera characteristics without requiring sensor-specific training, making it a versatile solution for real-world applications.


Poster #21
DarkIR: Robust Low-Light Image Restoration

Daniel Feijoo · Juan C. Benito · Alvaro Garcia · Marcos Conde

Photography during night or in dark conditions typically suffers from noise, low light and blurring issues due to the dim environment and the common use of long exposure. Although Deblurring and Low-light Image Enhancement (LLIE) are related under these conditions, most approaches in image restoration solve these tasks separately. In this paper, we present an efficient and robust neural network for multi-task low-light image restoration. Instead of following the current tendency of Transformer-based models, we propose new attention mechanisms to enhance the receptive field of efficient CNNs. Our method reduces the computational costs in terms of parameters and MAC operations compared to previous methods. Our model, DarkIR, achieves new state-of-the-art results on the popular LOLBlur, LOLv2 and Real-LOLBlur datasets, being able to generalize on real-world night images.


Poster #22
PolarFree: Polarization-based Reflection-Free Imaging

Mingde Yao · Menglu Wang · King Man Tam · Lingen Li · Tianfan Xue · Jinwei Gu

Reflection removal is challenging due to complex light interactions, where reflections obscure important details and hinder scene understanding. Polarization naturally provides a powerful cue to distinguish between reflected and transmitted light, enabling more accurate reflection removal. However, existing methods often rely on small-scale or synthetic datasets, which fail to capture the diversity and complexity of real-world scenarios. To this end, we construct a large-scale dataset, PolarRR, for polarization-based reflection removal, which enables us to train models that generalize effectively across a wide range of real-world scenarios. The PolarRR dataset contains 6,500 well-aligned mixed-transmission image pairs, 8x larger than existing polarization datasets, and is the first to include both RGB and polarization images captured across diverse indoor and outdoor environments with varying lighting conditions. Besides, to fully exploit the potential of polarization cues for reflection removal, we introduce PolarFree, which leverages diffusion process to generate reflection-free cues for accurate reflection removal. Extensive experiments show that PolarFree significantly enhances image clarity in difficult reflective scenarios, setting a new benchmark for polarized imaging and reflection removal. Code and dataset will be public after acceptance.


Poster #23
Highlight
OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit

Benquan Wang · Ruyi An · Jin-Kyu So · Sergei Kurdiumov · Eng Aik Chan · Giorgio Adamo · Yuhan Peng · Yewen Li · Bo An

Observing objects of small size has always been a charming pursuit of human beings.However, due to the physical phenomenon of diffraction, the optical resolution is restricted to approximately half the wavelength of light, which impedes the observation of subwavelength objects, typically smaller than 200 nm. This constrains its application in numerous scientific and industrial fields that aim to observe objects beyond the diffraction limit, such as native state coronavirus inspection.Fortunately, deep learning methods have shown remarkable potential in uncovering underlying patterns within data, promising to overcome the diffraction limit by revealing the mapping pattern between diffraction images and their corresponding ground truth object localization images. However, the absence of suitable datasets has hindered progress in this field - collecting high-quality optical data of subwavelength objects is very challenging as these objects are inherently invisible under conventional microscopy, making it impossible to perform standard visual calibration and drift correction. Therefore, in collaboration with top optical scientists, we provide the first general optical imaging dataset based on the "LEGO" concept for addressing the diffraction limit. Drawing an analogy to the modular construction of the LEGO blocks, we construct a comprehensive optical imaging dataset comprising subwavelength fundamental elements, i.e., small square units that can be assembled into larger and more complex objects of any shape. We then frame the task as an image-to-image translation task and evaluate various vision backbone methods. Experimental results validate our "LEGO" concept, demonstrating that models trained on basic square units can effectively generalize to realistic, more complex unseen objects. Most importantly, by highlighting this underexplored AI-for-science area and its potential, we aspire to advance optical science by fostering collaboration with the vision and machine learning communities.


Poster #24
A Physics-Informed Blur Learning Framework for Imaging Systems

liqun.chen · Yuxuan Li · Jun Dai · Jinwei Gu · Tianfan Xue

Accurate blur estimation is essential for high-performance imaging across various applications. Blur is typically represented by the point spread function (PSF). In this paper, we propose a physics-informed PSF learning framework for imaging system, consisting a simple calibration followed by a learning process. Our framework could achieve both high accuracy and universal applicability. Inspired by the Seidel PSF model for representing spatially varying PSF, we identify its limitations in optimization and introduce a novel wavefront-based PSF model accompanied by an optimization strategy, both reduce optimization complexity and improve estimation accuracy. Moreover, our wavefront-based PSF model is independent of lens parameters, eliminate the need for prior knowledge of the lens. To validate our approach, we compare it with recent PSF estimation methods (Degradation Transfer and Fast Two-step) through a deblurring task, where all the estimated PSFs are used to train state-of-the-art deblurring algorithms. Our approach demonstrates improvements in image quality in simulation, also showcase noticeable visual quality improvements on real captured images. Code and models are public.


Poster #25
MaDCoW: Marginal Distortion Correction for Wide-Angle Photography with Arbitrary Objects

Kevin Zhang · Jia-Bin Huang · Jose Echevarria · Stephen DiVerdi · Aaron Hertzmann

We introduce MaDCoW, a method for correcting marginal distortion of arbitrary objects in wide-angle photography. People often use wide-angle photography to convey natural scenes—smartphones typically default to wide-angle photography—but depicting very wide-field-of-view scenes produces distorted object appearance, particularly marginal distortion in linear projections. With MaDCoW, a user annotates regions-of-interest to correct, along with straight lines. For each region, MaDCoW solves for a local-linear perspective projection and then jointly solves for a projection for the whole photograph that minimizes distortion. We show that our method can produce good results in cases where previous methods yield visible distortions.


Poster #26
Highlight
Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation

Hadi Alzayer · Philipp Henzler · Jonathan T. Barron · Jia-Bin Huang · Pratul P. Srinivasan · Dor Verbin

Reconstructing the geometry and appearance of objects from photographs taken in different environments is difficult as the illumination and therefore the object appearance vary across captured images. This is particularly challenging for more specular objects whose appearance strongly depends on the viewing direction. Some prior approaches model appearance variation across images using a per-image embedding vector, while others use physically-based rendering to recover the materials and per-image illumination. Such approaches fail at faithfully recovering view-dependent appearance given significant variation in input illumination and tend to produce mostly diffuse results. We present an approach that reconstructs objects from images taken under different illuminations by first relighting the images under a single reference illumination with a multiview relighting diffusion model and then reconstructing the object's geometry and appearance with a radiance field architecture that is robust to the small remaining inconsistencies among the relit images. We validate our proposed approach on both simulated and real datasets and demonstrate that it greatly outperforms existing techniques at reconstructing high-fidelity appearance from images taken under extreme illumination variation. Moreover, our approach is particularly effective at recovering view-dependent ``shiny'' appearance which cannot be reconstructed by prior methods.


Poster #27
IRGS: Inter-Reflective Gaussian Splatting with 2D Gaussian Ray Tracing

Chun Gu · Xiaofei Wei · Zixuan Zeng · Yuxuan Yao · Li Zhang

In inverse rendering, accurately modeling visibility and indirect radiance for incident light is essential for capturing secondary effects. Due to the absence of a powerful Gaussian ray tracer, previous 3DGS-based methods have either adopted a simplified rendering equation or used learnable parameters to approximate incident light, resulting in inaccurate material and lighting estimations. To this end, we introduce the inter-reflective Gaussian splatting (IRGS) framework for inverse rendering. To capture inter-reflection, we apply the full rendering equation without simplification and compute incident radiance on the fly using the proposed differentiable 2D Gaussian ray tracing. Additionally, we present an efficient optimization scheme to handle the computational demands of Monte Carlo sampling for rendering equation evaluation. Furthermore, we introduce a novel strategy for querying the indirect radiance of incident light when relighting the optimized scenes. Extensive experiments on multiple standard benchmarks validate the effectiveness of IRGS, demonstrating its capability to accurately model complex inter-reflection effects.


Poster #28
Highlight
Volumetrically Consistent 3D Gaussian Rasterization

Chinmay Talegaonkar · Yash Belhe · Ravi Ramamoorthi · Nicholas Antipa

Recently, 3D Gaussian Splatting (3DGS) has enabled photorealistic view synthesis at high inference speeds.However, its splatting-based rendering model makes several approximations to the rendering equation, reducing physical accuracy.We show that splatting and its approximations are unnecessary, even within a rasterizer;we instead volumetrically integrate 3D Gaussians directly to compute the transmittance across them analytically.We use this analytic transmittance to derive more physically accurate alpha values than 3DGS, which can directly be used within their framework. The result is a method that more closely follows the volume rendering equation (similar to ray tracing) while enjoying the speed benefits of rasterization. Our method represents opaque surfaces with higher accuracy and fewer points than 3DGS.This enables it to outperform 3DGS for view synthesis (measured in SSIM and LPIPS).Being volumetrically consistent also enables our method to work out of the box for tomography. We match the state-of-the-art 3DGS-based tomography method with fewer points.


Poster #29
MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities

Federico Lincetto · Gianluca Agresti · Mattia Rossi · Pietro Zanuttigh

Neural Radiance Fields (NeRF) have shown impressive performances in the rendering of 3D scenes from arbitrary viewpoints. While RGB images are widely preferred for training volume rendering models, the interest in other radiance modalities is also growing. However, the capability of the underlying implicit neural models to learn and transfer information across heterogeneous imaging modalities has seldom been explored, mostly due to the limited training data availability. For this purpose, we present MultimodalStudio (MMS): it encompasses MMS-DATA and MMS-FW. MMS-DATA is a multimodal multi-view dataset containing 32 scenes acquired with 5 different imaging modalities: RGB, monochrome, near-infrared, polarization and multispectral. MMS-FW is a novel modular multimodal NeRF framework designed to handle multimodal raw data and able to support an arbitrary number of multi-channel devices. Through extensive experiments, we demonstrate that MMS-FW trained on MMS-DATA can transfer information between different imaging modalities and produce higher quality renderings than using single modalities alone. We publicly release the dataset and the framework, to promote the research on multimodal volume rendering and beyond.


Poster #30
Neural Inverse Rendering from Propagating Light

Anagh Malik · Benjamin Attal · Andrew Xie · Matthew O’Toole · David B. Lindell

We present the first system for physically based, neural inverse rendering from multi-viewpoint videos of propagating light. Our approach relies on a time-resolved extension of neural radiance caching --- a technique that accelerates inverse rendering by storing infinite-bounce radiance arriving at any point from any direction. The resulting model accurately accounts for direct and indirect light transport effects and, when applied to captured measurements from a flash lidar system, enables state-of-the-art 3D reconstruction in the presence of strong indirect light. Further, we demonstrate view synthesis of propagating light, automatic decomposition of captured measurements into direct and indirect components, as well as novel capabilities such as multi-view transient relighting of captured scenes.


Poster #31
PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields

Sean Wu · Shamik Basu · Tim Broedermann · Luc Van Gool · Christos Sakaridis

We tackle the ill-posed inverse rendering problem in 3D reconstruction with a Neural Radiance Field (NeRF) approach informed by Physics-Based Rendering (PBR) theory, named PBR-NeRF. Our method addresses a key limitation in most NeRF and 3D Gaussian Splatting approaches: they estimate view-dependent appearance without modeling scene materials and illumination. To address this limitation, we present an inverse rendering (IR) model capable of jointly estimating scene geometry, materials, and illumination. Our model builds upon recent NeRF-based IR approaches, but crucially introduces two novel physics-based priors that better constrain the IR estimation. Our priors are rigorously formulated as intuitive loss terms and achieve state-of-the-art material estimation without compromising novel view synthesis quality. Our method is easily adaptable to other inverse rendering and 3D reconstruction frameworks that require material estimation. We demonstrate the importance of extending current neural rendering approaches to fully model scene properties beyond geometry and view-dependent appearance. Code will be made publicly available.


Poster #32
MAGE : Single Image to Material-Aware 3D via the Multi-View G-Buffer Estimation Model

Haoyuan Wang · Zhenwei Wang · Xiaoxiao Long · Cheng Lin · Gerhard Hancke · Rynson W.H. Lau

With advances in deep learning models and the availability of large-scale 3D datasets, we have recently witnessed significant progress in single-view 3D reconstruction. However, existing methods often fail to reconstruct physically based material properties given a single image, limiting their applicability in complicated scenarios. This paper presents a novel approach (MAGE) for generating 3D geometry with realistic decomposed material properties given a single image as input. Our method leverages inspiration from traditional computer graphics deferred rendering pipelines to introduce a multi-view G-buffer estimation model. The proposed model estimates G-buffers for various views as multi-domain images, including XYZ coordinates, normals, albedo, roughness, and metallic properties from the single-view RGB. Furthermore, to address the inherent ambiguity and inconsistency in generating G-buffers simultaneously, we formulate a deterministic network from the pretrained diffusion models and propose a lighting response loss that enforces consistency across these domains using PBR principles. We also propose a large-scale synthetic dataset rich in material diversity for our model training. Experimental results demonstrate the effectiveness of our method in producing high-quality 3D meshes with rich material properties. We will release the dataset and code.


Poster #33
3D-HGS: 3D Half-Gaussian Splatting

Haolin Li · Jinyang Liu · Mario Sznaier · Octavia Camps

Photo-realistic 3D Reconstruction is a fundamental problem in 3D computer vision. This domain has seen considerable advancements owing to the advent of recent neural rendering techniques. These techniques predominantly aim to focus on learning volumetric representations of 3D scenes and refining these representations via loss functions derived from rendering. Among these, 3D Gaussian Splatting (3D-GS) has emerged as a significant method, surpassing Neural Radiance Fields (NeRFs). 3D-GS uses parameterized 3D Gaussians for modeling both spatial locations and color information, combined with a tile-based fast rendering technique. Despite its superior rendering performance and speed, the use of 3D Gaussian kernels has inherent limitations in accurately representing discontinuous functions, notably at edges and corners for shape discontinuities, and across varying textures for color discontinuities. To address this problem, we propose to employ 3D Half-Gaussian (\textbf{3D-HGS}) kernels, which can be used as a plug-and-play kernel. Our experiments demonstrate their capability to improve the performance of current 3D-GS related methods and achieve state-of-the-art rendering performance on various datasets without compromising rendering speed. The code and trained models will be available on GitHub.


Poster #34
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

Junha Hyung · Kinam Kim · Susung Hong · Min-Jung Kim · Jaegul Choo

Diffusion models have emerged as a powerful tool for generating high-quality images, videos, and 3D content. While sampling guidance techniques like CFG improve quality, they reduce diversity and motion. Autoguidance mitigates these issues but demands extra weak model training, limiting its practicality for large-scale models.In this work, we introduce Spatiotemporal Skip Guidance (STG), a simple training-free sampling guidance method for enhancing transformer-based video diffusion models.STG employs an implicit weak model via self-perturbation, avoiding the need for external models or additional training. By selectively skipping spatiotemporal layers, STG produces an aligned, degraded version of the original model to boost sample quality without compromising diversity or dynamic degree. Our contributions include: (1) introducing STG as an efficient, high-performing guidance technique for video diffusion models, (2) eliminating the need for auxiliary models by simulating a weak model through layer skipping, and (3) ensuring quality-enhanced guidance without compromising sample diversity or dynamics unlike CFG.


Poster #35
Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

Zhuoman Liu · Weicai Ye · Yan Luximon · Pengfei Wan · Di ZHANG

Realistic simulation of dynamic scenes requires accurately capturing diverse material properties and modeling complex object interactions grounded in physical principles. However, existing methods are constrained to basic material types with limited predictable parameters, making them insufficient to represent the complexity of real-world materials. We introduce a novel approach that leverages multi-modal foundation models and video diffusion to achieve enhanced 4D dynamic scene simulation. Our method utilizes multi-modal models to identify material types and initialize material parameters through image queries, while simultaneously inferring 3D Gaussian splats for detailed scene representation. We further refine these material parameters using video diffusion with a differentiable Material Point Method (MPM) and optical flow guidance rather than render loss or Score Distillation Sampling (SDS) loss. This integrated framework enables accurate prediction and realistic simulation of dynamic interactions in real-world scenarios, advancing both accuracy and flexibility in physics-based simulations.


Poster #36
ProbeSDF: Light Field Probes For Neural Surface Reconstruction

Briac Toussaint · Diego Thomas · Jean-Sébastien Franco

SDF-based differential rendering frameworks have achieved state-of-the-art multiview 3D shape reconstruction. In this work, we re-examine this family of approaches by minimally reformulating its core appearance model in a way that simultaneously yields faster computation and increased performance. To this goal, we exhibit a physically-inspired minimal radiance parametrization decoupling angular and spatial contributions, by encoding them with a small number of features stored in two respective volumetric grids of different resolutions. Requiring as little as four parameters per voxel, and a tiny MLP call inside a single fully fused kernel, our approach allows to enhance performance with both surface and image (PSNR) metrics, while providing a significant training speedup and real-time rendering. We show this performance to be consistently achieved on real data over two widely different and popular application fields, generic object and human subject shape reconstruction, using four representative and challenging datasets.


Poster #37
Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

Zhiyuan Ma · Xinyue Liang · Rongyuan Wu · Xiangyu Zhu · Zhen Lei · Lei Zhang

It is desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), which eliminates the need for 3D ground-truths by distilling multi-view diffusion models, and adapts SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries to the 3D outputs through score distillation. Our PRD scheme also accelerates the inference speed by training the model to generate 3D contents in just four steps. We use PRD to train a Triplane generator, namely TriplaneTurbo, which adds only 2.5% trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both quality and efficiency. Specifically, it can produce high-quality 3D meshes in 0.6 seconds.

In the real world, objects reveal internal textures when sliced or cut, yet this behavior is not well-studied in 3D generation tasks today. For example, slicing a virtual 3D watermelon should reveal flesh and seeds. Given that no available dataset captures an object's full internal structure and collecting data from all slices is impractical, generative methods become the obvious approach. However, current 3D generation and inpainting methods often focus on visible appearance and overlook internal textures. To bridge this gap, we introduce FruitNinja, the first method to generate internal textures for 3D objects undergoing geometric and topological changes. Our approach produces objects via 3D Gaussian Splatting (3DGS) with both surface and interior textures synthesized, enabling real-time slicing and rendering without additional optimization. FruitNinja leverages a pre-trained diffusion model to progressively inpaint cross-sectional views and applies voxel-grid-based smoothing to achieve cohesive textures throughout the object. Our OpaqueAtom GS strategy overcomes 3DGS limitations by employing densely distributed opaque Gaussians, avoiding biases toward larger particles that destabilize training and sharp color transitions for fine-grained textures. Experimental results show that FruitNinja substantially outperforms existing approaches, showcasing unmatched visual quality in real-time rendered internal views across arbitrary geometry manipulations.


Poster #39
DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

Wang Zhao · Yan-Pei Cao · Jiale Xu · Yue-Jiang Dong · Ying Shan

Procedural Content Generation (PCG) is powerful in creating high-quality 3D contents, yet controlling it to produce desired shapes is difficult and often requires extensive parameter tuning. Inverse Procedural Content Generation aims to automatically find the best parameters under the input condition. However, existing sampling-based and neural network-based methods still suffer from numerous sample iterations or limited controllability. In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. At its core is a lightweight diffusion transformer model, where PCG parameters are directly treated as the denoising target and the observed images as conditions to control parameter generation. DI-PCG is efficient and effective. With only 7.6M network parameters and 30 GPU hours to train, it demonstrates superior performance in recovering parameters accurately, and generalizing well to in-the-wild images. Quantitative and qualitative experiment results validate the effectiveness of DI-PCG in inverse PCG and image-to-3D generation tasks. DI-PCG offers a promising approach for efficient inverse PCG and represents a valuable exploration step towards a 3D generation path that models how to construct a 3D asset using parametric models.


Poster #40
CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images

Chen Cheng · Jiacheng Wei · Tianrun Chen · Chi Zhang · Xiaofeng Yang · Shangzhan Zhang · Bingchen Yang · Chuan-Sheng Foo · Guosheng Lin · Qixing Huang · Fayao Liu

Creating CAD digital twins from the physical world is crucial for manufacturing, design, and simulation. However, current methods typically rely on costly 3D scanning with labor-intensive post-processing. To provide a streamlined and user-friendly design process, we explore the problem of reverse engineering from unconstrained real-world CAD images that can be easily captured by users of all experiences. However, the scarcity of real-world CAD data poses challenges in directly training such models. To tackle these challenges, we propose CADCrafter, an image to parametric CAD model generation framework that trains a latent diffusion network solely on synthetic textureless CAD data while testing on real-world images. To bridge the significant representation disparity between images and parametric CAD models, we introduce a geometry encoder to improve the network's capability to accurately capture diverse geometric features. Moreover, the texture-invariant properties of the geometric features can also facilitate the generalization to real-world scenarios. Since compiling CAD parameter sequences into explicit CAD models is a non-differentiable process, the network training inherently lacks explicit geometric supervision. To impose geometric validity constraints on our model, we employ direct preference optimization to fine-tune the diffusion model with the automatic code checker feedback on CAD sequence quality. Furthermore, we collected a real-world dataset RealCAD, comprised of multi-view images and corresponding CAD command sequence pairs, to evaluate our method. Experimental results demonstrate that our approach can robustly handle real unconstrained CAD images, and even generalize to unseen general objects.


Poster #41
Highlight
MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

Jinnan Chen · Lingting Zhu · Zeyu HU · Shengju Qian · Yugang Chen · Xin Wang · Gim Hee Lee

Recent advances in auto-regressive transformers have revolutionized generative modeling across domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential prediction paradigms, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution. To address these limitations, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent token denoising. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficient up-scaling the latent token resolution. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling properties over joint distribution modeling approaches like diffusion transformers in 3D generation.


Poster #42
Scaling Mesh Generation via Compressive Tokenization

Haohan Weng · Zibo Zhao · Biwen Lei · Xianghui Yang · Jian Liu · Zeqiang Lai · Zhuo Chen · Liu Yuhong · Jie Jiang · Chunchao Guo · Tong Zhang · Shenghua Gao · C.L.Philip Chen

We propose a compressive yet effective mesh tokenization, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. BPT compresses mesh sequences by employing block-wise indexing and patch aggregation, reducing their length by approximately 75% compared to the vanilla sequences. This compression milestone unlocks the potential to utilize mesh data with significantly more faces, thereby enhancing detail richness and improving generation robustness. Empowered with the BPT, we have built a foundation mesh generative model training on scaled mesh data to support flexible control for point clouds and images. Our model demonstrates the capability to generate meshes with intricate details and accurate topology, achieving SoTA performance on mesh generation and reaching the level for direct product usage.


Poster #43
Hierarchical Gaussian Mixture Model Splatting for Efficient and Part Controllable 3D Generation

Qitong Yang · Mingtao Feng · Zijie Wu · Weisheng Dong · Fangfang Wu · Yaonan Wang · Ajmal Mian

3D content creation has achieved significant progress in terms of both quality and speed. Although current Gaussian Splatting-based methods can produce 3D objects within seconds, they are still limited by complex preprocessing or low controllability. In this paper, we introduce a novel framework designed to efficiently and controllably generate high-resolution 3D models from text promptsor image. Our key insights are three-fold: 1) Hierarchical Gaussian Mixture Model Splatting: We propose an hybrid hierarchical representation to extract fixed number of fine-grained Gaussians with multiscale details from textured object, also establish part-level representation of Gaussians primitives. 2) Mamba with adaptive tree topology: We present a diffusion mamba with tree-topology to adaptively generate Gaussians with disordered spatial structures, without the need for complex preprocessing and maintain linear complexity generation. 3) Controllable Generation: Building on the HGMM tree, we introduce a cascaded diffusion framework combining controllable implicit latent generation, which progressively generates condition-driven latents, and explicit splatting generation, which transforms latents into high-quality Gaussian primitives. Extensive experiments demonstrate the high fidelity and efficiency of our approach.


Poster #44
Identity-preserving Distillation Sampling by Fixed-Point Iterator

SeonHwa Kim · Jiwon Kim · Soobin Park · Donghoon Ahn · Jiwon Kang · Seungryong Kim · Kyong Hwan Jin · Eunju Cha

Score distillation sampling (SDS) demonstrates a powerful capability for text-conditioned 2D image and 3D object generation by distilling the knowledge from learned score functions. However, SDS often suffers from blurriness caused by noisy gradients. When SDS meets the image editing, such degradations can be reduced by adjusting bias shifts using reference pairs, but the de-biasing techniques are still corrupted by erroneous gradients. To this end, we introduce Identity-preserving Distillation Sampling (IDS), which compensates for the gradient leading to undesired changes in the results. Based on the analysis that these errors come from the text-conditioned scores, a new regularization technique, called fixed-point iterative regularization (FPR), is proposed to modify the score itself, driving the preservation of the identity even including poses and structures. Thanks to a self-correction by FPR, the proposed method provides clear and unambiguous representations corresponding to the given prompts in image-to-image editing and editable neural radiance field (NeRF). The structural consistency between the source and the edited data is obviously maintained compared to other state-of-the-art methods.


Poster #45
PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?

Martin Spitznagel · Jan Vaillant · Janis Keuper

The image-to-image translation abilities of generative learning models have recently made significant progress in the estimation of complex (steered) mappings between image distributions. While appearance based tasks like image in-painting or style transfer have been studied at length, we propose to investigate the potential of generative models in the context of physical simulations. Providing a dataset of 300k image-pairs and baseline evaluations for three different physical simulation tasks, we propose a benchmark to investigate the following research questions: i) are generative models able to learn complex physical relations from input-output image pairs? ii) what speedups can be achieved by replacing differential equation based simulations? While baseline evaluations of different current models show the potential for high speedups (ii), these results also show strong limitations toward the physical correctness (i). This underlines the need for new methods to enforce physical correctness.


Poster #46
EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting

Dong In Lee · Hyeongcheol Park · Jiyoung Seo · Eunbyung Park · Hyunje Park · Ha Dam Baek · Shin sangheon · sangmin kim · Sangpil Kim

Recent advancements in 3D editing have highlighted the potential of text-driven methods in real-time, user-friendly AR/VR applications. However, current methods rely on 2D diffusion models without adequately considering multi-view information, resulting in multi-view inconsistency. While 3D Gaussian Splatting (3DGS) significantly improves rendering quality and speed, its 3D editing process encounters difficulties with inefficient optimization, as pre-trained Gaussians retain excessive source information, hindering optimization. To address these limitations, we propose EditSplat, a novel 3D editing framework that integrates Multi-view Fusion Guidance (MFG) and Attention-Guided Trimming (AGT). Our MFG ensures multi-view consistency by incorporating essential multi-view information into the diffusion process, leveraging classifier-free guidance from the text-to-image diffusion model and the geometric properties of 3DGS. Additionally, our AGT leverages the explicit representation of 3DGS to selectively prune and optimize 3D Gaussians, enhancing optimization efficiency and enabling precise, semantically rich local edits. Through extensive qualitative and quantitative evaluations, EditSplat achieves superior multi-view consistency and editing quality over existing methods, significantly enhancing overall efficiency.


Poster #47
Highlight
DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds

Youyu Chen · Junjun Jiang · Kui Jiang · Xiao Tang · Zhihao Li · Xianming Liu · Yinyu Nie

3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where the rendering resolution and the primitive number, concluded as the optimization complexity, dominate the time cost in primitive optimization. In this paper, we propose DashGaussian, a scheduling scheme over the optimization complexity of 3DGS that strips redundant complexity to accelerate 3DGS optimization. Specifically, we formulate 3DGS optimization as progressively fitting 3DGS to higher levels of frequency components in the training views, and propose a dynamic rendering resolution scheme that largely reduces the optimization complexity based on this formulation. Besides, we argue that a specific rendering resolution should cooperate with a proper primitive number for a better balance between computing redundancy and fitting quality, where we schedule the growth of the primitives to synchronize with the rendering resolution. Extensive experiments show that our method accelerates the optimization of various 3DGS backbones by 45.7% on average while preserving the rendering quality.


Poster #48
Efficient Decoupled Feature 3D Gaussian Splatting via Hierarchical Compression

Zhenqi Dai · Ting Liu · Yanning Zhang

Efficient 3D scene representation has become a key challenge with the rise of 3D Gaussian Splatting (3DGS), particularly when incorporating semantic information into the scene representation. Existing 3DGS-based methods embed both color and high-dimensional semantic features into a single field, leading to significant storage and computational overhead. To mitigate this, we propose Decoupled Feature 3D Gaussian Splatting (DF-3DGS), a novel method that decouples the color and semantic fields, thereby reducing the number of 3D Gaussians required for semantic representation. We then introduce a hierarchical compression strategy that first employs our novel quantization approach with dynamic codebook evolution to reduce data size, followed by a scene-specific autoencoder for further compression of the semantic feature dimensions. This multi-stage approach results in a compact representation that enhances both storage efficiency and reconstruction speed. Experimental results demonstrate that DF-3DGS outperforms previous 3DGS-based methods, achieving faster training and rendering times while requiring less storage, without sacrificing performance—in fact, it improves performance in the novel view semantic segmentation task. Specifically, DF-3DGS achieves remarkable improvements over Feature 3DGS, reducing training time by 10$\times$ and storage by 20$\times$, while improving the mIoU of novel view semantic segmentation by 4\%. The code will be publicly available.


Poster #49
SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting

Jiahui Zhang · Fangneng Zhan · Ling Shao · Shijian Lu

Anchor-based 3D Gaussian splatting (3D-GS) exploits anchor features in 3D Gaussian prediction, which has achieved impressive 3D rendering quality with reduced Gaussian redundancy. On the other hand, it often encounters the dilemma among anchor features, model size, and rendering quality – large anchor features lead to large 3D models and high-quality rendering whereas reducing anchor features degrades Gaussian attribute prediction which leads to clear artifacts in the rendered textures and geometries. We design SOGS, an anchor-based 3D-GS technique that introduces second-order anchors to achieve superior rendering quality and reduced anchor features and model size simultaneously. Specifically, SOGS incorporates covariance-based second-order statistics and correlation across feature dimensions to augment features within each anchor, compensating for the reduced feature size and improving rendering quality effectively. In addition, it introduces a selective gradient loss to enhance the optimization of scene textures and scene geometries, leading to high-quality rendering with small anchor features. Extensive experiments over multiple widely adopted benchmarks show that SOGS achieves superior rendering quality in novel view synthesis with clearly reduced model size.


Poster #50
RestorGS: Depth-aware Gaussian Splatting for Efficient 3D Scene Restoration

Yuanjian Qiao · Mingwen Shao · Lingzhuang Meng · Kai Xu

3D Gaussian Splatting (3DGS) has recently achieved remarkable progress in novel view synthesis. However, existing methods rely heavily on high-quality data for rendering and struggle to handle degraded scenes with multi-view inconsistency, leading to inferior rendering quality. To address this challenge, we propose a novel Depth-aware \textbf{G}aussian \textbf{S}platting for efficient 3D scene \textbf{Restor}ation, called \textbf{RestorGS}, which flexibly restores multiple degraded scenes using a unified framework. Specifically, RestorGS consists of two core designs: Appearance Decoupling and Depth-Guided Modeling. The former exploits appearance learning over spherical harmonics to decouple clear and degraded Gaussian, thus separating the clear views from the degraded ones. Collaboratively, the latter leverages the depth information to guide the degradation modeling, thereby facilitating the decoupling process. Benefiting from the above optimization strategy, our method achieves high-quality restoration while enabling real-time rendering speed. Extensive experiments show that our RestorGS outperforms existing methods significantly in underwater, nighttime, and hazy scenes.


Poster #51
Seeing A 3D World in A Grain of Sand

Yufan Zhang · Yu Ji · Yu Guo · Jinwei Ye

We present a snapshot imaging technique for recovering 3D surrounding views of miniature scenes. Due to their intricacy, miniature scenes with objects sized in millimeters are difficult to reconstruct, yet miniatures are common in life and their 3D digitalization is desirable. We design a catadioptric imaging system with a single camera and eight pairs of planar mirrors for snapshot 3D reconstruction from a dollhouse perspective. We place paired mirrors on nested pyramid surfaces for capturing surrounding multi-view images in a single shot. Our mirror design is customizable based on the size of the scene for optimized view coverage. We use the 3D Gaussian Splatting (3DGS) representation for scene reconstruction and novel view synthesis. We overcome the challenge posed by our sparse view input by integrating visual hull-derived depth constraint. Our method demonstrates state-of-the-art performance on a variety of synthetic and real miniature scenes.


Poster #52
CoA: Towards Real Image Dehazing via Compression-and-Adaptation

Long Ma · Yuxin Feng · Yan Zhang · Jinyuan Liu · Weimin Wang · Guang-Yong Chen · Chengpei Xu · Zhuo Su

Learning-based image dehazing algorithms have shown remarkable success in synthetic domains. However, real image dehazing is still in suspense due to computational resource constraints and the diversity of real-world scenes. Therefore, there is an urgent need for an algorithm that excels in both efficiency and adaptability to address real image dehazing effectively. This work proposes a Compression-and-Adaptation (CoA) computational flow to tackle these challenges in a divide-and-conquer perspective. First, model compression is performed in the synthetic domain to develop a compact dehazing parameter space, satisfying efficiency demands. Then, a bilevel adaptation in the real domain is introduced to be fearless in unknown real environments by aggregating the synthetic dehazing capabilities during the learning process. Leveraging a succinct design free from additional constraints, our CoA exhibits domain-irrelevant stability and model-agnostic flexibility, effectively bridging the model chasm between synthetic and real domains to further improve its practical utility. Extensive evaluations and analyses underscore the approach's superiority and effectiveness. The code will be made publicly available upon acceptance of this work.


Poster #53
S2D-LFE: Sparse-to-Dense Light Field Event Generation

Yutong Liu · Wenming Weng · Yueyi Zhang · Zhiwei Xiong

For the first time to our knowledge, S2D-LFE enables arbitrary novel view synthesis only from sparse-view light field event (LFE) data, and addresses three critical challenges for the LFE generation task: simplicity, controllability, and consistency. The simplicity aspect eliminates the dependency on frame-based modality, which often suffers from motion blur and low frame-rate limitations. The controllability aspect enables precise view synthesis under sparse LFE conditions with view-related constraints. The consistency aspect ensures both cross-view and temporal coherence in the generated results. To realize S2D-LFE, we develop a novel diffusion-based generation network with two key components. First, we design an LFE-customized variational auto-encoder that effectively compresses and reconstructs LFE by integrating cross-view information. Second, we design an LFE-aware injection adaptor to extract comprehensive geometric and texture priors. Furthermore, we construct a large-scale synthetic LFE dataset containing 162 one-minute sequences using simulator, and capture a real-world testset using our custom-built sparse LFE acquisition system, covering diverse indoor and outdoor scenes. Extensive experiments demonstrate that S2D-LFE successfully generates up to $9\times9$ dense LFE from $2\times2$ sparse inputs and outperforms existing methods on both synthetic and real-world data.


Poster #54
Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction

Li Fang · Hao Zhu · Longlong Chen · Fei Hu · Long Ye · Zhan Ma

Recent advancements in generalizable novel view synthesis have achieved impressive quality through interpolation between nearby views. However, rendering high-resolution images remains computationally intensive due to the need for dense sampling of all rays. Observing the piecewise smooth nature of natural scenes, we find that sampling all rays is redundant for novel view synthesis. Inspired by plenoptic sampling theory, we propose a bundle sampling strategy. By grouping adjacent rays into a bundle and sampling them collectively, a shared representation is generated for decoding all rays within the bundle. For regions with high-frequency content, such as edges and depth discontinuities, more samples along depth are used to capture finer details. To further optimize efficiency, we introduce a depth-guided adaptive sampling strategy, which dynamically allocates samples based on depth confidence—concentrating more samples in complex regions and reducing them in smoother areas. This dual approach significantly accelerates rendering. Applied to ENeRF, our method achieves up to a 1.27 dB PSNR improvement and a 47% increase in FPS on the DTU dataset. Extensive experiments on synthetic and real-world datasets demonstrate state-of-the-art rendering quality and up to $2 \times$ faster rendering compared to existing generalizable methods. Code and trained models will be released upon acceptance.


Poster #55
FrugalNeRF: Fast Convergence for Extreme Few-shot Novel View Synthesis without Learned Priors

Chin-Yang Lin · Chung-Ho Wu · Changhan Yeh · Shih Han Yen · Cheng Sun · Yu-Lun Liu

Neural Radiance Fields (NeRF) face significant challenges in extreme few-shot scenarios, primarily due to overfitting and long training times. Existing methods, such as FreeNeRF and SparseNeRF, use frequency regularization or pre-trained priors but struggle with complex scheduling and bias. We introduce FrugalNeRF, a novel few-shot NeRF framework that leverages weight-sharing voxels across multiple scales to efficiently represent scene details. Our key contribution is a cross-scale geometric adaptation scheme that selects pseudo ground truth depth based on reprojection errors across scales. This guides training without relying on externally learned priors, enabling full utilization of the training data. It can also integrate pre-trained priors, enhancing quality without slowing convergence. Experiments on LLFF, DTU, and RealEstate-10K show that FrugalNeRF outperforms other few-shot NeRF methods while significantly reducing training time, making it a practical solution for efficient and accurate 3D scene reconstruction.


Poster #56
MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World

Ankit Dhiman · Manan Shah · R. Venkatesh Babu

Diffusion models have become central to various image editing tasks, yet they often fail to fully adhere to physical laws, particularly with effects like shadows, reflections, and occlusions. In this work, we address the challenge of generating photorealistic mirror reflections using diffusion-based generative models. Despite extensive training data, existing diffusion models frequently overlook the nuanced details crucial to authentic mirror reflections. Recent approaches have attempted to resolve this by creating synthetic datasets and framing reflection generation as an inpainting task; however, they struggle to generalize across different object orientations and positions relative to the mirror.Our method overcomes these limitations by introducing key augmentations into the synthetic data pipeline: (1) random object positioning, (2) randomized rotations, and (3) grounding of objects, significantly enhancing generalization across poses and placements. To further address spatial relationships and occlusions in scenes with multiple objects, we implement a strategy to pair objects during dataset generation, resulting in a dataset robust enough to handle these complex scenarios. Achieving generalization to real-world scenes remains a challenge, so we introduce a three-stage training curriculum to train a conditional generative model, aimed at improving real-world performance. We provide extensive qualitative and quantitative evaluations to support our approach, and the code and data will be released for research purposes.


Poster #57
Highlight
Matrix3D: Large Photogrammetry Model All-in-One

Yuanxun Lu · Jingyang Zhang · Tian Fang · Jean-Daniel Nahmias · Yanghai Tsin · Long Quan · Xun Cao · Yao Yao · Shiwei Li

We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data.Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation.


Poster #58
SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs

Guibiao Liao · Qing Li · Zhenyu Bao · Guoping Qiu · KANGLIN LIU

3D Gaussian Splatting-based indoor open-world free-view synthesis approaches have shown significant performance with dense input images. However, they exhibit poor performance when confronted with sparse inputs, primarily due to the sparse distribution of Gaussian points and insufficient view supervision. To relieve these challenges, we propose SPC-GS, leveraging Scene-layout-based Gaussian Initialization (SGI) and Semantic-Prompt Consistency (SPC) Regularization for open-world free view synthesis with sparse inputs. Specifically, SGI provides a dense, scene-layout-based Gaussian distribution by utilizing view-changed images generated from the video generation model and view-constraint Gaussian points densification. Additionally, SPC mitigates limited view supervision by employing semantic-prompt-based consistency constraints developed by SAM2. This approach leverages available semantics from training views, serving as instructive prompts, to optimize visually overlapping regions in novel views with 2D and 3D consistency constraints. Extensive experiments demonstrate the superior performance of SPC-GS across Replica and ScanNet benchmarks. Notably, our SPC-GS achieves a 3.06 dB gain in PSNR for reconstruction quality and a 7.3\% improvement in mIoU for open-world semantic segmentation.


Poster #59
Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency

Hyunho Ha · Lei Xiao · Christian Richardt · Thu Nguyen-Phuoc · Changil Kim · Min H. Kim · Douglas Lanman · Numair Khan

We introduce a novel geometry-guided online video view synthesis method with enhanced view and temporal consistency. Traditional approaches achieve high-quality synthesis from dense multi-view camera setups but require significant computational resources. In contrast, selective-input methods reduce this cost but often compromise quality, leading to multi-view and temporal inconsistencies such as flickering artifacts. Our method addresses this challenge to deliver efficient, high-quality novel-view synthesis with view and temporal consistency. The key innovation of our approach lies in using global geometry to guide an image-based rendering pipeline. To accomplish this, we progressively refine depth maps using color difference masks across time. These depth maps are then accumulated through truncated signed distance fields (TSDF) in the synthesized view's image space. This depth representation is view and temporally consistent, and is used to guide a pre-trained blending network that fuses multiple forward-rendered input-view images. Thus, the network is encouraged to output geometrically consistent synthesis results across multiple views and time. Our approach achieves consistent, high-quality video synthesis, while running efficiently in an online manner.


Poster #60
EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis

Sheng Miao · Jiaxin Huang · Dongfeng Bai · Xu Yan · Hongyu Zhou · Yue Wang · Bingbing Liu · Andreas Geiger · Yiyi Liao

Novel view synthesis of urban scenes is essential for autonomous driving-related applications. Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization. We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner. Unlike existing feed-forward, pixel-aligned 3DGS methods, which often suffer from issues like multi-view inconsistencies and duplicated content, our approach predicts 3D Gaussians across multiple frames within a unified volume using a 3D convolutional network. This is achieved by initializing 3D Gaussians with noisy depth predictions, and then refining their geometric properties in 3D space and predicting color based on 2D textures. Our model also handles distant views and the sky with a flexible hemisphere background model. This enables us to perform fast, feed-forward reconstruction while achieving real-time rendering. Experimental evaluations on the KITTI-360 and Waymo datasets show that our method achieves state-of-the-art quality compared to existing feed-forward 3DGS- and NeRF-based methods.


Poster #61
MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention

Yuhan Wang · Fangzhou Hong · Shuai Yang · Liming Jiang · Wayne Wu · Chen Change Loy

Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called **mesh attention** to enable training at $1024^2$ resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, **MEAT**. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods. Code and model will be publicly available.


Poster #62
Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views

Jiang Wu · Rui Li · Yu Zhu · Rong Guo · Jinqiu Sun · Yanning Zhang

We present a Gaussian Splatting method for surface reconstruction using sparse input views. Previous methods relying on dense views struggle with extremely sparse Structure-from-Motion points for initialization. While learning-based Multi-view Stereo (MVS) provides dense 3D points, directly combining it with Gaussian Splatting leads to suboptimal results due to the ill-posed nature of sparse-view geometric optimization. We propose Sparse2DGS, an MVS-initialized Gaussian Splatting pipeline for complete and accurate reconstruction. Our key insight is to incorporate the geometric-prioritized enhancement schemes, allowing for direct and robust geometric learning under ill-posed conditions. As the first method of this kind, Sparse2DGS outperforms existing methods by notable margins, with 1.13 Chamfer Distance error compared to 2DGS (2.81) on the DTU dataset using 3 views. Meanwhile, our method is 2× faster than NeRF-based fine-tuning approach.


Poster #63
Highlight
NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction

Wenyuan Zhang · Emily Yue-ting Jia · Junsheng Zhou · Baorui Ma · Kanle Shi · Yu-Shen Liu · Zhizhong Han

Recently, it has shown that priors are vital for neural implicit functions to reconstruct high-quality surfaces from multi-view RGB images. However, current priors require large-scale pre-training, and merely provide geometric clues without considering the importance of color. In this paper, we present NeRFPrior, which adopts a neural radiance field as a prior to learn signed distance fields using volume rendering for surface reconstruction. Our NeRF prior can provide both geometric and color clues, and also get trained fast under the same scene without additional data. Based on the NeRF prior, we are enabled to learn a signed distance function (SDF) by explicitly imposing a multi-view consistency constraint on each ray intersection for surface inference. Specifically, at each ray intersection, we use the density in the prior as a coarse geometry estimation, while using the color near the surface as a clue to check its visibility from another view angle. For the textureless areas where the multi-view consistency constraint does not work well, we further introduce a depth consistency loss with confidence weights to infer the SDF. Our experimental results outperform the state-of-the-art methods under the widely used benchmarks. The source code will be publicly available.


Poster #64
Efficient Video Super-Resolution for Real-time Rendering with Decoupled G-buffer Guidance

Mingjun Zheng · Long Sun · Jiangxin Dong · Jinshan Pan

Latency is a key driver for real-time rendering applications, making super-resolution techniques increasingly popular to accelerate rendering processes. In contrast to existing methods that directly concatenate low-resolution frames and G-buffers as input without discrimination, we develop an asymmetric UNet-based super-resolution network with decoupled G-buffer guidance, dubbed \textbf{RDG}, to facilitate the spatial and temporal feature exploration for minimizing performance overheads and latency. We first propose a dynamic feature modulator (DFM) to selectively encode the spatial information for capturing a precise structural information. We then incorporate auxiliary G-buffer information to guide the decoder to generate detail-rich, temporally stable results. Specifically, we adopt a high-frequency feature booster (HFB) to adaptively transfer the high-frequency information from the normal and bidirectional reflectance distribution function (BRDF) components of the G-buffer, enhancing the details of the generated results. To further enhance the temporal stability, we design a cross-frame temporal refiner (CTR) with depth and motion vector constraints to aggregate the previous and current frames. Extensive experimental results reveal that our proposed method is capable of generating high-quality and temporally stable results in real-time rendering. The proposed RDG-s produces \textbf{1080P} rendering results on a RTX 3090 GPU with a speed of \textbf{126 FPS}.


Poster #65
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting

Sangwoon Kwak · Joonsoo Kim · Jun Young Jeong · Won-Sik Cheong · Jihyong Oh · Munchurl Kim

3D Gaussian Splatting (3DGS) has made significant strides in scene representation and neural rendering, with intense efforts to adapt it for dynamic scenes. While achieving high rendering quality and speed, the existing methods struggle with storage demands and representing complex real-world motions. To tackle these issues, we propose MoDec-GS, a memory-efficient Gaussian splatting framework for reconstructing novel views in challenging scenarios with complex motions. We introduce Global-to-Local Motion Decomposition (GLMD) to effectively capture dynamic motions in a coarse-to-fine manner. This approach leverages Global Canonical Scaffolds (Global CS) and Local Canonical Scaffolds (Local CS), extending static Scaffold representation to dynamic video reconstruction. For Global CS, we propose Global Anchor Deformation (GAD) to efficiently represent global dynamics along complex motions, by directly deforming the implicit Scaffold attributes which are anchor position, offset, and local context features. Next, we finely adjust local motions via the Local Gaussian Deformation (LGD) of Local CS explicitly. Additionally, we introduce Temporal Interval Adjustment (TIA) to automatically control the temporal coverage of each Local CS during training, allowing MoDec-GS to find optimal interval assignments based on the specified number of temporal segments. Extensive evaluations demonstrate that MoDec-GS achieves an average 70% reduction in model size over state-of-the-art methods for dynamic 3D Gaussians from real-world dynamic videos while maintaining or even improving rendering quality. Our code will be available online at the time of the publication.


Poster #66
RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance

Yuheng Jiang · Zhehao Shen · Chengcheng Guo · Yu Hong · Zhuo Su · Yingliang Zhang · Marc Habermann · Lan Xu

Human-centric volumetric videos offer immersive free-viewpoint experiences, yet existing methods either focus on replaying general dynamic scenes or animating human avatars under new motions, limiting their ability to re-perform general dynamic scenes. In this paper, we present RePerformer, a novel Gaussian-based representation that unifies playback and re-performance for high-fidelity human-centric volumetric videos. Specifically, we hierarchically disentangle the dynamic scenes into motion Gaussians and appearance Gaussians which are associated in the canonical space. We further employ a Morton-based parameterization to efficiently encode the appearance Gaussians into 2D position and attribute maps. For enhanced generalization, we adopt 2D CNNs to map position maps to attribute maps, which can be assembled into appearance Gaussians for high-fidelity rendering of the dynamic scenes. For re-performance, we develop a semantic-aware alignment module and apply deformation transfer on motion Gaussians, enabling photo-real rendering under novel motions. Extensive experiments validate the robustness and effectiveness of RePerformer, setting a new benchmark for playback-then-reperformance paradigm in human-centric volumetric videos.


Poster #67
DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction

Miaowei Wang · Yibo Zhang · Rui Ma · Weiwei Xu · Changqing Zou · Daniel Morris

We present DecoupledGaussian, a novel system that decouples static objects from their contacted surfaces captured in-the-wild videos, a key prerequisite for realistic Newtonian-based physical simulations. Unlike prior methods focused on synthetic data or elastic jittering along the contact surface, which prevent objects from fully detaching or moving independently, DecoupledGaussian allows for significant positional changes without being constrained by the initial contacted surface. Recognizing the limitations of current 2D inpainting tools for restoring 3D locations, our approach uses joint Poisson fields to repair and expand the Gaussians of both objects and contacted scenes after separation. This is complemented by a multi-carve strategy to refine the object's geometry. Our system enables realistic simulations of decoupling motions, collisions, and fractures driven by user-specified impulses, supporting complex interactions within and across multiple scenes. We validate DecoupledGaussian through a comprehensive user study and quantitative benchmarks. This system enhances digital interaction with objects and scenes in real-world environments, benefiting industries such as VR, robotics, and autonomous driving. The code is available in Supplementary and will be released publicly upon acceptance.


Poster #68
Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields

Navami Kairanda · Marc Habermann · Shanthika Shankar Naik · Christian Theobalt · Vladislav Golyanik

3D reconstruction of highly deformable surfaces (e.g. cloths) from monocular RGB videos is a challenging problem, and no solution provides a consistent and accurate recovery of fine-grained surface details. To account for the ill-posed nature of the setting, existing methods use deformation models with statistical, neural, or physical priors. They also predominantly rely on nonadaptive discrete surface representations (e.g. polygonal meshes), perform frame-by-frame optimisation leading to error propagation, and suffer from poor gradients of the mesh-based differentiable renderers. Consequently, fine surface details such as cloth wrinkles are often not recovered with the desired accuracy. In response to these limitations, we propose Thin-Shell-SfT, a new method for non-rigid 3D tracking that represents a surface as an implicit and continuous spatiotemporal neural field. We incorporate continuous thin shell simulation based on the Kirchhoff-Love model for spatial regularisation, which starkly contrasts the discretised alternatives of earlier works. Lastly, we leverage 3D Gaussian splatting to differentiably render the surface into image space and optimise the deformations based on analysis-by-synthesis principles. Our Thin-Shell-SfT method outperforms prior work qualitatively and quantitatively thanks to our continuous surface formulation in conjunction with a specially tailored simulation prior and joint space-time optimisation.


Poster #69
Co-Speech Gesture Video Generation with Implicit Motion-Audio Entanglement

Xinjie Li · Ziyi Chen · Xinlu Yu · Iek-Heng Chu · Peng Chang · Jing Xiao

Co-speech gestures are essential to non-verbal communication, enhancing both the naturalness and effectiveness of human interaction. Although recent methods have made progress in generating co-speech gesture videos, many rely on strong visual controls, such as pose images or TPS keypoint movements, which often lead to artifacts like blurry hands and distorted fingers. In response to these challenges, we present the Implicit Motion-Audio Entanglement (IMAE) method for co-speech gesture video generation. IMAE strengthens audio control by entangling implicit motion parameters, including pose and expression, with audio inputs. Our method utilizes a two-branch framework that combines an audio-to-motion generation branch with a video diffusion branch, enabling realistic gesture generation without requiring additional inputs during inference. To improve training efficiency, we propose a two-stage slow-fast training strategy that balances memory constraints while facilitating the learning of meaningful gestures from long frame sequences.Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple metrics.


Poster #70
Highlight
QuCOOP: A Versatile Framework for Solving Composite and Binary-Parametrised Problems on Quantum Annealers

Natacha Kuete Meli · Vladislav Golyanik · Marcel Seelbach Benkner · Michael Moeller

There is growing interest in solving computer vision problems such as mesh or point set alignment using Adiabatic Quantum Computing (AQC). Unfortunately, modern experimental AQC devices such as D-Wave only support Quadratic Unconstrained Binary Optimization (QUBO) problems, which severely limits their applicability. This paper proposes a new way to overcome this limitation and introduces QuCOOP, an optimization framework extending the scope of AQC to composite and binary-parameterized, possibly non-quadratic problems. The key idea of QuCOOP is to iteratively approximate the original objective function by a sequel of local (intermediate) QUBO forms, whose binary parameters can be sampled on AQC devices. We experiment with quadratic assignment problems, shape matching and point set registration without knowing the correspondences in advance. Our approach achieves state-of-the-art results across multiple instances of tested problems.


Poster #71
Highlight
Image Reconstruction from Readout-Multiplexed Single-Photon Detector Arrays

Shashwath Bharadwaj · Ruangrawee Kitichotkul · Akshay Agarwal · Vivek K Goyal

Readout multiplexing is a promising solution to overcome hardware limitations and data bottlenecks in imaging with single-photon detectors. Conventional multiplexed readout processing creates an upper bound on photon counts at a very fine time scale, where measurements with multiple detected photons must either be discarded or allowed to introduce significant bias. We formulate multiphoton coincidence resolution as an inverse imaging problem and introduce a solution framework to probabilistically resolve the spatial locations of photon incidences. Specifically, we develop a theoretical abstraction of row--column multiplexing and a model of photon events that make readouts ambiguous. Using this, we propose a novel estimator that spatially resolves up to four coincident photons. Our estimator achieves a 3 to 4 dB increase in the peak signal-to-noise ratio of image reconstruction compared to traditional methods at higher incidence photon fluxes. Additionally, this method achieves a ~4 reduction in the required number of readout frames to achieve the same mean-squared error as other methods. Finally, our solution matches the Cramer-Rao bound for detection probability estimation for a wider range of incident flux values compared to conventional methods. While demonstrated for a specific detector type and readout architecture, this method can be extended to more general multiplexing with different detector models.


Poster #72
Spk2SRImgNet: Super-Resolve Dynamic Scene from Spike Stream via Motion Aligned Collaborative Filtering

Yuanlin Wang · Yiyang Zhang · Ruiqin Xiong · Jing Zhao · Jian Zhang · Xiaopeng Fan · Tiejun Huang

Spike camera is a kind of neuromorphic camera that records dynamic scenes by firing a stream of binary spikes with extremely high temporal resolution. It demonstrates great potential for vision tasks in high-speed scenarios. One limitation in its current implementation is the relatively low spatial resolution. This paper develops a network called Spk2SRImgNet to super-resolve high resolution images from low resolution spike stream. However, fluctuations in spike stream hinder the performance of spike camera super resolution. To address this issue, we propose a motion aligned collaborative filtering (MACF) module, which is motivated by key ideas in classic image restoration schemes to mitigate fluctuations in spike data. MACF leverages the temporal similarity of spike stream to acquire similar features from neighboring moments via motion alignment. To separate disturbances from features, MACF filters these similar features jointly in transform domain to exploit representation sparsity, and generates refinement features that will be used to update initial fluctuated features. Specifically, MACF designs an inverse motion alignment operation to map these refinement features back to their original positions. The initial features are aggregated with the repositioned refinement features to enhance reliability. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared with existing methods. The code will be made publicly available.


Poster #73
Highlight
EventPSR: Surface Normal and Reflectance Estimation from Photometric Stereo Using an Event Camera

Bohan Yu · Jin Han · Boxin Shi · Imari Sato

Simultaneously acquisition of the surface normal and reflectance parameters is a crucial but challenging technique in the field of computer vision and graphics. It requires capturing multiple high dynamic range (HDR) images in existing methods using frame-based cameras. In this paper, we propose EventPSR, the first work to recover surface normal and reflectance parameters (e.g., metallic and roughness) simultaneously using an event camera. Compared with the existing methods based on photometric stereo or neural radiance fields, EventPSR is a robust and efficient approach that works consistently with different materials. Thanks to the extremely high temporal resolution and high dynamic range coverage of event cameras, EventPSR can recover accurate surface normal and reflectance of objects with various materials in 10 seconds. Extensive experiments on both synthetic data and real objects show that compared with existing methods using more than 100 HDR images, EventPSR recovers comparable surface normal and reflectance parameters with only about 30% of the data rate.


Poster #74
PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting

Cheng Zhang · Haofei Xu · Qianyi Wu · Camilo Cruz Gambardella · Dinh Phung · Jianfei Cai

With the advent of portable 360° cameras, panorama has gained significant attention in applications like virtual reality (VR), virtual tours, robotics, and autonomous driving. As a result, wide-baseline panorama view synthesis has emerged as a vital task, where high resolution, fast inference, and memory efficiency are essential. Nevertheless, existing methods typically focus on lower resolutions ($512 \times 1024$) due to demanding memory and computational requirements. In this paper, we present $\textbf{PanSplat}$, a generalizable, feed-forward approach that efficiently supports $\textbf{resolution up to 4K}$ ($2048 \times 4096$). Our approach features a tailored spherical 3D Gaussian pyramid with a Fibonacci lattice arrangement, enhancing image quality while reducing information redundancy. To accommodate the demands of high resolution, we propose a pipeline that integrates a hierarchical spherical cost volume and localized Gaussian heads, enabling two-step deferred backpropagation for memory-efficient training on a single A100 GPU. Experiments demonstrate that PanSplat achieves state-of-the-art results with superior efficiency and image quality across both synthetic and real-world datasets.


Poster #75
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge

Xuan Shen · Weize Ma · Jing Liu · Changdi Yang · Rui Ding · Quanyi Wang · Henghui Ding · Wei Niu · Yanzhi Wang · Pu Zhao · Jun Lin · Jiuxiang Gu

Monocular Depth Estimation (MDE) has emerged as a pivotal task in computer vision, supporting numerous real-world applications.However, deploying high-performing depth estimation models on resource-constrained edge devices, especially Application-Specific Integrated Circuits (ASICs), remains a formidable challenge due to the substantial computational and memory demands of state-of-the-art models. Recent advancements in foundational depth estimation deliver impressive results but further amplify the difficulty of deployment on ASICs. To address this, we propose QuartDepth which adopts post-training quantization to optimize and accelerate MDE models specifically for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, significantly reducing the model size and computation cost. To mitigate the performance degradation typically associated with aggressive quantization, we introduce an activation polishing and compensation algorithm applied before and after activation quantization, as well as a weight reconstruction method for minimizing errors in weight quantization.Furthermore, we design a novel flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability, enhancing throughput and efficiency.Experimental results demonstrate that our proposed framework achieves competitive accuracy while enabling fast inference and higher energy efficiency on ASICs, bridging the gap between high-performance depth estimation and practical edge-device applicability.


Poster #76
WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments

Jianhao Zheng · Zihan Zhu · Valentin Bieri · Marc Pollefeys · Songyou Peng · Iro Armeni

We present WildGS-SLAM, a robust and efficient monocular RGB SLAM system designed to handle dynamic environments by leveraging uncertainty-aware geometric mapping. Unlike traditional SLAM systems, which assume static scenes, our approach integrates depth and uncertainty information to enhance tracking, mapping, and rendering performance in the presence of moving objects. We introduce an uncertainty map, predicted by a shallow multi-layer perceptron and DINOv2 features, to guide dynamic object removal during both tracking and mapping.This uncertainty map enhances dense bundle adjustment and Gaussian map optimization, improving reconstruction accuracy. Our system is evaluated on multiple datasets and demonstrates artifact-free view synthesis. Results showcase WildGS-SLAM's superior performance in dynamic environments compared to state-of-the-art methods.


Poster #77
Continuous 3D Perception Model with Persistent State

Qianqian Wang · Yifei Zhang · Aleksander Holynski · Alexei A. Efros · Angjoo Kanazawa

We propose a novel unified framework capable of solving a broad range of 3D tasks. At the core of our approach is an online stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, our method leverages the evolving state to generate metric-scale pointmaps for each input in an online manner. These pointmaps reside within a common coordinate system, accumulating into a coherent 3D scene reconstruction. Our model captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen structures beyond the coverage of the input images through a raymap probe. Our method is simple yet highly flexible, naturally accepting varying lengths of image sequences and working seamlessly with both video streams and unordered photo collections. We evaluate our method on various 3D/4D tasks including monocular/video depth estimation, camera estimation, multi-view reconstruction, and achieve competitive or state-of-the-art performance. Additionally, we showcase intriguing behaviors enabled by our state representation.


Poster #78
MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos

Zhengqi Li · Richard Tucker · Forrester Cole · Qianqian Wang · Linyi Jin · Vickie Ye · Angjoo Kanazawa · Aleksander Holynski · Noah Snavely

We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. Most conventional structure from motion and monocular SLAM techniques assume input videos that feature predominantly static scenes with large amounts of parallax. Such methods tend to produce erroneous estimates in the absence of these conditions. Recent neural network based approaches attempt to overcome these challenges; however, such methods are either computationally expensive or brittle when run on dynamic videos with uncontrolled camera motion or unknown field of view. We demonstrate the surprising effectiveness of the deep visual SLAM framework, and with careful modifications to its training and inference schemes, this system can scale to real-world videos of complex dynamic scenes with unconstrained camera paths, including videos with little camera parallax. Extensive experiments on both synthetic and real videos demonstrate that our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work, with faster or comparable running times.


Poster #79
Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video

Hoang Chuong Nguyen · Wei Mao · Jose M. Alvarez · Miaomiao Liu

Neural Radiance Fields (NeRF) has demonstrated its superior capability to represent 3D geometry but require accurately precomputed camera poses during training. To mitigate this requirement, existing methods jointly optimize camera poses and NeRF often relying on good pose initialisation or depth priors. However, these approaches struggle in challenging scenarios, such as large rotations, as they map each camera to a world coordinate system. We propose a novel method that eliminates prior dependencies by modeling continuous camera motions as time-dependent angular velocity and velocity. Relative motions between cameras are learned first via velocity integration, while camera poses can be obtained by aggregating such relative motions up to a world coordinate system defined at a single time step within the video. Specifically, accurate continuous camera movements are learned through a time-dependent NeRF, which captures local scene geometry and motion by training from neighboring frames for each time step. The learned motions enable fine-tuning the NeRF to represent the full scene geometry. Experiments on Co3D and Scannet show our approach achieves superior camera pose and depth estimation and comparable novel-view synthesis performance compared to state-of-the-art methods.


Poster #80
Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision

Jinneyong Kim · Seung-Hwan Baek

Integrating RGB and NIR imaging provides complementary spectral information, enhancing robotic vision in challenging lighting conditions. However, existing datasets and imaging systems lack pixel-level alignment between RGB and NIR images, posing challenges for downstream tasks.In this paper, we develop a robotic vision system equipped with two pixel-aligned RGB-NIR stereo cameras and a LiDAR sensor mounted on a mobile robot. The system simultaneously captures RGB stereo images, NIR stereo images, and temporally synchronized LiDAR point cloud. Utilizing the mobility of the robot, we present a dataset containing continuous video frames with pixel-aligned RGB and NIR stereo pairs under diverse lighting conditions.We introduce two methods that utilize our pixel-aligned RGB-NIR images: an RGB-NIR image fusion method and a feature fusion method. The first approach enables existing RGB-pretrained vision models to directly utilize RGB-NIR information without fine-tuning. The second approach fine-tunes existing vision models to more effectively utilize RGB-NIR information.Experimental results demonstrate the effectiveness of using pixel-aligned RGB-NIR images across diverse lighting conditions.


Poster #81
MVSAnywhere: Zero-Shot Multi-View Stereo

Sergio Izquierdo · Mohamed Sayed · Michael Firman · Guillermo Garcia-Hernando · Daniyar Turmukhambetov · Javier Civera · Oisin Mac Aodha · Gabriel Brostow · Jamie Watson

Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision.However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges.MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.


Poster #82
Three-view Focal Length Recovery From Homographies

Yaqing Ding · Viktor Kocur · Zuzana Berger Haladova · Qianliang Wu · Shen Cai · Jian Yang · Zuzana Kukelova

In this paper, we propose a novel approach for recovering focal lengths from three-view homographies. By examining the consistency of normal vectors between two homographies, we derive new explicit constraints between the focal lengths and homographies using an elimination technique. We demonstrate that three-view homographies provide two additional constraints, enabling the recovery of one or two focal lengths. We discuss four possible cases, including three cameras having an unknown equal focal length, three cameras having two different unknown focal lengths, three cameras where one focal length is known, and the other two cameras have equal or different unknown focal lengths. All the problems can be converted into solving polynomials in one or two unknowns, which can be efficiently solved using Sturm sequence or hidden variable technique. Evaluation using both synthetic and real data shows that the proposed solvers are both faster and more accurate than methods relying on existing two-view solvers.


Poster #83
Highlight
Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers

Ji Zhao · Banglei Guan · Zibin Liu · Laurent Kneip

For event cameras, current sparse geometric solvers for egomotion estimation assume that the rotational displacements are known, such as those provided by an IMU. Thus, they can only recover the translational motion parameters. Recovering full-DoF motion parameters using a sparse geometric solver is a more challenging task, and has not yet been investigated. In this paper, we propose several solvers to estimate both rotational and translational velocities within a unified framework. Our method leverages event manifolds induced by line segments. The problem formulations are based on either an incidence relation for lines or a novel coplanarity relation for normal vectors. We demonstrate the possibility of recovering full-DoF egomotion parameters for both angular and linear velocities without requiring extra sensor measurements or motion priors. To achieve efficient optimization, we exploit the Adam framework with a first-order approximation of rotations for quick initialization. Experiments on both synthetic and real-world data demonstrate the effectiveness of our method. The code will be made publicly available.

Self-supervised monocular depth estimation has long been treated as a point-wise prediction problem, where the depth of each pixel is usually estimated independently. However, artifacts are often observed in the estimated depth map, e.g., depth values for points located in the same region may jump dramatically. To address this issue, we propose a novel self-supervised monocular depth estimation framework called GeoDepth, where we explore the intrinsic geometric representation in 3D scene for producing accurate and continuous depth map. In particularity, we model the complex 3D scene as a collection of planes with varying sizes, where each plane is characterized by a unique set of parameters, namely planar normal (indicating plane orientation) and planar offset (defining the perpendicular distance from the camera center to the plane). Under this modeling, points in a same plane are enforced to share a unique representation and their depth variations related only to pixel coordinates, thus this geometric relationship can be exploited to regularize the depth variations of these points. To this end, we design a structured plane generation module that introduce temporal-spatial geometric cues and the plane uniqueness principle to recover the correct scene plane representation. In addition, we develop a depth discontinuity module to dynamically identify depth discontinuity regions and subsequently optimize them. Our experiments on the KITTI and NYUv2 datasets demonstrate that GeoDepth achieves state-of-the-art performance, with additional tests on Make3D and ScanNet validating its generalization capabilities.


Poster #85
R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

Xudong Jiang · Fangjinhua Wang · Silvano Galliani · Christoph Vogel · Marc Pollefeys

Learning-based visual localization methods that use scene coordinate regression (SCR) offer the advantage of smaller map sizes. However, on datasets with complex illumination changes or image-level ambiguities, it remains a less robust alternative to feature matching methods. This work aims to close the gap. We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. Additionally, we revisit the network architecture and local feature extraction module. Our method achieves state-of-the-art on challenging large-scale datasets without relying on network ensembles or 3D supervision. On Aachen Day-Night, we are 10$\times$ more accurate than previous SCR methods with similar map size and require at least 5$\times$ smaller map sizes than any other SCR method while still delivering superior accuracy. Code will be available upon acceptance.

In this work, we propose HyperPose, which utilizes hypernetworks in absolute camera pose regressors. The inherent appearance variations in natural scenes, attributable to environmental conditions, perspective, and lighting, induce a significant domain disparity between the training and test datasets. This disparity degrades the precision of contemporary localization networks. To mitigate this, we advocate for incorporating hypernetworks into single-scene and multiscene camera pose regression models. During inference, the hypernetwork dynamically computes adaptive weights for the localization regression heads based on the particular input image, effectively narrowing the domain gap. Using indoor and outdoor datasets, we evaluate the HyperPose methodology across multiple established absolute pose regression architectures. In particular, we introduce and share the Extended Cambridge Landmarks (ECL), which is a novel localization dataset, based on the Cambridge Landmarks dataset, showing it in multiple seasons with significantly varying appearance conditions. Our empirical experiments demonstrate that HyperPose yields notable performance enhancements for both single- and multi-scene architectures. We have made our source code, pre-trained models, and ECL dataset openly available.


Poster #87
Highlight
Learning to Filter Outlier Edges in Global SfM

Nicole Damblon · Marc Pollefeys · Daniel Barath

This paper introduces a novel approach to improve camera position estimation in global Structure-from-Motion (SfM) frameworks by filtering inaccurate pose graph edges, representing relative translation estimates, before applying translation averaging. In SfM, pose graph vertices represent cameras and edges relative poses (rotation and translation) between cameras. We formulate the edge filtering problem as a vertex filtering in the dual graph - a line graph where the vertices stem from edges in the original graph, and the edges from cameras. Exploiting such a representation, we frame the problem as a binary classification over nodes in the dual graph. To learn such a classification and find outlier edges, we employ a Transformer architecture-based technique. To address the challenge of memory overflow often caused by converting to a line graph, we introduce a clustering-based graph processing approach, enabling the application of our method to arbitrarily large pose graphs. The proposed method outperforms existing relative translation filtering techniques in terms of final camera position accuracy and can be seamlessly integrated with any other filter. The code will be made public.


Poster #88
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Linyi Jin · Richard Tucker · Zhengqi Li · David Fouhey · Noah Snavely · Aleksander Holynski

Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3r to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes.


Poster #89
Towards Optimizing Large-Scale Multi-Graph Matching in Bioimaging

Max Kahl · Sebastian Stricker · Lisa Hutschenreiter · Florian Bernard · Carsten Rother · Bogdan Savchynskyy

Multi-graph matching is an important problem in computer vision. Our task comes from bioimaging, where a set of 29 3D-microscopic images of worms have to be brought into correspondence. Surprisingly, virtually all existing methods are not applicable to this large-scale, real-world problem since they either assume a complete or dense problem setting, and they have so far only been applied to small-scale, toy or synthetic problems. Despite claims in literature that methods addressing complete multi-graph matching are applicable in an incomplete setting, our first contribution is to prove that their runtime would be excessive and impractical. Our second contribution is a new method for incomplete multi-graph matching that applies to real-world, larger-scale problems.We experimentally show that for our bioimaging application we are able to attain results in less than two minutes, whereas the only competing approach requires at least half an hour while producing far worse results. Furthermore, even for small-scale, dense or complete problem instances we achieve results that are at least on par with the leading methods, but an order of magnitude faster.


Poster #90
Bridging Viewpoint Gaps: Geometric Reasoning Boosts Semantic Correspondence

Qiyang Qian · Hansheng Chen · Masayoshi Tomizuka · Kurt Keutzer · Qianqian Wang · Chenfeng Xu

Finding semantic correspondences between images is a challenging problem in computer vision, particularly under significant viewpoint changes. Previous methods rely on semantic features from pre-trained 2D models like Stable Diffusion and DINOv2, which often struggle to extract viewpoint-invariant features. To overcome this, we propose a novel approach that integrates geometric and semantic reasoning. Unlike prior methods relying on heuristic geometric enhancements, our framework fine-tunes DUSt3R on synthetic cross-instance data to reconstruct distinct objects in an aligned 3D space. By learning to deform these objects into similar shapes using semantic supervision, we enable efficient KNN-based geometric matching, followed by sparse semantic matching within local KNN candidates. While trained on synthetic data, our method generalizes effectively to real-world images, achieving up to 7.4-point improvements in zero-shot settings on the rigid-body subset of Spair-71K and up to 19.6-point gains under extreme viewpoint variations. Additionally, it accelerates runtime by up to 40 times, demonstrating both its robustness to viewpoint changes and its efficiency for practical applications.


Poster #91
MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation

Aviral Chharia · Wenbo Gou · Haoye Dong

Though single-view 3D human pose estimation has gained much attention, 3D multi-view multi-person pose estimation faces several challenges including the presence of occlusions and generalizability to new camera arrangements or scenarios. Existing transformer-based approaches often struggle to accurately model joint spatial sequences, especially in occluded scenarios. To address this, we present a novel Multi-View State Space Modeling framework, named MV-SSM for robustly reconstructing 3D human poses, by explicitly modeling the joint spatial sequence at two distinct levels: the feature level from multi-view images and the joint level of the person. Specifically, we propose a Projective State Space (PSS) block to learn the joint spatial sequences using state space modeling. Furthermore, we modify Mamba's unidirectional scanning into an effective Grid token-guided Bidirectional scan (GTBS) which is integral to the PSS block. Experiments on multiple challenging benchmarks demonstrate that MV-SSM archives highly accurate 3D pose estimation and is generalizable across the number of cameras (+10.8 on AP25 on the challenging 3 camera setting in CMU Panoptic), varying camera arrangements (+7.0 on AP25), and cross-datasets (+15.3 PCP on Campus A1), significantly outperforming SOTAs. The code has been submitted and will be open-sourced with model weights upon acceptance.


Poster #92
Multi-View Pose-Agnostic Change Localization with Zero Labels

Chamuditha Jayanga Galappaththige · Jason Lai · Lloyd Windrim · Donald G. Dansereau · Niko Suenderhauf · Dimity Miller

Autonomous agents often require accurate methods for detecting and localizing changes in their environment, particularly when observations are captured from unconstrained and inconsistent viewpoints. We propose a novel label-free, pose-agnostic change detection method that integrates information from multiple viewpoints to construct a change-aware 3D Gaussian Splatting (3DGS) representation of the scene. With as few as 5 images of the post-change scene, our approach can learn additional change channels in a 3DGS and produce change masks that outperform single-view techniques. Our change-aware 3D scene representation additionally enables the generation of accurate change masks for unseen viewpoints. Experimental results demonstrate state-of-the-art performance in complex multi-object scenes, achieving a 1.7$\times$ and 1.6$\times$ improvement in Mean Intersection Over Union and F1 score respectively over other baselines. We also contribute a new real-world dataset to benchmark change detection in diverse challenging scenes in the presence of lighting variations. Our code and dataset will be made publicly available upon acceptance.


Poster #93
Highlight
Structure-Aware Correspondence Learning for Relative Pose Estimation

Yihan Chen · Wenfei Yang · Huan Ren · Shifeng Zhang · Tianzhu Zhang · Feng Wu

Relative pose estimation provides a promising way for achieving object-agnostic pose estimation. Despite the success of existing 3D correspondence-based methods, the reliance on explicit feature matching suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Inspired by humans' ability to assemble two object parts that have small or no overlapping regions by considering object structure, we propose a novel Structure-Aware Correspondence Learning method for Relative Pose Estimation, which consists of two key modules. First, a structure-aware keypoint extraction module is designed to locate a set of kepoints that can represent the structure of objects with different shapes and appearance, under the guidance of a keypoint based image reconstruction loss. Second, a structure-aware correspondence estimation module is designed to model the intra-image and inter-image relationships between keypoints to extract structure-aware features for correspondence estimation. By jointly leveraging these two modules, the proposed method can naturally estimate 3D-3D correspondences for unseen objects without explicit feature matching for precise relative pose estimation. Experimental results on the CO3D, Objaverse and LineMOD datasets demonstrate that the proposed method significantly outperforms prior methods, i.e., with $5.7^\circ$ reduction in mean angular error on the CO3D dataset.


Poster #94
Co-op: Correspondence-based Novel Object Pose Estimation

Sungphill Moon · Hyeontae Son · Dongcheol Hur · Sangwook Kim

We propose Co-op, a novel method for accurately and robustly estimating the 6DoF pose of objects unseen during training from a single RGB image. Our method requires only the CAD model of the target object and can precisely estimate its pose without any additional fine-tuning. While existing model-based methods suffer from inefficiency due to using a large number of templates, our method enables fast and accurate estimation with a small number of templates. This improvement is achieved by finding semi-dense correspondences between the input image and the pre-rendered templates. Our method achieves strong generalization performance by leveraging a hybrid representation that combines patch-level classification and offset regression. Additionally, our pose refinement model estimates probabilistic flow between the input image and the rendered image, refining the initial estimate to an accurate pose using a differentiable PnP layer. We demonstrate that our method not only estimates object poses rapidly but also outperforms existing methods by a large margin on the seven core datasets of the BOP Challenge, achieving state-of-the-art accuracy.


Poster #95
Any6D: Model-free 6D Pose Estimation of Novel Object

Taeyeop Lee · Bowen Wen · Minjun Kang · Gyuree Kang · In So Kweon · Kuk-Jin Yoon

We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. Unlike existing methods that rely on textured 3D models or multiple viewpoints, Any6D leverages a joint object alignment process to enhance 2D-3D alignment and metric size estimation for improved pose accuracy. Our approach integrates a render-and-compare strategy to generate and refine pose hypotheses, enabling robust performance in scenarios with occlusions, non-overlapping views, diverse lighting conditions, and large cross-environment variations. We evaluate our method on four challenging datasets: REAL275, Toyota-Light, HO3D, and YCBINEOAT, demonstrating its effectiveness in significantly outperforming state-of-the-art methods for novel object pose estimation.


Poster #96
Highlight
CRISP: Object Pose and Shape Estimation with Test-Time Adaptation

Jingnan Shi · Rajat Talak · Harry Zhang · David Jin · Luca Carlone

We consider the problem of estimating object pose and shape from an RGB-D image. Our first contribution is to introduce CRISP, a category-agnostic object pose and shape estimation pipeline. The pipeline implements an encoder-decoder model for shape estimation. It uses FiLM-conditioning for implicit shape reconstruction and a DPT-based network for estimating pose-normalized points for pose estimation. As a second contribution, we propose an optimization-based pose and shape corrector that can correct estimation errors caused by a domain gap. Observing that the shape decoder is well behaved in the convex hull of known shapes, we approximate the shape decoder with an active shape model, and show that this reduces the shape correction problem to a constrained linear least squares problem, which can be solved efficiently by an interior point algorithm. Third, we introduce a self-training pipeline to perform self-supervised domain adaptation of CRISP. The self-training is based on a correct-and-certify approach, which leverages the corrector to generate pseudo-labels at test time, and uses them to self-train CRISP. We demonstrate CRISP (and the self-training) on YCBV, SPE3R, and NOCS datasets. CRISP shows high performance on all the datasets. Moreover, our self-training is capable of bridging a large domain gap. Finally, CRISP also shows an ability to generalize to unseen objects.


Poster #97
Highlight
CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image

Jingshun Huang · Haitao Lin · Tianyu Wang · Yanwei Fu · Xiangyang Xue · Yi Zhu

This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks and introduces a new benchmark dataset. While recent methods estimate part poses and sizes at the category level, they often rely on geometric cues and complex multi-stage pipelines that first segment parts from the point cloud, followed by Normalized Part Coordinate Space (NPCS) estimation for 6D poses. These approaches overlook dense semantic cues from RGB images, leading to suboptimal accuracy, particularly for objects with small parts. To address these limitations, we propose a single-stage Network, CAP-Net, for estimating the 6D poses and sizes of Categorical Articulated Parts. This method combines RGB-D features to generate instance segmentation and NPCS representations for each part in an end-to-end manner. CAP-Net uses a unified network to simultaneously predict point-wise class labels, centroid offsets, and NPCS maps. A clustering algorithm then groups points of the same predicted class based on their estimated centroid distances to isolate each part. Finally, the NPCS region of each part is aligned with the point cloud to recover its final pose and size.To bridge the sim-to-real domain gap, we introduce the RGBD-Art dataset, the largest RGB-D articulated dataset to date, featuring photorealistic RGB images and depth noise simulated from real sensors. Experimental evaluations on the RGBD-Art dataset demonstrate that our method significantly outperforms the state-of-the-art approach. Real-world deployments of our model in robotic tasks underscore its robustness and exceptional sim-to-real transfer capabilities, confirming its substantial practical utility.


Poster #98
EchoMatch: Partial-to-Partial Shape Matching via Correspondence Reflection

Yizheng Xie · Viktoria Ehm · Paul Roetzer · Nafie El Amrani · Maolin Gao · Florian Bernard · Daniel Cremers

Finding correspondences between 3D shapes is a crucial problem in computer vision and graphics. While most research has focused on finding correspondences in settings where at least one of the shapes is complete, the realm of partial-to-partial shape matching remains under-explored. Yet it is of importance since, in many applications, shapes are only observed partially due to occlusion or scanning.Finding correspondences between partial shapes comes with an additional challenge: We not only want to identify correspondences between points on either shape but also have to determine which points of each shape actually have a partner.To tackle this challenging problem, we present EchoMatch, a novel framework for partial-to-partial shape matching that incorporates the concept of correspondence reflection to enable an overlap prediction within a functional map framework.With this approach, we show that we can outperform current SOTA methods in challenging partial-to-partial shape matching problems.


Poster #99
Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation

Sayak Nag · Udita Ghosh · Calvin-Khang Ta · Sarosij Bose · Jiachen Li · Amit K. Roy-Chowdhury

Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction (CP) based framework, adaptive to any existing SGG method, for quantifying their predictive uncertainty by constructing well-calibrated prediction sets over their generated scene graphs. These scene graph prediction sets are designed to achieve statistically rigorous coverage guarantees. Additionally, to ensure these prediction sets contain the most practically interpretable scene graphs, we design an effective MLLM-based post-processing strategy for selecting the most visually and semantically plausible scene graphs within these prediction sets. We show that our proposed approach can produce diverse possible scene graphs from an image, assess the reliability of SGG methods, and improve overall SGG performance.


Poster #100
Focusing on Tracks for Online Multi-Object Tracking

Kyujin Shim · Kangwook Ko · YuJin Yang · Changick Kim

Multi-object tracking (MOT) is a critical task in computer vision, requiring the accurate identification and continuous tracking of multiple objects across video frames. However, current state-of-the-art methods mainly rely on a global optimization technique and multi-stage cascade association strategy, and those approaches often overlook the specific characteristics of assignment task in MOT and useful detection results that may represent occluded objects. To address these challenges, we propose a novel Track-Focused Online Multi-Object Tracker (TrackTrack) with two key strategies: Track-Perspective-Based Association (TPA) and Track-Aware Initialization (TAI). The TPA strategy associates each track with the most suitable detection result by choosing the one with the minimum distance from all available detection results in a track-perspective manner. On the other hand, the TAI method precludes the generation of spurious tracks in the track-aware aspect by suppressing track initialization of detection results that heavily overlap with current active tracks and more confident detection results. Extensive experiments on MOT17, MOT20, and DanceTrack demonstrate that our TrackTrack outperforms current state-of-the-art trackers, offering improved robustness and accuracy across diverse and challenging tracking scenarios.


Poster #101
GRAE-3DMOT: Geometry Relation-Aware Encoder for Online 3D Multi-Object Tracking

Hyunseop Kim · Hyo-Jun Lee · Yonguk Lee · Jinu Lee · Hanul Kim · Yeong Jun Koh

Recently, 3D multi-object tracking (MOT) has widely adopted the standard tracking-by-detection paradigm, which solves the association problem between detections and tracks. Many tracking-by-detection approaches establish constrained relationships between detections and tracks using a distance threshold to reduce confusion during association. However, this approach does not effectively and comprehensively utilize the information regarding objects due to the constraints of the distance threshold. In this paper, we propose GRAE-3DMOT, Geometry Relation-Aware Encoder 3D Multi-Object Tracking, which contains a geometric relation-aware encoder to produce informative features for association. The geometric relation-aware encoder consists of three components: a spatial relation-aware encoder, a spatiotemporal relation-aware encoder, and a distance-aware feature fusion layer. The spatial relation-aware encoder effectively aggregates detection features by comprehensively exploiting as many detections as possible. The spatiotemporal relation-aware encoder provides spatiotemporal relation-aware features by combing spatial and temporal relation features, where the spatiotemporal relation-aware features are transformed into association scores for MOT. The distance-aware feature fusion layer is integrated into both encoders to enhance the relation features of physically proximate objects. Experimental results demonstrate that the proposed GRAE-3DMOT outperforms the state-of-the-art on the nuScenes. Our approach achieves 73.7% and 70.2% AMOTA on the nuScenes validation and test sets using CenterPoint detections.


Poster #102
PointSR: Self-Regularized Point Supervision for Drone-View Object Detection

Weizhuo Li · Yue Xi · Wenjing Jia · zehao zhang · Fei Li · Xiangzeng Liu · Qiguang Miao

Point-Supervised Object Detection (PSOD) in a discriminative style has recently gained significant attention for its impressive detection performance and cost-effectiveness. However, accurately predicting high-quality pseudo-box labels for drone-view images, which often feature densely packed small objects, remains a challenge. This difficulty arises primarily from the limitation of rigid sampling strategies, which hinder the optimization of pseudo-boxes. To address this, we propose PointSR, an effective and robust point-supervised object detection framework with self-regularized sampling that integrates temporal and informative constraints throughout the pseudo-box generation process. Specifically, the framework comprises three key components: Temporal-Ensembling Encoder (TE Encoder), Coarse Pseudo-box Prediction, and Pseudo-box Refinement. The TE Encoder builds an anchor prototype library by aggregating temporal information for dynamic anchor adjustment. In Coarse Pseudo-box Prediction, anchors are refined using the prototype library, and a set of informative samples is collected for subsequent refinement. During Pseudo-box Refinement, these informative negative samples are used to suppress low-confidence candidate positive samples, thereby improving the quality of the pseudo boxes. Experimental results on benchmark datasets demonstrate that PointSR significantly outperforms state-of-the-art methods, achieving up to $\mathbf{3.3}$%$\sim$$\mathbf{7.2}$% higher AP$_{50}$ using only point supervision. Additionally, it exhibits strong robustness to perturbation in human-labeled points.


Poster #103
Multi-Modal Aerial-Ground Cross-View Place Recognition with Neural ODEs

Sijie Wang · Rui She · Qiyu Kang · Siqi Li · Disheng Li · Tianyu Geng · Shangshu Yu · Wee Peng Tay

Place recognition (PR) aims at retrieving the query place from a database and plays a crucial role in various applications, including navigation, autonomous driving, and augmented reality. While previous multi-modal PR works have mainly focused on the same-view scenario in which ground-view descriptors are matched with a database of ground-view descriptors during inference, the multi-modal cross-view scenario, in which ground-view descriptors are matched with aerial-view descriptors in a database, remains under-explored. We propose AGPlace, a model that effectively integrates information from multi-modal ground sensors (cameras and LiDARs) to achieve accurate aerial-ground PR. AGPlace achieves effective aerial-ground cross-view PR by leveraging a manifold-based neural ordinary differential equation (ODE) framework with a multi-domain alignment loss. It outperforms existing state-of-the-art cross-view PR models on large-scale datasets. As most existing PR models are designed for ground-ground PR, we adapt these baselines into our cross-view pipeline. Experiments demonstrate that this direct adaptation performs worse than our overall model architecture AGPlace. AGPlace represents a significant advancement in multi-modal aerial-ground PR, with promising implications for real-world applications.

Neural surface reconstruction has been dominated by implicit representations with marching cubes for explicit surface extraction. However, those methods typically require high-quality normals for accurate reconstruction. We propose OffsetOPT, a method that reconstructs explicit surfaces directly from 3D point clouds and eliminates the need for point normals. The approach comprises two stages: first, we train a neural network to predict surface triangles based on local point geometry, given isometrically distributed input points. Next, we apply the frozen network to reconstruct surfaces from unseen point clouds by optimizing a per-point offset to maximize the accuracy of triangle predictions. Compared to state-of-the-art methods, OffsetOPT not only excels at reconstructing overall surfaces but also significantly preserves sharp surface features. We demonstrate its accuracy on popular benchmarks, including small-scale shapes and large-scale open surfaces.


Poster #105
Highlight
High-Fidelity Lightweight Mesh Reconstruction from Point Clouds

Chen Zhang · Wentao Wang · Ximeng Li · Xinyao Liao · Wanjuan Su · Wenbing Tao

Recently, learning signed distance functions (SDFs) from point clouds has become popular for reconstruction. To ensure accuracy, most methods require using high-resolution Marching Cubes for surface extraction. However, this results in redundant mesh elements, making the mesh inconvenient to use. To solve the problem, we propose an adaptive meshing method to extract resolution-adaptive meshes based on surface curvature, enabling the recovery of high-fidelity lightweight meshes. Specifically, we first use point-based representation to perceive implicit surfaces and calculate surface curvature. A vertex generator is designed to produce curvature-adaptive vertices with any specified number on the implicit surface, preserving the overall structure and high-curvature features. Then we develop a Delaunay meshing algorithm to generate meshes from vertices, ensuring geometric fidelity and correct topology. In addition, to obtain accurate SDFs for adaptive meshing and achieve better lightweight reconstruction, we design a hybrid representation combining feature grid and feature tri-plane for better detail capture. Experiments demonstrate that our method can generate high-quality lightweight meshes from point clouds. Compared with methods from various categories, our approach achieves superior results, especially in capturing more details with fewer elements.


Poster #106
Parametric Point Cloud Completion for Polygonal Surface Reconstruction

Zhaiyu Chen · Yuqing Wang · Liangliang Nan · Xiao Xiang Zhu

Existing polygonal surface reconstruction methods heavily depend on input completeness and struggle with incomplete point clouds. We argue that while current point cloud completion techniques may recover missing points, they are not optimized for polygonal surface reconstruction, where the parametric representation of underlying surfaces remains overlooked. To address this gap, we introduce parametric completion, a novel paradigm for point cloud completion, which recovers parametric primitives instead of individual points to convey high-level geometric structures. Our presented approach, PaCo, enables high-quality polygonal surface reconstruction by leveraging plane proxies that encapsulate both plane parameters and inlier points, proving particularly effective in challenging scenarios with highly incomplete data. Comprehensive evaluation of our approach on the ABC dataset establishes its effectiveness with superior performance and sets a new standard for polygonal surface reconstruction from incomplete data.


Poster #107
Self-Supervised Large Scale Point Cloud Completion for Archaeological Site Restoration

Aocheng Li · James R. Zimmer-Dauphinee · Rajesh Kalyanam · Ian Lindsay · Parker VanValkenburgh · Steven Wernke · Daniel Aliaga

Point cloud completion helps restore partial incomplete point clouds suffering occlusions. Current self-supervised methods fail to give high fidelity completion for large objects with missing surfaces and unbalanced distribution of available points. In this paper, we present a novel method for restoring large-scale point clouds with limited and imbalanced ground-truth. Using rough boundary annotations for a region of interest, we project the original point clouds into a multiple-center-of-projection (MCOP) image, where fragments are projected to images of 5 channels (RGB, depth, and rotation). Completion of the original point cloud is reduced to inpainting the missing pixels in the MCOP images. Due to lack of complete structures and an unbalanced distribution of existing parts, we develop a self-supervised scheme which learns to infill the MCOP image with points resembling existing "complete" patches. Special losses are applied to further enhance the regularity and consistency of completed MCOP images, which is mapped back to 3D to form final restoration. Extensive experiments demonstrate the superiority of our method in completing 600+ incomplete and unbalanced archaeological structures in Peru.


Poster #108
Dual Focus-Attention Transformer for Robust Point Cloud Registration

Kexue Fu · Ming'zhi Yuan · Changwei Wang · Weiguang Pang · Jing Chi · Manning Wang · Longxiang Gao

Recently, coarse-to-fine methods for point cloud registration have achieved great success, but few works deeply explore the impact of feature interaction at both coarse and fine scales. By visualizing attention scores and correspondences, we find that existing methods fail to achieve effective feature aggregation at the two scales during the feature interaction. To tackle this issue, we propose a Dual Focus-Attention Transformer framework, which only focuses on points relevant to the current point for feature interaction, avoiding interactions with irrelevant points. For the coarse scale, we design a superpoint focus-attention transformer guided by sparse keypoints, which are selected from the neighborhood of superpoints. For the fine scale, we only perform feature interaction between the point sets that belong to the same superpoint. Experiments show that our method achieve the state-of-the-art performance on three standard benchmarks. The code and pre-trained models will be available at Github.

Gaussian and Laplacian entropy models are proved effective in learned point cloud attribute compression, as they assist in arithmetic coding of latents. However, we demonstrate through experiments that there is still unutilized information in entropy parameters estimated by neural networks in current methods, which can be used for more accurate probability estimation. Thus we introduce generalized Gaussian entropy model, which controls the tail shape through shape parameter to more accurately estimate the probability of latents. Meanwhile, to the best of our knowledge, existing methods use fixed likelihood intervals for each integer during arithmetic coding, which limits model performance. We propose Mean Error Discriminator (MED) to determine whether the entropy parameter estimation is accurate and then dynamically adjust likelihood intervals. Experiments show that our method significantly improves rate-distortion (RD) performance on three VAE-based models for point cloud attribute compression, and our method can be applied to other compression tasks, such as image and video compression.


Poster #110
TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion

Yiran Wang · Jiaqi Li · Chaoyi Hong · Ruibo Li · Liusheng Sun · Xiao Song · Zhe Wang · Zhiguo Cao · Guosheng Lin

Radar-Camera depth estimation aims to predict dense and accurate metric depth by fusing input images and Radar data. Model efficiency is crucial for this task in pursuit of real-time processing on autonomous vehicles and robotic platforms. However, due to the sparsity of Radar returns, the prevailing methods adopt multi-stage frameworks with intermediate quasi-dense depth, which are time-consuming and not robust. To address these challenges, we propose TacoDepth, an efficient and accurate Radar-Camera depth estimation model with one-stage fusion. Specifically, the graph-based Radar structure extractor and the pyramid-based Radar fusion module are designed to capture and integrate the graph structures of Radar point clouds, delivering superior model efficiency and robustness without relying on the intermediate depth results. Moreover, TacoDepth can be flexible for different inference modes, providing a better balance of speed and accuracy. Extensive experiments are conducted to demonstrate the efficacy of our method. Compared with the previous state-of-the-art approach, TacoDepth improves depth accuracy and processing speed by 12.8% and 91.8%. Our work provides a new perspective on efficient Radar-Camera depth estimation.


Poster #111
SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation

Dekai Zhu · Yan Di · Stefan Gavranovic · Slobodan Ilic

Denoising diffusion probabilistic models have achieved significant success in point cloud generation, enabling numerous downstream applications, such as generative data augmentation and 3D model editing. However, little attention has been given to generating point clouds with point-wise segmentation labels, as well as to developing evaluation metrics for this task. Therefore, in this paper, we present SeaLion, a novel diffusion model designed to generate high-quality and diverse point cloud with fine-grained segmentation labels. Specifically, we introduce the semantic part-aware latent point diffusion technique, which leverages the intermediate features of the generative models to jointly predict the noise for perturbed latent points and associated part segmentation labels during the denoising process, and subsequently decodes the latent points to point clouds conditioned on part segmentation labels. To effectively evaluate the quality of generated point clouds, we introduce a novel point cloud pairwise distance calculation method named part-aware Chamfer distance (p-CD). This method enables existing metrics, such as 1-NNA, to measure both the local structural quality and inter-part coherence of generated point clouds. Experiments on the large-scale synthetic dataset ShapeNet and real-world medical dataset IntrA, demonstrate that SeaLion achieves remarkable performance in generation quality and diversity, outperforming the existing state-of-the-art model, DiffFacto, by 13.33% and 6.52% on 1-NNA (p-CD) across the two datasets. Experimental analysis shows that SeaLion can be trained semi-supervised, thereby reducing the demand for labeling efforts. Lastly, we validate the applicability of SeaLion in generative data augmentation for training segmentation models and the capability of SeaLion to serve as a tool for part-aware 3D shape editing.


Poster #112
Spectral Informed Mamba for Robust Point Cloud Processing

Ali Bahri · Moslem Yazdanpanah · Mehrdad Noori · Sahar Dastani · Milad Cheraghalikhani · David OSOWIECHI · Gustavo Vargas Hakim · Farzad Beizaee · Ismail Ben Ayed · Christian Desrosiers

State Space Models (SSMs) have shown significant promise in Natural Language Processing (NLP) and, more recently, computer vision. This paper introduces a new methodology leveraging Mamba and Masked Autoencoder (MAE) networks for point cloud data in both supervised and self-supervised learning. We propose three key contributions to enhance Mamba's capability in processing complex point cloud structures. First, we exploit the spectrum of a graph Laplacian to capture patch connectivity, defining an isometry-invariant traversal order that is robust to viewpoints and better captures shape manifolds than traditional 3D grid-based traversals. Second, we adapt segmentation via a recursive patch partitioning strategy informed by Laplacian spectral components, allowing finer integration and segment analysis. Third, we address token placement in MAE for Mamba by restoring tokens to their original positions, which preserves essential order and improves learning. Extensive experiments demonstrate our approach’s improvements in classification, segmentation, and few-shot tasks over state-of-the-art (SOTA) baselines.


Poster #113
Hyperbolic Uncertainty-Aware Few-Shot Incremental Point Cloud Segmentation

Tanuj Sur · Samrat Mukherjee · Kaizer Rahaman · Subhasis Chaudhuri · Muhammad Haris Khan · Biplab Banerjee

3D point cloud segmentation is essential across a range of applications; however, conventional methods often struggle in evolving environments, particularly when tasked with identifying novel categories under limited supervision. Few-Shot Learning (FSL) and Class Incremental Learning (CIL) have been adapted previously to address these challenges in isolation, yet the combined paradigm of Few-Shot Class Incremental Learning (FSCIL) remains largely unexplored for point cloud segmentation. To address this gap, we introduce \textbf{Hyperbolic Ideal Prototypes Optimization} (\textsc{HiPo}), a novel framework that harnesses hyperbolic embeddings for FSCIL in 3D point clouds. \textsc{HiPo} employs the Poincaré Hyperbolic Sphere as its embedding space, integrating Ideal Prototypes enriched by CLIP-derived class semantics, to capture the hierarchical structure of 3D data. By enforcing orthogonality among prototypes and maximizing representational margins, \textsc{HiPo} constructs a resilient embedding space that mitigates forgetting and enables the seamless integration of new classes, thereby effectively countering overfitting. Extensive evaluations on S3DIS, ScanNetv2, and cross-dataset scenarios demonstrate \textsc{HiPo}’s strong performance, significantly surpassing existing approaches in both in-domain and cross-dataset FSCIL tasks for 3D point cloud segmentation. \textbf{Code will be released upon acceptance.}


Poster #114
CamPoint: Boosting Point Cloud Segmentation with Virtual Camera

Jianhui Zhang · Luo Yizhi · Zicheng Zhang · Xuecheng Nie · Bonan Li

Local features aggregation and global information perception are the fundamental to point cloud segmentation. However, existing works often fall short in effectively identifying semantic relevant neighbors and face challenges in endowing each point with high-level information. Here, we propose CamPoint, an innovative method that employs virtual cameras to solve above problems. The core of CamPoint lies in introducing the novel camera visibility feature for points, where each dimension encodes the visibility of that point from a specific camera. Leveraging this feature, we propose the camera perspective slice distance for accurate relevant neighbor searching and design the camera parameter embedding to deliver rich feature representations for global interaction. Specifically, the camera perspective slice distance between two points is defined as a similarity metric derived from their camera visibility features, whereby an increased number of shared cameras observing both points corresponds to a reduced distance between them. To effectively facilitate global semantic perception, we assign each camera an optimizable embedding and then integrate these embeddings into the original spatial features based on visibility attributes, thereby obtaining high-level features enriched with camera priors. Additionally, the state space model characterized by linear computational complexity is employed as the operator to achieve global learning with efficiency. Comprehensive experiments on multiple datasets shows that our CamPoint surpasses the current state-of-the-art in multiple datasets, achieving low training cost and fast inference speed. Code will be released upon acceptance.


Poster #115
ReRAW: RGB-to-RAW Image Reconstruction via Stratified Sampling for Efficient Object Detection on the Edge

Radu Berdan · Beril Besbinar · Christoph Reinders · Junji Otsuka · Daisuke Iso

Edge-based computer vision models running on compact, resource-limited devices benefit greatly from using unprocessed, detail-rich RAW sensor data instead of processed RGB images. Training these models, however, necessitates large labeled RAW datasets, which are costly and often impractical to obtain. Thus, converting existing labeled RGB datasets into sensor-specific RAW images becomes crucial for effective model training. In this paper, we introduce ReRAW, an RGB-to-RAW conversion model that achieves state-of-the-art reconstruction performance across five diverse RAW datasets. This is accomplished through ReRAW’s novel multi-head architecture predicting RAW image candidates in gamma space. The performance is further boosted by a stratified sampling-based training data selection heuristic, which helps the model better reconstruct brighter RAW pixels. We finally demonstrate that pretraining compact models on a combination of high-quality synthetic RAW datasets (such as generated by ReRAW) and ground-truth RAW images for downstream tasks like object detection, outperforms both standard RGB pipelines, and RAW fine-tuning of RGB-pretrained models for the same task.

The sparsity of point clouds poses challenges to current LiDAR-only 3D object detection methods. Recently, methods that convert RGB images into virtual points via depth completion to be fused with LiDAR points have alleviated this issue. Although these methods can achieve outstanding results, they often introduce significant computation overhead due to the high density of virtual points and noises due to inevitable errors in depth completion. At the same time, they do not fully leverage the semantic information from images. In this paper, we propose ViKIENet (Virtual Key Instance Enhanced Network), a highly efficient and effective multi-modal feature fusion framework which fuses the features of virtual key instances (VKIs) with those of LiDAR points in multiple stages. We observed that using only VKIs can enhance the detection performance similar to using all virtual points. ViKIENet has three main components: Semantic Key Instance Selection (SKIS), Virtual Instance Focused Fusion (VIFF) and Virtual-Instance-to-Real Attention (VIRA). ViKIENet-R and VIFF-R are extended versions of ViKIENet and VIFF that include rotationally equivariant features. ViKIENet and ViKIENet-R achieve considerable improvements in detection performance on the KITTI, JRDB and nuScenes datasets. On the KITTI dataset, ViKIENet and ViKIENet-R run fast at 22.7 and 15.0 FPS respectively. We rank first on the KITTI car object detection and orientation estimation evaluation leaderboard and rank second on the car 3D object detection leaderboard among published papers.


Poster #117
ViiNeuS: Volumetric Initialization for Implicit Neural Surface Reconstruction of Urban Scenes with Limited Image Overlap

Hala Djeghim · Nathan Piasco · Moussab Bennehar · Luis Guillermo Roldao Jimenez · Dzmitry Tsishkou · Désiré Sidibé

Neural implicit surface representation methods have recently shown impressive 3D reconstruction results. However, existing solutions struggle to reconstruct driving scenes due to their large size, highly complex nature and limited visual observation overlap.Hence, to achieve accurate reconstructions, additional supervision data such as LiDAR, strong geometric priors, and long training times are required.To tackle such limitations, we present ViiNeuS, a new hybrid implicit surface learning method that efficiently initializes the signed distance field to reconstruct large driving scenes from 2D street view images.ViiNeuS's hybrid architecture models two separate implicit fields: one representing the volumetric density of the scene, and another one representing the signed distance to the surface.To accurately reconstruct urban outdoor driving scenarios, we introduce a novel volume-rendering strategy that relies on self-supervised probabilistic density estimation to sample points near the surface and transition progressively from volumetric to surface representation. Our solution permits a proper and fast initialization of the signed distance field without relying on any geometric prior on the scene, compared to concurrent methods.By conducting extensive experiments on four outdoor driving datasets, we show that ViiNeuS can learn an accurate and detailed 3D surface scene representation in various driving scenarios while being two times faster to train compared to previous state-of-the-art solutions.


Poster #118
D^3CTTA: Domain-Dependent Decorrelation for Continual Test-Time Adaption of 3D LiDAR Segmentation

Jichun Zhao · Haiyong Jiang · Haoxuan Song · Jun Xiao · Dong Gong

Adapting pre-trained LiDAR segmentation models to dynamic domain shifts during testing is of paramount importance for the safety of autonomous driving. Most existing methods neglect the influence of domain changes on the continual test-time adaption (CTTA) and require backpropagation and large batch sizes for stable adaption.We approach this problem with three insights: 1) Distance of a point to the LiDAR sensor is highly relevant to its local density; 2) The feature distribution of different domains varies, and domain-aware parameters can alleviate domain gaps; 3) Features are highly correlated and make segmentation of different labels confusing. To this end, this work presents D^3CTTA, an online backpropagation-free framework for 3D continual test-time adaption for LiDAR segmentation.D^3CTTA consists of a distance-aware prototype learning module to integrate LiDAR-based geometry prior and a domain-dependent decorrelation module to reduce feature correlations among different domains and different categories.Extensive experiments on three benchmarks showcase that our method achieves a state-of-the-art performance compared to both backpropagation-based methods and backpropagation-free methods.


Poster #119
Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving

Alexey Nekrasov · Malcolm Burdorf · Stewart Worrall · Bastian Leibe · Julie Stephany Berrio Perez

For safe operation, autonomous vehicles (AVs) must detect and handle unexpected objects or anomalies on the road. While anomaly detection and segmentation have been explored in 2D images, a gap remains for similar tasks in 3D LiDAR point clouds. Existing datasets lack high-quality multimodal data typically found in AVs. This paper presents a novel dataset for anomaly segmentation in driving scenarios. To the best of our knowledge, it is the first publicly available dataset focused on road anomaly segmentation with dense 3D semantic labeling, incorporating both LiDAR and camera data, as well as sequential information to enable anomaly detection across various ranges. This capability is critical for the safe navigation of autonomous vehicles. We adapted and benchmarked several baseline models for 3D segmentation, highlighting the challenges of 3D anomaly detection in driving environments. Our dataset and evaluation code will be openly accessible, facilitating testing and performance comparison across diverse approaches.

Deep learning models for 3D data have shown to be vulnerable to adversarial attacks, which have received increasing attention in various safety-critical applications such as autonomous driving and robotic navigation. Existing 3D attackers mainly put effort into attacking the simple 3D classification model by perturbing point cloud objects in the white/black-box setting. However, real-world 3D applications focus on tackling more complicated scene-based data while sharing no information about the model parameters and logits with users. Therefore, directly applying previous naive 3D attack methods to these applications does not work. To this end, this paper attempts to address the challenging hard-label 3D scene attack with access only to the input/output of the 3D models. To make the attack effective and stealthy, we propose to generate universal adversarial objects, which will mislead scene-aware 3D models to predict attacker-chosen labels whenever these objects are placed on any scene input. Specifically, we inject an imperceptible object trigger with further perturbations into all scenes and learn to mislead their reasoning by only querying the 3D model. We start by initializing the trigger pattern with a realistic object and searching for an appropriate location to place it naturally in the scene data. Then, we design a novel weighted gradient estimation strategy to perturb the object trigger with additive slight noise to make them adversarial in an iterative optimization procedure. Extensive experiments demonstrate that our attack can achieve superior performance on seven 3D models and three scene-based datasets, with satisfactory adversarial imperceptibility and strong resistance to defense methods.


Poster #121
Highlight
Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAV Target Detection

Houzhang Fang · Xiaolin Wang · Zengyang Li · Lu Wang · Qingshan Li · Yi Chang · Luxin Yan

Infrared unmanned aerial vehicle (UAV) images captured using thermal detectors are often affected by temperature-dependent low-frequency nonuniformity, which significantly reduces the contrast of the images. Detecting UAV targets under nonuniform conditions is crucial in UAV surveillance applications. Existing methods typically treat infrared nonuniformity correction (NUC) as a preprocessing step for detection, which leads to suboptimal performance. Balancing the two tasks while enhancing detection-beneficial information remains challenging. In this paper, we present a detection-friendly union framework, termed UniCD, that simultaneously addresses both infrared NUC and UAV target detection tasks in an end-to-end manner. We first model NUC as a small number of parameter estimation problem jointly driven by priors and data to generate detection-conducive images. Then, we incorporate a new auxiliary loss with target mask supervision into the backbone of the infrared UAV target detection network to strengthen target features while suppressing the background. To better balance correction and detection, we introduce a detection-guided self-supervised loss to reduce feature discrepancies between the two tasks, thereby enhancing detection robustness to varying nonuniformity levels. Additionally, we construct a new benchmark composed of 50,000 infrared images in various nonuniformity types, multi-scale UAV targets and rich backgrounds with target annotations, called IRBFD. Extensive experiments on IRBFD demonstrate that our UniCD is a robust union framework for NUC and UAV target detection while achieving real-time processing capabilities. Dataset can be available at https://github.com/anonymous2025submit/UniCD.


Poster #122
RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions

Shihang Du · Sanqing Qu · Tianhang Wang · Xudong Zhang · Yunwei Zhu · Jian Mao · Fan Lu · Qiao Lin · Guang Chen

Collaborative perception enhances single-vehicle perception by integrating sensory data from multiple connected vehicles. However, existing studies often assume ideal conditions, overlooking resilience to real-world challenges, such as adverse weather and sensor malfunctions, which is critical for safe deployment. To address this gap, we introduce RCP-Bench, the first comprehensive benchmark designed to evaluate the robustness of collaborative detection models under a wide range of real-world corruptions. RCPBench includes three new datasets (i.e., OPV2V-C, V2XSet-C, and DAIR-V2X-C) that simulate six collaborative cases and 14 types of camera corruption resulting from external environmental factors, sensor failures, and temporal misalignments. Extensive experiments on 10 leading collaborative perception models reveal that, while these models perform well under ideal conditions, they are significantly affected by corruptions. To improve robustness, we propose two simple yet effective strategies, RCP-Drop and RCP-Mix, based on training regularization and feature augmentation. Additionally, we identify several critical factors influencing robustness, such as backbone architecture, camera number, feature fusion methods, and the number of connected vehicles. We hope that RCP-Bench, along with these strategies and insights, will stimulate future research toward developing more robust collaborative perception models. Our benchmark toolkit will be made public.


Poster #123
Generative Map Priors for Collaborative BEV Semantic Segmentation

Jiahui Fu · Yue Gong · Luting Wang · Shifeng Zhang · Xu Zhou · Si Liu

Collaborative perception aims to address the constraint of single-agent perception by exchanging information among multiple agents. Previous works primarily focus on collaborative object detection, exploring compressed transmission and fusion prediction tailored to sparse object features. However, these strategies are not well-suited for dense features in collaborative BEV semantic segmentation. Therefore, we propose CoGMP, a novel Collaborative framework that leverages Generative Map Priors to achieve efficient compression and robust fusion. CoGMP introduces two key innovations: Element Format Feature Compression (EFFC) and Structure Guided Feature Fusion (SGFF). Specifically, EFFC leverages map element priors from codebook to encode BEV features as discrete element indices for transmitted information compression. Meanwhile, SGFF utilizes a diffusion model with structural priors to coherently integrate multi-agent features, thereby achieving consistent fusion predictions. Evaluations on the OPV2V dataset show that CoGMP achieves a 6.89/7.64 Road/Lane IoU improvement and a 32-fold reduction in communication volume. The code can be found in the supplementary materials.


Poster #124
SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion

Xiyue Guo · Jiarui Hu · Junjie Hu · Hujun Bao · Guofeng Zhang

Recently, camera-based solutions have been extensively explored for scene semantic completion (SSC). Despite their success in visible areas, existing methods struggle to capture complete scene semantics due to frequent visual occlusions. To address this limitation, this paper presents the first satellite-ground cooperative SSC framework, i.e., SGFormer, exploring the potential of satellite-ground image pairs in the SSC task. Specifically, we propose a dual-branch architecture that encodes orthogonal satellite and ground views in parallel, unifying them into a common domain. Additionally, we design a ground-view guidance strategy that pre-corrects satellite image biases during feature encoding, addressing misalignment between satellite and ground views. Moreover, we develop an adaptive weighting strategy that balances contributions from satellite and ground views. Experiments demonstrate that SGFormer outperforms the state of the art on SemanticKITTI and SSCBench-KITTI-360 datasets. We will make our source code publicly available soon.

Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both designed to enhance distant scenes by leveraging context from near-viewpoint scenes. The Scan Module uses axis-wise masked attention, where each axis employing a near-to-far cascade masking that enables distant voxels to capture relationships with preceding voxels. In addition, the Scan Loss computes the cross-entropy along each axis between cumulative logits and corresponding class distributions in a near-to-far direction, thereby propagating rich context-aware signals to distant voxels. Leveraging the synergy between these components, ScanSSC achieves state-of-the-art performance, with IoUs of 44.54 and 48.29, and mIoUs of 17.40 and 20.14 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.


Poster #126
OccMamba: Semantic Occupancy Prediction with State Space Models

Heng Li · Yuenan Hou · Xiaohan Xing · Yuexin Ma · Xiao Sun · Yanyong Zhang

Training deep learning models for semantic occupancy prediction is challenging due to factors such as a large number of occupancy cells, severe occlusion, limited visual cues, complicated driving scenarios, etc. Recent methods often adopt transformer-based architectures given their strong capability in learning input-conditioned weights and long-range relationships. However, transformer-based networks are notorious for their quadratic computation complexity, seriously undermining their efficacy and deployment in semantic occupancy prediction. Inspired by the global modeling and linear computation complexity of the Mamba architecture, we present the first Mamba-based network for semantic occupancy prediction, termed OccMamba. Specifically, we first design the hierarchical Mamba module and local context processor to better aggregate global and local contextual information, respectively. Besides, to relieve the inherent domain gap between the linguistic and 3D domains, we present a simple yet effective 3D-to-1D reordering scheme, i.e., height-prioritized 2D Hilbert expansion. It can maximally retain the spatial structure of 3D voxels as well as facilitate the processing of Mamba blocks. Endowed with the aforementioned designs, our OccMamba is capable of directly and efficiently processing large volumes of dense scene grids, achieving state-of-the-art performance across three prevalent occupancy prediction benchmarks, including OpenOccupancy, SemanticKITTI, and SemanticPOSS. Notably, on OpenOccupancy, our OccMamba outperforms the previous state-of-the-art Co-Occ by 5.1% IoU and 4.3% mIoU, respectively. Codes will be released upon publication.


Poster #127
GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

Haoyi Jiang · Liu Liu · Tianheng Cheng · Xinjie wang · Tianwei Lin · Zhizhong Su · Wenyu Liu · Xinggang Wang

3D Semantic Occupancy Prediction is pivotal for spatial understanding as it provides a comprehensive semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive labeled data and computationally intensive voxel-based modeling, restricting the scalability and generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian Transformer that aligns with foundation models to enhance self-supervised 3D spatial understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians representing scenes in a feed-forward manner. Through the alignment of rendered Gaussian features with diverse knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D representations, thereby enabling open-vocabulary occupancy prediction without explicit annotations. Empirical evaluations on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance, achieving 11.70 mIoU while reducing training duration by approximately 50\%. These results highlight the significant potential of GaussTR for advancing scalable and holistic 3D spatial understanding, with promising implications for autonomous driving and embodied agents. The code will be made publicly available in due course.


Poster #128
UniScene: Unified Occupancy-centric Driving Scene Generation

Bohan Li · Jiazhe Guo · Hongsi Liu · Yingshuang Zou · Yikang Ding · Xiwu Chen · Hu ZHU · Feiyang Tan · Chi Zhang · Tiancai Wang · Shuchang Zhou · Li Zhang · Xiaojuan Qi · Hao Zhao · Mu Yang · Wenjun Zeng · Xin Jin

Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms — semantic occupancy, video, and LiDAR — in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks. The code is available in the supplementary.


Poster #129
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving

Georg Hess · Carl Lindström · Maryam Fatemi · Christoffer Petersson · Lennart Svensson

Ensuring the safety of autonomous robots, such as self-driving vehicles, requires extensive testing across diverse driving scenarios. Simulation is a key ingredient for conducting such testing in a cost-effective and scalable way. Neural rendering methods have gained popularity, as they can build simulation environments from collected logs in a data-driven manner. However, existing neural radiance field (NeRF) methods for sensor-realistic rendering of camera and lidar data suffer from low rendering speeds, limiting their applicability for large-scale testing. While 3D Gaussian Splatting (3DGS) enables real-time rendering, current methods are limited to camera data and are unable to render lidar data essential for autonomous driving. To address these limitations, we propose SplatAD, the first 3DGS-based method for realistic, real-time rendering of dynamic scenes for both camera and lidar data. SplatAD accurately models key sensor-specific phenomena such as rolling shutter effects, lidar intensity, and lidar ray dropouts, using purpose-built algorithms to optimize rendering efficiency. Evaluation across three autonomous driving datasets demonstrates that SplatAD achieves state-of-the-art rendering quality with up to +2 PSNR for NVS and +3 PSNR for reconstruction while increasing rendering speed over NeRF-based methods by an order of magnitude. Code to be released upon publication.


Poster #130
Highlight
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Katrin Renz · Long Chen · Elahe Arani · Oleg Sinavski

Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is grounded in the action space. Otherwise, the model’s answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Leaderboard 2.0 and the Bench2Drive benchmarks. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance. We will release code, data and models upon acceptance.


Poster #131
FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

Lue Fan · Hao ZHANG · Qitai Wang · Hongsheng Li · Zhaoxiang Zhang

We propose FreeSim, a camera simulation method for driving scenes. FreeSim emphasizes high-quality rendering from viewpoints beyond the recorded ego trajectories. In such viewpoints, previous methods have unacceptable degradation because the training data of these viewpoints is unavailable. To address such data scarcity, we first propose a generative enhancement model with a matched data construction strategy. The resulting model can generate high-quality images in a viewpoint slightly deviated from the recorded trajectories, conditioned on the degraded rendering of this viewpoint. We then propose a progressive reconstruction strategy, which progressively adds generated images in unrecorded views into the reconstruction process, starting from slightly off-trajectory viewpoints and moving progressively farther away. With this progressive generation-reconstruction pipeline, FreeSim supports high-quality off-trajectory view synthesis under large deviations of more than 3 meters.


Poster #132
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

Guosheng Zhao · Chaojun Ni · Xiaofeng Wang · Zheng Zhu · Xueyang Zhang · Yida Wang · Guan Huang · xinze chen · Boyuan Wang · Youyi Zhang · Wenjun Mei · Xingang Wang

Closed-loop simulation is essential for advancing end-to-end autonomous driving systems. Contemporary sensor simulation methods, such as NeRF and 3DGS, rely predominantly on conditions closely aligned with training data distributions, which are largely confined to forward-driving scenarios. Consequently, these methods face limitations when rendering complex maneuvers (e.g., lane change, acceleration, deceleration). Recent advancements in autonomous-driving world models have demonstrated the potential to generate diverse driving videos. However, these approaches remain constrained to 2D video generation, inherently lacking the spatiotemporal coherence required to capture intricacies of dynamic driving environments.In this paper, we introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors. Specifically, we utilize the world model as a data machine to synthesize novel trajectory videos, where structured conditions are explicitly leveraged to control the spatial-temporal consistency of traffic elements. Besides, the cousin data training strategy is proposed to facilitate merging real and synthetic data for optimizing 4DGS. To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios.Experimental results reveal that DriveDreamer4D significantly enhances generation quality under novel trajectory views, achieving a relative improvement in FID by 32.1\%, 46.4\%, and 16.3\% compared to PVG, $\text{S}^3$Gaussian, and Deformable-GS. Moreover, DriveDreamer4D markedly enhances the spatiotemporal coherence of driving agents, which is verified by a comprehensive user study and the relative increases of 22.6\%, 43.5\%, and 15.6\% in the NTA-IoU metric.


Poster #133
Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene

Tai-Yu Daniel Pan · Sooyoung Jeon · Mengdi Fan · Jinsu Yoo · Zhenyang Feng · Mark Campbell · Kilian Q Weinberger · Bharath Hariharan · Wei-Lun Chao

Self-driving cars relying solely on ego-centric perception face limitations in sensing, often failing to detect occluded, faraway objects. Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial. It requires placing multiple sensor-equipped agents in a real-world driving scene, simultaneously! As such, existing datasets are limited in locations and agents. We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene, conditioned on a real-world sample -- the ego-car's sensory data. This surrogate has huge potential: it could potentially turn any ego-car dataset into a collaborative driving one to scale up the development of CAV.We present the very first solution, using a combination of synthetic collaborative data and real ego-car data. Our method, Transfer Your Perspective (TYP), learns a conditioned diffusion model whose output samples are not only realistic but also consistent in both semantics and layouts with the given ego-car data. Empirical results demonstrate TYP's effectiveness in aiding in a CAV setting. In particular, TYP enables us to (pre-)train collaborative perception algorithms like early and late fusion with little or no real-world collaborative data, greatly facilitating downstream CAV applications.


Poster #134
Highlight
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

Bencheng Liao · Shaoyu Chen · haoran yin · Bo Jiang · Cheng Wang · Sixu Yan · xinbang zhang · Xiangyu Li · ying zhang · Qian Zhang · Xinggang Wang

Recently, the diffusion model has emerged as a powerful generative technique for robotic policy learning, capable of modeling multi-mode action distributions. Leveraging its capability for end-to-end autonomous driving is a promising direction. However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed. To address these challenges, we propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule, enabling the model to learn denoising from anchored Gaussian distribution to the multi-mode driving action distribution. Additionally, we design an efficient cascade diffusion decoder for enhanced interaction with conditional scene context. The proposed model, DiffusionDrive, demonstrates $10\times$ reduction in denoising steps compared to vanilla diffusion policy, delivering superior diversity and quality in just $2$ steps. On the planning-oriented NAVSIM dataset, with the aligned ResNet-34 backbone, DiffusionDrive achieves $88.1$ PDMS without bells and whistles, setting a new record, while running at a real-time speed of $45$ FPS on an NVIDIA 4090. Qualitative results on challenging scenarios further confirm that DiffusionDrive can robustly generate diverse plausible driving actions. Code and model will be available for future research.


Poster #135
TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Zhiying Song · Lei Yang · Fuxi Wen · Jun Li

Cooperative perception presents significant potential for enhancing the sensing capabilities of individual vehicles, however, inter-agent latency remains a critical challenge. Latencies cause misalignments in both spatial and semantic features, complicating the fusion of real-time observations from the ego vehicle with delayed data from others. To address these issues, we propose TraF-Align, a novel framework that learns the flow path of features by predicting the feature-level trajectory of objects from past observations up to the ego vehicle’s current time. By generating temporally ordered sampling points along these paths, TraF-Align directs attention from the current-time query to relevant historical features along each trajectory, supporting the reconstruction of current-time features and promoting semantic interaction across multiple frames. This approach corrects spatial misalignment and ensures semantic consistency across agents, effectively compensating for motion and achieving coherent feature fusion. Experiments on real-world datasets, V2V4Real and DAIR-V2X-Seq, show that TraF-Align sets a new benchmark for asynchronous cooperative perception. Notably, our method shows minimal average precision (AP50) drops of only $4.87$% and $5.68$% at $400$ ms latency on the two datasets, respectively.


Poster #136
Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM

Yizhou Huang · Yihua Cheng · Kezhi Wang

Motion prediction is crucial for autonomous driving systems, as it enables accurate forecasting of future vehicle trajectories based on historical motion data. This paper introduces Trajectory Mamba (Tamba), a novel efficient trajectory prediction framework based on the selective state-space (SSM) model. Conventional attention-based models face the challenge of computational costs that grow quadratically with the number of targets, hindering their application in highly dynamic environments. To address this, Tamba leverages the SSM module to redesign the self-attention mechanism in the encoder-decoder architecture, thereby achieving linear time complexity.To address the potential reduction in prediction accuracy resulting from modifications to the attention mechanism, we propose a joint polyline encoding strategy to better capture the associations between static and dynamic contexts, ultimately enhancing prediction accuracy. In addition, to achieve a better balance between prediction accuracy and inference speed, we adopted a structure in the decoder that differs entirely from the encoder. Through cross-state space attention, all target agents share the scene context, allowing the SSM to interact with the shared scene representation during decoding, thus inferring different trajectories over the next prediction steps.Our model achieves state-of-the-art (SOTA) results in terms of inference speed and parameter efficiency on both the Argoverse 1 and Argoverse 2 datasets. It demonstrates a fourfold reduction in FLOPs compared to existing methods and reduces parameter count by over 40\% while surpassing the performance of the vast majority of previous SOTA results. These findings validate the effectiveness of Trajectory Mamba in trajectory prediction tasks.


Poster #137
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

Xuesong Chen · Linjiang Huang · Tao Ma · Rongyao Fang · Shaoshuai Shi · Hongsheng Li

The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and real-time decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient asynchronous cooperation, aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and interpretable autonomous driving systems.


Poster #138
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

Xinshuai Song · weixing chen · Yang Liu · Weikai Chen · Guanbin Li · Liang Lin

Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.


Poster #139
Highlight
MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Zhi-Yuan Zhang · Xiaofan Li · Zhihao Xu · Wenjie Peng · Zijian Zhou · Miaojing Shi · Shuangping Huang

Autonomous driving visual question answering (AD-VQA) aims to answer questions related to perception, prediction, and planning based on given driving scene images, heavily relying on the model's spatial perception capabilities.Previous works typically express spatial comprehension through textual representations of spatial coordinates, resulting in semantic gaps between visual coordinate representations and textual descriptions.This oversight hinders the accurate transmission of spatial information and increases the expressive burden.To address this, we propose Marker-based Prompt Learning framework (MPDrive), which transforms spatial coordinates into concise visual markers, ensuring linguistic consistency and enhancing the accuracy of visual perception and spatial expression in AD-VQA.Specifically, MPDrive converts complex spatial coordinates into text-based visual marker predictions, simplifying the expression of spatial information for autonomous decision-making.Moreover, we introduce visual marker images as conditional inputs and integrate object-level fine-grained features to further enhance multi-level spatial perception abilities.Extensive experiments on the DriveLM and CODA-LM datasets show that MPDrive performs at state-of-the-art levels, particularly in cases requiring sophisticated spatial understanding.


Poster #140
Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models

Hao Ren · Yiming Zeng · Zetong Bi · Zhaoliang Wan · Junlong Huang · Hui Cheng

Recent advancements in diffusion-based imitation learning, which shows impressive performance in modeling multimodal distributions and training stability, have led to substantial progress in various robot learning tasks. In visual navigation, previous diffusion-based policies typically generate action sequences by initiating from denoising Gaussian noise. However, the target action distribution often diverges significantly from Gaussian noise, leading to redundant denoising steps and increased learning complexity. Additionally, the sparsity of effective action distributions makes it challenging for the policy to generate accurate actions without guidance. To address these issues, we propose a novel, unified visual navigation framework leveraging the denoising diffusion bridge models named NaviBridger. This approach enables action generation by initiating from any informative prior actions, enhancing guidance and efficiency in the denoising process. We explore how diffusion bridges can enhance imitation learning in visual navigation tasks and further examine three source policies for generating prior actions. Extensive experiments in both simulated and real-world indoor and outdoor scenarios demonstrate that NaviBridger accelerates policy inference and outperforms the baselines in generating target action sequences. Minimal implementation codes are available in supplementary materials.


Poster #141
Highlight
Reasoning in Visual Navigation of End-to-end Trained Agents: A Dynamical Systems Approach

Steeven JANNY · Hervé Poirier · Leonid Antsfeld · Guillaume Bono · Gianluca Monaci · Boris Chidlovskii · Francesco Giuliari · Alessio Del Bue · Christian Wolf

Progress in Embodied AI has made it possible for end-to-end-trained agents to navigate in photo-realistic environments with high-level reasoning and zero-shot or language-conditioned behavior, but evaluations and benchmarks are still dominated by simulation. In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving \numepisodes{} navigation episodes in a real environment with a physical robot, where we analyze the type of reasoning emerging from end-to-end training. In particular, we study the presence of realistic dynamics which the agent learned for open-loop forecasting, and their interplay with sensing. We analyze the way the agent uses latent memory to hold elements of the scene structure and information gathered during exploration. We probe the planning capabilities of the agent, and find in its memory evidence for somewhat precise plans over a limited horizon. Furthermore, we show in a post-hoc analysis that the value function learned by the agent relates to long-term planning. Put together, our experiments paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. An interactive tool is available at https://visual-navigation-reasoning.github.io


Poster #142
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai · Zihao Wang · Kewei Lian · Zhancun Mu · Xiaojian Ma · Anji Liu · Yitao Liang

Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning. Experiments in Minecraft show that our approach enables agents to achieve previously unattainable tasks, with a $76$\% absolute improvement in open-world interaction performance. Codes and demos will be released.

This work presents IAAO, a novel framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. Unlike prior methods that rely on task-specific networks and assumptions about movable parts, our IAAO leverages large foundation models to estimate interactive affordances and part articulations in three stages. We first build hierarchical features and label fields for each state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances. Finally, scenes from different states are merged and refined based on the estimated transformations, enabling robust affordance-based interaction and manipulation of objects. Experimental results demonstrate the effectiveness of our method. We will make our code open-source upon paper acceptance.


Poster #144
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

Xin Wen · Bingchen Zhao · Yilun Chen · Jiangmiao Pang · Xiaojuan Qi

Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data—a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. We will make our code and model artifacts publicly available.


Poster #145
Robotic Visual Instruction

Yanbang Li · ZiYang Gong · Haoyang Li · Xiaoqi Huang · Haolan Kang · Guangpingbai · Xianzheng Ma

Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision for robotic control introduces challenges such as ambiguity and verbosity. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment, enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Code and Datasets in this paper will be released soon.


Poster #146
DynScene: Scalable Generation of Dynamic Robotic Manipulation Scenes for Embodied AI

Sangmin Lee · Sungyong Park · Heewon Kim

Creating robotic manipulation datasets is traditionally labor-intensive and expansive, requiring extensive manual effort. To alleviate this problem, we introduce PhaseScene, which generates realistic and diverse dynamic scenes (or robotic manipulation data) from text instructions for Embodied AI. PhaseScene employs a phase-specific data representation by dividing dynamic scenes into static environments and robot movements. Each phase utilizes a diffusion-based method to generate phase-specific data, incorporating data refinement and augmentation techniques. Our experiments demonstrate that PhaseScene outperforms human creation by about 20 times faster speed, 1.84 times accuracy, and 28% higher action diversity based on standard metrics. Additionally, the generated scenes enable accurate agent training with an average success rate improvement of 7.96% for PerAct and 11.23% for PerAct-PSA.


Poster #147
FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation

Sen Wang · Le Wang · Sanping Zhou · Jingyi Tian · lijiayi · Haowen Sun · Wei Tang

Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhancing information exploration in 3D environments. In response, we propose FlowRAM, a novel framework that leverages generative models to achieve region-aware perception, enabling efficient multimodal information processing. Specifically, we devise a Dynamic Radius Schedule, which enables adaptive perception, facilitating transitions from global scene comprehension to fine-grained geometric details. Furthermore, we integrate state space models to integrate multimodal information, while preserving linear computational complexity. In addition, we employ conditional flow matching to learn action poses by regressing deterministic vector fields, which simplifies the learning process while maintaining performance. We verify the effectiveness of the FlowRAM in the RLBench, an established manipulation benchmark, and achieve state-of-the-art performance. The results demonstrate that FlowRAM achieves a remarkable improvement, particularly in high-precision tasks, where it outperforms previous methods by 12.0\% in average success rate. Additionally, FlowRAM is able to generate physically plausible actions for a variety of real-world tasks in less than 4 time steps, significantly increasing inference speed.


Poster #148
GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation

Ning Gao · Yilun Chen · Shuai Yang · Xinyi Chen · Yang Tian · Hao Li · Haifeng Huang · Hanqing Wang · Tai Wang · Jiangmiao Pang

Robotic manipulation in real-world settings presents significant challenges, particularly in achieving reliable performance across diverse real-world conditions. However, existing simulation platforms often lack the necessary support for studying policy generalization across varied tasks and conditions, falling behind the growing interest in leveraging foundation models. To address these limitations, we introduce \textbf{GenManip}, a realistic tabletop simulation platform designed to study policy generalization. The platform features a \textbf{task-oriented scene graph}-based scenario generation driven by GPT capabilities, enabling large-scale everyday task synthesis using \textit{10K} 3D assets. To investigate the generalization of robotic manipulation, we introduce \textbf{GenManip-Bench}, a benchmark comprising 250 task scenarios derived from generated tasks and refined through human-in-the-loop correction. We focus on two key areas: a modular manipulation system that employs foundation models for component-specific analysis, and end-to-end policy exploration using the scalable data collection pipeline. Experimental results show that while data scaling benefits learning-based policies, their generalization remains limited compared to modular approaches using foundation models. We expect this platform to offer critical insights for advancing policy generalizability in realistic settings. All code will be made publicly available.


Poster #149
UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

Wenbo Wang · Fangyun Wei · Lei Zhou · Xi Chen · Lin Luo · Xiaohan Yi · Yizhong Zhang · Yaobo Liang · Chang Xu · Yan Lu · Jiaolong Yang · Baining Guo

We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting.


Poster #150
Highlight
ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping

Youxin Pang · Ruizhi Shao · Jiajun Zhang · Hanzhang Tu · Yun Liu · Boyao Zhou · Hongwen Zhang · Yebin Liu

In this paper, we introduce ManiVideo, a novel method for generating consistent and temporally coherent bimanual hand-object manipulation videos from given motion sequences of hands and objects. The core idea of ManiVideo is the construction of a multi-layer occlusion (MLO) representation that learns 3D occlusion relationships from occlusion-free normal maps and occlusion confidence maps. By embedding the MLO structure into the UNet in two forms, the model enhances the 3D consistency of dexterous hand-object manipulation. To further achieve the generalizable grasping of objects, we integrate Objaverse, a large-scale 3D object dataset, to address the scarcity of video data, thereby facilitating the learning of extensive object consistency. Additionally, we propose an innovative training strategy that effectively integrates multiple datasets, supporting downstream tasks such as human-centric hand-object manipulation video generation. Through extensive experiments, we demonstrate that our approach not only achieves video generation with plausible hand-object interaction and generalizable objects, but also outperforms existing SOTA methods.


Poster #151
Hand-held Object Reconstruction from RGB Video with Dynamic Interaction

Shijian Jiang · Qi Ye · Rengan Xie · Yuchi Huo · Jiming Chen

This work aims to reconstruct the 3D geometry of a rigid object manipulated by one or both hands using monocular RGB video. Previous methods rely on Structure-from-Motion or hand priors to estimate relative motion between the object and camera, which typically assume textured objects or single-hand interactions. To accurately recover object geometry in dynamic hand-object interactions, we incorporate priors from 3D generation models into object pose estimation and propose semantic consistency constraints to solve the challenge of shape and texture discrepancy between the generated priors and observations. The poses are initialized, followed by joint optimization of the object poses and implicit neural representation. During the optimization, a novel pose outlier voting strategy with inter-view consistency is proposed to correct large pose errors. Experiments on three datasets demonstrate that our method significantly outperforms the state-of-the-art in reconstruction quality for both single- and two-hand scenarios.


Poster #152
UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation

Yinqiao Wang · Hao Xu · Pheng-Ann Heng · Chi-Wing Fu

Estimating the 3D pose of hand and potential hand-held object from monocular images is a longstanding challenge. Yet, existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. Technically, we design a grasp-aware feature fusion module to integrate hand-object features with an object switcher to dynamically control the hand-object pose estimation according to grasping status. Further, to uplift the robustness of hand pose estimation regardless of object presence, we generate realistic de-occluded image pairs to train the model to learn object-induced hand occlusions, and formulate multi-level feature enhancement techniques for learning occlusion-invariant features. Extensive experiments on three commonly-used benchmarks demonstrate UniHOPE’s SOTA performance in addressing hand-only and hand-object scenarios. Code will be publicly released upon publication.


Poster #153
WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

Rolandos Alexandros Potamias · Jinglei Zhang · Jiankang Deng · Stefanos Zafeiriou

In recent years, 3D hand pose estimation methods have garnered significant attention due to their extensive applications in human-computer interaction, virtual reality, and robotics. In contrast, there has been a notable gap in hand detection pipelines, posing significant challenges in constructing effective real-world multi-hand reconstruction systems. In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model. To tackle the limitations of previous methods and build a robust and stable detection network, we introduce a large-scale dataset with over than 2M in-the-wild hand images with diverse lighting, illumination, and occlusion conditions. Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks. Finally, we showcase the effectiveness of our pipeline to achieve smooth 3D hand tracking from monocular videos, without utilizing any temporal components. Code, models and dataset will be made publicly available.


Poster #154
Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation

Zhuoran ZHAO · Linlin Yang · Pengzhan Sun · Pan Hui · Angela Yao

Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Source code and data will be released upon acceptance.


Poster #155
Highlight
InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions

Sirui Xu · Hung Yu Ling · Yu-Xiong Wang · Liangyan Gui

Achieving realistic simulations of humans engaging in a wide range of object interactions has long been a fundamental goal in animation. Extending physics-based motion imitation techniques to complex human-object interactions (HOIs) is particularly challenging due to the intricate coupling between human-object dynamics and the variability in object geometries and properties. Moreover, motion capture data often contain artifacts such as inaccurate contacts and insufficient hand details, which hinder the learning process. We introduce InterMimic, a framework that overcomes these challenges by enabling a single policy to robustly learn from imperfect motion capture sequences encompassing tens of hours of diverse full-body interaction skills with dynamic and varied objects. Our key insight is employing a curriculum strategy: perfecting first, then scaling up. We first train subject-specific teacher policies to mimic, retarget, and refine the motion capture data, effectively correcting imperfections. Then, we distill a student policy from these teachers; the teachers act as online experts providing direct supervision and supplying clean references. This ensures that the student policy learns from high-quality guidance despite imperfections in the original dataset. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across various HOI datasets. Notably, the learned policy exhibits zero-shot generalization, allowing seamless integration with kinematic generators and transforming the entire framework from mere imitation to generative modeling tasks.


Poster #156
PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation

Uyoung Jeong · Jonathan Freer · Seungryul Baek · Hyung Jin Chang · Kwang In Kim

Human pose estimation is in increasing demand across diverse applications, from avatar generation to human-robot interaction. However, the domains of these applications often diverge from standard human pose estimation datasets, leading to limited domain transfer. Particularly in multi-dataset training (MDT), there are often variations in skeleton types and limited comprehensive supervision across them.We propose a novel MDT framework, called PoseBH, that integrates poses beyond humans.Our method addresses keypoint heterogeneity and limited supervision through two primary techniques. First, we introduce nonparametric keypoint prototypes that learn on a unified embedding space, enabling seamless integration across arbitrary skeleton types and facilitating robust domain transfer. Second, we introduce a cross-modal self-supervision mechanism that aligns keypoint predictions with keypoint embedding prototypes, thus enhancing supervision without reliance on teacher-student models or additional augmentations.PoseBH demonstrates significant generalization improvements on whole-body and animal pose datasets (COCO-WholeBody, AP-10K, APT-36K), while maintaining the performance of the standard human pose benchmarks (COCO, MPII, AIC). Our learned keypoint embeddings also transfer well to hand shape (InterHand2.6M) and human shape (3DPW) domains.


Poster #157
M3GYM: A Large-Scale Multimodal Multi-view Multi-person Pose Dataset for Fitness Activity Understanding in Real-world Settings

Qingzheng Xu · Ru Cao · Xin Shen · Heming Du · Sen Wang · Xin Yu

Human pose estimation is a critical task in computer vision for applications in sports analysis, healthcare monitoring, and human-computer interaction. However, existing human pose datasets are collected either from custom-configured laboratories with complex devices or they only include data on single individuals, and both types typically capture daily activities. In this paper, we introduce the M3GYM dataset, a large-scale multimodal, multi-view, and multi-person pose dataset collected from a real gym to address the limitations of existing datasets.Specifically, we collect videos for 82 sessions from the gym, each session lasting between 40 to 60 minutes. These videos are gathered by 8 cameras, including over 50 subjects and 47 million frames. These sessions include 51 Normal fitness exercise sessions as well as 17 Pilates and 14 Yoga sessions. The exercises cover a wide range of poses and typical fitness activities, particularly in Yoga and Pilates, featuring poses with stretches, bends, and twists, \eg, humble warrior, fire hydrants and knee hover side twists.Each session involves multiple subjects, leading to significant self-occlusion and mutual occlusion in single views.Moreover, the gym has two symmetric floor mirrors, a feature not seen in previous datasets, and seven lighting conditions. We provide frame-level multimodal annotations, including 2D\&3D keypoints, subject IDs, and meshes. Additionally, M3GYM uniquely offers labels for over 500 actions along with corresponding assessments from sports experts.We benchmark a variety of state-of-the-art methods for several tasks, \ie, 2D human pose estimation, single-view and multi-view 3D human pose estimation, and human mesh recovery. To simulate real-world applications, we also conduct cross-domain experiments across Normal, Yoga, and Pilates sessions. The results show that M3GYM significantly improves model generalization in complex real-world settings.


Poster #158
Certified Human Trajectory Prediction

Mohammadhossein Bahari · Saeed Saadatnejad · Amirhossein Askari Farsangi · Seyed-Mohsen Moosavi-Dezfooli · Alex Alahi

Predicting human trajectories is essential for the safe operation of autonomous vehicles, yet current data-driven models often lack robustness in case of noisy inputs such as adversarial examples or imperfect observations. Although some trajectory prediction methods have been developed to provide empirical robustness, these methods are heuristic and do not offer guaranteed robustness.In this work, we propose a certification approach tailored for trajectory prediction that provides guaranteed robustness. To this end, we address the unique challenges associated with trajectory prediction, such as unbounded outputs and multi-modality. To mitigate the inherent performance drop through certification, we propose a diffusion-based trajectory denoiser and integrate it into our method. Moreover, we introduce new certified performance metrics to reliably measure the trajectory prediction performance. Through comprehensive experiments, we demonstrate the accuracy and robustness of the certified predictors and highlight their advantages over the non-certified ones. The code will be released upon publication.


Poster #159
Highlight
ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate

Ming Yan · Xincheng Lin · Yuhua Luo · Shuqi Fan · Yudi Dai · Qixin Zhong · Lincai Zhong · Yuexin Ma · Lan Xu · Chenglu Wen · Siqi Shen · Cheng Wang

Human Motion Recovery (HMR) research mainly focuses on ground-based motions such as running. The study on capturing climbing motion, an off-ground motion, is sparse. This is partly due to the limited availability of climbing motion datasets, especially large-scale and challenging 3D labeled datasets. To address the insufficiency of climbing motion datasets, we collect AscendMotion, a large-scale well-annotated, and challenging climbing motion dataset. It consists of 412k RGB, LiDAR frames, and IMU measurements, which includes the challenging climbing motions of 22 professional climbing coaches across 12 different rocks. Capturing the climbing motions is challenging as it requires precise recovery of not only the complex pose but also the global position of climbers. Although multiple global HMR methods have been proposed, they cannot faithfully capture climbing motions. To address the limitations of HMR methods for climbing, we propose ClimbingCap, a motion recovery method that reconstructs continuous 3D human climbing motion in a global coordinate system. One key insight is to use the RGB and the LiDAR modalities to separately reconstruct motions in camera coordinates and global coordinates and optimize them jointly. We demonstrate the quality of the AscendMotion dataset and present promising results from ClimbingCap. The AscendMotion dataset and the source code of ClimbingCap will be released publicly to the research community.


Poster #160
Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment

Hiromu Taketsugu · Takeru Oba · Takahiro Maeda · Shohei Nobuhara · Norimichi Ukita

Humans can predict future human trajectories even from momentary observations by using human pose-related cues. However, previous Human Trajectory Prediction (HTP) methods leverage the pose cues implicitly, resulting in implausible predictions. To address this, we propose Locomotion Embodiment, a framework that explicitly evaluates the physical plausibility of the predicted trajectory by locomotion generation under the laws of physics.While the plausibility of locomotion is learned with an indifferentiable physics simulator, it is replaced by our differentiable Locomotion Value function to train an HTP network in a data-driven manner. In particular, our proposed Embodied Locomotion loss is beneficial for efficiently training a stochastic HTP network using multiple heads.Furthermore, the Locomotion Value filter is proposed to filter out implausible trajectories at inference. Experiments demonstrate that our method further enhances even the state-of-the-art HTP methods across diverse datasets and problem settings. Our code will be publicly available.


Poster #161
Vision-Guided Action: Enhancing 3D Human Motion Prediction with Gaze-informed Affordance in 3D Scenes

Ting Yu · Yi Lin · Jun Yu · Zhenyu Lou · Qiongjie Cui

Recent advances in human motion prediction (HMP) have shifted focus from isolated motion data to integrating human-scene correlations. In particular, the latest methods leverage human gaze points, using their spatial coordinates to indicate intent—where a person might move within a 3D environment. Despite promising trajectory results, these methods often produce inaccurate poses by overlooking the semantic implications of gaze, specifically the affordances of observed objects, which indicate the possible interactions. To address this, we propose GAP3DS, an affordance-aware HMP model that utilizes gaze-informed object affordances to improve HMP in complex 3D environments. GAP3DS incorporates a gaze-guided affordance learner to identify relevant objects in the scene and infer their affordances based on human gaze, thus contextualizing future human-object interactions. This affordance information, enriched with visual features and gaze data, conditions the generation of multiple human-object interaction poses, which are subsequently decoded into final motion predictions. Extensive experiments on two real-world datasets demonstrate that GAP3DS outperforms state-of-the-art methods in both trajectory and pose accuracy, producing more physically consistent and contextually grounded predictions.


Poster #162
On Denoising Walking Videos for Gait Recognition

Dongyang Jin · Chao Fan · Jingzhe Ma · Jingkai Zhou · Weihua Chen · Shiqi Yu

To capture individual gait patterns, excluding identity-irrelevant cues in walking videos, such as clothing texture and color, remains a persistent challenge for vision-based gait recognition. Traditional silhouette and pose-based methods, though theoretically effective at removing such distractions, often fall short of high accuracy due to their sparse and less informative inputs. To address this, emerging end-to-end methods focus on directly denoising RGB videos using global optimization and human-defined priors. Building on this trend, we propose a novel gait denoising method, DenosingGait. Inspired by the philosophy that “what I cannot create, I do not understand”, we turn to generative diffusion models, uncovering how these models can partially filter out irrelevant factors for improved gait understanding. Based on this generation-driven denoising, we introduce feature matching, a kind of popular geometrical constraint in optical flow and depth estimation, to compact multi-channel float-encoded RGB information into two-channel direction vectors that represent local structural features, where within-frame matching captures spatial details and cross-frame matching conveys temporal dynamics. Experiments on the CCPG, CAISA-B*, and SUSTech1K datasets demonstrate that DenoisingGait achieves a new SoTA performance in most cases for both within-domain and cross-domain evaluations.All the code will be released.


Poster #163
ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation

Ling-An Zeng · Guohong Huang · Yi-Lin Wei · Shengbo Gu · Yu-Ming Tang · Jingke Meng · Wei-Shi Zheng

We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. Unlike existing methods that implicitly model interactions using full-body poses as tokens, we argue that explicitly modeling joint-level interactions is more natural and effective for generating realistic HOIs, as it directly captures the geometric and semantic relationships between joints, rather than modeling interactions in the latent pose space. To this end, ChainHOI introduces a novel joint graph to capture potential interactions with objects, and a Generative Spatiotemporal Graph Convolution Network to explicitly model interactions at the joint level. Furthermore, we propose a Kinematics-based Interaction Module that explicitly models interactions at the kinetic chain level, ensuring more realistic and biomechanically coherent motions. Evaluations on two public datasets demonstrate that ChainHOI significantly outperforms previous methods, generating more realistic, and semantically consistent HOIs.


Poster #164
StickMotion: Generating 3D Human Motions by Drawing a Stickman

Tao Wang · Zhihua Wu · Qiaozhi He · Jiaming Chu · Ling Qian · Yu Cheng · Junliang Xing · Jian Zhao · Lei Jin

Text-to-motion generation, which translates textual descriptions into human motions, has been challenging in accurately capturing detailed user-imagined motions from simple text inputs. This paper introduces StickMotion, an efficient diffusion-based network designed for multi-condition scenarios, which generates desired motions based on traditional text and our proposed stickman conditions for global and local control of these motions, respectively. We address the challenges introduced by the user-friendly stickman from three perspectives: 1) Data generation. We develop an algorithm to generate hand-drawn stickmen automatically across different dataset formats. 2) Multi-condition fusion. We propose a multi-condition module that integrates into the diffusion process and obtains outputs of all possible condition combinations, reducing computational complexity and enhancing StickMotion's performance compared to conventional approaches with the self-attention module. 3) Dynamic supervision. We empower StickMotion to make minor adjustments to the stickman's position within the output sequences, generating more natural movements through our proposed dynamic supervision strategy. Through quantitative experiments and user studies, sketching stickmen saves users about 51.5% of their time generating motions consistent with their imagination. Our codes, demos, and relevant data will be released to facilitate further research and validation within the scientific community.


Poster #165
MixerMDM: Learnable Composition of Human Motion Diffusion Models

Pablo Ruiz-Ponce · German Barquero · Cristina Palmero · Sergio Escalera · Jose Garcia-Rodriguez

Generating human motion guided by conditions such as textual descriptions is challenging due to the need for datasets with pairs of high-quality motion and their corresponding conditions. The difficulty increases when aiming for finer control in the generation. To that end, prior works have proposed to combine several motion diffusion models pre-trained on datasets with different types of conditions, thus allowing control with multiple conditions. However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. Unlike previous approaches, MixerMDM provides a dynamic mixing strategy that is trained in an adversarial fashion to learn to combine the denoising process of each model depending on the set of conditions driving the generation. By using MixerMDM to combine single- and multi-person motion diffusion models, we achieve fine-grained control on the dynamics of every person individually, and also on the overall interaction. Furthermore, we propose a new evaluation technique that, for the first time in this task, measures the interaction and individual quality by computing the alignment between the mixed generated motions and their conditions as well as the capabilities of MixerMDM to adapt the mixing throughout the denoising process depending on the motions to mix.


Poster #166
HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

Boyuan Wang · Xiaofeng Wang · Chaojun Ni · Guosheng Zhao · Zhiqin Yang · Zheng Zhu · Muyang Zhang · YuKun Zhou · xinze chen · Guan Huang · lihong liu · Xingang Wang

Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4\%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8\%, 26.3\%, and 18.3\%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.


Poster #167
Poly-Autoregressive Prediction for Modeling Interactions

Neerja Thakkar · Tara Sadjadpour · Jathushan Rajasegaran · Shiry Ginosar · Jitendra Malik

We introduce a simple framework for predicting the behavior of an ego agent in multi-agent settings. In contrast to autoregressive (AR) tasks, such as language processing, our focus is on scenarios with multiple agents whose interactions are shaped by physical constraints and internal motivations. To this end, we propose Poly-Autoregressive (PAR) modeling, which forecasts an ego agent’s future behavior by reasoning about the ego agent’s state history and the current state of other interacting agents. At its core, PAR represents the behavior of all agents as a sequence of tokens, each representing an agent’s state at a specific timestep. With minimal data pre-processing changes, we show that PAR can be applied to three different problems: human action prediction in social situations, trajectory prediction for autonomous vehicles, and object pose prediction during hand-object interaction. Using a small proof-of-concept transformer backbone, PAR outperforms AR across our three scenarios.


Poster #168
Adapting Pre-trained 3D Models for Point Cloud Video Understanding via Cross-frame Spatio-temporal Perception

Baixuan Lv · Yaohua Zha · Tao Dai · Xue Yuerong · Ke Chen · Shu-Tao Xia

Point cloud video understanding is becoming increasingly important in fields such as robotics, autonomous driving, and augmented reality, as they can accurately represent object motion and environmental changes. Despite the progress made in self-supervised learning methods for point cloud video understanding, the limited availability of 4D data and the high computational cost of training 4D-specific models remain significant obstacles. In this paper, we investigate the potential of transferring pre-trained static 3D point cloud models to the 4D domain, identifying the limitations of static models that capture only spatial information while neglecting temporal dynamics. To address this, we propose a novel Cross-frame Spatio-temporal Adaptation (CSA) strategy by introducing the Point Tube Adapter as the embedding layer and the Geometric Constraint Temporal Adapter (GCTA) to enforce temporal consistency across frames. This strategy extracts both short-term and long-term temporal dynamics, effectively integrating them with spatial features and enriching the model’s understanding of temporal changes in point cloud videos. Extensive experiments on 3D action and gesture recognition tasks demonstrate that our method achieves state-of-the-art performance, establishing its effectiveness for point cloud video understanding.


Poster #169
Recovering Dynamic 3D Sketches from Videos

Jaeah Lee · Changwoon Choi · Young Min Kim · Jaesik Park

Understanding 3D motion from videos presents inherent challenges due to the diverse types of movement, ranging from rigid and deformable objects to articulated structures. To overcome this, we propose Liv3Stroke, a novel approach for abstracting objects in motion with deformable 3D strokes. The detailed movements of an object may be represented by unstructured motion vectors or a set of motion primitives using a pre-defined articulation from a template model. Just as a free-hand sketch can intuitively visualize scenes or intentions with a sparse set of lines, we utilize a set of parametric 3D curves to capture a set of spatially smooth motion elements for general objects with unknown structures. We first extract noisy, 3D point cloud motion guidance from video frames using semantic features, and our approach deforms a set of curves to abstract essential motion features as a set of explicit 3D representations. Such abstraction enables an understanding of prominent components of motions while maintaining robustness to environmental factors. Our approach allows direct analysis of 3D object movements from video, tackling the uncertainty that typically occurs when translating real-world motion into recorded footage.


Poster #170
FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity

Jinxi Li · Ziyang Song · Siyuan Zhou · Bo Yang

In this paper, we aim to model 3D scene geometry, appearance, and the underlying physics purely from multi-view videos. By applying various governing PDEs as PINN losses or incorporating physics simulation into neural networks, existing works often fail to learn complex physical motions at boundaries or requiring object priors such as masks or types. In this paper, we propose NGV to learn physics of complex dynamic 3D scenes without needing any object priors. The key to our approach is to introduce a physics code followed by a carefully designed divergence-free module for estimating a per-Gaussian velocity field, without relying on the inefficient PINN losses. Extensive experiments on two public datasets and a newly collected challenging real-world dataset demonstrate superior performance of our method for future frame extrapolation and motion segmentation. Most notably, our investigation into the learned physics codes reveals that they truly learn meaningful 3D physical motion patterns in the absence of any human labels in training.


Poster #171
Dynamic Camera Poses and Where to Find Them

Chris Rockwell · Joseph Tung · Tsung-Yi Lin · Ming-Yu Liu · David Fouhey · Chen-Hsuan Lin

Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation.However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation.Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-the-art methods.In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses.Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models.For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches.Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.


Poster #172
Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation

Jingxi Chen · Brandon Y. Feng · Haoming Cai · Tianfu Wang · Levi Burner · Dehao Yuan · Cornelia Fermuller · Christopher Metzler · Yiannis Aloimonos

Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied upon a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.


Poster #173
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

Rick Akkerman · Haiwen Feng · Michael J. Black · Dimitrios Tzionas · Victoria Abrevaya

Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous motion and subsequent dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video foundation models can act as both neural renderers and implicit physics “simulators” by learning interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions, while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines. Code and trained models will be released.


Poster #174
DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

Emanuele Aiello · Umberto Michieli · Diego Valsesia · Mete Ozay · Enrico Magli

Personalized image generation requires text-to-image generative models that capture the core features of a reference subject to allow for controlled generation across different contexts. Existing methods face challenges due to complex training requirements, high inference costs, limited flexibility, or a combination of these issues. In this paper, we introduce DreamCache, a scalable approach for efficient and high-quality personalized image generation. By caching a small number of reference image features from a subset of layers and a single timestep of the pretrained diffusion denoiser, DreamCache enables dynamic modulation of the generated image features through lightweight, trained conditioning adapters. DreamCache achieves state-of-the-art image and text alignment, utilizing an order of magnitude fewer extra parameters, and is both more computationally effective and versatile than existing models.


Poster #175
Highlight
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Hanlin Wang · Hao Ouyang · Qiuyu Wang · Wen Wang · Ka Leong Cheng · Qifeng Chen · Yujun Shen · Limin Wang

The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images.


Poster #176
Highlight
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Xi Chen · Zhifei Zhang · He Zhang · Yuqian Zhou · Soo Ye Kim · Qing Liu · Yijun Li · Jianming Zhang · Nanxuan Zhao · Yilin Wang · Hui Ding · Zhe Lin · Hengshuang Zhao

We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental principles: preserving consistency between inputs and outputs while capturing visual variations. Inspired by recent video generation models that effectively balance consistency and variation across frames, we propose a unifying approach that treats image-level tasks as discontinuous video generation. Specifically, we treat varying numbers of input and output images as frames, enabling seamless support for tasks such as image generation, editing, composition, etc. Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision. UniReal learns world dynamics from large-scale videos, demonstrating advanced capability in handling shadows, reflections, pose variation, and object interaction, while also exhibiting emergent capability for novel applications.


Poster #177
Highlight
Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think

Zhenyi Lu · Xiaoye Qu · Zhenyi Lu · Wei Wei · Sichen Liu · Yu Cheng

Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text). The key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images.However, current I2V diffusion models (I2V-DMs) often produce videos with limited motion degrees or exhibit uncontrollable motion that conflicts with the textual condition. In this paper, we propose a novel Extrapolating and Decoupling framework to mitigate these issues. Specifically, our framework consists of three separate stages:(1) Starting with a base I2V-DM, we explicitly inject the textual condition into the temporal module using a lightweight, learnable adapter and fine-tune the integrated model to improve motion controllability. (2) We introduce a training-free extrapolation strategy to amplify the dynamic range of the motion, effectively reversing the fine-tuning process to enhance the motion degree significantly.(3) With the above two-stage models excelling in motion controllability and motion degree, we decouple the relevant parameters associated with each type of motion ability and inject them into the base I2V-DM. Since the I2V-DM handles different levels of motion controllability and dynamics at various denoising time steps, we adjust the motion-aware parameters accordingly over time. Extensive qualitative and quantitative experiments have been conducted to demonstrate the superiority of our framework over existing methods.


Poster #178
Highlight
Generative Omnimatte: Learning to Decompose Video into Layers

Yao-Chih Lee · Erika Lu · Sarah Rumbley · Michal Geyer · Jia-Bin Huang · Tali Dekel · Forrester Cole

Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections.Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions.We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, anddemonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.


Poster #179
RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression

Uri Gadot · Shie Mannor · Assaf Shocher · Gal Chechik · Assaf Hallak

Video encoders optimize compression for human perception by minimizing reconstruction error under bit-rate constraints. In many modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks like object recognition or segmentation, rather than being watched by humans. It is therefore useful to optimize the encoder for a downstream task instead of for perceptual image quality. However, a major challenge is how to combine such downstream optimization with existing standard video encoders, which are highly efficient and popular. Here, we address this challenge by controlling the Quantization Parameters (QPs) at the macro-block level to optimize the downstream task. This granular control allows us to prioritize encoding for task-relevant regions within each frame. We formulate this optimization problem as a Reinforcement Learning (RL) task, where the agent learns to balance long-term implications of choosing QPs on both task performance and bit-rate constraints. Notably, our policy does not require the downstream task as an input during inference, making it suitable for streaming applications and edge devices such as vehicles. We demonstrate significant improvements in two tasks, car detection, and ROI (saliency) encoding. Our approach improves task performance for a given bit rate compared to traditional task agnostic encoding methods, paving the way for more efficient task-aware video compression.


Poster #180
Towards Practical Real-Time Neural Video Compression

Zhaoyang Jia · Bin Li · Jiahao Li · Wenxuan Xie · Linfeng Qi · Houqiang Li · Yan Lu

We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs is influenced by 1) computational costs, and 2) non-computational operational costs, such as memory I/O and the number of function calls. While most efficient NVCs prioritize reducing computational cost, we identify operational cost as the primary bottleneck to achieving higher coding speed. Leveraging this insight, we introduce a set of efficiency-driven design improvements focused on minimizing operational costs. Specifically, we employ implicit temporal modeling to eliminate complex explicit motion modules, and use single low-resolution latent representations rather than progressive downsampling. These innovations significantly accelerate NVC without sacrificing compression quality. Additionally, we implement model integerization for consistent cross-device coding and a module-bank-based rate control scheme to improve practical adaptability. Experiments show our NVC achieves an impressive average encoding/decoding speed at 125.2/112.8 fps (frames per second) for 1080p video, while saving an average of 21% in bitrate compared to H.266/VTM. The code will be released.


Poster #181
Neural Video Compression with Context Modulation

Chuanbo Tang · Zhuoyuan Li · Yifan Bian · Li Li · Dong Liu

Efficient video coding is highly dependent on exploiting the temporal redundancy, which is usually achieved by extracting and leveraging the temporal context in the emerging conditional coding-based neural video codec (NVC). Although the latest NVC has achieved remarkable progress in improving the compression performance, the inherent temporal context propagation mechanism lacks the ability to sufficiently leverage the reference information, limiting further improvement. In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps. Specifically, we first propose the flow orientation to mine the inter-correlation between the reference frame and prediction frame for generating the additional oriented temporal context. Moreover, we introduce the context compensation to leverage the oriented context to modulate the propagated temporal context generated from the propagated reference feature. Through the synergy mechanism and decoupling loss supervision, the irrelevant propagated information can be effectively eliminated to ensure better context modeling. Experimental results demonstrate that our codec achieves on average 22.7% bitrate reduction over the advanced traditional video codec H.266/VVC, and offers an average 10.1% bitrate saving over the previous state-of-the-art NVC DCVC-FM.


Poster #182
Event-based Video Super-Resolution via State Space Models

Zeyu Xiao · Xinchao Wang

Exploiting temporal correlations is crucial for video super-resolution (VSR). Recent approaches enhance this by incorporating event cameras. In this paper, we introduce MamEVSR, an Mamba-based network for event-based VSR that leverages the selective state space model, Mamba. MamEVSR stands out by offering global receptive field coverage with linear computational complexity, thus addressing the limitations of convolutional neural networks and Transformers. The key components of MamEVSR include: (1) The interleaved Mamba (iMamba) block, which interleaves tokens from adjacent frames and applies multi-directional selective state space modeling, enabling efficient feature fusion and propagation across bi-directional frames while maintaining linear complexity. (2) The cross-modality Mamba (cMamba) block facilitates further interaction and aggregation between event information and the output from the iMamba block. The cMamba block can leverage complementary spatio-temporal information from both modalities and allows MamEVSR to capture finer motion details. Experimental results show that the proposed MamEVSR achieves superior performance on various datasets quantitatively and qualitatively.


Poster #183
STDD: Spatio-Temporal Dual Diffusion for Video Generation

Shuaizhen Yao · Xiaoya Zhang · Xin Liu · Mengyi Liu · Zhen Cui

Diffusion probabilistic model is becoming the cornerstone of data generation, especially generating high-quality images. As an extension, video diffusion generation is in urgent need of a principled temporal-sequence diffusion way, while the spatial-domain diffusion dominates most video diffusion methods. In this work, we propose an explicit Spatio-Temporal Dual Diffusion (STDD) method by principledly extending the standard diffusion model to a spatio-temporal diffusion model for joint spatial and temporal noise propagation/reduction. Mathematically, an analysable dual diffusion process is derived to accumulate noises/information in temporal sequence as well as spatial domain. Correspondingly, we theoretically derive a spatio-temporal probabilistic reverse diffusion process and propose an accelerated sampling way to reduce the inference cost. In principle, the spatio-temporal dual diffusion enables the information of previous frames to be transferred to the current frame, which thus could be beneficial for video consistency. Extensive experiments demonstrate that our proposed STDD is more competitive over the state-of-the-art methods in the task of video generation/prediction as well as text-to-video generation.


Poster #184
IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior

Jingyi Xu · Siwei Tu · Weidong Yang · Ben Fei · Shuhao Li · Keyi Liu · Yeqi Luo · Lipeng Ma · Lei Bai

Variation of Arctic sea ice has significant impacts on polar ecosystems, transporting routes, coastal communities, and global climate. Tracing the change of sea ice at a finer scale is paramount for both operational applications and scientific studies. Recent pan-Arctic sea ice forecasting methods that leverage advances in artificial intelligence have made promising progress over numerical models. However, forecasting sea ice at higher resolutions is still under-explored. To bridge the gap, we propose a two-module cooperative deep learning framework, IceDiff, to forecast sea ice concentration at finer scales. IceDiff first leverages a vision transformer to generate coarse yet superior forecasting results over previous methods at a regular 25km grid. This high-quality sea ice forecasting can be utilized as reliable guidance for the next module. Subsequently, an unconditional diffusion model pre-trained on low-resolution sea ice concentration maps is utilized for sampling down-scaled sea ice forecasting via a zero-shot guided sampling strategy and a patch-based method. For the first time, IceDiff demonstrates sea ice forecasting with a 6.25km resolution. IceDiff extends the boundary of existing sea ice forecasting models and more importantly, its capability to generate high-resolution sea ice concentration data is vital for pragmatic usages and research.


Poster #185
OSV: One Step is Enough for High-Quality Image to Video Generation

Xiaofeng Mao · Zhengkai Jiang · Fu-Yun Wang · Jiangning Zhang · Hao Chen · Mingmin Chi · Yabiao Wang · Wenhan Luo

Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. Although techniques such as consistency distillation and adversarial training have been employed to accelerate video diffusion by reducing inference steps, these methods often simply transfer the generation approaches from Image diffusion models to video diffusion models. As a result, these methods frequently fall short in terms of both performance and training stability. In this work, we introduce a two-stage training framework that effectively combines consistency distillation with adversarial training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance (FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).


Poster #186
I2VGuard: Safeguarding Images against Misuse in Diffusion-based Image-to-Video Models

Dongnan Gui · Xun Guo · Wengang Zhou · Yan Lu

Recent advances in image-to-video generation have enabled animation of still images and offered pixel-level controllability. While these models hold great potential to transform single images into vivid and dynamic videos, they also carry risks of misuse that could impact privacy, security, and copyright protection. This paper proposes a novel approach that applies imperceptible perturbations on images to degrade the quality of the generated videos, thereby protecting images from misuse in white-box image-to-video diffusion models. Specifically, we function our approach as an adversarial attack, incorporating spatial, temporal, and diffusion attack modules. The spatial attack shifts image features from their original distribution to a lower-quality target distribution, reducing visual fidelity. The temporal attack disrupts coherent motion by interfering with temporal attention maps that guide motion generation. To enhance the robustness of our approach across different models, we further propose a diffusion attack module leveraging contrastive loss. Our approach can be easily integrated with mainstream diffusion-based I2V models. Extensive experiments on SVD, CogVideoX, and ControlNeXt demonstrate that our method significantly impairs generation quality in terms of visual clarity and motion consistency, while introducing only minimal artifacts to the images. To the best of our knowledge, we are the first to explore adversarial attacks on image-to-video generation for security purposes.


Poster #187
CASP: Consistency-aware Audio-induced Saliency Prediction Model for Omnidirectional Video

Zhaolin Wan · Han Qin · Zhiyang Li · Xiaopeng Fan · Wangmeng Zuo · Debin Zhao

Omnidirectional videos (ODVs) present distinct challenges for accurate audio-visual saliency prediction due to their immersive nature, which combines spatial audio with panoramic visuals to enhance the user experience. While auditory cues are crucial for guiding visual attention across the panoramic scene, the interaction between audio and visual stimuli in ODVs remains underexplored. Existing models primarily focus on spatiotemporal visual cues and treat audio signals separately from their spatial and temporal contexts, often leading to misalignments between audio and visual content and undermining temporal consistency across frames. To bridge these gaps, we propose a novel audio-induced saliency prediction model for ODVs that holistically integrates audio and visual inputs through a multi-modal encoder, an audio-visual interaction module, and an audio-visual transformer. Unlike conventional methods that isolate audio cue locations and attributes, our model employs a query-based framework, where learnable audio queries capture comprehensive audio-visual dependencies, thus enhancing saliency prediction by dynamically aligning with audio cues. Besides, we introduce a novel consistency loss to enforce temporal coherence in saliency regions across frames. Extensive experiments demonstrate that our model outperforms state-of-the-art methods in predicting audio-visual salient regions in ODVs, establishing its robustness and superior performance.


Poster #188
Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

Zhiyuan Yan · Yandan Zhao · Shen Chen · Mingyi Guo · Xinghe Fu · Taiping Yao · Shouhong Ding · Yunsheng Wu · Li Yuan

Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how can we tackle efficiency without compromising accuracy? This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. To reproduce FFD, we then propose a novel Video-level Blending data (VB), where VB is implemented by blending the original image and its warped version frame-by-frame, serving as a hard negative sample to mine more general artifacts. Second, we carefully design a lightweight Spatiotemporal Adapter (StA) to equip a pre-trained image model with the ability to capture both spatial and temporal features jointly and efficiently. StA is designed with two-stream 3D-Conv with varying kernel sizes, allowing it to process spatial and temporal features separately. This eliminates the need to design a new deepfake-specific video architecture from scratch. Extensive experiments validate the effectiveness of the proposed methods; and show our approach can generalize well to previously unseen forgery videos.


Poster #189
OSDFace: One-Step Diffusion Model for Face Restoration

Jingkai Wang · Jue Gong · Lin Zhang · Zheng Chen · Xing Liu · Hong Gu · Yutong Liu · Yulun Zhang · Xiaokang Yang

Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject’s identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency. The code and model will be released soon.


Poster #190
MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

Mengqiu XU · Kaixin Chen · Heng Guo · Yixiang Huang · Ming Wu · Zhenwei Shi · Chuang Zhang · Jun Guo

Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbf{MFogHub}, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are available in the supplementary materials.


Poster #191
Feature Spectrum Learning for Remote Sensing Change Detection

Qi Zang · Dong Zhao · Shuang Wang · Dou Quan · Licheng Jiao · Zhun Zhong

Change detection (CD) holds significant implications for Earth observation, in which pseudo-changes between bitemporal images induced by imaging environmental factors are key challenges. Existing methods mainly regard pseudo-changes as a kind of style shift and alleviate it by transforming bitemporal images into the same style using generative adversarial networks (GANs). Nevertheless, their efforts are limited by the complexity of optimizing GANs and the absence of guidance from physical properties. This paper finds that the spectrum transformation (ST) has the potential to mitigate pseudo-changes by aligning in the frequency domain carrying the style. However, the benefit of ST is largely constrained by two drawbacks: 1) limited transformation space and 2) inefficient parameter search. To address these limitations, we propose a Feature Spectrum learning (FeaSpect) that adaptively eliminate pseudo-changes in the latent space. For the drawback 1), FeaSpect directs the transformation towards style-aligned discriminative features via feature spectrum transformation (FST). For the drawback 2), FeaSpect allows FST to be trainable, efficiently discovering optimal parameters via extraction box with adaptive attention and extraction box with learnable strides. Extensive experiments on challenging datasets demonstrate that our method remarkably outperforms existing methods and achieves a commendable trade-off between accuracy and efficiency. Importantly, our method can be easily injected into other frameworks, achieving consistent improvements.


Poster #192
Dual-Granularity Semantic Guided Sparse Routing Diffusion Model for General Pansharpening

Yinghui Xing · Qu Li Tao · Shizhou Zhang · Di Xu · YingkunYang · Yanning Zhang

Pansharpening aims at integrating complementary information from panchromatic and multispectral images. Available deep-learning based pansharpening methods typically perform exceptionally with particular satellite datasets. At the same time, it has been observed that these models also exhibit scene dependence, for example, if the majority of the training samples come from the urban scenes, the model's performance may decline in the river scene. To address the domain gap produced by varying satellite sensors and distinct scenes, we propose a dual-granularity semantic guided sparse routing diffusion model for general pansharpening. By utilizing the large Vision Language Models (VLMs) in the field of geoscience, i.e., GeoChat, we introduce the dual granularity semantics to generate dynamic sparse routing scores for adaptation of different satellite sensors and scenes. These scene-level and region-level dual-granularity semantic information serve as guidance to dynamically activating specialized experts within the diffusion model. Extensive experiments on WorldView-3, QuickBird, and GaoFen-2 datasets show the effectiveness of our proposed method. Notably, the proposed method outperforms the comparison approaches in adapting to new satellite sensors and scenes. The code will be available.


Poster #193
Hyperspectral Pansharpening via Diffusion Models with Iteratively Zero-Shot Guidance

Jin-Liang Xiao · Ting-Zhu Huang · Liang-Jian Deng · Guang Lin · Zihan Cao · Chao Li · Qibin Zhao

Hyperspectral pansharpening refers to fusing a panchromatic image (PAN) and a low-resolution hyperspectral image (LR-HSI) to obtain a high-resolution hyperspectral image (HR-HSI). Recently, guiding pre-trained diffusion models (DMs) has demonstrated significant potential in this area, leveraging their powerful representational abilities while avoiding complex training processes. However, these DMs are often trained on RGB images, not well-suited for pansharpening tasks, limited in adapting to the hyperspectral images. In this work, we propose a novel guided diffusion scheme with zero-shot guidance and neural spatial-spectral decomposition (NSSD) to iteratively generate the RGB detail image and map the RGB detail image to target HR-HSI. Specifically, zero-shot guidance employs an auxiliary neural network that trained only with a PAN and LR-HSI to guide pre-trained DMs in generating the RGB detail image, informed by specific prior knowledge. Then, NSSD establishes a spectral mapping from the generated RGB detail image to the final HR-HSI. Extensive experiments are conducted on Pavia, Washington DC, and Chukusei datasets to demonstrate that the proposed method significantly enhances the performance of DMs for hyperspectral pansharpening tasks, outperforming existing methods across multiple metrics and achieving improvements in visualization results.


Poster #194
Complementary Advantages: Exploiting Cross-Field Frequency Correlation for NIR-Assisted Image Denoising

Yuchen Wang · Hongyuan Wang · Lizhi Wang · Xin Wang · Lin Zhu · Wanxuan Lu · Hua Huang

Existing single-image denoising algorithms often struggle to restore details when dealing with complex noisy images. The introduction of near-infrared (NIR) images offers new possibilities for RGB image denoising. However, due to the inconsistency between NIR and RGB images, the existing works still struggle to balance the contributions of two fields in the process of image fusion. In response to this, in this paper, we develop a cross-field Frequency Correlation Exploiting Network (FCENet) for NIR-assisted image denoising. We first propose the frequency correlation prior based on an in-depth statistical frequency analysis of NIR-RGB image pairs. The prior reveals the complementary correlation of NIR and RGB images in the frequency domain. Leveraging frequency correlation prior, we then establish a frequency learning framework composed of Frequency Dynamic Selection Mechanism (FDSM) and Frequency Exhaustive Fusion Mechanism (FEFM). FDSM dynamically selects complementary information from NIR and RGB images in the frequency domain, and FEFM strengthens the control of common and differential features during the fusion of NIR and RGB features. Extensive experiments on simulated and real data validate that our method outperforms various state-of-the-art (SOTA) methods in terms of image quality and computational efficiency. The code is available at https://github.com/11679-hub/11679.

Currently, the demand for higher video quality has grown significantly. However, satellite video has low resolution, complex motion, and weak textures. Haze interference further exacerbates the loss of motion information and texture details, hindering effective spatiotemporal feature fusion and fine-grained feature mining. This presents significant challenges for subsequent super-resolution (SR) reconstruction, especially at continuous scales. To address these problems, this paper models the double-degradation process of hazy low-quality satellite videos and proposes a novel network to learn the optimal joint degradation pattern (ODPNet) for continuous-scale SR of hazy satellite videos. First, we design a prior-based feature soft dehazing module to eliminate haze interference at the feature level. Second, we develop a spatiotemporal self-attention (SSA) to capture long-range feature dependencies, thereby achieving effective spatiotemporal feature fusion. Third, we devise a tri-branch cross-aggregation block (TCB) to enhance feature representations of weak textures in satellite videos by effectively aggregating contextual information. Finally, we propose a cross-scale feature Top-$k$ selection Transformer (CFTST), which aims to adaptively select and aggregate cross-scale latent codes to learn feature representations of satellite videos at arbitrary resolutions, thus enabling continuous-scale SR. Experiments show that ODPNet outperforms existing methods and achieves a better balance between model parameters and performance.


Poster #196
Iterative Predictor-Critic Code Decoding for Real-World Image Dehazing

Jiayi Fu · Siyu Liu · Zikun Liu · Chun-Le Guo · Hyunhee Park · Rui-Qi Wu · Guoqing Wang · Chongyi Li

We propose a novel real-world image dehazing method, abbreviated as IPC-Dehaze, by leveraging the high-quality codebook prior encapsulated in a pre-train VQGAN. Apart from previous codebook-based methods that rely on one-shot decoding, our method utilizes high-quality codes obtained in the previous iteration to guide the prediction of the Code-Predictor in the subsequent iteration, improving code prediction accuracy and ensuring stable dehazing performance. Our idea stems from the observations that 1) the degradation of hazy images varies with haze density and scene depth, and 2) clear regions play crucial cues in restoring dense haze regions. However, it is nontrivial to progressively refine the obtained codes in subsequent iterations, owing to the difficulty in determining which codes should be retained or replaced at each iteration. Another key insight of our study is to propose Code-Critic to capture interrelations among codes. The Code-Critic is used to evaluate code correlations and then resample a set of codes with the highest mask scores, i.e., a higher score indicates that the code is more likely to be rejected, which helps retain more accurate codes and predict difficult ones. Extensive experiments demonstrate the superiority of our method over state-of-the-art methods in real-world dehazing. Our code will be made publicly available.


Poster #197
Efficient Visual State Space Model for Image Deblurring

Lingshun Kong · Jiangxin Dong · Jinhui Tang · Ming-Hsuan Yang · Jinshan Pan

Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration. While ViTs generally outperform CNNs by effectively capturing long-range dependencies and input-specific characteristics, their computational complexity increases quadratically with image resolution. This limitation hampers their practical application in high-resolution image restoration. In this paper, we propose a simple yet effective visual state space model (EVSSM) for image deblurring, leveraging the benefits of state space models (SSMs) to visual data. In contrast to existing methods that employ several fixed-direction scanning for feature extraction, which significantly increases the computational cost, we develop an efficient visual scan block that applies various geometric transformations before each SSM-based module, capturing useful non-local information and maintaining high efficiency. In addition, to more effectively capture and represent local information, we propose an efficient discriminative frequency domain-based feedforward network (EDFFN) which can effectively estimate useful frequency information for latent clear image restoration. Extensive experimental results show that the proposed EVSSM performs favorably against state-of-the-art methods on benchmark datasets and real-world images.


Poster #198
Rotation-Equivariant Self-Supervised Method in Image Denoising

Hanze Liu · Jiahong Fu · Qi Xie · Deyu Meng

Self-supervised image denoising methods have garnered significant research attention in recent years, for this kind of method reduces the requirement of large training datasets.Compared to supervised methods, self-supervised methods rely more on the prior embedded in deep networks themselves. As a result, most of the self-supervised methods are designed with Convolution Neural Networks(CNNs) architectures, which well capture one of the most important image prior, translation equivariant prior. Inspired by the great success achieved by the introduction of translational equivariance, in this paper, we explore the way to further incorporate another important image prior. Specifically, we first apply high-accuracy rotation equivariant convolution to self-supervised image denoising. Through rigorous theoretical analysis, we have proved that simply replacing all the convolution layers with rotation equivariant convolution layers would modify the network into its rotation equivariant version.To the best of our knowledge, this is the first time that rotation equivariant image prior is introduced to self-supervised image denoising at the network architecture level with a comprehensive theoretical analysis of equivariance errors, whichoffers a new perspective to the field of self-supervised image denoising.Moreover, to further improve the performance, we design a new mask mechanism to fusion the output of rotation equivariant network and vanilla CNN-based network, and construct an adaptive rotation equivariant framework. Through extensive experiments on three typical methods, we have demonstrated the effectiveness of the proposed method. Our code will be released later.

Structured artifacts are semi-regular, repetitive patterns that closely intertwine with genuine image content, making their removal highly challenging. In this paper, we introduce the Scale-Adaptive Deformable Transformer, an network architecture specifically designed to eliminate such artifacts from images. The proposed network features two key components: a scale-enhanced deformable convolution module for modeling local patterns with varying sizes, orientations, and distortions, and a scale-adaptive deformable attention mechanism for capturing long-range relationships among repetitive patterns with different sizes and non-uniform spatial distributions. Extensive experiments show that our network consistently outperforms state-of-the-art methods in several structured artifact removal tasks, including image deraining, image demoir\'eing, and image debanding.

Most full-reference image quality assessment (FR-IQA) models assume that the reference image is of perfect quality. However, this assumption is flawed because many reference images in existing IQA datasets are of subpar quality. Moreover, recent generative image enhancement methods are capable of producing images of higher quality than their original counterparts. These factors challenge the effectiveness and applicability of current FR-IQA models. To address this limitation, we build a large-scale IQA database, namely DiffIQA, which contains approximately 180,000 images generated by a diffusion-based image enhancer with adjustable hyper-parameters. Each image is annotated by human subjects as either worse, similar, or better quality compared to its reference. Building on this, we present a generalized FR-IQA model, namely $\textbf{A}$daptive $\textbf{FI}$delity-$\textbf{N}$aturalness $\textbf{E}$valuator (A-FINE), to accurately assess and adaptively combine the fidelity and naturalness of the test image. A-FINE aligns well with standard FR-IQA when the reference image is much more natural than the test image. We demonstrate by extensive experiments that A-FINE surpasses existing FR-IQA models on well-established IQA datasets and our newly created DiffIQA. To further validate A-FINE, we additionally construct a super-resolution IQA benchmark (SRIQA-Bench), encompassing test images derived from ten state-of-the-art SR methods with reliable human quality annotations. Tests on SRIQA-Bench re-affirm the advantages of A-FINE. Our data, code and models will be made publicly available.


Poster #201
Complexity Experts are Task-Discriminative Learners for Any Image Restoration

Eduard Zamfir · Zongwei Wu · Nancy Mehta · Yuedong Tan · Danda Paudel · Yulun Zhang · Radu Timofte

Recent advancements in all-in-one image restoration models have revolutionized the ability to address diverse degradations through a unified framework. However, parameters tied to specific tasks often remain inactive for other tasks, making mixture-of-experts (MoE) architectures a natural extension. Despite this, MoEs often show inconsistent behavior, with some experts unexpectedly generalizing across tasks while others struggle within their intended scope. This hinders leveraging MoEs' computational benefits by bypassing irrelevant experts during inference.We attribute this undesired behavior to the uniform and rigid architecture of traditional MoEs. To address this, we introduce ``complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields. A key challenge is assigning tasks to each expert, as degradation complexity is unknown in advance. Thus, we execute tasks with a simple bias toward lower complexity.To our surprise, this preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity. Extensive experiments validate our approach, demonstrating the ability to bypass irrelevant experts during inference while maintaining superior performance. The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability.The source code will be made available upon acceptance.


Poster #202
Visual-Instructed Degradation Diffusion for All-in-One Image Restoration

Haina Qin · Wenyang Luo · Zewen Chen · Yufan Liu · Bing Li · Weiming Hu · libin wang · DanDan Zheng · Yuming Li

Image restoration tasks, such as deblurring, denoising, and dehazing, typically require separate models for each degradation type, limiting their generalization in real-world scenarios where mixed or unknown degradations may occur. In this work, we propose \textbf{Defusion}, a novel all-in-one image restoration framework that utilizes visual instruction-guided degradation diffusion. Unlike existing methods that rely on task-specific models or ambiguous text-based priors, Defusion constructs explicit \textbf{visual instructions} that align with the visual degradation patterns. These instructions are grounded by applying degradations to standardized visual elements, capturing intrinsic degradation features while agnostic to image semantics. Defusion then uses these visual instructions to guide a diffusion-based model that operates directly in the degradation space, where it reconstructs high-quality images by denoising the degradation effects with enhanced stability and generalizability. Comprehensive experiments demonstrate that Defusion outperforms state-of-the-art methods across diverse image restoration tasks, including complex and real-world degradations.


Poster #203
PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution

Zhu Li Bo · Jianze Li · Haotong Qin · Wenbo Li · Yulun Zhang · Yong Guo · Xiaokang Yang

Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps. However, even though the denoising step has been reduced to one, they require high computational costs and storage requirements, making it difficult for deployment on hardware devices. To address these issues, we propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR. First, we simplify OSD model to two core components, UNet and Variational Autoencoder (VAE) by removing the CLIPEncoder. Secondly, we propose Learnable Boundary Quantizer (LBQ) and Learnable Equivalent Transformation (LET) to optimize the quantization process and manipulate activation distributions for better quantization. Finally, we design a Distributed Quantization Calibration (DQC) strategy that stabilizes the training of quantized parameters for rapid convergence. Comprehensive experiments demonstrate that PassionSR with 8-bit and 6-bit obtains comparable visual results with full-precision model. Moreover, our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR. Our code will be released


Poster #204
Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning

Isma Hadji · Mehdi Noroozi · Victor Escorcia · Anestis Zaganidis · Brais Martinez · Georgios Tzimiropoulos

There has been immense progress recently in the visual quality of Stable Diffusion-based Super Resolution (SD-SR). However, deploying large diffusion models on computationally restricted devices such as mobile phones remains impractical due to the large model size and high latency. This is compounded for SR as it often operates at high res (e.g. 4K×3K). In this work, we introduce Edge-SD-SR, the first parameter efficient and low latency diffusion model for image super-resolution. Edge-SD-SR consists of ∼ 169M parameters, including UNet, encoder and decoder, and has a complexity of only ∼ 142 GFLOPs. To maintain a high visual quality on such low compute budget, we introduce a number of training strategies: (i) A novel conditioning mechanism on the low-resolution input, coined bidirectional conditioning, which tailors the SD model for the SR task. (ii) Joint training of the UNet and encoder, while decoupling the encodings of the HR and LR images and using a dedicated schedule. (iii) Finetuning the decoder using the UNet’s output to directly tailor the decoder to the latents obtained at inference time. Edge-SD-SR runs efficiently on device, e.g. it can upscale a 128×128 patch to 512×512 in 38 msec while running on a Samsung S24 DSP, and of a 512 × 512 to 2, 048 × 2, 048 (requiring 25 model evaluations) in just ∼ 1.1 sec. Furthermore, we show that Edge-SD-SR matches or even outperforms state-of-the-art SR approaches on the most established SR benchmarks.


Poster #205
HUNet: Homotopy Unfolding Network for Image Compressive Sensing

Feiyang Shen · Hongping Gan

Deep Unfolding Networks (DUNs) have risen to prominence due to their interpretability and superior performance for image Compressive Sensing (CS). However, existing DUNs still face significant issues, such as the insufficient representation capability of single-scale image information during the iterative reconstruction phase and loss of feature information, which fundamentally limit the further enhancement of image CS performance. In this paper, we propose Homotopy Unfolding Network (HUNet) for image CS, which enables phase-by-phase reconstruction of images along homotopy path. Specifically, each iteration step of the traditional homotopy algorithm is mapped to a Multi-scale Homotopy Iterative Module (MHIM), which includes U-shaped stacked Window-based Transformer Blocks capable of efficient feature extraction. Within the MHIM, we design the Deep Homotopy Continuation Strategy to ensure the interpretability of the homotopy algorithm and facilitate feature learning. Additionally, we introduce a Dual-path Feature Fusion Module to mitigate the loss of high-dimensional feature information during the transmission between iterative phases, thereby maximizing the preservation of details in the reconstructed image. Extensive experiments indicate that HUNet achieves superior image reconstruction results compared to existing state-of-the-art methods.


Poster #206
Dual Prompting Image Restoration with Diffusion Transformers

Dehong Kong · Fan Li · Zhixin Wang · Jiaqi Xu · Renjing Pei · Wenbo Li · Wenqi Ren

Recent state-of-the-art image restoration methods mostly adopt latent diffusion models with U-Net backbones, yet still facing challenges in achieving high-quality restoration due to their limited capabilities. Diffusion transformers (DiTs), like SD3, are emerging as a promising alternative because of their better quality with scalability. However, previous conditional control methods for U-Net-based diffusion models, such as ControlNet, are not well-suited for DiTs. In this paper, we introduce DPIR (Dual Prompting Image Restoration), a novel DiT-based image restoration method that effectivly extracts conditional information of low-quality images from multiple perspectives. Specifically, DPIR consits of two branches: a low-quality image prior conditioning branch and a dual prompting control branch. into the DiT with high training efficiency. More importantly, we believe that in image restoration, the image's textual description alone cannot fully capture its rich visual characteristics. Therefore, a dual prompting module is designed to provide DiT with additional visual cues, capturing both global context and local appearance. The extracted global-local visual prompts as extra conditional control, together with text prompts, greatly enhance the quality and fidelity of the restoration. Extensive experimental results demonstrate that DPIR delivers superior image restoration performance with broad applicability.


Poster #207
Frequency-Biased Synergistic Design for Image Compression and Compensation

Jiaming Liu · Qi Zheng · Zihao Liu · Yilian Zhong · Peiye Liu · Tao Liu · Shusong Xu · Yanheng Lu · Sicheng Li · Dimin Niu · Yibo Fan

Compression artifacts removal (CAR), an effective post-processing method to reduce compression distortion in edge-side codecs, demonstrates remarkable results by utilizing convolutional neural networks (CNNs) on high computational power cloud side. Traditional image compression reduces redundancy in the frequency domain, and we observed that CNNs also exhibit a bias in frequency domain when handling compression distortions. However, no prior research leverages this frequency bias to design compression methods tailored to CAR CNNs, or vice versa. In this paper, we present a synergistic design that bridges the gap between image compression and learnable compensation for CAR. Our investigation reveals that different compensation networks have varying effects on low and high-frequencies. Building upon these insights, we propose a pioneering redesign of the quantization process, a fundamental component in lossy image compression, to more effectively compress low-frequency information. Additionally, we devise a novel compensation framework that applies different neural networks for reconstructing different frequencies, incorporating a basis attention block to prioritize intentionally dropped low-frequency information, thereby enhancing the overall compensation. We instantiate two compensation networks based on this synergistic design and conduct extensive experiments on three image compression standards, demonstrating that our approach significantly reduces bitrate consumption while delivering high perceptual quality.


Poster #208
FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error

Beilin Chu · Xuan Xu · Xin Wang · Yufei Zhang · Weike You · Linna Zhou

The rapid advancement of diffusion models has significantly improved high-quality image generation, making generated content increasingly challenging to distinguish from real images and raising concerns about potential misuse. In this paper, we observe that diffusion models struggle to accurately reconstruct mid-band frequency information in real images, suggesting the limitation could serve as a cue for detecting diffusion model generated images. Motivated by this observation, we propose a novel method called $\textbf{F}$requency-gu$\textbf{I}$ded $\textbf{R}$econstruction $\textbf{E}$rror (FIRE), which, to the best of our knowledge, is the first to investigate the influence of frequency decomposition on reconstruction error. FIRE assesses the variation in reconstruction error before and after the frequency decomposition, offering a robust method for identifying diffusion model generated images. Extensive experiments show that FIRE generalizes effectively to unseen diffusion models and maintains robustness against diverse perturbations.


Poster #209
Robust Message Embedding via Attention Flow-Based Steganography

Huayuan Ye · Shenzhuo Zhang · Shiqi Jiang · Jing Liao · Shuhang Gu · Dejun Zheng · Changbo Wang · Chenhui Li

Image steganography can hide information in a host image and obtain a stego image that is perceptually indistinguishable from the original one. This technique has tremendous potential in scenarios like copyright protection, information retrospection, etc. Some previous studies have proposed to enhance the robustness of the methods against image disturbances to increase their applicability. However, they generally cannot achieve a satisfying balance between the steganography quality and robustness. Instead of image-in-image steganography, we focus on the issue of message-in-image embedding that is robust to various real-world image distortions. This task aims to embed information into a natural image and the decoding result is required to be completely accurate, which increases the difficulty of data concealing and revealing. Inspired by the recent developments in transformer-based vision models, we discover that the tokenized representation of image is naturally suitable for steganography task. In this paper, we propose a novel message embedding framework, called Robust Message Steganography (RMSteg), which is competent to hide message via QR Code in a host image based on an normalizing flow-based model. The stego image derived by our method has imperceptible changes and the encoded message can be accurately restored even if the image is printed out and photoed. To our best knowledge, this is the first work that integrates the advantages of transformer models into normalizing flow. Our experiment result shows that RMSteg has great potential in robust and high-quality message embedding.


Poster #210
Learned Image Compression with Dictionary-based Entropy Model

Jingbo Lu · Leheng Zhang · Xingyu Zhou · Mu Li · Wen Li · Shuhang Gu

Learned image compression methods have attracted great research interest and exhibited superior rate-distortion performance to the best classical image compression standards of the present.The entropy model plays a key role in learned image compression, which estimates the probability distribution of the latent representation for further entropy coding.Most existing methods employed hyper-prior and auto-regressive architectures to form their entropy models.However, they only aimed to explore the internal dependencies of latent representation while neglecting the importance of extracting prior from training data.In this work, we propose a novel entropy model named Dictionary-based Cross Attention Entropy model, which introduces a learnable dictionary to summarize the typical structures occurring in the training dataset to enhance the entropy model.Extensive experimental results have demonstrated that the proposed model strikes a better balance between performance and latency, achieving state-of-the-art results on various benchmark datasets.


Poster #211
D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation

Weinan Jia · Mengqi Huang · Nan Chen · Lei Zhang · Zhendong Mao

Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D$^2$iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an innovative combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with fine-grained regions correction achieves a unification of global consistency and local realism. We conduct comprehensive experiments on the ImageNet 256$\times$256 benchmark, showing that D$^2$iT achieves 23.8\% quality improvement than DiT (D$^2$iT's 1.73 \vs DiT's 2.27 on FID score, lower better), by using only 57.1\% of the computational resources as DiT.


Poster #212
Classifier-Free Guidance Inside the Attraction Basin May Cause Memorization

Anubhav Jain · Yuya Kobayashi · Takashi Shibuya · Yuhta Takida · Nasir Memon · Julian Togelius · Yuki Mitsufuji

Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel way to understand the memorization phenomenon, and propose a simple yet effective approach to mitigate memorization. We argue that memorization occurs because of an attraction basin in the denoising process which steers the diffusion trajectory towards a memorized image. However, this can be mitigated by guiding the diffusion trajectory away from the attraction basin by not applying classifier-free guidance until an ideal transition point occurs. This leads to the generation of non-memorized images that are high in image quality and well aligned with the conditioning mechanism. To further improve on this, we present a new guidance technique, opposite guidance, that escapes the attraction basin sooner in the denoising process. We demonstrate the existence of attraction basins in various scenarios in which memorization occurs, and we show that our proposed approach successfully mitigates memorization.


Poster #213
Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability

Lei Wang · Senmao Li · Fei Yang · Jianye Wang · Ziheng Zhang · Yuhan Liu · Yaxing Wang · Jian Yang

The diffusion models, in early stages focus on constructing basic image structures, while the refined details, including local features and textures, are generated in later stages. Thus the same network layers are forced to learn both structural and textural information simultaneously, significantly differing from the traditional deep learning architectures (e.g., ResNet or GANs) which captures or generates the image semantic information at different layers. This difference inspires us to explore the time-wise diffusion models. We initially investigate the key contributions of the U-Net parameters to the denoising process and identify that properly zeroing out certain parameters (including large parameters) contributes to denoising, substantially improving the generation quality on the fly. Capitalizing on this discovery, we propose a simple yet effective method—termed “MaskUNet”— that enhances generation quality with negligible parameter numbers. Our method fully leverages timestep- and sample-dependent effective U-Net parameters. To optimize MaskUNet, we offer two fine-tuning strategies: a training-based approach and a training-free approach, including tailored networks and optimization functions. In zero-shot inference on the COCO dataset, MaskUNet achieves the best FID score and further demonstrates its effectiveness in downstream task evaluations.

Diffusion models have demonstrated impressive generation capabilities, particularly with recent advancements leveraging transformer architectures to improve both visual and artistic quality.However, Diffusion Transformers (DiTs) continue to encounter challenges related to low inference speed, primarily due to the iterative denoising process.To address this issue, we propose BlockDance, a training-free approach that explores feature similarities at adjacent time steps to accelerate DiTs.Unlike previous feature-reuse methods that lack tailored reuse strategies for features at different scales, BlockDance prioritizes the identification of the most structurally similar features, referred to as Structurally Similar Spatio-Temporal (STSS) features. These features are primarily located within the structure-focused blocks of the transformer during the later stages of denoising.BlockDance caches and reuses these highly similar features to mitigate redundant computation, thereby accelerating DiTs while maximizing consistency with the generated results of the original model.Furthermore, considering the diversity of generated content and the varying distributions of redundant features, we introduce BlockDance-Ada, a lightweight decision-making network tailored for instance-specific acceleration.BlockDance-Ada dynamically allocates resources and provides superior content quality.Both BlockDance and BlockDance-Ada have proven effective across various generation tasks and models, achieving accelerations between 25\% and 50\% while maintaining generation quality.


Poster #215
Diffusion Model is Effectively Its Own Teacher

Xinyin Ma · Runpeng Yu · Songhua Liu · Gongfan Fang · Xinchao Wang

In this paper, we introduce a novel self-distillation paradigm for improving the performance of diffusion models. Previous studies have shown that introducing a teacher to distill the diffusion model can enhance its sampling efficiency. We raise an intriguing question: can the diffusion model itself serve as its teacher to further improve the performance of itself? To this end, we propose a new paradigm called Self Step-Distillation (SSD). The core idea of SSD is to integrate the predictions or the intermediate activations of the diffusion model at each timestep with its preceding timestep through a fusion mechanism. We propose two forms, explicit SSD and implicit SSD (iSSD), to perform N-step to N-step distillation from the diffusion model itself to achieve improved image quality. We further elucidate the relationship between SSD and high-order solver, highlighting their underlying relationship. The effectiveness of SSD is validated through extensive experiments on diffusion transformers of various sizes and across different sampling steps. Our results show that this novel self-distillation paradigm can significantly enhance performance. Additionally, our method is compatible with the distillation method designed for few-step inference. Notably, with iSSD trained less than one epoch, we obtain a 32-step DiT-XL/2 achieving an FID of 1.99, outperforming the original 250-step DiT-XL/2 with an FID of 2.26. We further validate the effectiveness of our method on text-to-image diffusion models, such as Stable Diffusion, and also observe notable improvement in image quality.


Poster #216
Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Zhiwei Jia · Yuesong Nan · Huixi Zhao · Gengdai Liu

Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, offering great flexibility in model alignment. However, it is challenging to apply existing RL methods to timestep-distilled DMs for ultra-fast ($\le2$-step) image generation.Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO towards improving $\le2$-step image generation.Based on the insights, we propose to fine-tune DMs with learned differentiable surrogate rewards.Our method, named \textbf{LaSRO}, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance.LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency.We show that LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including those based on PPO or DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights behind it.


Poster #217
RaSS: Improving Denoising Diffusion Samplers with Reinforced Active Sampling Scheduler

Xin Ding · Lei Yu · Xin Li · Zhijun Tu · Hanting Chen · Jie Hu · Zhibo Chen

Recent years have witnessed the great success of denoising diffusion samplers in improving the generative capability and sampling efficiency given a pre-trained diffusion model. However, most sampling schedulers in diffusion models lack the sampling dynamics and planning capability for future generation results, leading to suboptimal solutions. To overcome this, we propose the Reinforced Active Sampling Scheduler, termed RaSS, intending to find the optimal sampling trajectory by actively planning and adjusting the sampling steps for each sampling process in time. Concretely, RaSS divides the whole sampling process into five stages and introduces a reinforcement learning (RL) agent to continuously monitor the generated instance and perceive the potential generation results, thereby achieving optimal instance- and state-adaptive sampling steps decision. Meanwhile, a sampling reward is designed to assist the planning capability of the RL agent by balancing the sampling efficiency and generation quality. The RaSS is a plug-and-play module, which is applicable to multiple denoising diffusion samplers of diffusion models. Extensive experiments on different benchmarks have shown that our RaSS can consistently improve the generation quality and efficiency across various tasks, without introducing significant computational overhead.


Poster #218
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Kai Wang · Mingjia Shi · YuKun Zhou · Zekai Li · Xiaojiang Peng · Zhihang Yuan · Yuzhang Shang · Hanwang Zhang · Yang You

Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergence areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training. To address this, we design an asymmetric sampling strategy that reduces the frequency of steps from the convergence area while increasing the sampling probability for steps from other areas. Additionally, we propose a weighting strategy to emphasize the importance of time steps with rapid-change process increments. As a plug-and-play and architecture-agnostic approach, SpeeD consistently achieves 3-times acceleration across various diffusion architectures, datasets, and tasks. Notably, due to its simple design, our approach significantly reduces the cost of diffusion model training with minimal overhead. Our research enables more researchers to train diffusion models at a lower cost.


Poster #219
Scaling Properties of Diffusion Models For Perceptual Tasks

Rahul Ravishankar · Zeeshan Patel · Jathushan Rajasegaran · Jitendra Malik

In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute.


Poster #220
Highlight
Parallelized Autoregressive Visual Generation

Yuqing Wang · Shuhuai Ren · Zhijie Lin · Yujin Han · Haoyuan Guo · Zhenheng Yang · Difan Zou · Jiashi Feng · Xihui Liu

Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process.In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling.Our key insight is that the feasibility of parallel generation is closely tied to visual token dependencies - while tokens with weak dependencies can be generated in parallel, adjacent tokens with strong dependencies are hard to generate together, as independent sampling of strongly correlated tokens may lead to inconsistent decisions.Based on this observation, we develop a parallel generation strategy that generates distant tokens with weak dependencies in parallel while maintaining sequential generation for strongly dependent local tokens. Specifically, we first generate initial tokens in each region sequentially to establish the global structure, then enable parallel generation across distant regions while maintaining sequential generation within each region. Our approach can be seamlessly integrated into standard autoregressive models without modifying the architecture or tokenizer. Experiments on ImageNet and UCF-101 demonstrate that our method achieves a 3.6$\times$ speedup with comparable quality and up to 9.5$\times$ speedup with minimal quality degradation across both image and video generation tasks.We hope this work will inspire future research in efficient visual generation and unified autoregressive modeling.


Poster #221
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu · Xiaokang Chen · Zhiyu Wu · Yiyang Ma · Xingchao Liu · Zizheng Pan · Wen Liu · Zhenda Xie · Xingkai Yu · Chong Ruan · Ping Luo

We introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.


Poster #222
Highlight
Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Shenghai Yuan · Jinfa Huang · Xianyi He · Yunyang Ge · Yujun Shi · Liuhan Chen · Jiebo Luo · Li Yuan

Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human-identity consistent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V.


Poster #223
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Weixi Feng · Chao Liu · Sifei Liu · William Yang Wang · Arash Vahdat · Weili Nie

Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives -- blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with a Large Language Model for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.


Poster #224
ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

Jiazi Bu · Pengyang Ling · Pan Zhang · Tong Wu · Xiaoyi Dong · Yuhang Zang · Yuhang Cao · Dahua Lin · Jiaqi Wang

The text-to-video (T2V) generation models, offering convenient visual creation, have recently garnered increasing attention. Despite their substantial potential, the generated videos may present artifacts, including structural implausibility, temporal inconsistency, and a lack of motion, often resulting in near-static video. In this work, we have identified a correlation between the disparity of temporal attention maps across different blocks and the occurrence of temporal inconsistencies. Additionally, we have observed that the energy contained within the temporal attention maps is directly related to the magnitude of motion amplitude in the generated videos. Based on these observations, we present ByTheWay, a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time. Specifically, ByTheWay is composed of two principal components: 1) Temporal Self-Guidance improves the structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks. 2) Fourier-based Motion Enhancement enhances the magnitude and richness of motion by amplifying the energy of the map. Extensive experiments demonstrate that ByTheWay significantly improves the quality of text-to-video generation with negligible additional cost.


Poster #225
Keyframe-Guided Creative Video Inpainting

Yuwei Guo · Ceyuan Yang · Anyi Rao · Chenlin Meng · Omer Bar-Tal · Shuangrui Ding · Maneesh Agrawala · Dahua Lin · Bo Dai

Video inpainting, which aims to fill missing regions with visually coherent content, has emerged as a crucial technique for editing and virtual tour applications. While existing approaches achieve either visual consistency or text-guided generation, they often struggle to balance between coherence and creative diversity. In this work, we introduce VideoRepainter, a two-stage framework that first allows users to inpaint a keyframe using established image-level techniques, and then propagates the corresponding change to other frames. Our approach can leverage state-of-the-art image diffusion models for keyframe manipulation, thereby easing the burden of the video-inpainting process. To this end, we integrate an image-to-video model with a symmetric condition mechanism to address ambiguity caused by direct mask downsampling. We further explore efficient strategies for mask synthesis and parameter optimization to reduce costs in data processing and model training. Evaluations demonstrate our method achieves superior results in both visual fidelity and content diversity compared to existing approaches, providing a practical solution for high-quality video editing and creation.


Poster #226
SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models

Jaerin Lee · Daniel Jung · Kanggeon Lee · Kyoung Mu Lee

We introduce SemanticDraw, a new paradigm of interactive content creation where high-quality images are generated in near real-time from given multiple hand-drawn regions, each encoding prescribed semantic meaning. In order to maximize the productivity of content creators and to fully realize their artistic imagination, it requires both quick interactive interfaces and fine-grained regional controls in their tools. Despite astonishing generation quality from recent diffusion models, we find that existing approaches for regional controllability are very slow (52 seconds for 512 x 512 image) while not compatible with acceleration methods such as LCM, blocking their huge potential in interactive content creation. From this observation, we build our solution for interactive content creation in two steps: (1) we establish compatibility between region-based controls and acceleration techniques for diffusion models, maintaining high fidelity of multi-prompt image generation with x 10 reduced number of inference steps, (2) we increase the generation throughput with our new multi-prompt stream batch pipeline, enabling low-latency generation from multiple, region-based text prompts on a single RTX 2080 Ti GPU. Our proposed framework is generalizable to any existing diffusion models and acceleration schedulers, allowing sub-second (0.64 seconds) image content creation application upon well-established image diffusion models. The demo application can be found in Supplementary Material.


Poster #227
Highlight
TKG-DM: Training-free Chroma Key Content Generation Diffusion Model

Ryugo Morita · Stanislav Frolov · Brian Bernhard Moser · Takahiro Shirakawa · Ko Watanabe · Andreas Dengel · Jinjia Zhou

Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.


Poster #228
K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

Ziheng Ouyang · Zhen Li · Qibin Hou

Recent studies have explored combining different LoRAs to jointly generate learned style and content. However, existing methods either fail to effectively preserve both the original subject and style simultaneously or require additional training. In this paper, we argue that the intrinsic properties of LoRA can effectively guide diffusion models in merging learned subject and style. Building on this insight, we propose K-LoRA, a simple yet effective training-free LoRA fusion approach. In each attention layer, K-LoRA compares the Top-K elements in each LoRA to be fused, determining which LoRA to select for optimal fusion. This selection mechanism ensures that the most representative features of both subject and style are retained during the fusion process, effectively balancing their contributions. Experimental results demonstrate that the proposed method effectively integrates the subject and style information learned by the original LoRAs, outperforming state-of-the-art training-based approaches in both qualitative and quantitative results.


Poster #229
Highlight
SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer

Chunnan Shang · Zhizhong Wang · Hongwei Wang · Xiangming Meng

Attention-based arbitrary style transfer methods, including CNN-based, Transformer-based, and Diffusion-based, have flourished and produced high-quality stylized images. However, they perform poorly on the content and style images with the same semantics, i.e., the style of the corresponding semantic region of the generated stylized image is inconsistent with that of the style image. We argue that the root cause lies in their failure to consider the relationship between local regions and semantic regions. To address this issue, we propose a plug-and-play semantic continuous-sparse attention, dubbed SCSA, for arbitrary semantic style transfer—each query point considers certain key points in the corresponding semantic region. Specifically, semantic continuous attention ensures each query point fully attends to all the continuous key points in the same semantic region that reflect the overall style characteristics of that region; Semantic sparse attention allows each query point to focus on the most similar sparse key point in the same semantic region that exhibits the specific stylistic texture of that region. By combining the two modules, the resulting SCSA aligns the overall style of the corresponding semantic regions while transferring the vivid textures of these regions. Qualitative and quantitative results prove that SCSA enables attention-based arbitrary style transfer methods to produce high-quality semantic stylized images. The codes of this work will be made publicly available.


Poster #230
MARBLE: Material Recomposition and Blending in CLIP-Space

Ta-Ying Cheng · Prafull Sharma · Mark Boss · Varun Jampani

Editing materials of objects in images based on exemplar images is an active area of research in computer vision and graphics. We propose MARBLE, a method for performing material blending and recomposing fine-grained material properties by finding material embeddings in CLIP-space and using that to control pre-trained text-to-image models. We improve exemplar-based material editing by finding a block in the denoising UNet responsible for material attribution. Given two material exemplar-images, we find directions in the CLIP-space for blending the materials. Further, we can achieve parametric control over fine-grained material attributes such as roughness, metallic, transparency, and glow using a shallow network to predict the direction for the desired material attribute change. We perform qualitative and quantitative analysis to demonstrate the efficacy of our proposed method. We also present the ability of our method to perform multiple edits in a single forward pass and applicability to painting.


Poster #231
MagicQuill: An Intelligent Interactive Image Editing System

Zichen Liu · Yue Yu · Hao Ouyang · Qiuyu Wang · Ka Leong Cheng · Wen Wang · Zhiheng Liu · Qifeng Chen · Yujun Shen

As a highly practical application, image editing encounters a variety of user demands and thus prioritizes excellent ease of use.In this paper, we unveil MagicQuill, an integrated image editing system designed to support users in swiftly actualizing their creativity.Our system starts with a streamlined yet functionally robust interface, enabling users to articulate their ideas (e.g., inserting elements, erasing objects, altering color, etc.) with just a few strokes.These interactions are then monitored by a multimodal large language model (MLLM) to anticipate user intentions in real time, bypassing the need for prompt entry.Finally, we apply the powerful diffusion prior, enhanced by a carefully learned two-branch plug-in module, to process the editing request with precise control.We will release the entire system to facilitate the community.


Poster #232
FluxSpace: Disentangled Semantic Editing in Rectified Flow Models

Yusuf Dalva · Kavana Venkatesh · Pinar Yanardag

Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, understanding their inner workings remains a significant challenge due to their ``black box'' nature. Recent research has focused on identifying a representation space that facilitates semantic manipulation of generated images, but these models generally lack a GAN-like linear latent space, that allows straightforward control over image generation. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability of controlling the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work both offers a scalable and effective image editing approach and significantly enhances the interpretability of rectified flow transformers.


Poster #233
FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

Jun Zhou · Jiahao Li · Zunnan Xu · Hanhui Li · Yiji Cheng · Fa-Ting Hong · Qin Lin · qinglin lu · Xiaodan Liang

Currently, instruction-based image editing methods have made significant progress by leveraging the powerful cross-modal understanding capabilities of visual language models (VLMs). However, they still face challenges in three key areas: 1) complex scenarios; 2) semantic consistency; and 3) fine-grained editing. To address these issues, we propose FireEdit, an innovative \textbf{F}ine-grained \textbf{I}nstruction-based image editing framework that exploits a REgion-aware VLM. FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process. We employ a VLM to precisely localize the desired editing regions within complex scenes. To enhance the fine-grained visual perception capabilities of the VLM, we introduce additional region tokens that complement the holistic image features and are integrated into the user's instructions. Relying solely on the output of the Language Model (LLM) to guide the diffusion model may result in suboptimal editing outcomes.Therefore, we propose a Time-Aware Target Injection module and a Hybrid Visual Cross Attention module. The former dynamically adjusts the guidance strength at various denoising stages by integrating timestep embeddings with the text embeddings. The latter enhances visual details for image editing, thereby preserving semantic consistency between the edited result and the source image. By combining the VLM enhanced with fine-grained region tokens and the time-dependent diffusion model, FireEdit demonstrates significant advantages in comprehending editing instructions and maintaining high semantic consistency. Extensive experiments indicate that our approach surpasses the state-of-the-art instruction-based image editing methods.


Poster #234
Recognition-Synergistic Scene Text Editing

Zhengyao Fang · Pengyuan Lyu · Jingjing Wu · Chengquan Zhang · Jun Yu · Guangming Lu · Wenjie Pei

Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, RS-STE achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code will be made publicly available.


Poster #235
Highlight
HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis

Mengtian Li · Jinshu Chen · Wanquan Feng · Bingchuan Li · Fei Dai · Songtao Zhao · Qian HE

Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.


Poster #236
Self-Evolving Visual Concept Library using Vision-Language Critics

Atharva Sehgal · Patrick Yuan · Ziniu Hu · Yisong Yue · Jennifer J. Sun · Swarat Chaudhuri

We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them. Our approach, ESCHER, takes a library learning perspective to iteratively discover and improve visual concepts. ESCHER uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, ESCHER dynamically improves its concept generation strategy based on the VLM critic's feedback. Finally, ESCHER does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of ESCHER to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.


Poster #237
Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis

Zixuan Wang · DUO PENG · Feng Chen · Yuwei Yang · Yinjie Lei

Image synthesis is a crucial task with broad applications, such as artistic creation and virtual reality. However, challenges in achieving control over generated images have underscored the need for the task of conditional image synthesis. Current methods for conditional image synthesis, nevertheless, remain limited, as they are often task-oriented with a narrow scope, handling a restricted condition with constrained applicability. In this paper, we propose a novel approach that treats conditional image synthesis as the modular combination of fundamental condition units. This perspective allows us to develop a framework for modular conditional generation, significantly enhancing the model's adaptability to diverse conditional generation tasks and greatly expanding its application range. Specifically, we divide conditions into three primary units: text, layout, and drag. To enable effective control over these conditions, we design a dedicated alignment module for each. For the text condition, we introduce a Dense Concept Alignment (DCA) module, which achieves dense visual-text alignment by drawing on diverse textual concepts. For the layout condition, we propose a Dense Geometry Alignment (DGA) module to impose comprehensive geometric constraints that ensure adherence to spatial configuration of the layout condition. For the drag condition, we introduce a Dense Motion Alignment (DMA) module to apply multi-level motion regularization, ensuring that each pixel follows its desired trajectory without visual artifacts. By flexibly inserting and combining these condition modules, our framework enables highly controllable image generation. Extensive experiments demonstrate the superior performance of our framework across a variety of conditions, including textual caption, layout mask (or box), drag manipulation, and their combinations. Our code will be released.


Poster #238
Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts

Feng Liang · Haoyu Ma · Zecheng He · Tingbo Hou · Ji Hou · Kunpeng Li · Xiaoliang Dai · Felix Juefei-Xu · Samaneh Azadi · Animesh Sinha · Peizhao Zhang · Peter Vajda · Diana Marculescu

Video personalization, which generates customized videos using reference images, has gained significant attention.However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration.Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources.This challenge arises due to the lack of a mechanism to link each concept with its specific reference image.We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation.Additionally, we introduce concept embeddings to encode the order of reference images.Our approach, Movie Weaver, seamlessly weaves multiple concepts—including face, body, and animal images—into one video, allowing flexible combinations in a single model.The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.


Poster #239
AMO Sampler: Enhancing Text Rendering with Overshooting

Xixi Hu · Keyang Xu · Bo Liu · Hongliang Fei · Qiang Liu

Achieving precise alignment between textual instructions and generated images in text-to-image generation is a significant challenge, particularly in rendering written text within images. Open-source models like Stable Diffusion 3 (SD3), Flux, and AuraFlow often struggle with accurate text depiction, resulting in misspelled or inconsistent text. We introduce a training-free method with minimal computational overhead that significantly enhances text rendering quality. Specifically, we introduce an overshooting sampler for a pretrained RF model, by alternating between over-simulating the learned ordinary differential equation (ODE) and reintroducing noise. Compared to the Euler sampler, the overshooting sampler effectively introduces an extra Langevin dynamics term that can help correct the compounding error from successive Euler steps and therefore improve the text rendering. However, when the overshooting strength is high, we observe over-smoothing artifacts on the generated images. To address this issue, we adaptively control the strength of the overshooting for each image patch according to their attention score with the text content. We name the proposed sampler Attention Modulated Overshooting sampler (AMO). AMO demonstrates a 32.3% and 35.9% improvement in text rendering accuracy on SD3 and Flux without compromising overall image quality or increasing inference cost.


Poster #240
ArtiFade: Learning to Generate High-quality Subject from Blemished Images

Shuya Yang · Shaozhe Hao · Yukang Cao · Kwan-Yee K. Wong

Subject-driven text-to-image generation has witnessed remarkable advancements in its ability to learn and capture characteristics of a subject using only a limited number of images. However, existing methods commonly rely on high-quality images for training and may struggle to generate reasonable images when the input images are blemished by artifacts. This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts. In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. Specifically, ArtiFade exploits fine-tuning of a pre-trained text-to-image model, aiming to remove artifacts. The elimination of artifacts is achieved by utilizing a specialized dataset that encompasses both unblemished images and their corresponding blemished counterparts during fine-tuning. ArtiFade also ensures the preservation of the original generative capabilities inherent within the diffusion model, thereby enhancing the overall performance of subject-driven methods in generating high-quality and artifact-free images. We further devise evaluation benchmarks tailored for this task. Through extensive qualitative and quantitative experiments, we demonstrate the generalizability of ArtiFade in effective artifact removal under both in-distribution and out-of-distribution scenarios.


Poster #241
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Shufan Li · Konstantinos Kallidromitis · Akash Gokul · Zichun Liao · Yusuke Kato · Kazuki Kozuka · Aditya Grover

We introduce OminiFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OminiFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities.


Poster #242
LoRACLR: Contrastive Adaptation for Customization of Diffusion Models

Enis Simsar · Thomas Hofmann · Federico Tombari · Pinar Yanardag

Recent advances in text-to-image customization have enabled high-fidelity, context-rich generation of personalized images, allowing specific concepts to appear in a variety of scenarios. However, current methods struggle with combining multiple personalized models, often leading to attribute entanglement or requiring separate training to preserve concept distinctiveness. We present LoRACLR, a novel approach for multi-concept image generation that merges multiple LoRA models, each fine-tuned for a distinct concept, into a single, unified model without additional individual fine-tuning. LoRACLR uses a contrastive objective to align and merge the weight spaces of these models, ensuring compatibility while minimizing interference. By enforcing distinct yet cohesive representations for each concept, LoRACLR enables efficient, scalable model composition for high-quality, multi-concept image synthesis. Our results highlight the effectiveness of LoRACLR in accurately merging multiple concepts, advancing the capabilities of personalized image generation.


Poster #243
Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization

Zhanhao Liang · Yuhui Yuan · Shuyang Gu · Bohan CHEN · Tiankai Hang · Mingxi Cheng · Ji Li · Liang Zheng

Generating visually appealing images is fundamental to modern text-to-image generation models. A potential solution to better aesthetics is direct preference optimization (DPO), which has been applied to diffusion models to improve general image quality including prompt alignment and aesthetics. Popular DPO methods propagate preference labels from clean image pairs to all the intermediate steps along the two generation trajectories. However, preference labels provided in existing datasets are blended with layout and aesthetic opinions, which would disagree with aesthetic preference. Even if aesthetic labels were provided (at substantial cost), it would be hard for the two-trajectory methods to capture nuanced visual differences at different steps. To improve aesthetics economically, this paper uses existing generic preference data and introduces step-by-step preference optimization (SPO) that discards the propagation strategy and allows fine-grained image details to be assessed. Specifically, at each denoising step, we 1) sample a pool of candidates by denoising from a shared noise latent, 2) use a step-aware preference model to find a suitable win-lose pair to supervise the diffusion model, and 3) randomly select one from the pool to initialize the next denoising step. This strategy ensures that the diffusion models to focus on the subtle, fine-grained visual differences instead of layout aspect. We find that aesthetic can be significantly enhanced by accumulating these improved minor differences. When fine-tuning Stable Diffusion v1.5 and SDXL, SPO yields significant improvements in aesthetics compared with existing DPO methods while not sacrificing image-text alignment compared with vanilla models. Moreover, SPO converges much faster than DPO methods due to the step-by-step alignment of fine-grained visual details.


Poster #244
Composing Parts for Expressive Object Generation

Harsh Rangwani · Aishwarya Agarwal · Kuldeep Kulkarni · R. Venkatesh Babu · Srikrishna Karanam

Image composition and generation are processes where the artists need control over various parts of the generated images. However, the current state-of-the-art generation models, like Stable Diffusion, cannot handle fine-grained part-level attributes in the text prompts. Specifically, when additional attribute details are added to the base text prompt, these text-to-image models either generate an image vastly different from the image generated from the base prompt or ignore the attribute details. To mitigate these issues, we introduce PartComposer, a training-free method that enables image generation based on fine-grained part-level attributes specified for objects in the base text prompt. This allows more control for artists and enables novel object compositions by combining distinctive object parts. PartComposer first localizes object parts by denoising the object region from a specific diffusion process. This enables each part token to be localized to the right region. After obtaining part masks, we run a localized diffusion process in each part region based on fine-grained part attributes and combine them to produce the final image. All stages of PartComposer are based on repurposing a pre-trained diffusion model, which enables it to generalize across domains. We demonstrate the effectiveness of part-level control provided by PartComposer through qualitative visual examples and quantitative comparisons with contemporary baselines.

Text-to-image diffusion model alignment is critical for improving the alignment between the generated images and human preferences. While training-based methods are constrained by high computational costs and dataset requirements, training-free alignment methods remain underexplored and are often limited by inaccurate guidance. We propose a plug-and-play training-free alignment method, DyMO, for aligning the generated images and human preferences during inference. Apart from text-aware human preference scores, we introduce a semantic alignment objective for enhancing the semantic alignment in the early stages of diffusion, relying on the fact that the attention maps are effective reflections of the semantics in noisy images. We propose dynamic scheduling of multiple objectives and intermediate recurrent steps to reflect the requirements at different steps. Experiments with diverse pre-trained diffusion models and metrics demonstrate the effectiveness and robustness of the proposed method.


Poster #246
Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

Stefan Andreas Baumann · Felix Krause · Michael Neumayr · Nick Stracke · Melvin Sevi · Vincent Tao Hu · Björn Ommer

Recent advances in text-to-image (T2I) diffusion models have significantly improved the quality of generated images. However, providing efficient control over individual subjects, particularly the attributes characterizing them, remains a key challenge. While existing methods have introduced mechanisms to modulate attribute expression, they typically provide either detailed, object-specific localization of such a modification or fine-grained, nuanced control of attributes. No current approach offers both simultaneously, resulting in a gap when trying to achieve precise continuous and subject-specific attribute modulation in image generation. In this work, we demonstrate that token-level directions exist within commonly used CLIP text embeddings that enable fine-grained, subject-specific control of high-level attributes in T2I models. We introduce two methods to identify these directions: a simple, optimization-free technique and a learning-based approach that utilizes the T2I model to characterize semantic concepts more specifically. Our methods allow the augmentation of the prompt text input, enabling fine-grained control over multiple attributes of individual subjects simultaneously, without requiring any modifications to the diffusion model itself. This approach offers a unified solution that fills the gap between global and localized control, providing competitive flexibility and precision in text-guided image generation.


Poster #247
Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Lital Binyamin · Yoad Tewel · Hilit Segev · Eran Hirsch · Royi Rassin · Gal Chechik

Despite the unprecedented success of text-to-image diffusion models, controlling the number of depicted objects using text is surprisingly hard. This is important for various applications from technical documents, to children's books to illustrating cooking recipes. Generating object-correct counts is fundamentally challenging because the generative model needs to keep a sense of separate identity for every instance of the object, even if several objects look identical or overlap, and then carry out a global computation implicitly during generation. It is still unknown if such representations exist. To address count-correct generation, we first identify features within the diffusion model that can carry the object identity information. We then use them to separate and count instances of objects during the denoising process and detect over-generation and under-generation. We fix the latter by training a model that predicts both the shape and location of a missing object, based on the layout of existing ones, and show how it can be used to guide denoising with correct object count. Our approach, CountGen, does not depend on external source to determine object layout, but rather uses the prior from the diffusion model itself, creating prompt-dependent and seed-dependent layouts. Evaluated on two benchmark datasets, we find that CountGen strongly outperforms the count-accuracy of existing baselines.

Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed.However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings.Despite their effectiveness on single-concept prompts, current methods still face challenges when as they struggle to precisely remove harmful concepts without disrupting the semantics of benign ones. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process.DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation.The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.


Poster #249
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

Mingcheng Li · Xiaolu Hou · Ziyang Liu · Dingkang Yang · Ziyun Qian · Jiawei Chen · Jinjie Wei · Yue Jiang · Qingyao Xu · Lihua Zhang

Diffusion models have shown excellent performance in text-to-image generation. However, existing methods often suffer from performance bottlenecks when dealing with complex prompts involving multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration based scene parsing module that generates an agent system containing multiple agents with different tasks using MLLMs to adequately extract multiple scene elements. In addition, Hierarchical Compositional diffusion utilizes Gaussian mask and filtering to achieve the refinement of bounding box regions and highlights objects through region enhancement for accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, which has a large advantage in complex scene generation. The code will be open-source on github.


Poster #250
StoryGPT-V: Large Language Models as Consistent Story Visualizers

Xiaoqian Shen · Mohamed Elhoseiny

Recent generative models have demonstrated impressive capabilities in generating realistic and visually pleasing images grounded on textual prompts. Nevertheless, a significant challenge remains in applying these models for the more intricate task of story visualization. Since it requires resolving pronouns (he, she, they) in the frame descriptions, i.e., anaphora resolution, and ensuring consistent characters and background synthesis across frames. Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references and process extensive sequences. Therefore, we introduce \emph{StoryGPT-V}, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters grounded on given story descriptions. First, we train a character-aware LDM, which takes character-augmented semantic embedding as input and includes the supervision of the cross-attention map using character segmentation masks, aiming to enhance character generation accuracy and faithfulness. In the second stage, we enable an alignment between the output of LLM and the character-augmented embedding residing in the input space of the first-stage model. This harnesses the reasoning ability of LLM to address ambiguous references and the comprehension capability to memorize the context. We conduct comprehensive experiments on two visual story visualization benchmarks. Our model reports superior quantitative results and consistently generates accurate characters of remarkable quality with low memory consumption. Our code will be made publicly available\footnote{Please refer to the \href{https://storygpt-v.s3.amazonaws.com/index.html}{anonymous webpage} for qualitative results.


Poster #251
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Chengyou Jia · Changliang Xia · Zhuohang Dang · Weijia Wu · Hangwei Qian · Minnan Luo

Despite the significant advancements in text-to-image (T2I) generative models, users often face a trial-and-error challenge in practical scenarios. This challenge arises from the complexity and uncertainty of tedious steps such as crafting suitable prompts, selecting appropriate models, and configuring specific arguments, making users resort to labor-intensive attempts for desired images. This paper proposes Automatic T2I generation, which aims to automate these tedious steps, allowing users to simply describe their needs in a freestyle chatting way. To systematically study this problem, we first introduce ChatGenBench, a novel benchmark designed for Automatic T2I. It features high-quality paired data with diverse freestyle inputs, enabling comprehensive evaluation of automatic T2I models across all steps. Additionally, recognizing Automatic T2I as a complex multi-step reasoning task, we propose ChatGen-Evo, a multi-stage evolution strategy that progressively equips models with essential automation skills. Through extensive evaluation across step-wise accuracy and image quality, ChatGen-Evo significantly enhances performance over various baselines. Our evaluation also uncovers valuable insights for advancing automatic T2I. All our data, code and models will be publicly available.


Poster #252
OmniGen: Unified Image Generation

Shitao Xiao · Yueze Wang · Junjie Zhou · Huaying Yuan · Xingrun Xing · Ruiran Yan · Chaofan Li · Shuting Wang · Tiejun Huang · Zheng Liu

The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored.In this work, we introduce OmniGen, a new diffusion model for unified image generation. OmniGen is characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion models, it is more user-friendly and can complete complex tasks end-to-end through instructions without the need for extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of the chain-of-thought mechanism.This work represents the first attempt at a general-purpose image generation model, and we will open-source the related resources to foster advancements in this field.


Poster #253
ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts

Dmitrii M Petrov · Pradyumn Goyal · Divyansh Shivashok · Yuanming Tao · Melinos Averkiou · Evangelos Kalogerakis

We introduce ShapeWords, an approach for synthesizing images based on 3D shape guidance and text prompts.ShapeWords incorporates target 3D shape information within specialized tokens embedded together with the input text, effectively blending 3D shape awareness with textual context to guide the image synthesis process. Unlike conventional shape guidance methods that rely on depth maps restricted to fixed viewpoints and often overlook full 3D structure or textual context, ShapeWords generates diverse yet consistent images that reflect both the target shape’s geometry and the textual description. Experimental results show that ShapeWords produces images that are more text-compliant, aesthetically plausible, while also maintaining 3D shape awareness.


Poster #254
Highlight
From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing

Jingxuan Wei · Cheng Tan · Qi Chen · Gaowei Wu · Siyuan Li · Zhangyang Gao · Linzhuang Sun · Bihui Yu · Ruifeng Guo

We introduce the task of text-to-diagram generation, which focuses on creating structured visual representations directly from textual descriptions. Existing approaches in text-to-image and text-to-code generation lack the logical organization and flexibility needed to produce accurate, editable diagrams, often resulting in outputs that are either unstructured or difficult to modify. To address this gap, we introduce DiagramGenBenchmark, a comprehensive evaluation framework encompassing eight distinct diagram categories, including flowcharts, model architecture diagrams, and mind maps. Additionally, we present DiagramAgent, an innovative framework with four core modules—Plan Agent, Code Agent, Check Agent, and Diagram-to-Code Agent—designed to facilitate both the generation and refinement of complex diagrams. Our extensive experiments, which combine objective metrics with human evaluations, demonstrate that DiagramAgent significantly outperforms existing baseline models in terms of accuracy, structural coherence, and modifiability. This work not only establishes a foundational benchmark for the text-to-diagram generation task but also introduces a powerful toolset to advance research and applications in this emerging area.


Poster #255
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

Shivam Duggal · Yushi Hu · Oscar Michel · Aniruddha Kembhavi · William Freeman · Noah A. Smith · Ranjay Krishna · Antonio Torralba · Ali Farhadi · Wei-Chiu Ma

Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To effectively assess the quality of the generated 3D data, there is a need for a reliable 3D evaluation tool. Unfortunately, existing 3D evaluation metrics often overlook the geometric quality of generated assets or merely rely on black-box multimodal large language models for coarse assessment. In this paper, we introduce Eval3D, a fine-grained, interpretable evaluation tool that can faithfully evaluate the quality of generated 3D assets based on various distinct yet complementary criteria. Our key observation is that many desired properties of 3D generation, such as semantic and geometric consistency, can be effectively captured by measuring the consistency among various foundation models and tools. We thus leverage a diverse set of models and tools as probes to evaluate the inconsistency of generated 3D assets across different aspects. Compared to prior work, Eval3D provides pixel-wise measurement, enables accurate 3D spatial feedback, and aligns more closely with human judgments. We comprehensively evaluate existing 3D generation models using Eval3D and highlight the limitations and challenges of current models.


Poster #256
EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark

Ming Li · Jike Zhong · Tianle Chen · Yuxiang Lai · Konstantinos Psounis

Recent studies on large language models (LLMs) and large multimodal models (LMMs) have demonstrated promising skills in various domains including science and mathematics. However, their capability in more challenging and real-world related scenarios like engineering has not been systematically studied. To bridge this gap, we propose EEE-Bench, a multimodal benchmark aimed at assessing LMMs' capabilities in solving practical engineering tasks, using electrical and electronics engineering (EEE) as the testbed. Our benchmark consists of 2860 hand-picked and carefully curated problems spanning 10 essential subdomains such as analog circuits, control systems, etc. Compared to other domains, engineering problems are intrinsically 1) more visually complex and versatile and 2) less deterministic in solutions. Successful solutions to these problems often demand more-than-usual rigorous integration of visual and textual information as models need to understand intricate images like abstract circuits and system diagrams while taking professional instructions. Alongside EEE-Bench, we provide extensive quantitative evaluations, fine-grained analysis, and improvement methods using 17 widely-used open- and closed-sourced LLMs and LMMs and 7 popular prompting techniques. Our results reveal notable deficiencies in current foundation models for EEE, including an average performance ranging from 19.48\% to 46.78\% and a tendency toward ``laziness" in overlooking essential visual context. In summary, we believe EEE-Bench not only reveals some noteworthy limitations of LMMs but also provides a valuable resource for advancing research on their application in practical engineering tasks, driving future improvements in their capability to handle complex, real-world scenarios.


Poster #257
Towards Precise Embodied Dialogue Localization via Causality Guided Diffusion

Haoyu Wang · Le Wang · Sanping Zhou · Jingyi Tian · Zheng Qin · Yabing Wang · Gang Hua · Wei Tang

Embodied localization based on vision and natural language dialogues presents a persistent challenge in embodied intelligence. Existing methods often approach this task as an image translation problem, leveraging encoder-decoder architectures to predict heatmaps. However, these methods frequently experience a deficiency in accuracy, largely due to their heavy reliance on resolution. To address this issue, we introduce CGD, a novel framework that utilizes causality guided diffusion model to directly model coordinate distributions. Specifically, CGD employs a denoising network to regress coordinates, while integrating causal learning modules, namely back-door adjustment (BDA) and front-door adjustment (FDA) to mitigate confounders during the diffusion process. This approach reduces the dependency on high resolution for improving accuracy, while effectively minimizing spurious correlations, thereby promoting unbiased learning. By guiding the denoising process with causal adjustments, CGD offers flexible control over intensity, ensuring seamless integration with diffusion models. Experimental results demonstrate that CGD outperforms state-of-the-art methods across all metrics. Additionally, we also evaluate CGD in a multi-shot setting, achieving consistently high accuracy.


Poster #258
Rethinking Training for De-biasing Text-to-Image Generation: Unlocking the Potential of Stable Diffusion

Eunji Kim · Siwon Kim · Minjun Park · Rahim Entezari · Sungroh Yoon

Recent advancements in text-to-image models, such as Stable Diffusion, show significant demographic biases. Existing de-biasing techniques rely heavily on additional training, which imposes high computational costs and risks of compromising core image generation functionality. This hinders them from being widely adopted to real-world applications. In this paper, we explore Stable Diffusion's overlooked potential to reduce bias without requiring additional training. Through our analysis, we uncover that initial noises associated with minority attributes form minority regions' rather than scattered. We view theseminority regions' as opportunities in SD to reduce bias. To unlock the potential, we propose a novel de-biasing method called `weak guidance,' carefully designed to guide a random noise to the minority regions without compromising semantic integrity. Through analysis and experiments on various versions of SD, we demonstrate that our proposed approach effectively reduces bias without additional training, achieving both efficiency and preservation of core image generation functionality.


Poster #259
Rectified Diffusion Guidance for Conditional Generation

Mengfei Xia · Nan Xue · Yujun Shen · Ran Yi · Tieliang Gong · Yong-Jin Liu

Classifier-Free Guidance (CFG), which combines the conditional and unconditional score functions with two coefficients summing to one, serves as a practical technique for diffusion model sampling. Theoretically, however, denoising with CFG cannot be expressed as a reciprocal diffusion process, which may consequently leave some hidden risks during use. In this work, we revisit the theory behind CFG and rigorously confirm that the improper configuration of the combination coefficients (i.e., the widely used summing-to-one version) brings about expectation shift of the generative distribution. To rectify this issue, we propose ReCFG with a relaxation on the guidance coefficients such that denoising with ReCFG strictly aligns with the diffusion theory. We further show that our approach enjoys a closed-form solution given the guidance strength. That way, the rectified coefficients can be readily pre-computed via traversing the observed data, leaving the sampling speed barely affected. Empirical evidence on real-world data demonstrate the compatibility of our post-hoc design with existing state-of-the-art diffusion models, including both class-conditioned ones (e.g., EDM2 on ImageNet) and text-conditioned ones (e.g., SD3 on CC12M), without any retraining. We will open-source the code to facilitate further research.


Poster #260
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

Lijun Li · Zhelun Shi · Xuhao Hu · Bowen Dong · Yiran Qin · Xihui Liu · Lu Sheng · Jing Shao

Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing.


Poster #261
The Illusion of Unlearning: The Unstable Nature of Machine Unlearning in Text-to-Image Diffusion Models

Naveen George · Karthik Nandan Dasaraju · Rutheesh Reddy Chittepu · Konda Reddy Mopuri

Text-to-image models such as Stable Diffusion, DALL·E, and Midjourney have gained immense popularity lately. However, they are trained on vast amounts of data that may include private, explicit, or copyrighted material used without permission, raising serious legal and ethical concerns. In light of the recent regulations aimed at protecting individual data privacy, there has been a surge in Machine Unlearning methods designed to remove specific concepts from these models. However, we identify a critical flaw in these unlearning techniques: unlearned concepts will revive when the models are fine-tuned, even with general or unrelated prompts. In this paper, for the first time, through an extensive study, we demonstrate the unstable nature of existing unlearning methods in text-to-image diffusion models. We introduce a framework that includes a couple of measures for analyzing the stability of existing unlearning methods. Further, the paper offers preliminary insights into the plausible explanation for the instability of the mapping-based unlearning methods that can guide future research toward more robust unlearning techniques. Anonymized codes for implementing the proposed framework are provided.


Poster #262
Towards Universal Dataset Distillation via Task-Driven Diffusion

Ding Qi · Jian Li · Junyao Gao · Shuguang Dou · Ying Tai · Jianlong Hu · Bo Zhao · Yabiao Wang · Chengjie Wang · Cai Rong Zhao

Dataset distillation (DD) condenses key information from large-scale datasets into smaller synthetic datasets, reducing storage and computational costs for training networks. However, recent research has primarily focused on image classification tasks, with limited expansion to detection and segmentation. Two key challenges remain: (i) Task Optimization Heterogeneity, where existing methods focus on class-level information and fail to address the diverse needs of detection and segmentation and (ii) Inflexible Image Generation, where current generation methods rely on global updates for single-class targets and lack localized optimization for specific object regions.To address these challenges, we propose a universal dataset distillation framework, named UniDD, a task-driven diffusion model for diverse DD tasks, as illustrated in Fig.1. Our approach operates in two stages: Universal Task Knowledge Mining, which captures task-relevant information through task-specific proxy model training, and Universal Task-Driven Diffusion, where these proxies guide the diffusion process to generate task-specific synthetic images.Extensive experiments across ImageNet-1K, Pascal VOC, and MS COCO demonstrate that UniDD consistently outperforms state-of-the-art methods. In particular, on ImageNet-1K with IPC-10, UniDD surpasses previous diffusion-based methods by 6.1\%, while also reducing deployment costs.


Poster #263
RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations

Peter Sushko · Ayana Bharadwaj · Zhi Yang Lim · Vasily Ilin · Ben Caffee · Dongping Chen · Reza Salehi · Cheng-Yu Hsieh · Ranjay Krishna

Existing image editing models struggle to meet real-world demands; despite excelling in academic benchmarks, we are yet to see them adopted to solve real user needs. The datasets that power these models use artificial edits, lacking the scale and ecological validity necessary to address the true diversity of user requests. In response, we introduce RealEdit, a large-scale image editing dataset with authentic user requests and human-made edits sourced from Reddit. RealEdit contains a test set of $9.3$K examples the community can use to evaluate models on real user requests. Our results show that existing models fall short on these tasks, implying a need for realistic training data.So, we introduce $48$K training examples, with which we train our RealEdit model. Our model achieves substantial gains—outperforming competitors by up to $165$ Elo points in human judgment and $92\%$ relative improvement on the automated VIEScore metric on our test set. We deploy our model back on Reddit, testing it on new requests, and receive positive feedback. Beyond image editing, we explore RealEdit's potential in detecting edited images by partnering with a deepfake detection non-profit. Finetuning their model on RealEdit data improves its F1-score by $14$ percentage points, underscoring the dataset's value for broad, impactful applications.


Poster #264
Harnessing Global-Local Collaborative Adversarial Perturbation for Anti-Customization

Long Xu · Jiakai Wang · Haojie Hao · Haotong Qin · Jiejie Zhao · Xianglong Liu

Though achieving significant success in personalized image synthesis, Latent Diffusion models (LDMs) pose substantial social risks caused by unauthorized misuse (e.g., face theft). To counter these threats, the Anti-Customization (AC) method that exploits adversarial perturbations has been proposed.Unfortunately, existing AC methods show insufficient defense ability due to the ignorance to hierarchical characteristics, i.e., global feature correlations and local facial attribute, leading to weak resistance to concept transfer and semantic theft from customization methods. To address this problem, we are motivated to propose a Global-local collaborated Anti-Customization (GoodAC) framework to generate powerful adversarial perturbations by disturbing both feature correlations and facial attributes.For enhancing the ability to resist concept transfer, we disrupt the spatial correlation of perceptual features that form the basis of model generation at a global level, thereby creating highly concept-transfer-resistant adversarial camouflage.To improve the ability to resist semantic theft, leveraging the fact that facial attributes are personalized, we designed a personalized and precise facial attribute distortion strategy locally, focusing the attack on the individual's image structure to generate strong camouflage.Extensive experiments on various LDMs, including Dreambooth, LoRA, and textual inversion, have strongly demonstrated that our GoodAC outperforms other state-of-the-art approaches by large margins, e.g., over 50\% improvements on ISM.


Poster #265
Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal

Haonan An · Guang Hua · Zhengru Fang · Guowen Xu · Susanto Rahardja · Yuguang Fang

The intellectual property of deep image-to-image models can be protected by the so-called box-free watermarking. It uses an encoder and a decoder, respectively, to embed into and extract from the model's output images invisible copyright marks. Prior works have improved watermark robustness, focusing on the design of better watermark encoders. In this paper, we reveal an overlooked vulnerability of the unprotected watermark decoder which is jointly trained with the encoder and can be exploited to train a watermark removal network. To defend against such an attack, we propose the decoder gradient shield (DGS) as a protection layer in the decoder API to prevent gradient-based watermark removal with a closed-form solution. The fundamental idea is inspired by the classical adversarial attack, but is utilized for the first time as a defensive mechanism in the box-free model watermarking. We then demonstrate that DGS can reorient and rescale the gradient directions of watermarked queries and stop the watermark remover's training loss from converging to the level without DGS, while retaining decoder output image quality. Experimental results verify the effectiveness of proposed method. Code of paper will be made available upon acceptance.

Advances in talking-head animation based on Latent Diffusion Models (LDM) enable the creation of highly realistic, synchronized videos. These fabricated videos are indistinguishable from real ones, increasing the risk of potential misuse for scams, political manipulation, and misinformation. Hence, addressing these ethical concerns has become a pressing issue in AI security. Recent proactive defense studies focused on countering LDM-based models by adding perturbations to portraits. However, these methods are ineffective at protecting reference portraits from advanced image-to-video animation. The limitations are twofold: 1) they fail to prevent images from being manipulated by audio signals, and 2) diffusion-based purification techniques can effectively eliminate protective perturbations. To address these challenges, we propose $\textbf{Silencer}$, a two-stage method designed to proactively protect the privacy of portraits. First, a nullifying loss is proposed to ignore audio control in talking-head generation. Second, we apply anti-purification loss in LDM to optimize the inverted latent feature to generate robust perturbations. Extensive experiments demonstrate the effectiveness of $\textbf{Silencer}$ in proactively protecting portrait privacy. We hope this work will raise awareness among the AI security community regarding critical ethical issues related to talking-head generation techniques.


Poster #267
Secret Lies in Color: Enhancing AI-Generated Images Detection with Color Distribution Analysis

Zexi Jia · Chuanwei Huang · Yeshuang Zhu · Hongyan Fei · Xiaoyue Duan · Yuan Zhiqiang · Ying Deng · Jiapei Zhang · Jinchao Zhang · Jie Zhou

The rapid advancement of Generative Adversarial Networks (GANs) and diffusion models significantly enhances the realism of synthetic images, driving progress in image processing and creative design. However, this progress also necessitates the development of effective detection methods, as synthetic images become increasingly difficult to distinguish from real ones. This difficulty leads to various societal issues, such as the spread of misinformation, identity theft, and online fraud. While previous detection methods perform well on public benchmarks, they struggle with our proposed benchmark, FakeART, particularly when dealing with the latest models and cross-domain tasks (e.g., photo-to-painting). To address this challenge, we develop a new synthetic image detection technique based on color distribution. Unlike real images, synthetic images often exhibit uneven color distribution. By employing color quantization and restoration techniques, we analyze the color differences before and after image restoration. We discover and prove that these differences closely relate to the uniformity of color distribution. Based on this finding, we extract effective color features and combine them with image features to create a detection model with only 1.4 million parameters. This model achieves state-of-the-art results across various evaluation benchmarks, including the challenging FakeART dataset.


Poster #268
CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI

Siyuan Cheng · Lingjuan Lyu · Zhenting Wang · Xiangyu Zhang · Vikash Sehwag

With the rapid advancement of generative AI, it is now possible to synthesize high-quality images in a few seconds. Despite the power of these technologies, they raise significant concerns regarding misuse. Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, CO-SPY, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. Additionally, we create CO-SPYBench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models, including the latest models like FLUX. We also collect 50k synthetic images in the wild from the Internet to enable evaluation in a more practical setting. Our extensive evaluations demonstrate that our detector outperforms existing methods under identical training conditions, achieving an average accuracy improvement of approximately 11% to 34%.


Poster #269
Highlight
FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

Ian Huang · Yanan Bao · Karen Truong · Howard Zhou · Cordelia Schmid · Leonidas Guibas · Alireza Fathi

Scene generation with 3D assets presents a complex challenge, requiring both high-level semantic understanding and low-level geometric reasoning. While Multimodal Large Language Models (MLLMs) excel at semantic tasks, their application to 3D scene generation is hindered by their limited grounding on 3D geometry. In this paper, we investigate how to best work with MLLMs in an object placement task. Towards this goal, we introduce a novel framework, FirePlace, that applies existing MLLMs in (1) 3D geometric reasoning and the extraction of relevant geometric details from the 3D scene, (2) constructing and solving geometric constraints on the extracted low-level geometry, and (3) pruning for final placements that conform to common sense. By combining geometric reasoning with real-world understanding of MLLMs, our method can propose object placements that satisfy both geometric constraints as well as high-level semantic common-sense considerations. Our experiments show that these capabilities allow our method to place objects more effectively in complex scenes with intricate geometry, surpassing the quality of prior work.


Poster #270
VI^3NR: Variance Informed Initialization for Implicit Neural Representations

Chamin Hewa Koneputugodage · Yizhak Ben-Shabat · Sameera Ramasinghe · Stephen Gould

Implicit Neural Representations (INRs) are a versatile and powerful tool for encoding various forms of data, including images, videos, sound, and 3D shapes. A critical factor in the success of INRs is the initialization of the network, which can significantly impact the convergence and accuracy of the learned model. Unfortunately, commonly used neural network initializations are not widely applicable for many activation functions, especially those used by INRs. In this paper, we improve upon previous initialization methods by deriving an initialization that has stable variance across layers, and applies to any activation function. We show that this generalizes many previous initialization methods, and has even better stability for well studied activations. We also show that our initialization leads to improved results with INR activation functions in multiple signal modalities. Our approach is particularly effective for Gaussian INRs, where we demonstrate that the theory of our initialization matches with task performance in multiple experiments, allowing us to achieve improvements in image, audio, and 3D surface reconstruction.


Poster #271
EigenGS Representation: From Eigenspace to Gaussian Image Space

LO-WEI TAI · Ching-En Ching En, Li · Cheng-Lin Chen · Chih-Jung Tsai · Hwann-Tzong Chen · Tyng-Luh Liu

Principal Component Analysis (PCA), a classical dimensionality reduction technique, and Gaussian Splatting, a recent high-quality image synthesis method, represent fundamentally different approaches to image representation. Despite these significant differences, we present EigenGS, a novel method that bridges these two paradigms. By establishing an efficient transformation pipeline between eigenspace and image-space Gaussian representations, our approach enables instant initialization of Gaussian parameters for new images without requiring per-image training from scratch. Our method also introduces a frequency-aware learning mechanism that encourages Gaussians to adapt to different scales in order to better model spatial frequencies, effectively preventing artifacts in high-resolution reconstruction. Extensive experiments demonstrate that EigenGS not only achieves superior reconstruction quality but also dramatically accelerates convergence. The results highlight EigenGS's effectiveness and its ability to generalize across images with varying resolutions and diverse categories. This makes high-quality Gaussian Splatting practically viable for real-time applications.


Poster #272
Few-shot Personalized Scanpath Prediction

Ruoyu Xue · Jingyi Xu · Sounak Mondal · Hieu Le · Gregory Zelinsky · Minh Hoai · Dimitris Samaras

A personalized model for scanpath prediction provides insights into the visual preferences and attention patterns of individual subjects. However, existing methods for training scanpath prediction models are data-intensive and cannot be effectively personalized to new individuals with only a few available examples. In this paper, we propose few-shot personalized scanapth prediction task (FS-PSP) and a novel method to address it, which aims to predict scanpaths for an unseen subject using minimal support data of that subject's scanpath behavior. The key to our method's adaptability is the Subject-Embedding Network (SE-Net), specifically designed to capture unique, individualized representations for each user's scanpaths. SE-Net generates subject embeddings that effectively distinguish between subjects while minimizing variability among scanpaths from the same individual. The personalized scanpath prediction model is then conditioned on these subject embeddings to produce accurate, personalized results. Experiments on multiple eye-tracking datasets demonstrate that our method excels in FS-PSP settings and does not require any fine-tuning steps at test time.

Accurate 3D gaze estimation in unconstrained real-world environments remains a significant challenge due to variations in appearance, head pose, occlusion, and the limited availability of in-the-wild 3D gaze datasets. To address these challenges, we introduce a novel Self-Training Weakly-Supervised Gaze Estimation framework (ST-SWGE). This two-stage learning framework leverages diverse 2D gaze datasets, such as gaze-following data, which offer rich variations in appearances, natural scenes, and gaze distributions, and proposes an approach to generate 3D pseudo-labels and enhance model generalization. Furthermore, traditional modality-specific models, designed separately for images or videos, limit the effective use of available training data. To overcome this, we propose the Gaze Transformer (GaT), a modality-agnostic architecture capable of simultaneously learning static and dynamic gaze information from both image and video datasets. By combining 3D video datasets with 2D gaze target labels from gaze following tasks, our approach achieves the following key contributions: (i) Significant state-of-the-art improvements in within-domain and cross-domain generalization on unconstrained benchmarks like Gaze360 and GFIE, with notable cross-modal gains in video gaze estimation; (ii) Superior cross-domain performance on datasets such as MPIIFaceGaze and Gaze360 compared to frontal face methods. Code and pre-trained models will be released to the community.


Poster #274
FilmComposer: LLM-Driven Music Production for Silent Film Clips

Zhifeng Xie · Qile He · Youjia Zhu · Qiwei He · Mengtian Li

In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film—audio quality, musicality, and musical development—and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,000 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and musical development.


Poster #275
VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

Saksham Singh Kushwaha · Yapeng Tian

Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds (onscreen and off-screen). This is because T2A cannot generate sounds aligning with onscreen objects, while V2A cannot generate semantically complete (offscreen sounds missing). In this work, we address the task of holistic audio generation: given a video and a text prompt, we aim to generate both onscreen and offscreen sounds that are temporally synchronized with the video and semantically aligned with text and video. Previous approaches for joint text and video-to-audio generation often suffer from modality bias, favoring one modality over the other. To overcome this limitation, we introduce VinTAGe, a flow-based transformer model that jointly considers text and video to guide audio generation. Our framework comprises two key components: a Visual-Text Encoder and a Joint VT-SiT model. To reduce modality bias and improve generation quality, we employ pretrained uni-modal text-to-audio and video-to-audio generation models for additional guidance. Due to the lack of appropriate benchmarks, we also introduce VinTAGe-Bench, a dataset of 636 video-text-audio pairs containing both onscreen and offscreen sounds. Our comprehensive experiments on VinTAGe-Bench demonstrate that joint text and visual interaction is necessary for holistic audio generation. Furthermore, VinTAGe achieves state-of-the-art results on the VGGSound benchmark.We will release our pretrained models and the VinTAGe-Bench dataset to facilitate future research in this exciting field.


Poster #276
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes

Hyeonggon Ryu · Seongyu Kim · Joon Chung · Arda Senocak

We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a "mix-and-separate" framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources.Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves state-of-the-art performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach.


Poster #277
Audio-Visual Instance Segmentation

Ruohao Guo · Xianghua Ying · Yaru Chen · Dantong Niu · Guangyao Li · Liao Qu · Yanyu Qi · Jinxing Zhou · Bowei Xing · Wenzhen Yue · Ji Shi · Qixun Wang · Peiliang Zhang · Buwen Liang

In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. To facilitate this research, we introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos. Additionally, we propose a strong baseline model for this task. Our model first localizes sound source within each frame, and condenses object-specific contexts into concise tokens. Then it builds long-range audio-visual dependencies between these tokens using window-based attention, and tracks sounding objects among the entire video sequences. Extensive experiments reveal that our method performs best on AVISeg, surpassing the existing methods from related tasks. We further conduct the evaluation on several multi-modal large models; however, they exhibits subpar performance on instance-level sound source localization and temporal perception. We expect that AVIS will inspire the community towards a more comprehensive multi-modal understanding.


Poster #278
UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing

Yung-Hsuan Lai · Janek Ebbers · Yu-Chiang Frank Wang · François Germain · Michael J. Jones · Moitreya Chatterjee

Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both unimodal events, i.e., those occurring either exclusively in the visual or acoustic modalities of a video, and multimodal events, i.e., those occurring in both modalities concurrently. Moreover, the prohibitive cost of annotating the training data with the class labels of all these events, along with their start and end times, imposes constraints on the scalability of AVVP techniques unless they can be trained in a weakly-supervised setting, e.g. where only modality-agnostic, video-level labels might be assumed to be available in the training data. To this end, recently proposed approaches seek to generate segment-level pseudo-labels to better guide the training of these methods. However, the lack of inter-segment consistency of these pseudo-labels and the general bias towards predicting labels that are absent in a segment, limit their performance. This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV).Additionally, our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training. Empirical evaluations show that UWAV outperforms the current state-of-the-art for the AVVP task on multiple metrics, across two different datasets, attesting to its effectiveness and generalizability.


Poster #279
Highlight
DistinctAD: Distinctive Audio Description Generation in Contexts

Bo Fang · Wenhao Wu · Qiangqiang Wu · YuXin Song · Antoni B. Chan

Audio Descriptions (ADs) aim to provide a narration of a movie in text form, describing non-dialogue-related narratives, such as characters, actions, or scene establishment. Automatic generation of ADs remains challenging due to: i) the domain gap between movie-AD data and existing data used to train vision-language models, and ii) the issue of contextual redundancy arising from highly similar neighboring visual clips in a long movie. In this work, we propose DistinctAD, a novel two-stage framework for generating ADs that emphasize distinctiveness to produce better narratives. To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora, enabling more effective alignment between movie and AD modalities at both global and fine-grained levels. In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context, ensuring the prediction of unique terms specific to the current AD. Comprehensive evaluations on MAD-Eval, CMD-AD, and TV-AD benchmarks demonstrate the superiority of DistinctAD, with the model consistently outperforming baselines, particularly in Recall@k/N, highlighting its effectiveness in producing high-quality, distinctive ADs.


Poster #280
ExpertAF: Expert Actionable Feedback from Video

Kumar Ashutosh · Tushar Nagarajan · Georgios Pavlakos · Kris Kitani · Kristen Grauman

Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching---expert commentary, expert video retrieval, and expert pose generation---outperforming strong vision-language models on both established metrics and human preference studies. Code and data will be publicly released.


Poster #281
FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding

Rong Gao · Xin Liu · Zhuozhao Hu · Bohao Xing · Baiqiang XIA · Zitong YU · Heikki Kälviäinen

Figure skating, known as the “Art on Ice,” is among the most artistic sports, challenging to understand due to its blend of technical elements (like jumps and spins) and overall artistic expression. Existing figure skating datasets mainly focus on single tasks, such as action recognition or scoring, lacking comprehensive annotations for both technical and artistic evaluation. Current sports research is largely centered on ball games, with limited relevance to artistic sports like figure skating. To address this, we introduce FSAnno, a large-scale dataset advancing artistic sports understanding through figure skating. FSAnno includes an open-access training and test dataset, alongside a benchmark dataset, FSBench, for fair model evaluation. FSBench consists of FSBench-Text, with multiple-choice questions and explanations, and FSBench-Motion, containing multimodal data and Question and Answer (QA) pairs, supporting tasks from technical analysis to performance commentary. Initial tests on FSBench reveal significant limitations in existing models’ understanding of artistic sports. We hope FSBench will become a key tool for evaluating and enhancing model comprehension of figure skating.


Poster #282
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Yuying Ge · Yizhuo Li · Yixiao Ge · Ying Shan

In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a $\textbf{Di}$ffusion-Powered $\textbf{V}$ide$\textbf{o}$ $\textbf{T}$okenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.


Poster #283
LLaVA-Critic: Learning to Evaluate Multimodal Models

Tianyi Xiong · Xiyao Wang · Dong Guo · Qinghao Ye · Haoqi Fan · Quanquan Gu · Heng Huang · Chunyuan Li

We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: $(i)$ LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and $(ii)$ Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.


Poster #284
Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Yiping Wang · Xuehai He · Kuan Wang · Luyao Ma · Jianwei Yang · Shuohang Wang · Simon Shaolei Du · yelong shen

The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in specific short stories, which is foreseeable an essential capability for future long video generation scenarios. While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation.To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2–4 consecutive events. We employ Vision-Language Models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50\%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.


Poster #285
Progress-Aware Video Frame Captioning

Zihui Xue · Joungbin An · Xitong Yang · Kristen Grauman

While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.


Poster #286
Learning from Streaming Video with Orthogonal Gradients

Tengda Han · Dilara Gokay · Joseph Heyward · Chuhan Zhang · Daniel Zoran · Viorica Patraucean · Joao Carreira · Dima Damen · Andrew Zisserman

We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner.This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms.When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance.We demonstrate the drop in performance when moving from shuffled to sequential learning on three systems: the one-video representation learning method DoRA, standard VideoMAE, and the task of future video prediction.To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training.The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW.Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks.On three scenarios (DoRA, VideoMAE, future prediction),we show our orthogonal optimizer outperforms the strong AdamW all three cases.


Poster #287
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang · Shusheng Yang · Anjali W. Gupta · Rilyn Han · Li Fei-Fei · Saining Xie

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also "think in space" from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive—though subhuman—visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance awareness.

View-invariant representation learning from egocentric (first-person, ego) and exocentric (third-person, exo) videos is a promising approach toward generalizing video understanding systems across multiple perspectives. However, this area has been underexplored due to the substantial differences in perspective, motion patterns, and context between ego and exo views. In this paper, we propose a novel fine-grained view-invariant video representation learning from unpaired ego-exo videos, called Bootstrap Your Own Videos (BYOV). We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding. To this end, we introduce a masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment. Specifically, self-causal masking and cross-view masking predictions are learned concurrently to facilitate view-invariant and powerful representations across viewpoints. Experimental results demonstrate that our BYOV significantly surpasses existing approaches with notable gains across all metrics in four downstream ego-exo video tasks. The code is available at \url{https://anonymous.4open.science/r/byov-D967.


Poster #289
Highlight
VEU-Bench: Towards Comprehensive Understanding of Video Editing

Bozheng Li · Yongliang Wu · YI LU · Jiashuo Yu · Licheng Tang · Jiawang Cao · Wenqing Zhu · Yuyang Sun · Jay Wu · Wenbo Zhu

Widely shared videos on the internet are often edited. Recently, although Video Large Language Models (Vid-LLMs) have made great progress in general video understanding tasks, their capabilities in video editing understanding (VEU) tasks remain unexplored. To address this gap, in this paper, we introduce VEU-Bench (\textbf{V}ideo \textbf{E}diting \textbf{U}nderstanding \textbf{Bench}mark), a comprehensive benchmark that categorizes video editing components across various dimensions, from intra-frame features like shot size to inter-shot attributes such as cut types and transitions. Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging. To enhance the annotation of VEU automatically, we built an annotation pipeline integrated with an ontology-based knowledge base. Through extensive experiments with 11 state-of-the-art Vid-LLMs, our findings reveal that current Vid-LLMs face significant challenges in VEU tasks, with some performing worse than random choice. To alleviate this issue, we develop Oscars\footnote{Named after the Academy Awards.}, a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3\% in accuracy and achieves performance comparable to commercial models like GPT-4o. We also demonstrate that incorporating VEU data significantly enhances the performance of Vid-LLMs on general video understanding benchmarks, with an average improvement of 8.3\% across nine reasoning tasks. The code and data will be made available.


Poster #290
Highlight
Question-Aware Gaussian Experts for Audio-Visual Question Answering

Hongyeob Kim · Inyoung Jung · Dayoon Suh · Youjia Zhang · Sangmin Lee · Sungeun Hong

Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance.


Poster #291
MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou · Yan Shu · Bo Zhao · Boya Wu · Zhengyang Liang · Shitao Xiao · Minghao Qin · Xi Yang · yongping xiong · Bo Zhang · Tiejun Huang · Zheng Liu

The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark called MLVU (Multi-task Long Video Understanding Benchmark) for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 23 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding ability, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.


Poster #292
M-LLM Based Video Frame Selection for Efficient Video Understanding

Kai Hu · Feng Gao · Xiaohan Nie · Peng Zhou · Son Dinh Tran · Tal Neiman · Lingyun Wang · Mubarak Shah · Raffay Hamid · Bing Yin · Trishul Chilimbi

Recent advances in \acf{mllms} show promising results in video reasoning. Popular \ac{mllm} frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an \ac{mllm}, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream \ac{mllm} may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight \ac{mllm}-based frame selection method that adaptively select frames that are more relevant to users' queries. The selected frames are then digested by a frozen downstream \acf{videollm} for visual reasoning and question answering. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a \ac{mllm}; (ii) Temporal signal, in which multiple frames selection by prompting \ac{llm} using the captions of all frame candidates. Empirical results show that the proposed \ac{mllm} video frame selector improves the performances various downstream \ac{videollm} across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.


Poster #293
On the Consistency of Video Large Language Models in Temporal Comprehension

Minjoon Jung · Junbin Xiao · Byoung-Tak Zhang · Angela Yao

Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code will be publicly released.

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, the problem of hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that the visual encoder of MLLMs often struggles to differentiate between video pairs that are visually distinct but semantically similar, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding tasks. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. VidHalluc consists of 5,002 videos, paired based on semantic similarity and visual differences, focusing on cases where hallucinations are most likely to occur. Through comprehensive testing, our experiments show that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency information from DINOv2 to reweight visual features during inference. Our results demonstrate that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations among all tasks. Both the VidHalluc benchmark and DINO-HEAL code will be publicly released.


Poster #295
ReWind: Understanding Long Videos with Instructed Learnable Memory

Anxhelo Diko · Tinghuai Wang · Wassim Swaileh · Shiyan Sun · Ioannis Patras

Vision-Language Models (VLMs) are crucial for real-world applications that require understanding textual and visual information. However, existing VLMs face multiple challenges in processing long videos, including computational inefficiency, memory limitations, and difficulties maintaining coherent understanding across extended sequences. These issues stem partly from the quadratic scaling of self-attention w.r.t. number of tokens but also encompass broader challenges in temporal reasoning and information integration over long sequences. To address these challenges, we introduce ReWind, a novel two-stage framework for long video understanding. In the first stage, ReWind maintains a dynamic memory that stores and updates instruction-relevant visual information as the video unfolds.Memory updates leverage novel read and write mechanisms utilizing learnable queries and cross-attentions between memory contents and the input stream. This approach maintains low memory requirements as the cross-attention layers scale linearly w.r.t. number of tokens. In the second stage, the memory content guides the selection of a few relevant frames, represented at high spatial resolution, which are combined with the memory contents and fed into an LLM to generate the final answer. We empirically demonstrate ReWind's superiority in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13\% score gain and a +12\% accuracy improvement on the MovieChat-1K VQA dataset and an +8\% mIoU increase on Charades-STA for temporal grounding.


Poster #296
Highlight
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

Kyungho Bae · Jinhyung Kim · Sihaeng Lee · Soonyoung Lee · Gunhee Lee · Jinwoo Choi

In this work, we tackle action-scene hallucination in Video Large Language Models (Video-LLMs), where models incorrectly predict actions based on the scene context or scenes based on observed actions. We observe that existing Video-LLMs often suffer from action-scene hallucination due to two main factors. First, existing Video-LLMs intermingle spatial and temporal features by applying an attention operation across all tokens. Second, they use the standard Rotary Position Embedding (RoPE), which causes the text tokens to overemphasize certain types of tokens depending on their sequential orders. To address these issues, we introduce MASH-VLM, Mitigating Action-Scene Hallucination in Video-LLMs through disentangled spatial-temporal representations. Our approach includes two key innovations: (1) DST-attention, a novel attention mechanism that disentangles the spatial and temporal tokens within the LLM by using masked attention to restrict direct interactions between the spatial and temporal tokens; (2) Harmonic-RoPE, which extends the dimensionality of the positional IDs, allowing the spatial and temporal tokens to maintain balanced positions relative to the text tokens. To evaluate the action-scene hallucination in Video-LLMs, we introduce the UNSCENE benchmark with 1,320 videos and 4,078 QA pairs. Extensive experiments demonstrate that MASH-VLM achieves state-of-the-art results on the UNSCENE benchmark, as well as on existing video understanding benchmarks.


Poster #297
Number it: Temporal Grounding Videos like Flipping Manga

Yongliang Wu · Xinting Hu · Yuyang Sun · Yizhou Zhou · Wenbo Zhu · Fengyun Rao · Bernt Schiele · Xu Yang

Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to ``read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be made publicly available.


Poster #298
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

Andong Deng · Zhongpai Gao · Anwesa Choudhuri · Benjamin Planche · Meng Zheng · Bin Wang · Terrence Chen · Chen Chen · Ziyan Wu

Temporal awareness is essential for video large language models (LLMs) to understand and reason about events within long videos, enabling applications like dense video captioning and temporal video grounding in a unified system. However, the scarcity of long videos with detailed captions and precise temporal annotations limits their temporal awareness. In this paper, we propose Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos. By converting sequence positions into temporal annotations, we transform large-scale image and clip captioning datasets into sequences that mimic the temporal structure of long videos, enabling self-supervised training with abundant time-sensitive data. To enable sequence-to-time knowledge transfer, we introduce a novel time representation that unifies positional information across image sequences, clip sequences, and long videos. Experiments demonstrate the effectiveness of our method, achieving a 27.6\% improvement in F1 score and 44.8\% in CIDEr on the YouCook2 benchmark and a 14.7\% increase in recall on the Charades-STA benchmark compared to the baseline.


Poster #299
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

Zichen Liu · Kunlun Xu · Bing Su · Xu Zou · Yuxin Peng · Jiahuan Zhou

Pre-trained on tremendous image-text pairs, vision-language models like CLIP have demonstrated promising zero-shot generalization across numerous image-based tasks. However, extending these capabilities to video tasks remains challenging due to limited labeled video data and high training costs. Recent video prompting methods attempt to adapt CLIP for video tasks by introducing learnable prompts, but they typically rely on a single static prompt for all video sequences, overlooking the diverse temporal dynamics and spatial variations that exist across frames. This limitation significantly hinders the model’s ability to capture essential temporal information for effective video understanding. To address this, we propose an integrated Spatial-TempOral dynamic Prompting (STOP) model which consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. Our intra-frame spatial prompts are designed to adaptively highlight discriminative regions within each frame by leveraging intra-frame attention and temporal variation, allowing the model to focus on areas with substantial temporal dynamics and capture fine-grained spatial details. Additionally, to highlight the varying importance of frames for video understanding, we further introduce inter-frame temporal prompts, dynamically inserting prompts between frames with high temporal variance as measured by frame similarity. This enables the model to prioritize key frames and enhances its capacity to understand temporal dependencies across sequences. Extensive experiments on various video benchmarks demonstrate that STOP consistently achieves superior performance against state-of-the-art methods. Our code will be released soon.


Poster #300
SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction

Enrico Pallotta · Sina Mokhtarzadeh Azar · Shuai Li · Olga Zatsarynna · Jürgen Gall

Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world.To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate SyncVP against other video prediction methods on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality, and demonstrate modality-agnostic generalization on SYNTHIA with semantic segmentation. Notably, SyncVP achieves state-of-the-art performance, even in scenarios where depth conditioning is absent, demonstrating its robustness and potential for a wide range of applications.


Poster #301
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

Hao Du · Bo Wu · Yan Lu · Zhendong Mao

Vision-language temporal alignment is a crucial capability for human recognition and cognition in real-world scenarios. Although existing works have designed methods to capture vision-language correlations, they are limited by benchmark issues, including biased temporal distributions, imprecise annotations, and inadequate compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary, we first present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective.To this end, we introduce $\textbf{SVLTA}$, a synthetic, large-scale, and compositional benchmark for vision-language temporal alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, process permutation, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.


Poster #302
DTOS: Dynamic Time Object Sensing with Large Multimodal Model

Jirui Tian · Jinrong Zhang · Shenglan Liu · Luhao Xu · Zhixiong Huang · Gao Huang

Existing multimodal large language models (MLLM) face significant challenges in Referring Video Object Segmentation(RVOS). We identify three critical challenges: (C1) insufficient quantitative representation of textual numerical data, (C2) repetitive and degraded response templates for spatiotemporal referencing, and (C3) loss of visual information in video sampling queries lacking textual guidance. To address these, we propose a novel framework, \textbf{Dynamic Time Object Sensing (DTOS)}, specifically designed for RVOS. To tackle (C1) and (C2), we introduce specialized tokens to construct multi-answer response templates, enabling regression of event boundaries and target localization. This approach improves the accuracy of numerical regression while mitigating the issue of repetitive degradation. To address (C3), we propose a Text-guided Clip Sampler (TCS) that selects video clips aligned with user instructions, preventing visual information loss and ensuring consistent temporal resolution. TCS is also applicable to Moment Retrieval tasks, with enhanced multimodal input sequences preserving spatial details and maximizing temporal resolution. DTOS demonstrates exceptional capability in flexibly localizing multiple spatiotemporal targets based on user-provided textual instructions. Extensive experiments validate the effectiveness of our approach, with DTOS achieving state-of-the-art performance in J&F scores: an improvement of +4.36 on MeViS, +4.48 on Ref-DAVIS17, and +3.02 on Ref-YT-VOS. Additionally, our TCS demonstrates exceptional performance in Moment Retrieval. All code, models, and datasets will be made publicly available.


Poster #303
Decoupled Motion Expression Video Segmentation

Hao Fang · Runmin Cong · Xiankai Lu · Xiaofei Zhou · Sam Kwong · Wei Zhang

Motion expression video segmentation aims to segment objects based on input motion descriptions. Compared with traditional referring video object segmentation, it focuses on motion and multi-object expressions and is more challenging. Previous works achieved it by simply injecting text information into the video instance segmentation (VIS) model. However, this requires retraining the entire model and optimization is difficult. In this work, we propose DMVS, a simple structure built on top of an off-the-shelf query-based VIS model, emphasizing decoupling the task into video instance segmentation and motion expression understanding. Firstly, we use an video instance segmenter as a means of distilling object-specific contexts into frame-level and video-level queries. Secondly, we interact two levels of queries with static and motion cues, respectively, to further encode visually enhanced motion expressions. Furthermore, we propose a novel query initialization strategy that uses video queries guided by classification priors to initialize motion queries, greatly reducing the difficulty of optimization. Without bells and whistles, DMVS achieves the state-of-the-art on the challenging MeViS dataset at a lower training cost. Extensive experiments verify the effectiveness and efficiency of our framework. The code will be publicly released.


Poster #304
EdgeTAM: On-Device Track Anything Model

Chong Zhou · Chenchen Zhu · Yunyang Xiong · Saksham Suri · Fanyi Xiao · Lemeng Wu · Raghuraman Krishnamoorthi · Bo Dai · Chen Change Loy · Vikas Chandra · Bilge Soran

On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 86.2, 64.3, 69.8, and 67.1 J&F on DAVIS 2017, MOSE, SA-V val and SA-V test, while running at 16 FPS on iPhone 15 Pro Max.


Poster #305
Highlight
Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity

Huaxin Zhang · Xiaohao Xu · Xiang Wang · Jialong Zuo · Xiaonan Huang · Changxin Gao · Shanjun Zhang · Li Yu · Nong Sang

How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts?Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies.To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments.For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy.Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos.Our benchmark and model will be publicly available.


Poster #306
Highlight
MammAlps: A Multi-view Video Behavior Monitoring Dataset of Wild Mammals in the Swiss Alps

Valentin Gabeff · Haozhe Qi · Brendan Flaherty · Gencer Sumbul · Alexander Mathis · Devis Tuia

Monitoring wildlife is essential for ecology and especially in light of the increasing human impact on ecosystems. Camera traps have emerged as habitat-centric sensors enabling the study of wildlife-environment interactions at scale with minimal disturbance. While computer vision models are becoming more powerful for general video understanding tasks, they struggle comparatively with camera trap videos. This gap in terms of performance and applicability can be partly attributed to the lack of annotated video datasets. To advance research in wild animal behavior monitoring we present MammAlps, a multimodal and multi-view dataset of wildlife behavior monitoring from 9 camera-traps in the Swiss National Park. MammAlps contains over 14 hours of video with audio, 2D segmentation maps and 8.5 hours of individual tracks densely labeled for species and behavior. Behaviors were annotated at two levels of complexity: actions representing simple behaviors and high-level activities. Based on 6,135 single animal clips, we propose the first hierarchical and multimodal animal behavior recognition benchmark using audio, video and reference scene segmentation maps as inputs. To enable future ecology research, we also propose a second benchmark aiming at identifying activities, species, number of individuals and meteorological conditions from 397 multi-view and long-term ecological events, including false positive triggers. We advocate that both tasks are complementary and contribute to bridging the gap between machine learning and ecology. Code and data will be made accessible.


Poster #307
Boosting Point-Supervised Temporal Action Localization through Integrating Query Reformation and Optimal Transport

Mengnan Liu · Le Wang · Sanping Zhou · Kun Xia · Xiaolong Sun · Gang Hua

Point-supervised Temporal Action Localization poses significant challenges due to the difficulty of identifying complete actions with a single-point annotation per action. Existing methods typically employ Multiple Instance Learning, which struggles to capture global temporal context and requires heuristic post-processing. In research on fully-supervised tasks, DETR-based structures have effectively addressed these limitations. However, it is nontrivial to merely adapt DETR to this task, encountering two major bottlenecks. (1) How to integrate point label information into the model and (2) How to select optimal decoder proposals for training in the absence of complete action segment annotations. To address this issue, we introduce an end-to-end framework by integrating Query Reformation and Optimal Transport (QROT). Specifically, we encode point labels through a set of semantic consensus queries, enabling effective focus on action-relevant snippets. Furthermore, we integrate an optimal transport mechanism to generate high-quality pseudo labels. These pseudo-labels facilitate precise proposals selection based on Hungarian algorithm, significantly enhancing localization accuracy in point-supervised settings. Extensive experiments on the THUMOS14 and ActivityNet-v1.3 datasets demonstrate that our method outperforms existing MIL-based approaches, offering more stable and accurate temporal action localization in point-level supervision. The code will be publicly available.


Poster #308
Semantic-guided Cross-Modal Prompt Learning for Skeleton-based Zero-shot Action Recognition

Anqi Zhu · Jingmin Zhu · James Bailey · Mingming Gong · Qiuhong Ke

Skeleton-based human action recognition has emerged as a promising approach due to its privacy preservation, robustness to visual challenges, and computational efficiency. While current methods predominantly rely on fully supervised learning, the practical necessity to recognize unseen actions has led to increased interest in zero-shot skeleton-based action recognition (ZSSAR). Existing ZSSAR approaches often rely on manually crafted action descriptions and movement assumptions, limiting their flexibility across diverse action classes. To overcome this, we introduce Semantic-guided Cross-Model Prompt Learning (SCoPLe), a novel framework that replaces manual guidance with data-driven prompt learning for skeletal and textual knowledge refinement and alignment. Specifically, we introduce a dual-stream language prompting module that selectively preserves the original semantic context, effectively enhancing the prompting features. We also introduce a semantic-guided adaptive skeleton prompting module that learns joint-level prompts for skeleton features and incorporates an adaptive visual representation sampler that leverages text semantics to strengthen the cross-modal prompting interactions during skeleton-to-text embedding projection. Experimental results on the NTU-RGB+D 60, NTU-RGB+D 120, and PKU-MMD datasets demonstrate the state-of-the-art performance of our method in both ZSSAR and Generalized ZSSAR scenarios.


Poster #309
Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking

Hongkai Wei · YANG YANG · Shijie Sun · Mingtao Feng · Xiangyu Song · Qi Lei · Hongli Hu · Rong Wang · Huansheng Song · Naveed Akhtar · Ajmal Mian

Visual-Language Tracking (VLT) is emerging as a promising paradigm to bridge the human-machine performance gap. For single objects, VLT broadens the problem scope to text-driven video comprehension. Yet, this direction is still confined to 2D spatial extents, currently lacking the ability to deal with 3D tracking in the confines of monocular video. Unfortunately, advances in 3D tracking mainly rely on expensive sensor inputs, e.g., point clouds, depth measurements, radar. Absence of language counterpart for the outputs of these mildly democratized sensors in the literature also hinders VLT expansion to 3D tracking. Addressing that, we make the first attempt towards extending VLT to 3D tracking based on monocular video. We present a comprehensive framework, introducing (i) the Monocular-Video-based 3D Visual Language Tracking (Mono3DVLT) task, (ii) a large-scale dataset for the task, called Mono3DVLT-V2X, and (iii) a customized neural model for the task. Our dataset is carefully curated, leveraging a Large Langauge Model (LLM) followed by human verification, composing natural language descriptions for 79,158 video sequences aiming at single object tracking, providing 2D and 3D bounding box annotations. Our neural model, termed Mono3DVLT-MT, is the first targeted approach for the Mono3DVLT task. Comprising the pipeline of multi-modal feature extractor, visual-language encoder, tracking decoder and a tracking head, our model sets a strong baseline for the task on Mono3DVLT-V2X. Experimental results show that our method significantly outperforms existing techniques on the Mono3DVLT-V2X dataset. Our dataset and code are available in Supplementary Material to ease reproducibility.


Poster #310
FSboard: Over 3 Million Characters of ASL Fingerspelling Collected via Smartphones

Manfred Georg · Garrett Tanzer · Esha Uboweja · Saad Hassan · Maximus Shengelia · Sam Sepah · Sean Forbes · Thad Starner

Progress in machine understanding of sign languages has been slow and hampered by limited data.In this paper, we present FSboard, an American Sign Language fingerspelling dataset situated in a mobile text entry use case, collected from 147 paid and consenting Deaf signers using Pixel 4A selfie cameras in a variety of environments.Fingerspelling recognition is an incomplete solution that is only one small part of sign language translation, but it could provide some immediate benefit to Deaf/Hard of Hearing signers while more broadly capable technology develops.At >3 million characters in length and >250 hours in duration, FSboard is the largest fingerspelling recognition dataset to date by a factor of >10x. As a simple baseline, we finetune 30 Hz MediaPipe Holistic landmark inputs into ByT5-Small and achieve 11.1% Character Error Rate (CER) on a test set with unique phrases and signers. This quality degrades gracefully when decreasing frame rate and excluding face/body landmarks---plausible optimizations to help with on-device performance.


Poster #311
Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior

Chanhui Lee · Yeonghwan Song · Jeany Son

Data-free Universal Adversarial Perturbation (UAP) is an image-agnostic adversarial attack that deceives deep neural networks using a single perturbation generated solely from random noise, without any data priors.However, traditional data-free UAP methods often suffer from limited transferability due to the absence of semantic information in random noise.To address this, we propose a novel data-free universal attack approach that generates a pseudo-semantic prior recursively from the UAPs, enriching semantic contents within the data-free UAP framework.Our method is based on the observation that UAPs inherently contain latent semantic information, enabling the generated UAP to act as an alternative data prior, by capturing a diverse range of semantics through region sampling.We further introduce a sample reweighting technique to emphasize hard examples by focusing on samples that are less affected by the UAP.By leveraging the semantic information from the pseudo-semantic prior, we also incorporate input transformations, typically ineffective in data-free UAPs due to the lack of semantic content in random priors, to boost black-box transferability.Comprehensive experiments on ImageNet show that our method achieves state-of-the-art performance in average fooling rate by a substantial margin, significantly improves attack transferability across various CNN architectures compared to existing data-free UAP methods, and even surpasses data-dependent UAP methods.Extensive experiments on the ImageNet dataset demonstrate that our PSP-UAP achieves a state-of-the-art average fooling rate and significantly enhances attack transferability on different CNN models as compared to other data-free universal attack methods.


Poster #312
Detecting Adversarial Data Using Perturbation Forgery

Qian Wang · Chen Li · Yuchen Luo · Hefei Ling · Shijuan Huang · Ruoxi Jia · Ning Yu

As a defense strategy against adversarial attacks, adversarial detection aims to identify and filter out adversarial data from the data flow based on discrepancies in distribution and noise patterns between natural and adversarial data. Although previous detection methods achieve high performance in detecting gradient-based adversarial attacks, new attacks based on generative models with imbalanced and anisotropic noise patterns evade detection. Even worse, the significant inference time overhead and limited performance against unseen attacks make existing techniques impractical for real-world use. In this paper, we explore the proximity relationship among adversarial noise distributions and demonstrate the existence of an open covering for these distributions. By training on the open covering of adversarial noise distributions, a detector with strong generalization performance against various types of unseen attacks can be developed. Based on this insight, we heuristically propose Perturbation Forgery, which includes noise distribution perturbation, sparse mask generation, and pseudo-adversarial data production, to train an adversarial detector capable of detecting any unseen gradient-based, generative-based, and physical adversarial attacks. Comprehensive experiments conducted on multiple general and facial datasets, with a wide spectrum of attacks, validate the strong generalization of our method.


Poster #313
Stacking Brick by Brick: Aligned Feature Isolation for Incremental Face Forgery Detection

Jikang Cheng · Zhiyuan Yan · Ying Zhang · Li Hao · Jiaxin Ai · Qin Zou · Chen Li · Zhongyuan Wang

The rapid advancement of face forgery techniques has introduced a growing variety of forgeries.Incremental Face Forgery Detection (IFFD), involvinggradually adding new forgery data to fine-tune the previously trained model, has been introduced as a promising strategy to deal with evolving forgery methods.However, a naively trained IFFD model is prone to catastrophic forgetting when new forgeries are integrated, as treating all forgeries as a single ''Fake" class in the Real/Fake classification can cause different forgery types overriding one another, thereby resulting in the forgetting of unique characteristics from earlier tasks and limiting the model's effectiveness in learning forgery specificity and generality.In this paper, we propose to stack the latent feature distributions of previous and new tasks brick by brick, $\textit{i.e.}$, achieving $\textbf{aligned feature isolation}$. In this manner, we aim to preserve learned forgery information and accumulate new knowledge by minimizing distribution overriding, thereby mitigating catastrophic forgetting.To achieve this, we first introduce Sparse Uniform Replay (SUR) to obtain the representative subsets that could be treated as the uniformly sparse versions of the previous global distributions.We then propose a Latent-space Incremental Detector (LID) that leverages SUR data to isolate and align distributions. For evaluation, we construct a more advanced and comprehensive benchmark tailored for IFFD. The leading experimental results validate the superiority of our method.


Poster #314
SapiensID: Foundation for Human Recognition

Minchul Kim · Dingqiang Ye · Yiyang Su · Feng Liu · Xiaoming Liu

Existing human recognition systems often rely on separate, specialized models for face and body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and context vary widely. This paper introduces SapiensID, a unified model that bridges this gap, achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch (RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent tokenization of regions of interest; (ii) Semantic Attention Head (SAH), an attention mechanism that learns pose-invariant representations by pooling features around key body parts; and (iii) a masked recognition model (MRM) that learns from variable token length. To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations. Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios while remaining competitive with dedicated face recognition systems. Furthermore, SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale ReID, demonstrating its ability to generalize to complex, real-world conditions.The dataset, code and models will be released.


Poster #315
Spiking Transformer with Spatial-Temporal Attention

Donghyun Lee · Yuhang Li · Youngeun Kim · Shiting Xiao · Priyadarshini Panda

Spike-based Transformer presents a compelling and energy-efficient alternative to traditional Artificial Neural Network (ANN)-based Transformers, achieving impressive results through sparse binary computations. However, existing spike-based transformers predominantly focus on spatial attention while neglecting crucial temporal dependencies inherent in spike-based processing, leading to suboptimal feature representation and limited performance. To address this limitation, we propose Spiking Transformer with $\textbf{S}$patial-$\textbf{T}$emporal $\textbf{Atten}$tion ($\textbf{STAtten}$), a simple and straightforward architecture that efficiently integrates both spatial and temporal information in the self-attention mechanism. STAtten introduces a block-wise computation strategy that processes information in spatial-temporal chunks, enabling comprehensive feature capture while maintaining the same computational complexity as previous spatial-only approaches. Our method can be seamlessly integrated into existing spike-based transformers without architectural overhaul. Extensive experiments demonstrate that STAtten significantly improves the performance of existing spike-based transformers across both static and neuromorphic datasets, including CIFAR10/100, ImageNet, CIFAR10-DVS, and N-Caltech101.


Poster #316
STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks

Tianqing Zhang · Kairong Yu · Xian Zhong · Hongwei Wang · Qi Xu · Qiang Zhang

Spiking Neural Networks (SNNs) have gained significant attention due to their biological plausibility and energy efficiency, making them promising alternatives to Artificial Neural Networks (ANNs). However, the performance gap between SNNs and ANNs remains a substantial challenge hindering the widespread adoption of SNNs. In this paper, we propose a Spatial-Temporal Attention Aggregator SNN (STAA-SNN) framework, which dynamically focuses on and captures both spatial and temporal dependencies. First, we introduce a spike-driven self-attention mechanism specifically designed for SNNs. Additionally, we pioneeringly incorporate position encoding to integrate latent temporal relationships into the incoming features. For spatial-temporal information aggregation, we employ step attention to selectively amplify relevant features at different steps. Finally, we implement a time-step random dropout strategy to avoid local optima. As a result, STAA-SNN effectively captures both spatial and temporal dependencies, enabling the model to analyze complex patterns and make accurate predictions. The framework demonstrates exceptional performance across diverse datasets and exhibits strong generalization capabilities. Notably, STAA-SNN achieves state-of-the-art results on neuromorphic datasets CIFAR10-DVS, with remarkable performances of 97.14%, 82.05% and 70.40% on the static datasets CIFAR-10, CIFAR-100 and ImageNet, respectively. Furthermore, our model exhibits improved performance ranging from 0.33% to 2.80% with fewer time steps. The code for the model is available on GitHub.


Poster #317
Efficient Event-Based Object Detection: A Hybrid Neural Network with Spatial and Temporal Attention

Soikat Hasan Ahmed · Jan Finkbeiner · Emre Neftci

Event cameras offer high temporal resolution and dynamic range with minimal motion blur, making them promising for robust object detection. While Spiking Neural Networks (SNNs) on neuromorphic hardware are often considered for energy efficient and low latency event-based data processing, they often fall short of Artificial Neural Networks (ANNs) in accuracy and flexibility. Here, we introduce Attention-based Hybrid SNN-ANN backbones for event-based object detection to leverage the strengths of both SNN and ANN architectures. A novel Attention-based SNN-ANN bridge module proposed to captures sparse spatial and temporal relations from the SNN layer and converts them into dense feature maps for the ANN part of the backbone. Additionally, we present a variant that integrates DWConvLSTMs to the ANN blocks to capture slower dynamics. This multi-timescale network combines fast SNN processing for short timesteps with long-term dense RNN processing, effectively capturing both fast and slow dynamics.Experimental results demonstrate that our proposed method surpasses SNN-based approaches by significant margins, with results comparable to existing ANN and RNN-based methods. Unlike ANN-only networks, the hybrid setup allow us to implement the SNN blocks on digital neuromorphic hardware to investigate the feasibility of our approach.Extensive ablation studies and implementation on neuromorphic hardware confirm the effectiveness of our proposed modules and architectural choices.Our hybrid SNN-ANN architectures pave the way for ANN-like performance at a drastically reduced parameter, latency, and power budget.

In this work, we focus on clothes-changing person re-identification (CC-ReID), which aims to recognize individuals under different clothing scenarios. Current CC-ReID approaches either concentrate on modeling body shape using additional modalities including silhouette, pose, and body mesh, potentially causing the model to overlook other critical biometric traits such as gender, age, and style, or they incorporate supervision through additional labels that the model tries to disregard or emphasize, such as clothing or personal attributes. However, these annotations are discrete in nature and do not capture comprehensive descriptions.In this work, we propose DIFFER: Disentangle Identity Features From Entangled Representations, a novel adversarial learning method that leverages textual descriptions to disentangle identity features. Recognizing that image features inherently mix inseparable information, DIFFER introduces NBDetach, a mechanism that utilizes the separable nature of text descriptions as disentanglement supervision to partition the feature space into distinct subspaces, enabling the effective separation of identity-related features from non-biometric features through gradient reversal. We evaluate DIFFER on 4 different benchmark datasets (LTCC, PRCC, CelebreID-Light, and CCVID) to demonstrate its effectiveness and provide state-of-the-art performance across all the benchmarks. DIFFER consistently outperforms the baseline method, with improvements in top-1 accuracy of 3.6\% on LTCC, 3.4\% on PRCC, 2.5\% on CelebReID-Light, and 1\% on CCVID. The code will be made publicly available.


Poster #319
SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images

Kaiyu Li · Ruixun Liu · Xiangyong Cao · Xueru Bai · Feng Zhou · Deyu Meng · Wang Zhi

Current remote sensing semantic segmentation methods are mostly built on the close-set assumption, meaning that the model can only recognize pre-defined categories that exist in the training set. However, in practical Earth observation, there are countless unseen categories, and manual annotation is impractical. To address this challenge, we first attempt to introduce training-free open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle these issues, we propose a simple and universal upsampler, i.e. SimFeatUp, to restore lost spatial information of deep features. Specifically, SimFeatUp only needs to learn from a few unlabeled images, and can upsample arbitrary remote sensing image features. Furthermore, based on the observation of the abnormal response of patch tokens to the [CLS] token in CLIP, we propose to execute a simple subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets of 4 tasks, including semantic segmentation, building extraction, road detection, and flood detection. Our method achieves an average of 5.8\%, 8.2\%, 4.0\%, and 15.3\% improvement over state-of-the-art methods on the 4 tasks.


Poster #320
Mixture of Submodules for Domain Adaptive Person Search

Minsu Kim · Seungryong Kim · Kwanghoon Sohn

Existing technique on domain adaptive person search commonly utilizes the unified framework for jointly localizing and identifying the person across domains. This framework, however, inevitably results in the gradient conflict problem, particularly in cross-domain scenarios with contradictory objectives, as the unified framework employs shared parameters to simultaneously address person detection and re-identification tasks across the domains. To overcome this, we present a novel mixture of submodules framework, dubbed MoS, that dynamically modulates the combination of submodules depending on the specific task to perform person detection and re-identification, separately. We further design the mixtures of submodules that vary depending on the domain, enabling domain-specific knowledge transfer. Especially, we decompose the main model into several submodules and employ diverse mixtures of submodules that vary depending on the tasks and domains through the conditional routing policy. In addition, we also present counterpart domain sample generation that synthesizes the augmented sample and uses them to learn domain invariant representation for person re-identification through the contrastive domain alignment. We conduct experiments to demonstrate the effectiveness of our MoS over the existing domain adaptive person search method and provide ablation studies.


Poster #321
An Image-like Diffusion Method for Human-Object Interaction Detection

Xiaofei Hui · Haoxuan Qu · Hossein Rahmani · Jun Liu

Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.


Poster #322
Free Lunch Enhancements for Multi-modal Crowd Counting

Haoliang Meng · Xiaopeng Hong · Zhengqin Lai · Miao Shang

This paper addresses multi-modal crowd counting with a novel 'free lunch' training enhancement strategy that requires no additional data, parameters, or increased inference complexity. First, we introduce a cross-modal alignment technique as a plug-in post-processing step for the pre-trained backbone network, enhancing the model’s ability to capture shared information across modalities. Second, we incorporate a regional density supervision mechanism during the fine-tuning stage, which differentiates features in regions with varying crowd densities. Extensive experiments on three multi-modal crowd counting datasets validate our approach, making it the first to achieve an MAE below 10 on RGBT-CC.


Poster #323
RORem: Training a Robust Object Remover with Human-in-the-Loop

Ruibin Li · Tao Yang · Song Guo · Lei Zhang

Despite the significant advancements, existing object removal methods struggle with incomplete removal, incorrect content synthesis and blurry synthesized regions, resulting in low success rates. Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). We first collect 60K training pairs from open-source datasets to train an initial object removal model for generating removal samples, and then utilize human feedback to select a set of high-quality object removal pairs, with which we train a discriminator to automate the following training data generation process. By iterating this process for several rounds, we finally obtain a substantial object removal dataset with over 200K pairs. Fine-tuning the pre-trained stable diffusion model with this dataset, we obtain our RORem, which demonstrates state-of-the-art object removal performance in terms of both reliability and image quality. Particularly, RORem improves the object removal success rate over previous methods by more than 18\%. The dataset, source code and trained model will be released.


Poster #324
Highlight
Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation

Hao Zhu · Yan Zhu · Jiayu Xiao · Tianxiang Xiao · Yike Ma · Yucheng Zhang · Feng Dai

Automated crop mapping through Satellite Image Time Series (SITS) has emerged as a crucial avenue for agricultural monitoring and management. However, due to the low resolution and unclear parcel boundaries, annotating pixel-level masks is exceptionally complex and time-consuming in SITS. This paper embraces the weakly supervised paradigm (i.e., only image-level categories available) to liberate the crop mapping task from the exhaustive annotation burden. The unique characteristics of SITS give rise to several challenges in weakly supervised learning: (1) noise perturbation from spatially neighboring regions, and (2) erroneous semantic bias from anomalous temporal periods. To address the above difficulties, we propose a novel method, termed exploring space-time perceptive clues (Exact). First, we introduce a set of spatial clues to explicitly capture the representative patterns of different crops from the most class-relative regions. Besides, we leverage the temporal-to-class interaction of the model to emphasize the contributions of pivotal clips, thereby enhancing the model perception for crop regions. Build upon the space-time perceptive clues, we derive the clue-based CAMs to effectively supervise the SITS segmentation network. Our method demonstrates impressive performance on various SITS benchmarks. Remarkably, the segmentation network trained on Exact-generated masks achieves 95% of its fully supervised performance, showing the bright promise of weakly supervised paradigm in crop mapping scenario. All code will be publicly available in the future.


Poster #325
MaSS13K: A Matting-level Semantic Segmentation Benchmark

Chenxi Xie · Minghan LI · Hui Zeng · Jun Luo · Lei Zhang

High-resolution semantic segmentation is essential for applications like image editing, bokeh imaging, and AR/VR, etc. Unfortunately, existing datasets often have limited resolution and lack precise mask details and boundaries. In this work, we build a large-scale, matting-level semantic segmentation dataset, named MaSS13K, which consists of 13,348 real-world images, all at 4K resolution. MaSS13K provides high-quality mask annotations of a number of objects, which are categorized into seven categories: human, vegetation, ground, sky, water, building, and others. MaSS13K features with precise masks, with an average mask complexity 20-50 times higher than existing semantic segmentation datasets. We consequently present a method specifically designed for high-resolution semantic segmentation, namely MaSSFormer, which employs an efficient pixel decoder that aggregates high-level semantic features and low-level texture features across three stages, aiming to produce high-resolution masks with minimal computational cost. Finally, we propose a new learning paradigm, which integrates the high-quality masks of the seven given categories with pseudo labels form new classes, enabling MaSSFormer to transfer its accurate segmentation capability to other classes of objects. Our proposed MaSSFormer is comprehensively evaluated on the MaSS13K benchmark together with 14 representative segmentation models. We expect that our meticulously annotated MaSS13K dataset and the MaSSFormer model can facilitate the research of high-resolution and high-quality semantic segmentation. Datasets and codes will be released.


Poster #326
Insightful Instance Features for 3D Instance Segmentation

Wonseok Roh · Hwanhee Jung · Giljoo Nam · Dong In Lee · Hyeongcheol Park · Sang Ho Yoon · Jungseock Joo · Sangpil Kim

Recent 3D Instance Segmentation methods typically encode hundreds of instance-wise candidates with instance-specific information in various ways and refine them into final masks.However, they have yet to fully explore the benefit of these candidates.They overlook the valuable cues encoded in multiple candidates that represent different parts of the same instance, resulting in fragments.Also, they often fail to capture the precise spatial range of 3D instances, primarily due to inherent noises from sparse and unordered point clouds.In this work, to address these challenges, we propose a novel instance-wise knowledge enhancement approach.We first introduce an Instance-wise Knowledge Aggregation to associate scattered single instance details by optimizing correlations among candidates representing the same instance.Moreover, we present an Instance-wise Structural Guidance to enhance the spatial understanding of candidates using structural cues from ambiguity-reduced features.Here, we utilize a simple yet effective truncated singular value decomposition algorithm to minimize inherent noises of 3D features.In our extensive experiments on large-scale benchmarks, ScanNetV2, ScanNet200, S3DIS, and STPLS3D, our method outperforms existing works.We also demonstrate the effectiveness of our modules based on both kernel and transformer architectures.


Poster #327
Convex Combination Star Shape Prior for Data-driven Image Semantic Segmentation

Xinyu Zhao · Jun Xie · Shengzhe Chen · Jun Liu

Multi-center star shape is a prevalent object shape feature, which has proven effective in model-based image segmentation methods. However, the shape field function induced by the multi-center star shape is non-smooth, and directly applying it to the data-driven image segmentation network architecture design may lead to instability in backpropagation. This paper proposes a convex combination star (CCS) shape, possessing multi-center star shape properties, and has the advantage of effectively controlling the shape of the region through a smooth field function. The sufficient condition of the proposed CCS shape can be combined into the image segmentation neural network structure design through the bridge between the variational segmentation model and the activation function of the data-driven method. Taking Segment Anything Model (SAM) and its improved version as backbone networks, we have shown that the segmentation network architecture with CCS shape properties can greatly improve the accuracy of segmentation results.


Poster #328
InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

Haijie Li · Yanmin Wu · Jiarui Meng · Qiankun Gao · Zhiyao Zhang · Ronggang Wang · Jian Zhang

3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation.In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation and training strategies.


Poster #329
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Andrew Szot · Bogdan Mazoure · Omar Attia · Aleksei Timofeev · Harsh Agrawal · R Devon Hjelm · Zhe Gan · Zsolt Kira · Alexander Toshev

We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark-specific approaches.


Poster #330
Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Junha Lee · Chunghyun Park · Jaesung Choe · Yu-Chiang Frank Wang · Jan Kautz · Minsu Cho · Chris Choy

We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models (VLM), we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs—significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.


Poster #331
UnCommon Objects in 3D

Xingchen Liu · Piyush Tayal · Jianyuan Wang · Jesus Zarzar · Tom Monnier · Konstantinos Tertikas · Jiali Duan · Antoine Toisoul · Jason Y. Zhang · Natalia Neverova · Andrea Vedaldi · Roman Shapovalov · David Novotny

We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI.uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360$^{\circ}$ coverage.uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories.It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations.Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds.In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction.We train several large 3D models on MVImgNet, CO3Dv2, and uCO3Dand obtain superior results using the latter, showing that uCO3D is better for learning applications.


Poster #332
PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding

Hongjia Zhai · Hai Li · Zhenzhe Li · Xiaokun Pan · Yijia He · Guofeng Zhang

Recently, 3D Gaussian Splatting (3DGS) has shown encouraging performance for open vocabulary scene understanding tasks. However, previous methods can not distinguish 3D instance-level information, which usually predicts a heatmap between the scene feature and text query. In this paper, we propose PanoGS, a novel and efficient 3D panoptic open vocabulary scene understanding approach. Technical-wise, to learn accurate 3D language features that can scale to large indoor scenarios, we adopt the pyramid tri-planes to model the latent continuous parametric feature space and use a 3D feature decoder to regress the multi-view fused 2D feature cloud. Besides, we propose language-guided graph cuts that synergistically leverage reconstructed geometry and learned language cues to group 3D Gaussian primitives into a set of super-primitives. To obtain 3D consistent instance, we perform graph clustering based segmentation with SAM-guided edge affinity computation between different super-primitives. Extensive experiments on widely used datasets show better or more competitive performance on 3D panoptic open vocabulary scene understanding.


Poster #333
Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding

Yan Wang · Baoxiong Jia · Ziyu Zhu · Siyuan Huang

Open-vocabulary 3D scene understanding is pivotal for enhancing physical intelligence, as it enables embodied agents to interpret and interact dynamically within real-world environments. This paper introduces MPEC, a novel Masked Point-Entity Contrastive learning method for open-vocabulary 3D semantic segmentation that leverages both 3D entity-language alignment and point-entity consistency across different point cloud views to foster entity-specific feature representations. Our method improves semantic discrimination and enhances the differentiation of unique instances, achieving state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation and demonstrating superior zero-shot scene understanding capabilities. Extensive fine-tuning experiments on 8 datasets, spanning from low-level perception to high-level reasoning tasks, showcase the potential of learned 3D features, driving consistent performance gains across varied 3D scene understanding tasks.


Poster #334
Highlight
Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration

JUNSEONG KIM · GeonU Kim · Kim Yu-Ji · Yu-Chiang Frank Wang · Jaesung Choe · Tae-Hyun Oh

We introduce Dr. Splat, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting. Unlike existing language-embedded 3DGS methods, which rely on a rendering process, our method directly associates language-aligned CLIP embeddings with 3D Gaussians for holistic 3D scene understanding. The key of our method is a language feature registration technique where CLIP embeddings are assigned to the dominant Gaussians intersected by each pixel-ray. Moreover, we integrate Product Quantization (PQ) trained on general large scale image data to compactly represent embeddings without per-scene optimization. Experiments demonstrate that our approach significantly outperforms existing approaches in 3D perception benchmarks, such as open-vocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks. Code will be publicly available if accepted.


Poster #335
Highlight
Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning

Hanxun Yu · Wentong Li · Song Wang · Junbo Chen · Jianke Zhu

Despite encouraging progress in 3D scene understanding, it remains challenging to develop an effective Large Multi-modal Model (LMM) that is capable of understanding and reasoning in complex 3D environments. Most previous methods typically encode 3D point and 2D image features separately, neglecting interactions between 2D semantics and 3D object properties, as well as the spatial relationships within the 3D environment. This limitation not only hinders comprehensive representations of 3D scene, but also compromises training and inference efficiency. To address these challenges, we propose a unified $\textbf{Inst}$ance-aware $\textbf{3D}$ $\textbf{L}$arge $\textbf{M}$ulti-modal $\textbf{M}$odel (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. To obtain the fine-grained instance-level visual tokens, we first introduce a novel Multi-view Cross-Modal Fusion (MCMF) module to inject the multi-view 2D semantics into their corresponding 3D geometric features. For scene-level relation-aware tokens, we further present a 3D Instance Spatial Relation (3D-ISR) module to capture the intricate pairwise spatial relationships among objects. Additionally, we perform end-to-end multi-task instruction tuning simultaneously without the subsequent task-specific fine-tuning. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods across 3D scene understanding, reasoning and grounding tasks. Our full implementation will be publicly available.


Poster #336
Highlight
Universal Scene Graph Generation

Shengqiong Wu · Hao Fei · Tat-seng Chua

Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics.To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs. Through extensive experiments, we demonstrate that USG offers a stronger capability for expressing scene semantics than standalone SGs, and also that our USG-Par achieves higher efficacy and performance.


Poster #337
DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering

Jingzhou Luo · Yang Liu · weixing chen · Zhen Li · Yaowei Wang · Guanbin Li · Liang Lin

3D Question Answering (3D QA) requires the model to comprehensively understand its situated 3D scene described by the text, then reason about its surrounding environment and answer a question under that situation. However, existing methods usually rely on global scene perception from pure 3D point clouds and overlook the importance of rich local texture details from multi-view images. Moreover, due to the inherent noise in camera poses and complex occlusions, there exists significant feature degradation and reduced feature robustness problems when aligning 3D point cloud with multi-view images. In this paper, we propose a Dual-vision Scene Perception Network (DSPNet), to comprehensively integrate multi-view and point cloud features to improve robustness in 3D QA. Our Text-guided Multi-view Fusion (TGMF) module prioritizes image views that closely match the semantic content of the text. To adaptively fuse back-projected multi-view images with point cloud features, we design the Adaptive Dual-vision Perception (ADVP) module, enhancing 3D scene comprehension. Additionally, our Multimodal Context-guided Reasoning (MCGR) module facilitates robust reasoning by integrating contextual information across visual and linguistic modalities. Experimental results on SQA3D and ScanQA datasets demonstrate the superiority of our DSPNet.


Poster #338
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

Shijie Zhou · Hui Ren · Yijia Weng · Shuwang Zhang · Zhen Wang · Dejia Xu · Zhiwen Fan · Suya You · Zhangyang Wang · Leonidas Guibas · Achuta Kadambi

Recent advancements in 2D and multi-modal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into 4D realm, using only monocular video input, which is widely available from user-generated content. The ``X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single, task-dependent representation. Additionally, to the best of our knowledge, we are the first method to distill and lift the video foundation models (e.g. SAM2, InternVideo2) features into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.

We introduce Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks. Our g3D-LF processes posed RGB-D images from agents to encode feature fields for: 1) Novel view representation predictions from any position in the 3D scene; 2) Generations of BEV maps centered on the agent; 3) Querying targets using multi-granularity language within the above-mentioned representations. Our representation can be generalized to unseen environments, enabling real-time construction and dynamic updates. By volume rendering latent features along sampled rays and integrating semantic and spatial relationships through multiscale encoders, our g3D-LF produces representations at different scales and perspectives, aligned with multi-granularity language, via multi-level contrastive learning. Furthermore, we prepare a large-scale 3D-language dataset to align the representations of the feature fields with language. Extensive experiments on Vision-and-Language Navigation under both Panorama and Monocular settings, Zero-shot Object Navigation, and Situated Question Answering tasks highlight the significant advantages and effectiveness of our g3D-LF for embodied tasks. Our source code and dataset will be made open-source upon paper acceptance.


Poster #340
Magma: A Foundation Model for Multimodal AI Agents

Jianwei Yang · Reuben Tan · Qianhui Wu · Ruijie Zheng · Baolin Peng · Yongyuan Liang · Yu Gu · Mu Cai · Seonghyeon Ye · Joel Jang · Yuquan Deng · Jianfeng Gao

This paper presents a new foundation model, called Magma, for multimodal AI agents in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that the former not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial intelligence) to complete agentic tasks ranging from UI navigation to robot manipulation. Magma is pre-trained on large amounts of heterogeneous VL datasets, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set of Marks (SoM) and the object movements (e.g., the trace of a robotic arm) in videos are labeled by Trace of Mark (ToM). Evaluation shows that SoM and ToM facilitate acquisition of spatial intelligence from training data. Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are tailored specifically to these tasks. On VL tasks, Magma also compares favorably to popular VL models that are trained on much larger datasets.


Poster #341
Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning

Jing Zhu · Yuhang Zhou · Shengyi Qian · Zhongmou He · Tong Zhao · Neil Shah · Danai Koutra

Graph machine learning has made significant strides in recent years, yet the integration of visual information with graph structures remains an underexplored area. To address this critical gap, we introduce the Multimodal Graph Benchmark (MM-GRAPH), a pioneering benchmark that incorporates both visual and textual information into graph learning tasks. MM-GRAPH extends beyond existing text-attributed graph benchmarks, offering a more comprehensive evaluation framework for multimodal graph neural networks (GNNs). Our benchmark comprises seven diverse datasets of varying scales, designed to assess graph learning algorithms across different tasks in real-world scenarios. These datasets feature rich multimodal node attributes, including visual data, which enables a more holistic evaluation of GNN performance in complex, multimodal environments. To support advancements in this emerging field, we provide an extensive empirical study on the performance of various graph learning frameworks when presented with features from multiple modalities, particularly emphasizing the impact of visual information. This study offers valuable insights into the challenges and opportunities of integrating visual data into graph learning algorithms.


Poster #342
Highlight
Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection

Zihao Zhang · Aming Wu · Yahong Han

Recently, a task of Single-Domain Generalized Object Detection (Single-DGOD) is proposed, aiming to generalize a detector to multiple unknown domains never seen before during training. Due to the unavailability of target-domain data, some methods leverage the multimodal capabilities of vision-language models, using textual prompts to estimate cross-domain information, enhancing the model's generalization capability. These methods typically use a single textual prompt, often referred to as the one-step prompt method. However, when dealing with complex styles such as the combination of rain and night, we observe that the performance of the one-step prompt method tends to be relatively weak. The reason may be that many scenes incorporate not just a single style but a combination of multiple styles. The one-step prompt method may not effectively synthesize combined information involving various styles. To address this limitation, we propose a new method, i.e., Style Evolving along Chain-of-Thought, which aims to progressively integrate and expand style information along the chain of thought, enabling the continual evolution of styles. Specifically, by progressively refining style descriptions and guiding the diverse evolution of styles, this approach enables more accurate simulation of various style characteristics and helps the model gradually learn and adapt to subtle differences between styles. Additionally, it exposes the model to a broader range of style features with different data distributions, thereby enhancing its generalization capability in unseen domains. The significant performance gains over five adverse-weather scenarios and the Real to Art benchmark demonstrate the superiorities of our method.


Poster #343
Highlight
Olympus: A Universal Task Router for Computer Vision Tasks

Yuanze Lin · Yunsheng Li · Dongdong Chen · Weijian Xu · Ronald Clark · Philip H.S. Torr

We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks.


Poster #344
Highlight
Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning

Bardia Safaei · Faizan Siddiqui · Jiacong Xu · Vishal M. Patel · Shao-Yuan Lo

Visual instruction tuning (VIT) for large vision-language models (LVLMs) requires training on expansive datasets of image-instruction pairs, which can be costly. Recent efforts in VIT data selection aim to select a small subset of high-quality image-instruction pairs, reducing VIT runtime while maintaining performance comparable to full-scale training. However, a major challenge often overlooked is that generating instructions from unlabeled images for VIT is highly expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. PreSel first estimates the relative importance of each vision task within VIT datasets to derive task-wise sampling budgets. It then clusters image features within each task, selecting the most representative images with the budget. This approach reduces computational overhead for both instruction generation during VIT data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets. Code will be made available.


Poster #345
Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning

JiHyeok Jung · EunTae Kim · SeoYeon Kim · Joo Ho Lee · Bumsoo Kim · Buru Chang

Multimodal large language models (MLLMs) act as essential interfaces, connecting humans with AI technologies in multimodal applications. However, current MLLMs face challenges in accurately interpreting object orientation in images due to inconsistent orientation annotations in training data, hindering the development of a coherent orientation understanding. To overcome this, we propose egocentric instruction tuning, which aligns MLLMs' orientation understanding with the user’s perspective, based on a consistent annotation standard derived from the user’s egocentric viewpoint. We first generate egocentric instruction data that leverages MLLMs' ability to recognize object details and applies prior knowledge for orientation understanding. Using this data, we perform instruction tuning to enhance the model’s capability for accurate orientation interpretation. In addition, we introduce EgoOrientBench, a benchmark that evaluates MLLMs' orientation understanding across three tasks using images collected from diverse domains. Experimental results on this benchmark show that egocentric instruction tuning significantly improves orientation understanding without compromising overall MLLM performance. The instruction data and benchmark dataset are available on our project page at \url{https://anonymous.4open.science/r/EgocentricInstructionTuning-E189}.


Poster #346
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man · De-An Huang · Guilin Liu · Shiwei Sheng · Shilong Liu · Liangyan Gui · Jan Kautz · Yu-Xiong Wang · Zhiding Yu

Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective.


Poster #347
Model Diagnosis and Correction via Linguistic and Implicit Attribute Editing

Xuanbai Chen · Xiang Xu · Zhihua Li · Tianchen Zhao · Pietro Perona · Qin ZHANG · Yifan Xing

How can we troubleshoot a deep visual model, i.e., understand why it makes certain mistakes and further take action to correct its behavior? We design a Model Diagnosis and Correction system (MDC), an automated framework that analyzes the pattern of errors, proposes candidate causes of attributes, conducts hypothesis testing via attribute editing, and ultimately generates counterfactual training samples to improve the performance of the model. Unlike previous methods, in addition to the linguistic attributes, our method also incorporates the analysis for implicit causal attributes, those cannot to be accurately described by language. To achieve this, we propose an image editing module capable of leveraging both implicit and linguistic attributes to generate counterfactual images depicting error patterns and further experimentally validate causality relationships. Lastly, we enrich the training set with synthetic samples depicting verified causal attributes and retrain the model, further boosting accuracy and robustness. Extensive experiments on fine-grained classification and face security applications demonstrate the superiority of our approach in model diagnosis and correction. Specifically, we achieve an average relative improvement of 62.01\% in HTER for face security application over state-of-the-art methods.


Poster #348
Foundations of the Theory of Performance-Based Ranking

Sébastien Piérard · Anaïs Halin · Anthony Cioppa · Adrien Deliege · Marc Van Droogenbroeck

Ranking entities such as algorithms, devices, methods, or models based on their performances, while accounting for application-specific preferences, is a challenge. To address this challenge, we establish the foundations of a universal theory for performance-based ranking. First, we introduce a rigorous framework built on top of both the probability and order theories. Our new framework encompassesthe elements necessary to (1) define and manipulate performances, (2) express which performances are worse than or equivalent to others, (3) model tasks through a variable called satisfaction, (4) consider properties of the evaluation, (5) define scores, and (6) specify application-specific preferences through a variable called importance. On top of this framework, we propose the first axiomatic definition of performance orderings and performance-based rankings. Then, we introduce a universal parametric family of scores, called ranking scores, that can be used to establish rankings satisfying our axioms, while considering application-specific preferences. Finally, we show, in the case of two-class classification, that the family of ranking scores encompasses well-known performance scores, including the accuracy, the true positive rate (recall), the positive predictive value (precision), Jaccard’s coefficient (intersection over union), and Fβ scores. However, we also show that some other scores commonly used to compare classifiers are unsuitable to derive performance orderings satisfying the axioms. Therefore, this paper provides the computer vision and machine learning communities with a rigorous framework for evaluating and ranking entities.


Poster #349
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues

Sagar Soni · Akshay Dudhane · Hiyam Debary · Mustansar Fiaz · Muhammad Akhtar Munir · Muhammad Sohail Danish · Paolo Fraccaro · Campbell D Watson · Levente Klein · Fahad Shahbaz Khan · Salman Khan

Automated analysis of vast Earth observation data via interactive Vision-Language Models (VLMs) can unlock new opportunities for environmental monitoring, disaster response, and {resource management}. Existing generic VLMs do not perform well on Remote Sensing data, while the recent Geo-spatial VLMs remain restricted to a fixed resolution and few sensor modalities. In this paper, we introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data, transforming complex, multi-sensory Earth observations into interactive, natural language dialogues. EarthDial supports multi-spectral, multi-temporal, and multi-resolution imagery, enabling a wide range of remote sensing tasks, including classification, detection, captioning, question answering, visual reasoning, and visual grounding.To achieve this, we introduce an extensive instruction tuning dataset comprising over 11.11M instruction pairs covering RGB, Synthetic Aperture Radar (SAR), and multispectral modalities such as Near-Infrared (NIR) and infrared. Furthermore, EarthDial handles bi-temporal and multi-temporal sequence analysis for applications like change detection.Our extensive experimental results on 37 downstream applications demonstrate that EarthDial outperforms existing generic and domain-specific models, achieving better generalization across various EO tasks. Our codes and data will be publicly released.


Poster #350
EMOE: Modality-Specific Enhanced Dynamic Emotion Experts

Yiyang Fang · Wenke Huang · Guancheng Wan · Kehua Su · Mang Ye

Multimodal Emotion Recognition (MER) aims to predict human emotions by leveraging multiple modalities, such as vision, acoustics, and language. However, due to the heterogeneity of these modalities, MER faces two key challenges: modality balance dilemma and modality specialization disappearance. Existing methods often overlook the varying importance of modalities across samples in tackling the modality balance dilemma. Moreover, mainstream decoupling methods, while preserving modality-specific information, often neglect the predictive capability of unimodal data. To address these, we propose a novel model, Modality-Specific Enhanced Dynamic Emotion Experts (EMOE), consisting of: (1) Mixture of Modality Experts for dynamically adjusting modality importance based on sample features, and (2) Unimodal Distillation to retain single-modality predictive ability within fused features. EMOE enables adaptive fusion by learning a unique modality weight distribution for each sample, enhancing multimodal predictions with single-modality predictions to balance invariant and specific features in emotion recognition. Experimental results on benchmark datasets show that EMOE achieves superior or comparable performance to state-of-the-art methods. Additionally, we extend EMOE to Multimodal Intent Recognition (MIR), further demonstrating its effectiveness and versatility.


Poster #351
Highlight
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Fengxiang Wang · hongzhen wang · Zonghao Guo · Di Wang · Yulin Wang · Mingshuo Chen · Qiang Ma · Long Lan · Wenjing Yang · Jing Zhang · Zhiyuan Liu · Maosong Sun

The astonishing breakthrough of multimodal large language models (MLLMs) has necessitated new benchmarks to quantitatively assess their capabilities, reveal their limitations, and indicate future research directions. However, this is challenging in the context of remote sensing (RS), since the imagery features ultra-high resolution that incorporates extremely complex semantic relationships. Existing benchmarks usually adopt notably smaller image sizes than real-world RS scenarios, suffer from limited annotation quality, and consider insufficient dimensions of evaluation. To address these issues, we present XLRS-Bench: a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios. XLRS-Bench boasts the largest average image size (8500$\times$8500) observed thus far, with all evaluation samples meticulously annotated manually, assisted by a novel semi-automatic captioner on ultra-high-resolution RS images. On top of the XLRS-Bench, 16 sub-tasks are defined to evaluate MLLMs' 6 kinds of perceptual abilities and 4 kinds of reasoning capabilities, with a primary emphasis on advanced cognitive processes that facilitate real-world decision-making and the capture of spatiotemporal changes. The results of both general and RS-focused MLLMs on XLRS-Bench indicate that further efforts are needed to enhance their performance in real RS scenarios. We will open source XLRS-Bench to support further research of developing more powerful MLLMs for RS.


Poster #352
DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels

Erjian Guo · Zhen Zhao · Zicheng Wang · Tong Chen · YUNYI LIU · Luping Zhou

Medical Visual Question Answering (Med-VQA) systems benefit the interpretation of medical images containing critical clinical information. However, the challenge of noisy labels and limited high-quality datasets remains underexplored. To address this, we establish the first benchmark for noisy labels in Med-VQA by simulating human mislabeling with semantically designed noise types. More importantly, we introduce the DiN framework, which leverages a diffusion model to handle noisy labels in Med-VQA. Unlike the dominant classification-based VQA approaches that directly predict answers, our Answer Diffuser (AD) module employs a coarse-to-fine process, refining answer candidates with a diffusion model for improved accuracy. The Answer Condition Generator (ACG) further enhances this process by generating task-specific conditional information via integrating answer embeddings with fused image-question features. To address label noise, our Noisy Label Refinement(NLR) module introduces a robust loss function and dynamic answer adjustment to further boost the performance of the AD module. Our DiN framework consistently outperforms existing methods across multiple benchmarks with varying noise levels.


Poster #353
DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension

Xiaofu Chen · Yaxin Luo · Luo · Jiayi Ji · Henghui Ding · Yiyi Zhou

In this paper, we focus on weakly supervised referring expression comprehension (REC), and identify that the lack of fine-grained visual capability greatly limits the upper performance bound of existing methods. To address this issue, we propose a novel framework for weakly supervised REC, namely Dynamic Visual routing Network (DViN), which overcomes the visual shortcomings from the perspective of feature combination and alignment. In particular, DViN is equipped with a novel sparse routing mechanism to efficiently combine features of multiple visual encoders in a dynamic manner, thus improving the visual descriptive power. Besides, we further propose an innovative weakly supervised objective, namely Routing-based Feature Alignment (RFA), which facilitates the visual understanding of routed features through the intra-modal and inter-modal alignment. To validate DViN, we conduct extensive experiments on four REC benchmark datasets. Experiments demonstrate that DViN achieves state-of-the-art results on four benchmarks while maintaining competitive inference efficiency. Besides, the strong generalization ability of DViN is also validated on weakly supervised referring expression segmentation. Source codes are anonymously released at: https://anonymous.4open.science/r/DViN-7736.


Poster #354
ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models

Heng Yin · Yuqiang Ren · Ke Yan · Shouhong Ding · Yongtao Hao

Multimodal large language models (MLLMs) have demonstrated strong language understanding and generation capabilities, excelling in visual tasks like referring and grounding. However, due to task type limitations and dataset scarcity, existing MLLMs only ground objects present in images and cannot reject non-existent objects effectively, resulting in unreliable predictions. In this paper, we introduce ROD-MLLM, a novel MLLM for Reliable Object Detection using free-form language. We propose a query-based localization mechanism to extract low-level object features. By aligning global and object-level visual information with text space, we leverage the large language model (LLM) for high-level comprehension and final localization decisions, overcoming the language understanding limitations of normal detectors. To enhance language-based object detection, we design an automated data annotation pipeline and construct the dataset ROD. This pipeline uses the referring capabilities of existing MLLMs and chain-of-thought techniques to generate diverse expressions corresponding to zero or multiple objects, addressing the shortage of training data. Experiments across various tasks, including referring, grounding, and language-based object detection, show that ROD-MLLM achieves state-of-the-art performance among MLLMs. Notably, in language-based object detection, our model achieves a +13.7 mAP improvement over existing MLLMs and surpasses most specialized detection models, especially in scenarios requiring advanced complex language understanding.


Poster #355
PerLA: Perceptive 3D Language Assistant

Guofeng Mei · Wei Lin · Luigi Riz · Yujiao Wu · Fabio Poiesi · Yiming Wang

Enabling Large Language Models (LLMs) to understand the 3D physical world is an emerging yet challenging research direction. Current strategies for processing point clouds typically downsample the scene or divide it into smaller parts for separate analysis. However, both approaches risk losing key local details or global contextual information.In this paper, we introduce PerLA, a 3D language assistant designed to be more perceptive to both details and context, making visual representations more informative for the LLM.PerLA captures high-resolution (local) details in parallel from different point cloud areas and integrates them with (global) context obtained from a lower-resolution whole point cloud.We present a novel algorithm that preserves point cloud locality through the Hilbert curve and effectively aggregates local-to-global information via cross-attention and a graph neural network.Lastly, we introduce a novel loss for local representation consensus to promote training stability.PerLA outperforms state-of-the-art 3D language assistants, with gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on ScanRefer and +3.88 on Nr3D for dense captioning.


Poster #356
BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs

Zhantao Yang · Ruili Feng · Keyu Yan · Huangji Wang · Zhicai Wang · Shangwen Zhu · Han Zhang · Jie Xiao · Pingyu Wu · Kai Zhu · Jixuan Chen · Chen-Wei Xie · Yue Yang · Hongyang Zhang · Yu Liu · Fan Cheng

Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions.To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information.We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V resources. Evaluations of overall quality, precision, and recall—as well as user studies—demonstrate that the resulting caption model consistently outperforms other state-of-the-art VLM models in generating high-quality captions.Additionally, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help groundingDINO achieve 1.51 times higher recall scores on open-vocabulary object detection tasks compared to leading methods.


Poster #357
Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

Yang Qin · Chao Chen · Zhihang Fu · Dezhong Peng · Xi Peng · Peng Hu

Despite remarkable advancements in text-to-image person re-identification (TIReID) facilitated by the breakthrough of cross-modal embedding models, existing methods often struggle to distinguish challenging candidate images due to intrinsic limitations, such as network architecture and data quality. To address these issues, we propose an Interactive Cross-modal Learning framework (ICL), which leverages human-centered interaction to enhance the discriminability of text queries through external multimodal knowledge. To achieve this, we propose a plug-and-play Test-time Humane-centered Interaction (TUI) module, which performs visual question answering focused on human characteristics, facilitating multi-round interactions with a multimodal large language model (MLLM) to align query intent with latent target images. Specifically, TUI refines user queries based on the MLLM responses to reduce the gap to the best-matching images, thereby boosting ranking accuracy. Additionally, to address the limitation of low-quality training texts, we introduce a novel Reorganization Data Augmentation (RDA) strategy based on information enrichment and diversity enhancement to enhance query discriminability by enriching, decomposing, and reorganizing person descriptions. Extensive experiments on four TIReID benchmarks, i.e., CUHK-PEDES, CFG-PEDES RSTPReid, RSTPReid, and UFine6926, demonstrate that our method achieves remarkable performance with substantial improvement. The code will be released publicly.


Poster #358
Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content

Zicheng Zhang · Tengchuan Kou · Chunyi Li · Shushi Wang · Wei Sun · Wei Wang · Xiaoyu Li · ZongYu Wang · Xuezhi Cao · Xiongkuo Min · Xiaohong Liu · Guangtao Zhai

Evaluating text-to-vision content hinges on two crucial aspects: visual quality and alignment. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to Scaling Law, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models.Therefore, we introduce a comprehensive dataset designed to Evaluate Visual quality and Alignment Level for text-to-vision content (Q-EVAL-100K), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for the mentioned two aspects.The Q-EVAL-100K dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and 40K videos). Leveraging this dataset with context prompt, we propose Q-Eval-Score, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment.Experimental results indicate that the proposed Q-Eval-Score achieves superior performance on both visual quality and alignment, with strong generalization capabilities across other benchmarks. These findings highlight the significant value of the Q-EVAL-100K dataset. The data and code will be released to help promote the generation models.


Poster #359
Highlight
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Yuanmin Tang · Jue Zhang · Xiaoting Qin · Jing Yu · Gaopeng Gou · Gang Xiong · Qingwei Lin · Saravan Rajmohan · Dongmei Zhang · Qi Wu

Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. This dual-modality approach is especially valuable in internet search and e-commerce, facilitating tasks like scene image search with object manipulation and product recommendations with attribute changes. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code is available at https://anonymous.4open.science/r/osrcir24/.


Poster #360
Highlight
Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding

Zhaoran Zhao · Peng Lu · Anran Zhang · Pei Pei Li · Xia Li · Xuannan Liu · Yang Hu · Shiyi Chen · liweiwang · Wenhao Guo

With the rapid growth of social media and digital photography, visually appealing images have become essential for effective communication and emotional engagement. Among the factors influencing aesthetic appeal, composition—the arrangement of visual elements within a frame—plays a crucial role. In recent years, specialized models for photographic composition have achieved impressive results across various aesthetic tasks. Meanwhile, rapidly advancing multimodal large language models (MLLMs) have excelled in several visual perception tasks. However, their ability to embed and understand compositional information remains underexplored, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce the Photographic Image Composition Dataset (PICD), a large-scale dataset consisting of 36,857 images categorized into 24 composition categories across 355 diverse scenes. We demonstrate the advantages of PICD over existing datasets in terms of data scale, composition category, label quality, and scene diversity. Building on PICD, we establish benchmarks to evaluate the composition embedding capabilities of specialized models and the compositional understanding ability of MLLMs. To enable efficient and effective evaluation, we propose a novel Composition Discrimination Accuracy (CDA) metric. Our evaluation highlights the limitations of current models and provides insights into directions for improving their ability to embed and understand composition.


Poster #361
Active Data Curation Effectively Distills Large-Scale Multimodal Models

Vishaal Udandarao · Nikhil Parthasarathy · Muhammad Ferjad Naeem · Talfan Evans · Samuel Albanie · Federico Tombari · Yongqin Xian · Alessio Tonioni · Olivier J Henaff

Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach---active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks.


Poster #362
Yo’Chameleon: Personalized Vision and Language Generation

Thao Nguyen · Krishna Kumar Singh · Jing Shi · Trung Bui · Yong Jae Lee · Yuheng Li

Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users.However, they remain generic models and lack personalized knowledge of specific user concepts.Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation.In this paper, we introduce Yo'Chameleon, the first attempt to study personalization for large multimodal models.Given 3-5 images of a particular concept, Yo'Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo'Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a ``soft-positive" image generation approach to enhance image quality in a few-shot setting.


Poster #363
Relation-Rich Visual Document Generator for Visual Information Extraction

Zi-Han Jiang · Chien-Wei Lin · WeiHua Li · Hsuan-Tung Liu · Yi-Ren Yeh · Chu-Song Chen

Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks.


Poster #364
Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

Zining Wang · Tongkun Guan · Pei Fu · Chen Duan · Qianyi Jiang · Zhentao Guo · Shan Guo · Junfeng Luo · Wei Shen · Xiaokang Yang

Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets will be available soon.


Poster #365
A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Zhaoqing Zhu · Chuwei Luo · Zirui Shao · Feiyu Gao · Hangdi Xing · Qi Zheng · Ji Zhang

Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to represent layout information. Due to the constraint on max position IDs, assigning them to layout information reduces those available for text content, reducing the capacity for the model to learn from the text during training, while also introducing a large number of potentially untrained position IDs during long-context inference, which can hinder performance on document understanding tasks.To address these issues, we propose LayTokenLLM, a simple yet effective method for document understanding. LayTokenLLM represents layout information as a single token per text segment and uses a specialized positional encoding scheme. It shares position IDs between text and layout tokens, eliminating the need for additional position IDs. This design maintains the model's capacity to learn from text while mitigating long-context issues during inference. Furthermore, a novel pre-training objective called Next Interleaved Text and Layout Token Prediction (NTLP) is devised to enhance cross-modality learning between text and layout tokens. Extensive experiments show that LayTokenLLM outperforms existing layout-integrated LLMs and MLLMs of similar scales on multi-page document understanding tasks, while also achieving superior performance on most single-page tasks. Code and data will be publicly available.


Poster #366
Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution

Zhiyuan You · Xin Cai · Jinjin Gu · Tianfan Xue · Chao Dong

With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising performance in linguistic quality description. However, current methods still fall short in accurately scoring image quality. In this work, we aim to leverage MLLMs to regress accurate quality scores. A key challenge is that the quality score is inherently continuous, typically modeled as a Gaussian distribution, whereas MLLMs generate discrete token outputs. This mismatch necessitates score discretization. Previous approaches discretize the mean score into a one-hot label, resulting in information loss and failing to capture inter-image relationships. We propose a distribution-based approach that discretizes the score distribution into a soft label. This method preserves the characteristics of the score distribution, achieving high accuracy and maintaining inter-image relationships. Moreover, to address dataset variation, where different IQA datasets exhibit various distributions, we introduce a fidelity loss based on Thurstone’s model. This loss captures intra-dataset relationships, facilitating co-training across multiple IQA datasets. With these designs, we develop the Distribution-based multi-modal image Quality Assessment model (DimiQA). Experiments across multiple benchmarks show that DimiQA stably outperforms baselines in score regression. Also, DimiQA can predict the score distribution that closely aligns with human annotations. Codes and model weights will be released.


Poster #367
FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs

Mothilal Asokan · Kebin wu · Fatima Albreiki

As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, FineLIP, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating Fine-grained alignment with Longer text input within the CLIP-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggregated results are then used to enforce fine-grained token-to-token cross-modal alignment. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation. Quantitative and qualitative experimental results demonstrate the effectiveness of FineLIP, outperforming existing state-of-the-art approaches. Furthermore, comprehensive ablation studies validate the benefits of key design elements within FineLIP.


Poster #368
MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures

Lucas Morin · Valery Weber · Ahmed Nassar · Gerhard Ingmar Meijer · Luc Van Gool · Yawei Li · Peter W. J. Staar

The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available upon acceptance.


Poster #369
Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

Andrea Maracani · Savas Ozkan · Sijun Cho · Hyo-Won Kim · Eunchung Noh · Jeongwon Min · Cho Jung Min · Dookun Park · Mete Ozay

Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.

Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in token length. MFuser consists of two key components: MVFuser, which acts as a co-adapter to jointly fine-tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state-of-the-art DGSS methods, achieving 68.19 mIoU on synthetic-to-real and 71.87 mIoU on real-to-real benchmarks. The code will be released upon acceptance.


Poster #371
RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

Haoran Hao · Jiaming Han · Changsheng Li · Yu-Feng Li · Xiangyu Yue

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models will be publicly available.


Poster #372
What’s in the Image? A Deep-Dive into the Vision of Vision Language Models

Omri Kaduri · Shai Bagon · Tali Dekel

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on the attention modules across layers, by which we reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by the model to store global image information; we demonstrate that the model generates surprisingly descriptive responses solely from these tokens, without direct access to image tokens. (ii) Cross-modal information flow is predominantly influenced by the middle layers (approximately 25% of all layers), while early and late layers contribute only marginally. (iii) Fine-grained visual attributes and object details are directly extracted from image tokens in a spatially localized manner, i.e., the generated tokens associated with a specific object or attribute attend strongly to their corresponding regions in the image. We propose novel quantitative evaluation to validate our observations, leveraging real-world complex visual scenes. Finally, we demonstrate the potential of our findings in facilitating efficient visual processing in state-of-the-art VLMs.


Poster #373
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

Zhihe Yang · Xufang Luo · Dongqi Han · Yunjian Xu · Dongsheng Li

Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26\% on the AMBER benchmark and 5.39\% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.


Poster #374
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Chenxin Tao · Shiqian Su · Xizhou Zhu · Chenyu Zhang · Zhe Chen · Jiawen Liu · Wenhai Wang · Lewei Lu · Gao Huang · Yu Qiao · Jifeng Dai

The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown capable of interpreting images, when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Furthermore, a multi-stage training strategy is carefully designed to empower the holistic embedding module. It is first trained to distill visual features from a pre-trained vision encoder and text embeddings from the LLM, enabling large-scale training with unpaired random images and text tokens. The whole model further undergoes next-token prediction on multi-modal data to align the embeddings. Finally, an instruction-tuning stage is incorporated. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks, outperforming previous monolithic models by a large margin.


Poster #375
FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual Compression

Bo Tong · Bokai Lai · Yiyi Zhou · Luo · Yunhang Shen · Ke Li · Xiaoshuai Sun · Rongrong Ji

Despite a big leap forward in capability, \emph{multimodal large language models} (MLLMs) tend to behave like a sloth in practical use, \emph{i.e.}, slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called \emph{\textbf{FlashSloth}}. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL-2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks. Our code is anonymously released at: \url{https://anonymous.4open.science/r/FlashSloth/}.


Poster #376
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

Xinyu Tian · Shu Zou · Zhaoyuan Yang · Jing Zhang

The evolution of Large Vision-Language Models (LVLMs) has progressed from single-image understanding to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models like GPT-4o show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by these insights, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA effectively reduces position bias and significantly enhances the reasoning performance of existing LVLMs.


Poster #377
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

Dhouib Mohamed · Davide Buscaldi · Vanier Sonia · Aymen Shabou

Visual Language Models require substantial computational resources for inference due to the additional input tokens needed to represent visual information. However, these visual tokens often contain redundant and unimportant information, resulting in an unnecessarily high number of tokens. To address this, we introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones at an early layer of the language model. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores, making it compatible with FlashAttention. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens while constraining the distances between elements within a cluster by a predefined threshold. We demonstrate the effectiveness of PACT through extensive experiments.


Poster #378
Conical Visual Concentration for Efficient Large Vision-Language Models

Long Xing · Qidong Huang · Xiaoyi Dong · Jiajie Lu · Pan Zhang · Yuhang Zang · Yuhang Cao · Conghui He · Jiaqi Wang · Feng Wu · Dahua Lin

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom ''A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers.To this end, we propose ViCo, a conical-style visual concentration strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that ViCo can achieve over 40\% training time reduction and 55\% inference FLOPs acceleration on leading LVLMs like LLaVA-NeXT, maintaining comparable multi-modal performance. Besides, ViCo can also serve as a plug-and-play strategy to accelerate inference in a free way, with better performance and lower inference cost than counterparts.


Poster #379
Highlight
Assessing and Learning Alignment of Unimodal Vision and Language Models

Le Zhang · Qian Yang · Aishwarya Agrawal

How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment methods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer ($\sim$6\%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, $\sim$5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4\% zero-shot accuracy on ImageNet (vs. CLIP's 72.7\%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models.


Poster #380
Continual SFT Matches Multimodal RLHF with Negative Supervision

Ke Zhu · Yu Wang · Yanpeng Sun · Qiang Chen · Jiang-Jiang Liu · gang zhang · Jingdong Wang

Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. We hope this paper will stimulate further research to properly align large vision language models.

Contrastive decoding strategies are widely used to mitigate object hallucinations in multimodal large language models (MLLMs). By reducing over-reliance on language priors, these strategies ensure that generated content remains closely grounded in visual inputs, producing contextually accurate outputs. Since contrastive decoding requires no additional training or external tools, it offers both computational efficiency and versatility, making it highly attractive. However, these methods present two main limitations: (1) bluntly suppressing language priors can compromise coherence and accuracy of generated content, and (2) processing contrastive inputs adds computational load, significantly slowing inference speed. To address these challenges, we propose Visual Amplification Fusion (VAF), a plug-and-play technique that enhances attention to visual signals within the model’s middle layers, where modality fusion predominantly occurs. This approach enables more effective capture of visual features, reducing the model’s bias toward language modality. Experimental results demonstrate that VAF significantly reduces hallucinations across various MLLMs without affecting inference speed, while maintaining coherence and accuracy in generated outputs.


Poster #382
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection

Le Yang · Ziwei Zheng · Boxu Chen · Zhengyu Zhao · Chenhao Lin · Chao Shen

Recent studies have shown that large vision-language models (LVLMs) often suffer from the issue of object hallucinations (OH). To mitigate this issue, we introduce an efficient method that edits the model weights based on an unsafe subspace, which we call HalluSpace in this paper. With truthful and hallucinated text prompts accompanying the visual content as inputs, the HalluSpace can be identified by extracting the hallucinated embedding features and removing the truthful representations in LVLMs. By orthogonalizing the model weights, input features will be projected into the Null space of the HalluSpace to reduce OH, based on which we name our method Nullu. We reveal that HalluSpaces generally contain statistical bias and unimodal priors of the large language models (LLMs) applied to build LVLMs, which have been shown as essential causes of OH in previous studies. Therefore, null space projection suppresses the LLMs' priors to filter out the hallucinated features, resulting in contextually accurate outputs. Experiments show that our method can effectively mitigate OH across different LVLM families without extra inference costs and also show strong performance in general LVLM benchmarks. Codes will be released at \url{url}.


Poster #383
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception

Yuanchen Wu · Lu Zhang · Hang Yao · Junlong Du · Ke Yan · Shouhong Ding · Yunsheng Wu · Xiaoqiang Li

Large Vision-Language Models (LVLMs) have achieved impressive results across various multi-modal tasks. However, hallucinations, i.e., the models generating counterfactual responses, remain a challenge. Though recent studies have attempted to alleviate object perception hallucinations, they focus on the models' response generation, overlooking the task question itself. This paper discusses the vulnerability of LVLMs in solving counterfactual presupposition questions (CPQs), where the models are prone to accept the presuppositions of counterfactual objects and produce severe hallucinatory responses. To this end, we introduce “Antidote,” a unified, synthetic data-driven post-training framework for mitigating both types of hallucination above. It leverages synthetic data to incorporate factual priors into questions to achieve self-correction and decouple the mitigation process into a preference optimization problem. Furthermore, we construct “CP-Bench,” a novel benchmark to evaluate LVLMs' ability to correctly handle CPQs and produce factual responses. Applied to the LLaVA series, Antidote can simultaneously enhance performance on CP-Bench by over 50%, POPE by 1.8-3.3%, and CHAIR & SHR by 30-50%, all without relying on external supervision from stronger LVLMs or human feedback and introducing noticeable catastrophic forgetting issues.


Poster #384
Highlight
MLLM-as-a-Judge for Image Safety without Human Labeling

Zhenting Wang · Shuming Hu · Shiyu Zhao · Xiaowen Lin · Felix Juefei-Xu · Zhuowei Li · Ligong Han · Harihar Subramanyam · Li Chen · Jianfa Chen · nan jiang · Lingjuan Lyu · Shiqing Ma · Dimitris N. Metaxas · Ankit Jain

Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, making quick judgments based on debiased token probabilities with logically complete yet simplified precondition chains for safety rules, and conducting more in-depth reasoning with cascaded chain-of-thought processes if necessary. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.


Poster #385
Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

Yuan-Hong Liao · Rafid Mahmood · Sanja Fidler · David Acuna

Improving semantic grounding in Vision-Language Models (VLMs) often involves collecting domain-specific training data, refining the network architectures, or modifying the training recipes. In this work, we venture into an orthogonal direction and explore self-correction in VLMs focusing on semantic grounding. We find that VLMs can correct their own semantic grounding mistakes when properly prompted and framed for the task, without any fine-tuning or even access to oracle feedback. We also introduce a self-correction framework in an iterative setting which consistently improves performance across all models investigated. Overall, we show that iterative self-correction consistently improves VLM performance in semantic grounding by up to 8.4 accuracy points across all models investigated, without requiring fine-tuning, additional architectural changes, or external data. Our exploration of self-correction also reveals that, even after several rounds of feedback, strong models like GPT-4V and GPT-4o retain limited capability in leveraging oracle feedback, suggesting promising directions for further research.


Poster #386
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Peng Xie · Yequan Bie · Jianda Mao · Yangqiu Song · Yang Wang · Hao Chen · Kani Chen

Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy is able to effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments. The code will be made publically available.


Poster #387
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Sanghwan Kim · Rui Xiao · Iuliana Georgescu · Stephan Alaniz · Zeynep Akata

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates novel text-cropping strategy and cross-attention module into self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.


Poster #388
Reproducible Vision-Language Models Meet Concepts Out of Pre-Training

Ziliang Chen · Xin Huang · Xiaoxuan Fan · Keze Wang · Yuyu Zhou · Quanlong Guan · Liang Lin

Contrastive Language-Image Pre-training (CLIP) models as a milestone of modern multimodal intelligence, its generalization mechanism grasped massive research interests in the community. While existing studies limited in the scope of pre-training knowledge, hardly underpinned its generalization to countless open-world concepts absent from the pre-training regime. This paper dives into such Out-of-Pre-training (OOP) generalization problem from a holistic perspective. We propose LAION-Beyond benchmark to isolate the evaluation of OOP concepts from pre-training knowledge, with regards to OpenCLIP and its reproducible variants derived from LAION datasets. Empirical analysis evidences that despite image features of OOP concepts born with significant category margins, their zero-shot transfer significantly fails due to the poor image-text alignment. To this, we elaborate the ``name-tuning'' methodology with its theoretical merits in terms of OOP generalization, then propose few-shot name learning (FSNL) and zero-shot name learning (ZSNL) algorithms to achieve OOP generalization in a data-efficient manner. Their superiority have been further verified in our comprehensive experiments.

Vision-language model (VLM) is one of the most important models for mono-modal tasks. Real industrial applications often meet the challenge of adapting VLMs to different scenarios, such as varying hardware platforms or performance requirements. Traditional methods involve training or fine-tuning to adapt multiple unique VLMs or using model compression techniques to create multiple compact models. These approaches are complex and resource-intensive. This paper introduces a novel paradigm called Once-Tuning-Multiple-Variants (OTMV). OTMV requires only a single tuning process to inject dynamic weight expansion capacity into the VLM with dynamic expansion capacity. This tuned VLM can then be expanded into multiple variants tailored for different scenarios in inference. The tuning mechanism of OTMV is inspired by the mathematical series expansion theorem, which helps to reduce the parameter size and memory requirements while maintaining accuracy for VLM. Experiment results show that OTMV-tuned models achieve comparable accuracy to baseline VLMs across various visual-language tasks. The experiments also demonstrate the dynamic expansion capability of OTMV-tuned VLMs, outperforming traditional model compression and adaptation techniques in terms of accuracy and efficiency.


Poster #390
Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves

Shihan Wu · Ji Zhang · Pengpeng Zeng · Lianli Gao · Jingkuan Song · Heng Tao Shen

Prompt tuning (PT) has long been recognized as an effective and efficient paradigm for transferring large pre-trained vision-language models (VLMs) to downstream tasks by learning a tiny set of context vectors. Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. Upon further investigation, we find that reducing both the length and width of the feature-gradient propagation flows of the full fine-tuning (FT) baseline is key to achieving effective and efficient knowledge transfer. Motivated by this, we propose Skip Tuning, a novel paradigm for adapting VLMs to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules. Extensive experiments across a wide spectrum of benchmarks demonstrate the superior effectiveness and efficiency of our Skip Tuning over both PT and adapter-based methods. Code: https://github.com/anonymity-007/SkipT.


Poster #391
SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling

Qi Zhu · Jiangwei Lao · Deyi Ji · Junwei Luo · Kang Wu · Yingying Zhang · Lixiang Ru · Jian Wang · Jingdong Chen · Ming Yang · Dong Liu · Feng Zhao

Open-world interpretation aims to accurately localize and recognize all objects within images by vision-language models (VLMs). While substantial progress has been made in this task for natural images, the advancements for remote sensing (RS) images still remain limited, primarily due to these two challenges. 1) Existing RS semantic categories are limited, particularly for pixel-level interpretation datasets. 2) Distinguishing among diverse RS spatial regions solely by language space is challenging due to the dense and intricate spatial distribution in open-world RS imagery. To address the first issue, we develop a fine-grained RS interpretation dataset, Sky-SA, which contains 183,375 high-quality local image-text pairs with full-pixel manual annotations, covering 1,763 category labels, exhibiting richer semantics and higher density than previous datasets. Afterwards, to solve the second issue, we introduce the vision-centric principle for vision-language modeling. Specifically, in the pre-training stage, the visual self-supervised paradigm is incorporated into image-text alignment, reducing the degradation of general visual representation capabilities of existing paradigms. Then, we construct a visual-relevance knowledge graph across open-category texts and further develop a novel vision-centric image-text contrastive loss for fine-tuning with text prompts. This new model, denoted as SkySense-O, demonstrates impressive zero-shot capabilities on a thorough evaluation encompassing 14 datasets over 4 tasks, from recognizing to reasoning and classification to localization. Specifically, it outperforms the latest models such as SegEarth-OV, GeoRSCLIP, and VHM by a large margin, i.e., 11.95\%, 8.04\% and 3.55\% on average respectively. We will release the dataset and model to facilitate future research.


Poster #392
Task-Aware Clustering for Prompting Vision-Language Models

Fusheng Hao · Fengxiang He · Fuxiang Wu · Tichao Wang · Chengqun Song · Jun Cheng

Prompt learning has attracted widespread attention in adapting vision-language models to downstream tasks. Existing methods largely rely on optimization strategies to ensure the task-awareness of learnable prompts. Due to the scarcity of task-specific data, overfitting is prone to occur. The resulting prompts often do not generalize well or exhibit limited task-awareness. To address this issue, we propose a novel Task-Aware Clustering (TAC) framework for prompting vision-language models, which increases the task-awareness of learnable prompts by introducing task-aware pre-context. The key ingredients are as follows: (a) generating task-aware pre-context based on task-aware clustering that can preserve the backbone structure of a downstream task with only a few clustering centers, (b) enhancing the task-awareness of learnable prompts by enabling them to interact with task-aware pre-context via the well-pretrained encoders, and (c) preventing the visual task-aware pre-context from interfering the interaction between patch embeddings by masked attention mechanism. Extensive experiments are conducted on benchmark datasets, covering the base-to-novel, domain generalization, and cross-dataset transfer settings. Ablation studies validate the effectiveness of key ingredients. Comparative results show the superiority of our TAC over competitive counterparts. The code will be made publicly available.


Poster #393
Learning Textual Prompts for Open-World Semi-Supervised Learning

Yuxin Fan · Junbiao Cui · Jiye Liang

Traditional semi-supervised learning achieves significant success in closed-world scenarios. To better align with the openness of the real world, researchers propose open-world semi-supervised learning (OWSSL), which enables models to effectively recognize known and unknown classes even without labels for unknown classes. Recently, researchers have attempted to enhance the model performance in recognizing visually similar classes by integrating textual information. However, these attempts do not effectively align images with text, resulting in limited improvements in model performance. In response to this challenge, we propose a novel OWSSL method. By adopting a global-and-local textual prompt learning strategy to enhance image-text alignment effectiveness, and implementing a forward-and-backward strategy to reduce noise in image-text matching for unlabeled samples, we ultimately enhance the model’s ability to extract and recognize discriminative features across different classes. Experimental results on multiple fine-grained datasets demonstrate that our method achieves significant performance improvements compared to state-of-the-art methods.


Poster #394
BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

Taha Koleilat · Hojat Asgariandehkordi · Hassan Rivaz · Yiming Xiao

Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability. The code will be publicly available upon acceptance of the submission.


Poster #395
ILIAS: Instance-Level Image retrieval At Scale

Giorgos Kordopatis-Zilos · Vladan Stojnić · Anna Manko · Pavel Suma · Nikolaos-Antonios Ypsilantis · Nikos Efthymiadis · Zakaria Laskar · Jiri Matas · Ondrej Chum · Giorgos Tolias

This work introduces ILIAS, a new test dataset for Instance-Level Image retrieval At Scale. It is designed to evaluate the ability of current and future foundation models and retrieval techniques to recognize particular objects. The key benefits over existing datasets include large scale, domain diversity, accurate ground truth, and a performance that is far from saturated. ILIAS includes query and positive images for 1,000 object instances, manually collected to capture challenging conditions and diverse domains. Large-scale retrieval is conducted against 100 million distractor images from YFCC100M. To avoid false negatives without extra annotation effort, we include only query objects confirmed to have emerged after 2014, i.e. the compilation date of YFCC100M. An extensive benchmarking is performed with the following observations: i) models fine-tuned on specific domains, such as landmarks or products, excel in that domain but fail on ILIAS, ii) learning a linear adaptation layer using multi-domain class supervision results in performance improvements, especially for vision-and-language models, iii) local descriptors in retrieval re-ranking are still a key ingredient, especially in the presence of severe background clutter, iv) the text-to-image performance of the vision-language foundation models is surprisingly close to the corresponding image-to-image case.


Poster #396
Highlight
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Vishwesh Nath · Wenqi Li · Dong Yang · Andriy Myronenko · Yao Lu · Zhijian Liu · Danny Yin · Yucheng Tang · Pengfei Guo · Ziyue Xu · Can Zhao · Yufan He · Greg Heinrich · Mingxin Zheng · Benjamin D. Simon · Stephanie Anne Harmon · Michael Zephyr · Marc Edgar · Stephen R. Aylward · Pavlo Molchanov · Yan Mee LAW · Baris Turkbey · Holger R. Roth · Daguang Xu

Generalist vision language models (VLMs) have made significant strides in computer vision, but they fall short in specialized fields like healthcare, where expert knowledge is essential. Current large multimodal models like Gemini and GPT-4o are insufficient for medical tasks due to their reliance on memorized internet knowledge rather than the nuanced expertise required in healthcare. Meanwhile, existing medical VLMs (e.g. Med-Gemini) often lack expert consultation as part of their design, and many rely on outdated, static datasets that were not created with modern, large deep learning models in mind. VLMs are usually trained in three stages: vision pre-training, vision-language pre-training, and instruction fine-tuning (IFT). IFT has been typically applied using a mixture of generic and healthcare data. In contrast, we propose that for medical VLMs, a fourth stage of specialized IFT is necessary, which focuses on medical data and includes information from domain expert models. Domain expert models developed for medical use are crucial because they are specifically trained for certain clinical tasks, e.g. to detect tumors and classify abnormalities through segmentation and classification, which learn fine-grained features of medical data$-$features that are often too intricate for a VLM to capture effectively. This paper introduces a new framework, VILA-M3, for medical VLMs that utilizes domain knowledge via expert models. We argue that generic VLM architectures alone are not viable for real-world clinical applications and on-demand usage of domain-specialized expert model knowledge is critical for advancing AI in healthcare. Through our experiments, we show an improved state-of-the-art (SOTA) performance with an average improvement of $\sim$9\% over the prior SOTA model Med-Gemini and $\sim$6\% over models trained on the specific tasks. Our approach emphasizes the importance of domain expertise in creating precise, reliable VLMs for medical applications.


Poster #397
Explaining in Diffusion: Explaining a Classifier with Diffusion Semantics

Tahira Kazimi · Ritika Allada · Pinar Yanardag

Classifiers are crucial to computer vision, yet their "black box" nature obscures the decision-making process, limiting the ability to trace the influence of individual features. Traditional interpretability methods, including GAN-based attribute editing, are constrained by domain and resource demands, often requiring extensive labeling and model-specific training. Text-to-image diffusion models, while promising for broader applications, lack precise semantics for classifier interpretation without extensive user input. We introduce DiffEx, a training-free framework that combines large language models (LLMs) and pre-trained diffusion models to improve classifier explainability. DiffEx leverages Vision-Language Models (VLMs) to build a comprehensive, hierarchical semantic corpus and applies a novel algorithm to rank impactful features, offering broad and fine-grained attributes that influence classifier scores. Our experiments show that DiffEx provides nuanced, interpretable insights across diverse domains, including medical diagnostics, making it versatile, scalable, and well-suited for understanding complex classifiers in critical applications.


Poster #398
Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging

Bo Wang · Dingwei Tan · Yen-Ling Kuo · Zhaowei Sun · Jeremy M Wolfe · Tat-Jen Cham · Mengmi Zhang

Imagine searching a collection of coins for quarters ($0.25$), dimes ($0.10$), nickels ($0.05$), and pennies ($0.01$)—a hybrid foraging task where observers search for multiple instances of multiple target types. In such tasks, how do target values and their prevalence influence foraging and eye movement behaviors (e.g., should you prioritize rare quarters or common nickels)? To explore this, we conducted human psychophysics experiments, revealing that humans are proficient reward foragers. Their eye fixations are drawn to regions with higher average rewards, fixation durations are longer on more valuable targets, and their cumulative rewards exceed chance, approaching the upper bound of optimal foragers. To probe these decision making process of humans, we developed a transformer-based Visual Forager (VF) model trained via reinforcement learning. Our VF model takes a series of targets, their corresponding values, and the search image as inputs, processes the images using foveated vision, and produces a sequence of eye movements along with decisions on whether to click on each fixated item. Our model outperforms all baselines, achieves cumulative rewards comparable to those of humans, and closely mirrors human foraging behavior in eye movements and click biases. Furthermore, stress tests on out-of-distribution tasks with novel targets, unseen values, and varying set sizes demonstrate the VF model’s effective generalization. Our work offers valuable insights into the relationship between eye movements and decision-making, with our model serving as a powerful tool for further exploration of this connection. All data, code, and models will be made public.


Poster #399
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Junjie Wang · BIN CHEN · Yulin Li · Bin Kang · Yichi Chen · Zhuotao Tian

Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain "content'' and "context'' features respectively. The "content'' features are aligned with image crop representations to improve local discriminability, while "context'' features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code and models will be made publicly available.

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper aims to address this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin.


Poster #401
Highlight
Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

Zheda Mai · Ping Zhang · Cheng-Hao Tu · Hong-You Chen · Quang-Huy Nguyen · Li Zhang · Wei-Lun Chao

Parameter-efficient fine-tuning (PEFT) has attracted significant attention due to the growth of pre-trained model sizes and the need to fine-tune (FT) them for superior downstream performance. Despite a surge in new PEFT methods, a systematic study to understand their performance and suitable application scenarios is lacking, leaving questions like "when to apply PEFT" and "which method to use" largely unanswered, especially in visual recognition. In this paper, we conduct a unifying empirical study of representative PEFT methods with Vision Transformers. We systematically tune their hyper-parameters to fairly compare their accuracy on downstream tasks. Our study offers a practical user guide and unveils several new insights. First, if tuned carefully, different PEFT methods achieve similar accuracy in the low-shot benchmark VTAB-1K. This includes simple approaches like FT the bias terms that were reported inferior. Second, despite similar accuracy, we find that PEFT methods make different mistakes and high-confidence predictions, likely due to their different inductive biases. Such an inconsistency (or complementariness) opens up the opportunity for ensemble methods, and we make preliminary attempts at this. Third, going beyond the commonly used low-shot tasks, we find that PEFT is also useful in many-shot regimes, achieving comparable or better accuracy than full FT while using significantly fewer parameters. Lastly, we investigate PEFT's ability to preserve a pre-trained model's robustness to distribution shifts (e.g., CLIP). Perhaps not surprisingly, PEFT approaches outperform full FT alone. However, with weight-space ensembles, full FT can better balance target distribution and distribution shift performance, suggesting a future research direction for PEFT.


Poster #402
TADFormer: Task-Adaptive Dynamic TransFormer for Efficient Multi-Task Learning

Seungmin Baek · Soyul Lee · Hayeon Jo · Hyesong Choi · Dongbo Min

Transfer learning paradigm has driven substantial advancements in various vision tasks. However, as state-of-the-art models continue to grow, classical full fine-tuning often becomes computationally impractical, particularly in multi-task learning (MTL) setup where training complexity increases proportional to the number of tasks. Consequently, recent studies have explored Parameter-Efficient Fine-Tuning (PEFT) for MTL architectures. Despite some progress, these approaches still exhibit limitations in capturing fine-grained, task-specific features that are crucial to MTL. In this paper, we introduce Task-Adaptive Dynamic transFormer, termed TADFormer, a novel PEFT framework that performs task-aware feature adaptation in the fine-grained manner by dynamically considering task-specific input contexts. TADFormer proposes the parameter-efficient prompting for task adaptation and the Dynamic Task Filter (DTF) to capture task information conditioned on input contexts. Experiments on the PASCAL-Context benchmark demonstrate that the proposed method achieves higher accuracy in dense scene understanding tasks, while reducing the number of trainable parameters by up to 8.4 times when compared to full fine-tuning of MTL models. TADFormer also demonstrates superior parameter efficiency and accuracy compared to recent PEFT methods. Our code is available at supplementary material.


Poster #403
LoKi: Low-dimensional KAN for Efficient Fine-tuning Image Models

Xuan Cai · Renjie Pan · Hua Yang

Pre-training + fine-tuning' has been widely used in various downstream tasks. Parameter-efficient fine-tuning (PEFT) has demonstrated higher efficiency and promising performance comapred to traditional full-tuning. The widely used adapter-based and prompt-based methods in PEFT can be uniformly represented as adding an MLP structure to the pre-trained model. These methods are prone to over-fitting in downstream tasks, due to the difference in data scale and distribution. To address this issue, we propose a new adapter-based PEFT module, i.e., LoKi, which consists of an encoder, a learnable activation layer, and a decoder. To maintain the simplicity of LoKi, we use single-layer linear networks for the encoder and decoder, and for the learnable activation layer, we use a Kolmogorov-Arnold Network (KAN) with the minimal number of layers (only 2 KAN linear layers). With a bottleneck rate much lower than that of Adapter, LoKi is equipped with fewer parameters (only half of Adapter) and eliminates slow training speed and high memory usage of KAN. We conduct extensive experiments on LoKi under image classification and video action recognition across 9 datasets. LoKi demonstrates highly competitive generalization performance compared to other PEFT methods with fewer tunable parameters, ensuring both effectiveness and efficiency. Code will be available.

Deep learning has revolutionized computer vision, but it achieved its tremendous success using deep network architectures which are mostly hand-crafted and therefore likely suboptimal. Neural Architecture Search (NAS) aims to bridge this gap by following a well-defined optimization paradigm which systematically looks for the best architecture, given objective criterion such as maximal classification accuracy. The main limitation of NAS is however its astronomical computational cost, as it typically requires training each candidate network architecture from scratch.In this paper, we aim to alleviate this limitation by proposing a novel training-free proxy for image classification accuracy based on Fisher Information. The proposed proxy has a strong theoretical background in statistics and it allows estimating expected image classification accuracy of a given deep network without training the network, thus significantly reducing computational cost of standard NAS algorithms. Our training-free proxy achieves state-of-the-art results on three public datasets and in two search spaces, both when evaluated using previously proposed metrics, as well as using a new metric that we propose which we demonstrate is more informative for practical NAS applications. The source code is publicly available.


Poster #405
Highlight
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation

Zhuguanyu Wu · Shihe Wang · Jiayi Zhang · Jiaxin Chen · Yunhong Wang

Network quantization, a prevalent technique for network compression, significantly reduces computational demands and memory usage, thereby facilitating the deployment of large-parameter models onto hardware with constrained resources. Post-training quantization (PTQ) stands out as a cost-effective and promising approach due to its avoidance of the need for retraining. Unfortunately, many current PTQ methods in Vision Transformer (ViT) exhibit a notable decrease in accuracy, especially in lowbit cases. To tackle these challenges, we analyze the extensively utilized Hessian-guided quantization loss, and uncover certain limitations within the approximated pre-activation Hessian. By deducing the relationship between KL divergence and Fisher information matrix (FIM), we develop a more refined approximation for FIM. Building on this, we introduce the Diagonal Plus Low-Rank FIM (DPLR) to achieve a more nuanced quantization loss. Our extensive experiments, conducted across various ViT-based architectures on public benchmark datasets, demonstrate that our quantization loss calculation surpasses the performance of the prevalent mean squared error (MSE) and approximated pre-activation Hessian, and outperform previous work in lowbit cases. Code will be released upon acceptance.


Poster #406
Transformers without Normalization

Jiachen Zhu · Xinlei Chen · Kaiming He · Yann LeCun · Zhuang Liu

Normalization layers are ubiquitous in modern neural networks and have long been considered essential.In this work, we demonstrate that strong performance can be achieved on Transformers without normalization layers, by using a remarkably simple technique.We introduce Dynamic Tanh (DyT), an element-wise operation $\mathrm{DyT}(\boldsymbol{x}) = \tanh(\alpha \boldsymbol{x})$, as a drop-in replacement for normalization layers in Transformers.DyT is inspired by the observation that layer normalization layers often produce tanh-like, $S$-shaped input-output mappings.By incorporating DyT, Transformers without any normalization layers can match or exceed the performance of their normalized counterparts, mostly without tuning training hyperparameters.We validate the efficacy of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models.These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep neural networks.


Poster #407
GroupMamba: Efficient Group-Based Visual State Space Model

Abdelrahman Shaker · Syed Talal Wasim · Salman Khan · Jürgen Gall · Fahad Shahbaz Khan

State-space models (SSMs) have recently shown promise in capturing long-range dependencies with subquadratic computational complexity, making them attractive for various applications. However, purely SSM-based models face critical challenges related to stability and achieving state-of-the-art performance in computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes. We introduce a parameter-efficient modulated group mamba layer that divides the input channels into four groups and applies our proposed SSM-based efficient Visual Single Selective Scanning (VSSS) block independently to each group, with each VSSS block scanning in one of the four spatial directions. The Modulated Group Mamba layer also wraps the four VSSS blocks into a channel modulation operator to improve cross-channel communication.Furthermore, we introduce a distillation-based training objective to stabilize the training of large models, leading to consistent performance gains. Our comprehensive experiments demonstrate the merits of the proposed contributions, leading to superior performance over existing methods for image classification on ImageNet-1K, object detection, instance segmentation on MS-COCO, and semantic segmentation on ADE20K. Our tiny variant with 23M parameters achieves state-of-the-art performance with a classification top-1 accuracy of 83.3% on ImageNet-1K, while being 26% efficient in terms of parameters, compared to the best existing Mamba design of same model size. Our code and models will be publicly released.


Poster #408
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Sanghyeok Lee · Joonmyung Choi · Hyunwoo J. Kim

For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively.Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens.Yet, efficient vision backbones built with SSM have been explored less.In this paper, we introduce Efficient Vision Mamba ($\textbf{EfficientViM}$), a novel architecture built on hidden state mixer-based state space duality ($\textbf{HSM-SSD}$) that efficiently captures global dependencies with further reduced computational cost.In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states.Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations.As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed.Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training.

Existing state-of-the-art feature matchers capture long-range dependencies with Transformers but are hindered by high spatial complexity,leading to demanding training and high-latency inference.Striking a better balance between performance and efficiency remains a critical challenge in feature matching.Inspired by the linear complexity $\mathcal{O}(N)$ of Mamba, we propose an ultra-lightweight Mamba-based matcher, named JamMa, which converges on a single GPU and achieves an impressive performance-efficiency balance in inference.To unlock the potential of Mamba for feature matching,we propose Joint Mamba with a scan-merge strategy named $\textbf{JEGO}$, which enables:(1) $\textbf{J}$oint scan of two images to achieve high-frequency mutual interaction, (2) $\textbf{E}$fficient scan with skip steps to reduce sequence length, (3) $\textbf{G}$lobal receptive field, and (4) $\textbf{O}$mnidirectional feature representation.With the above properties, the JEGO strategy significantly outperforms the scan-merge strategies proposed in VMamba and EVMamba in the feature matching task.Compared to attention-based sparse and semi-dense matchers, JamMa demonstrates a notably superior balance between performance and efficiency,delivering better performance with less than $50$% of the parameters and FLOPs.


Poster #410
Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation

Jiaxin Cai · Jingze Su · Qi Li · Wenjie Yang · Shu Wang · Tiesong Zhao · Shengfeng He · Wenxi Liu

Multimodal semantic segmentation is a critical challenge in computer vision, with early methods suffering from high computational costs and limited transferability due to full fine-tuning of RGB-based pre-trained parameters. Recent studies, while leveraging additional modalities as supplementary prompts to RGB, still predominantly rely on RGB, which restricts the full potential of other modalities. To address these issues, we propose a novel symmetric parameter-efficient fine-tuning framework for multimodal segmentation, featuring with a modality-aware prompting and adaptation scheme, to simultaneously adapt the capabilities of a powerful pre-trained model to both RGB and X modalities. Furthermore, prevalent approaches use the global cross-modality correlations of attention mechanism for modality fusion, which inadvertently introduces noise across modalities. To mitigate this noise, we propose a dynamic sparse cross-modality fusion module to facilitate effective and efficient cross-modality fusion. To further strengthen the above two modules, we propose a training strategy that leverages accurately predicted dual-modality results to self-teach the single-modality outcomes. In comprehensive experiments, we demonstrate that our method outperforms previous state-of-the-art approaches across six multimodal segmentation scenarios with minimal computation cost.


Poster #411
Mamba-Reg: Vision Mamba Also Needs Registers

Feng Wang · Jiahao Wang · Sucheng Ren · Guoyizhe Wei · Jieru Mei · Wei Shao · Yuyin Zhou · Alan L. Yuille · Cihang Xie

Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba---they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional inference paradigm, two key modifications are introduced: 1) evenly inserting registers throughout the input token sequence, and 2) recycling registers for final decision predictions. We term this new architecture MambaReg. Qualitative observations suggest, compared to vanilla Vision Mamba, MambaReg's feature maps appear cleaner and more focused on semantically meaningful regions. Quantitatively, MambaReg attains stronger performance and scales better. For example, on the ImageNet benchmark, our MambaReg-B attains 83.0% accuracy, significantly outperforming Vim-B's 81.8%; furthermore, we provide the first successful scaling to the large model size (i.e., with 340M parameters), attaining a competitive accuracy of 83.6% (84.5% if finetuned with 384x384 inputs). Additional validation on the downstream semantic segmentation task also supports MambaReg's efficacy.


Poster #412
Rethinking Token Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks

Cheng Lei · Ao Li · Hu Yao · Ce Zhu · Le Zhang

Parameter-efficient fine-tuning (PEFT) adapts pre-trained models to new tasks by updating only a small subset of parameters, achieving efficiency but still facing significant inference costs driven by input token length. This challenge is even more pronounced in pixel-level tasks, which require longer input sequences compared to image-level tasks. Although token reduction (TR) techniques can help reduce computational demands, they often lead to homogeneous attention patterns that compromise performance in pixel-level scenarios. This study underscores the importance of maintaining attention diversity for these tasks and proposes to enhance attention diversity while ensuring the completeness of token sequences. Our approach effectively reduces the number of tokens processed within transformer blocks, improving computational efficiency without sacrificing performance on several pixel-level tasks. We also demonstrate the superior generalization capability of our proposed method compared to challenging baseline models.


Poster #413
Highlight
No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition

Rong Qin · Xin Liu · Xingyu Liu · Jiaxuan Liu · Jinglei Shi · Liang Lin · Jufeng Yang

Over the last decade, many notable methods have emerged to tackle the computational resource challenge of the high resolution image recognition (HRIR). They typically focus on identifying and aggregating a few salient regions for classification, discarding sub-salient areas for low training consumption. Nevertheless, many HRIR tasks necessitate the exploration of wider regions to model objects and contexts, which limits their performance in such scenarios. To address this issue, we present a DBPS strategy to enable training with more patches at low consumption. Specifically, in addition to a fundamental buffer that stores the embeddings of most salient patches, DBPS further employs an auxiliary buffer to recycle those sub-salient ones. To reduce the computational cost associated with gradients of sub-salient patches, these patches are primarily used in the forward pass to provide sufficient information for classification. Meanwhile, only the gradients of the salient patches are back-propagated to update the entire network. Moreover, we design a Multiple Instance Learning (MIL) architecture that leverages aggregated information from salient patches to filter out uninformative background within sub-salient patches for better accuracy. Besides, we introduce the random patch drop to accelerate training process and uncover informative regions. Experiment results demonstrate the superiority of our method in terms of both accuracy and training consumption against other advanced methods. The code is available in the supplementary materials and will be publicly available.


Poster #414
Language Guided Concept Bottleneck Models for Interpretable Continual Learning

Lu Yu · HaoYu Han · Zhe Tao · Hantao Yao · Changsheng Xu

Continual learning (CL) aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information. CL faces the challenge of mitigating catastrophic forgetting while maintaining interpretability across tasks.Most existing CL methods focus primarily on preserving learned knowledge to improve model performance. However, as new information is introduced, the interpretability of the learning process becomes crucial for understanding the evolving decision-making process, yet it is rarely explored. In this paper, we introduce a novel framework that integrates language-guided Concept Bottleneck Models (CBMs) to address both challenges. Our approach leverages the Concept Bottleneck Layer, aligning semantic consistency with CLIP models to learn human-understandable concepts that can generalize across tasks. By focusing on interpretable concepts, our method not only enhances the model’s ability to retain knowledge over time but also provides transparent decision-making insights. We demonstrate the effectiveness of our approach by achieving superior performance on several datasets, outperforming state-of-the-art methods with an improvement of up to 3.06\% in final average accuracy on ImageNet-subset. Additionally, we offer concept visualizations for model predictions, further advancing the understanding of interpretable continual learning. Code will be released upon the acceptance.


Poster #415
Highlight
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Shenghao Fu · Qize Yang · Qijie Mo · Junkai Yan · Xihan Wei · Jingke Meng · Xiaohua Xie · Wei-Shi Zheng

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset will be available.


Poster #416
Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

Yongkang Li · Tianheng Cheng · Bin Feng · Wenyu Liu · Xinggang Wang

Recent open-vocabulary segmentation methods adopt mask generators to predict segmentation masks and leverage pre-trained vision-language models, e.g., CLIP, to classify these masks via mask pooling.Although these approaches show promising results, it is counterintuitive that accurate masks often fail to yield accurate classification results through pooling CLIP image embeddings within the mask regions.In this paper, we reveal the performance limitations of mask pooling and introduce Mask-Adapter, a simple yet effective method to address these challenges in open-vocabulary segmentation.Compared to directly using proposal masks, our proposed Mask-Adapter extracts semantic activation maps from proposal masks, providing richer contextual information and ensuring alignment between masks and CLIP.Additionally, we propose a mask consistency loss that encourages proposal masks with similar IoUs to obtain similar CLIP embeddings to enhance models' robustness to varying predicted masks.Mask-Adapter integrates seamlessly into open-vocabulary segmentation methods based on mask pooling in a plug-and-play manner, delivering more accurate classification results. Extensive experiments across several zero-shot benchmarks demonstrate significant performance gains for the proposed Mask-Adapter on several well-established methods.Notably, Mask-Adapter also extends effectively to SAM and achieves impressive results on several open-vocabulary segmentation datasets. Code and models will be made publicly available.


Poster #417
Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

Zelin Peng · Zhengqin Xu · Zhilin Zeng · Yu Huang · Yaoming Wang · Wei Shen

Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as powerful tools for acquiring open-vocabulary capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction ability often suffers three issues: 1) high computational cost, 2) misalignment between the two inherent modalities of CLIP, and 3) degraded generalization ability on unseen categories. To address these issues, we propose \alg, a symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in hyperspherical space for both of the two CLIP modalities. Specifically, the PEFT strategy is achieved by a series of efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module among all learnable matrices. Since the PEFT strategy is conducted symmetrically to the two CLIP modalities, the misalignment between them is mitigated. Furthermore, we apply an additional constraint to PEFT on the CLIP text encoder according to the hyperspherical energy principle, i.e., minimizing hyperspherical energy during fine-tuning preserves the intrinsic structure of the original parameter space, to prevent the destruction of the generalization ability offered by the CLIP text encoder. Extensive evaluations across various benchmarks show that H-CLIP achieves new SOTA open-vocabulary semantic segmentation results while only requiring updating approximately 4% of the total parameters of CLIP.

Document image segmentation is crucial in document analysis and recognition but remains challenging due to the heterogeneity of document formats and diverse segmentation tasks. Existing methods often treat these tasks separately, leading to limited generalization and resource wastage.This paper introduces DocSAM, a transformer-based unified framework for various document image segmentation tasks, including document layout analysis, multi-granularity text segmentation, and table structure recognition by modelling these tasks as a combination of instance and semantic segmentation.Specifically, DocSAM uses a Sentence BERT to map category names from each dataset into semantic queries of the same dimension as instance queries. These queries interact through attention mechanisms and are cross-attended with image features to predict instance and semantic segmentation masks. To predict instance categories, instance queries are dot-producted with semantic queries, and scores are normalized using softmax.As a result, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computing and storage resources. Comprehensive evaluations show that DocSAM outperforms existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation in various applications.


Poster #419
Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning

Kunyu Wang · Xueyang Fu · Xin Lu · Chengjie Ge · Chengzhi Cao · Wei Zhai · Zheng-Jun Zha

Continual test-time adaptive object detection (CTTA-OD) aims to online adapt a source pre-trained detector to ever-changing environments during inference under continuous domain shifts. Most existing CTTA-OD methods prioritize effectiveness while overlooking computational efficiency, which is crucial for resource-constrained scenarios. In this paper, we propose an efficient CTTA-OD method via pruning. Our motivation stems from the observation that not all learned source features are beneficial; certain domain-sensitive feature channels can adversely affect target domain performance. Inspired by this, we introduce a sensitivity-guided channel pruning strategy that quantifies each channel based on its sensitivity to domain discrepancies at both image and instance levels. We apply weighted sparsity regularization to selectively suppress and prune these sensitive channels, focusing adaptation efforts on invariant ones. Additionally, we introduce a stochastic channel reactivation mechanism to restore pruned channels, enabling recovery of potentially useful features and mitigating the risks of early pruning. Extensive experiments on three benchmarks show that our method achieves superior adaptation performance while reducing computational overhead by 12% in FLOPs compared to the recent SOTA method.


Poster #420
Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

Chanyoung Kim · Dayun Ju · Woojung Han · Ming-Hsuan Yang · Seong Jae Hwang

Open-Vocabulary Semantic Segmentation (OVSS) has advanced with recent vision-language models (VLMs), enabling segmentation beyond predefined categories through various learning schemes. Notably, training-free methods offer scalable, easily deployable solutions for handling unseen data, a key goal of OVSS. Yet, a critical issue persists: lack of object-level context consideration when segmenting complex objects in the challenging environment of OVSS based on arbitrary query prompts. This oversight limits models' ability to group semantically consistent elements within object and map them precisely to user-defined arbitrary classes. In this work, we introduce a novel approach that overcomes this limitation by incorporating object-level contextual knowledge within images. Specifically, our model enhances intra-object consistency by distilling spectral-driven features from vision foundation models into the attention mechanism of the visual encoder, enabling semantically coherent components to form a single object mask. Additionally, we refine the text embeddings with zero-shot object presence likelihood to ensure accurate alignment with the specific objects represented in the images. By leveraging object-level contextual knowledge, our proposed approach achieves state-of-the-art performance with strong generalizability across diverse datasets. All the attached source code will be made available to the public.


Poster #421
FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation

Dong Zhao · Jinlong Li · Shuang Wang · Mengyao Wu · Qi Zang · Nicu Sebe · Zhun Zhong

Vision Foundation Models (VFMs) excel in generalization due to large-scale pretraining, but fine-tuning them for Domain Generalized Semantic Segmentation (DGSS) while maintaining this ability remains challenging. Existing approaches either selectively fine-tune parameters or freeze the VFMs and update only the adapters, both of which may underutilize the VFMs' full potential in DGSS tasks. We observe that domain-sensitive parameters in VFMs, arising from task and distribution differences, can hinder generalization.To address this, we propose \textbf{FisherTune}, a robust fine-tuning method guided by the Domain-Related Fisher Information Matrix (DR-FIM). DR-FIM measures parameter sensitivity across tasks and domains, enabling selective updates that preserve generalization and enhance DGSS adaptability. FisherTune incorporates variational inference to stabilize DR-FIM estimation, treating parameters as Gaussian-distributed variables and leveraging pre-trained priors. Extensive experiments show that FisherTune achieves superior cross-domain segmentation while maintaining generalization, outperforming selective-parameter and adapter-based methods.


Poster #422
POT: Prototypical Optimal Transport for Weakly Supervised Semantic Segmentation

Jian Wang · Tianhong Dai · Bingfeng Zhang · Siyue Yu · ENG GEE LIM · Jimin Xiao

Weakly Supervised Semantic Segmentation (WSSS) leverages Class Activation Maps (CAMs) to extract spatial information from image-level labels. However, CAMs primarily highlight the most discriminative foreground regions, leading to incomplete results. Prototype-based methods attempt to address this limitation by employing prototype CAMs instead of classifier CAMs. Nevertheless, existing prototype-based methods typically use a single prototype for each class, which is insufficient to capture all attributes of the foreground features due to the significant intra-class variations across different images. Consequently, these methods still struggle with incomplete CAM predictions. In this paper, we propose a novel framework called Prototypical Optimal Transport (POT) for WSSS. POT enhances CAM predictions by dividing features into multiple clusters and activating them separately using multiple cluster prototypes. In this process, a similarity-aware optimal transport is employed to assign features to the most probable clusters. This similarity-aware strategy ensures the prioritization of significant cluster prototypes, thereby improving the accuracy of feature assignment. Additionally, we introduce an adaptive OT-based consistency loss to refine feature representations. This framework effectively overcomes the limitations of single-prototype methods, providing more complete and accurate CAM predictions. Extensive experimental results on standard WSSS benchmarks (PASCAL VOC and MS COCO) demonstrate that our method significantly improves the quality of CAMs and achieves state-of-the-art performances. The source code will be released.


Poster #423
FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding

Thanh-Dat Truong · Utsav Prabhu · Bhiksha Raj · Jackson Cothren · Khoa Luu

Continual Learning in semantic scene segmentation aims to continually learn new unseen classes in dynamic environments while maintaining previously learned knowledge. Prior studies focused on modeling the catastrophic forgetting and background shift challenges in continual learning. However, fairness, another major challenge that causes unfair predictions leading to low performance among major and minor classes, still needs to be well addressed. In addition, prior methods have yet to model the unknown classes well, thus resulting in producing non-discriminative features among unknown classes. This work presents a novel Fairness Learning via Contrastive Attention Approach to continual learning in semantic scene understanding. In particular, we first introduce a new Fairness Contrastive Clustering loss to address the problems of catastrophic forgetting and fairness. Then, we propose an attention-based visual grammar approach to effectively model the background shift problem and unknown classes, producing better feature representations for different unknown classes. Through our experiments, our proposed approach achieves State-of-the-Art (SoTA) performance on different continual learning benchmarks, i.e., ADE20K, Cityscapes, and Pascal VOC. It promotes the fairness of the continual semantic segmentation model.


Poster #424
WISNet: Pseudo Label Generation on Unbalanced and Patch Annotated Waste Images

Shifan Zhang · Hongzi Zhu · Yinan He · Minyi Guo · Ziyang Lou · Shan Chang

Computer-vision-based assessment on waste sorting is desired to replace manpower supervision in Shanghai city. Due to the hardness of labeling a multitude of waste images, it is infeasible to train a semantic segmentation model for this purpose directly. In this work, we construct a new dataset consisting of 12,208 waste images, upon which seed regions (i.e., patches) are annotated and classified into 21 categories in a crowdsourcing fashion. To obtain pixel-level labels to train an effective segmentation model, we propose a weakly-supervised waste image pseudo label generation scheme, called WISNet. Specifically, we train a cohesive feature extractor with contrastive prototype learning, incorporating an unsupervised classification pretext task to help the extractor focus on more discriminative regions even with the same category. Furthermore, we propose an effective iterative patch expansion method to generate accurate pixel-level pseudo labels. Given these generated pseudo labels, a few-shot segmentation model can be trained to segment waste images. We implement and deploy WISNet in two real-world scenarios and conduct intensive experiments. Results show that WISNet can achieve a state-of-the-art 40.2% final segmentation mIoU on our waste benchmark, outperforming all other baselines and demonstrating the efficacy of WISNet.


Poster #425
Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

Tian Liu · Huixin Zhang · Shubham Parashar · Shu Kong

Few-shot recognition (FSR) aims to train a classification model with only a few labeled examples of each concept concerned by a downstream task, where data annotation cost can be prohibitively high. We develop methods to solve FSR by leveraging a pretrained Vision-Language Model (VLM). We particularly explore retrieval-augmented learning (RAL), which retrieves data from the VLM's pretraining set to learn better models for serving downstream tasks. RAL has been widely studied in zero-shot recognition but remains under-explored in FSR. Although applying RAL to FSR may seem straightforward, we observe interesting and novel challenges and opportunities. First, somewhat surprisingly, finetuning a VLM on a large amount of retrieved data underperforms state-of-the-art zero-shot methods. This is due to the imbalanced distribution of retrieved data and its domain gaps with the few-shot examples in the downstream task. Second, more surprisingly, we find that simply finetuning a VLM solely on few-shot examples significantly outperforms previous FSR methods, and finetuning on the mix of retrieved and few-shot data yields even better results. Third, to mitigate the imbalanced distribution and domain gap issues, we propose Stage-Wise retrieval-Augmented fineTuning (SWAT), which involves end-to-end finetuning on mixed data in the first stage and retraining the classifier on the few-shot data in the second stage. Extensive experiments on nine popular benchmarks demonstrate that SWAT significantly outperforms previous methods by $>$6\% accuracy.


Poster #426
Highlight
Compositional Caching for Training-free Open-vocabulary Attribute Detection

Marco Garosi · Alessandro Conti · Gaowen Liu · Elisa Ricci · Massimiliano Mancini

Attribute detection is crucial for many computer vision tasks, as it enables systems to describe properties such as color, texture, and material. Current approaches often rely on labor-intensive annotation processes which are inherently limited: objects can be described at an arbitrary level of detail (e.g., color vs. color shades), leading to ambiguities when the annotators are not instructed carefully. Furthermore, they operate within a predefined set of attributes, reducing scalability and adaptability to unforeseen downstream applications. We present Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection that overcomes these constraints. ComCa requires only the list of target attributes and objects as input, using them to populate an auxiliary cache of images by leveraging web-scale databases and Large Language Models to determine attribute-object compatibility. To account for the compositional nature of attributes, cache images receive soft attribute labels. Those are aggregated at inference time based on the similarity between the input and cache images, refining the predictions of underlying Vision-Language Models (VLMs). Importantly, our approach is model-agnostic, compatible with various VLMs. Experiments on public datasets demonstrate that ComCa significantly outperforms zero-shot and cache-based baselines, competing with recent training-based methods, proving that a carefully designed training-free approach can successfully address open-vocabulary attribute detection.


Poster #427
Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang · Sangwoo Mo · Stella X. Yu · Sima Behpour · Liu Ren

Unlike common categories for plants and animals, ad-hoc categories such as things to sell at a garage sale are created to help people achieve a certain task. Likewise, AI agents need to adaptively categorize visual scenes in response to changing tasks. We thus study open ad-hoc categorization, where we learn to infer novel concepts and name images according to a varying categorization purpose, a few labeled exemplars, and many unlabeled images.We develop a simple method that combines top-down text guidance (CLIP) with bottom-up image clustering (GCD) to learn contextualized visual features and align visual clusters with CLIP semantics, enabling predictions for both known and novel classes. Benchmarked on multi-label datasets Stanford and Clevr-4, our so-called OAK significantly outperforms baselines in providing accurate predictions across contexts and identifying novel concepts, e.g., it achieves 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. OAK offers interpretable saliency maps, focusing on hands, faces, and backgrounds for Action, Mood, and Location contexts, respectively.


Poster #428
MOS: Modeling Object-Scene Associations in Generalized Category Discovery

Zhengyuan Peng · Jinpeng Ma · Zhimin Sun · Ran Yi · Haichuan Song · Xin Tan · Lizhuang Ma

Generalized Category Discovery (GCD) is a classification task that aims to classify both base and novel classes in unlabeled images, using knowledge from a labeled dataset. In GCD, previous research typically treats scene information as noise and minimizes its influence during model training. However, in this paper, we argue that scene information should not be treated as noise, but rather recognized as a strong prior for inferring novel classes. We attribute the misinterpretation of scene information to a key factor: the Ambiguity Challenge inherent in GCD. Specifically, novel objects in base scenes might be wrongly classified into base categories, while base objects in novel scenes might be mistakenly recognized as novel categories. Once the ambiguity challenge is addressed, scene information can reach its full potential, significantly enhancing the performance of GCD models. To more effectively leverage scene information, we propose the Modeling Object-Scene Associations (MOS) framework, which utilizes a simple MLP-based scene-awareness module to enhance GCD performance. It achieves an exceptional average accuracy of 4\% improvement on the challenging fine-grained datasets compared to state-of-the-art methods, emphasizing its superior performance in GCD tasks.


Poster #429
Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval

Mankeerat Sidhu · Hetarth Chopra · Ansel Blume · Jeonghwan Kim · Revanth Gangi Reddy · Heng Ji

In this paper, we introduce SearchDet, a training-free long-tail object detection framework that significantly enhances open-vocabulary object detection performance. SearchDet retrieves a set of positive and negative images of an object to ground, embeds these images, and computes an input image--weighted query which is used to detect the desired concept in the image. Our proposed method is simple and training-free, yet achieves over 16.81\% mAP improvement on ODinW and 59.85\% mAP improvement on LVIS compared to state-of-the-art models such as GroundingDINO. We further show that our approach of basing object detection on a set of Web-retrieved exemplars is stable with respect to variations in the exemplars, suggesting a path towards eliminating costly data annotation and training procedures.


Poster #430
Fractal Calibration for Long-tailed Object Detection

Konstantinos Alexandridis · Ismail Elezi · Jiankang Deng · Anh Nguyen · Shan Luo

Real-world datasets follow an imbalanced distribution, which poses significant challenges in rare-category object detection. Recent studies tackle this problem by developing re-weighting and re-sampling methods, that utilise the class frequencies of the dataset. However, these techniques focus solely on the frequency statistics and ignore the distribution of the classes in image space, missing important information. In contrast to them, we propose Fractal CALibration (FRACAL): a novel post-calibration method for long-tailed object detection. FRACAL devises a logit adjustment method that utilises the fractal dimension to estimate how uniformly classes are distributed in image space. During inference, it uses the fractal dimension to inversely downweight the probabilities of uniformly spaced class predictions achieving balance in two axes: between frequent and rare categories, and between uniformly spaced and sparsely spaced classes. FRACAL is a post-processing method and it does not require any training, also it can be combined with many off-the-shelf models such as one-stage sigmoid detectors and two-stage instance segmentation models. FRACAL boosts the rare class performance by up to 8.6% and surpasses all previous methods on LVIS dataset, while showing good generalisation to other datasets such as COCO, V3Det and OpenImages. We provide the code in the Appendix.


Poster #431
Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers

Quentin Guimard · Moreno D'Incà · Massimiliano Mancini · Elisa Ricci

A person downloading a pre-trained model from the web should be aware of its biases. Existing approaches for bias identification rely on datasets containing labels for the task of interest, something that a non-expert may not have access to, or may not have the necessary resources to collect, which greatly limits the number of tasks where model biases can be identified. In this work, we develop Classifier-to-Bias (C2B), the first bias discovery framework that works without access to any labeled data: it only relies on a textual description of the classification task to identify biases in the target classification model. This description is fed to a large language model to generate bias proposals and corresponding captions depicting those together with task-specific target labels. A text-to-image retrieval model collects images for those captions, which are then used to assess the accuracy of the model w.r.t. the given biases. C2B is training-free, does not require any annotations, has no constraints on the list of biases, and can be applied to detect biases for any pre-trained model on any classification task. Experiments on two publicly available datasets show that C2B discovers biases beyond those of the original datasets and outperforms a recent state-of-the-art bias detection baseline that relies on task-specific annotations, being a promising first step toward addressing task-agnostic unsupervised bias detection.


Poster #432
DEIM: DETR with Improved Matching for Fast Convergence

Shihua Huang · Zhichao Lu · Xiaodong Cun · Yongjun YU · Xiao Zhou · Xi Shen

We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. This approach increases the number of positive samples per image by incorporating additional targets, using standard data augmentation techniques. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance. To address this, we propose the Matchability-Aware Loss (MAL), a novel loss function that optimizes matches across various quality levels, enhancing the effectiveness of Dense O2O. Extensive experiments on the COCO dataset validate the efficacy of DEIM. When integrated with RT-DETR and D-FINE, it consistently boosts performance while reducing training time by 50\%. Notably, paired with RT-DETRv2, DEIM achieves 53.2\% AP in a single day of training on an NVIDIA 4090 GPU. Additionally, DEIM-trained real-time models outperform leading real-time object detectors, with DEIM-D-FINE-L and DEIM-D-FINE-X achieving 54.7\% and 56.4\% AP at 124 and 78 FPS on an NVIDIA T4 GPU, respectively, without the need for additional data. We believe DEIM sets a new baseline for advancements in real-time object detection. Our code will be made available upon publication.

Despite its prevalent use in image-text matching tasks in a zero-shot manner, CLIP has been shown to be highly vulnerable to adversarial perturbations added onto images. Recent studies propose to finetune the vision encoder of CLIP with adversarial samples generated on the fly, and show improved robustness against adversarial attacks on a spectrum of downstream datasets, a property termed as zero-shot robustness. In this paper, we show that malicious perturbations that seek to maximise the classification loss lead to `falsely stable' images, and propose to leverage the pre-trained vision encoder of CLIP to counterattack such adversarial images during inference to achieve robustness. Our paradigm is simple and training-free, providing the first method to defend CLIP from adversarial attacks at test time, which is orthogonal to existing methods aiming to boost zero-shot adversarial robustness of CLIP. We conduct experiments across 16 classification datasets, and demonstrate stable and consistent gains compared to test-time defence methods adapted from existing adversarial robustness studies that do not rely on external networks, without noticeably impairing performance on clean images. We also show that our paradigm can be employed on CLIP models that have been adversarially finetuned to further enhance their robustness at test time. Our code will be released.


Poster #434
FlexUOD: The Answer to Real-world Unsupervised Image Outlier Detection

Zhonghang Liu · Kun Zhou · Changshuo Wang · Daniel Lin · Jiangbo Lu

How many outliers are within an unlabeled and contaminated dataset? Despite a series of unsupervised outlier detection (UOD) approaches have been proposed, they cannot correctly answer this critical question, resulting in their performance instability across various real-world (varying contamination factor) scenarios. To address this problem, we propose FlexUOD, with a novel contamination factor estimation perspective. FlexUOD not only achieves its remarkable robustness but also is a general and plug-and-play framework, which can significantly improve the performance of existing UOD methods. Extensive experiments demonstrate that FlexUOD achieves state-of-the-art results as well as high efficacy on diverse evaluation benchmarks.


Poster #435
UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection

Zhaopeng Gu · Bingke Zhu · Guibo Zhu · Yingying Chen · Ming Tang · Jinqiao Wang

Visual Anomaly Detection (VAD) aims to identify abnormal samples in images that deviate from normal patterns, covering multiple domains, including industrial, logical, and medical fields. Due to the domain gaps between these fields, existing VAD methods are typically tailored to each domain, with specialized detection techniques and model architectures that are difficult to generalize across different domains. Moreover, even within the same domain, current VAD approaches often follow a ``one-category-one-model" paradigm, requiring large amounts of normal samples to train class-specific models, resulting in poor generalizability and hindering unified evaluation across domains. To address this issue, we propose a generalized few-shot VAD method, UniVAD, capable of detecting anomalies across various domains, such as industrial, logical, and medical anomalies, with a training-free unified model. UniVAD only needs a few normal samples as references during testing to detect anomalies in previously unseen objects without training on the specific domain. Specifically, UniVAD employs a Contextual Component Clustering ($\text{C}^3$) module based on clustering and vision foundation models to segment components within the image accurately, and leverages Component-Aware Patch Matching (CAPM) and Graph-Enhanced Component Modeling (GECM) modules to detect anomalies at different semantic levels, which are aggregated to produce the final detection result. We conduct experiments on nine datasets spanning industrial, logical, and medical fields, and the results demonstrate that UniVAD achieves state-of-the-art performance in few-shot anomaly detection tasks across multiple domains, outperforming domain-specific anomaly detection models.


Poster #436
Towards Training-free Anomaly Detection with Vision and Language Foundation Models

Jinjin Zhang · Guodong Wang · yizhou jin · Di Huang

Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision.Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness.Code will be made publicly available soon.


Poster #437
Real-IAD D³: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection

wenbing zhu · Lidong Wang · Ziqing Zhou · Chengjie Wang · Yurui Pan · Ruoyi.Zhang · Zhuhao Chen · Linjie Cheng · Bin-Bin Gao · Jiangning Zhang · Zhenye Gan · Yuxie Wang · Yulong Chen · Bruce Qian · Mingmin Chi · Bo Peng · Lizhuang Ma

The increasing complexity of industrial anomaly detection (IAD) has positioned multimodal detection methods as a focal area of machine vision research. However, dedicated multimodal datasets specifically tailored for IAD remain limited. Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with real industrial environments due to limitations in scale and resolution. To address these challenges, we introduce Real-IAD D³, a high-precision multimodal dataset that uniquely incorporates an additional pseudo-3D modality generated through photometric stereo, alongside high-resolution RGB images and micrometer-level 3D point clouds.Real-IAD D³ comprises industrial components with smaller dimensions and finer defects than existing datasets, offering diverse anomalies across modalities and presenting a more challenging benchmark for multimodal IAD research. With 20 product categories, the dataset offers significantly greater scale and diversity compared to current alternatives. Additionally, we introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality, enhancing detection performance. Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance. The Real-IAD D³ dataset will be publicly available to advance research and innovation in multimodal IAD.


Poster #438
DFM: Differentiable Feature Matching for Anomaly Detection

Wu Sheng · Yimi Wang · Xudong Liu · Yuguang Yang · Runqi Wang · Guodong Guo · David Doermann · Baochang Zhang

Feature matching methods for unsupervised anomaly detection have demonstrated impressive performance. Existing methods primarily rely on self-supervised training and handcrafted matching schemes for task adaptation. However, they can only achieve an inferior feature representation for anomaly detection because the feature extraction and matching modules are separately trained. To address these issues, we propose a Differentiable Feature Matching (DFM) framework for joint optimization of the feature extractor and the matching head. DFM transforms nearest-neighbor matching into a pooling-based module and embeds it within a Feature Matching Network (FMN). This design enables end-to-end feature extraction and feature matching module training, thus providing better feature representation for anomaly detection tasks. DFM is generic and can be incorporated into existing feature-matching methods. We implement DFM with various backbones and conduct extensive experiments across various tasks and datasets, demonstrating its effectiveness. Notably, we achieve state-of-the-art results in the continual anomaly detection task with instance-AUROC improvement of up to 3.9% and pixel-AP improvement of up to 5.5%.


Poster #439
Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression

Xiaoyi Qu · David Aponte · Colby Banbury · Daniel Robinson · Tianyu Ding · Kazuhito Koishida · Ilya Zharkov · Tianyi Chen

Structured pruning and quantization are fundamental techniques used to reduce the size of neural networks, and typically are applied independently. Applying these techniques jointly via co-optimization has the potential to produce smaller, high-quality models. However, existing joint schemas are not widely used because of (1) engineering difficulties (complicated multi-stage processes and hardware inefficiencies), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any deep neural network. GETA introduces three key innovations: (i) aquantization-aware dependency graph analysis that constructs a pruning search space, (ii) a partially projected stochastic gradient method that guarantees a layerwise bit constraint is satisfied, and (iii) a new joint learning strategy that incorporates interpretable relationships between pruning and quantization. We present numerical experiments on both convolutional neural networks and transformer architectures that show that our approach achieves competitive (often superior) performance compared to state-of-the-art joint pruning and quantization methods.


Poster #440
Highlight
OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation

Xiao Cui · Yulei Qin · Wengang Zhou · Hongsheng Li · Houqiang Li

The demands for increasingly large-scale datasets pose substantial storage and computation challenges to building deep learning models.Dataset distillation methods,especially those via sample generation techniques,rise in response to condensing large original datasets into small synthetic ones while preserving critical information.Existing subset synthesis methods simply minimize the homogeneous distance where uniform contributions from all real instances are allocated to shaping each synthetic sample.We demonstrate that such equal allocation fails to consider the instance-level relationship between each real-synthetic pair and gives rise to insufficient modeling of geometric structural nuances between the distilled and original sets.In this paper,we propose a novel framework named OPTICAL to reformulate the homogeneous distance minimization into a bi-level optimization problem via matching-and-approximating.In the matching step,we leverage optimal transport matrix to dynamically allocate contributions from real instances.Subsequently,we polish the generated samples in accordance with the established allocation scheme for approximating the real ones.Such a strategy better measures intricate geometric characteristics and handles intra-class variations for high fidelity of data distillation.Extensive experiments across seven datasets and three model architectures demonstrate our method's versatility and effectiveness.Its plug-and-play characteristic makes it compatible with a wide range of distillation frameworks.Codes are available at https://anonymous.4open.science/r/CVPR2025_696.


Poster #441
Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval

Yushuai Sun · Zikun Zhou · Dongmei Jiang · Yaowei Wang · Jun Yu · Guangming Lu · Wenjie Pei

Asymmetric retrieval is a typical scenario in real-world retrieval systems, where compatible models of varying capacities are deployed on platforms with different resource configurations. Existing methods generally train pre-defined networks or subnetworks with capacities specifically designed for pre-determined platforms, using compatible learning. Nevertheless, these methods suffer from limited flexibility for multi-platform deployment. For example, when introducing a new platform into the retrieval systems, developers have to train an additional model at an appropriate capacity that is compatible with existing models via backward-compatible learning. In this paper, we propose a Prunable Network with self-compatibility, which allows developers to generate compatible subnetworks at any desired capacity through post-training pruning. Thus it allows the creation of a sparse subnetwork matching the resources of the new platform without additional training. Specifically, we optimize both the architecture and weight of subnetworks at different capacities within a dense network in compatible learning. We also design a conflict-aware gradient integration scheme to handle the gradient conflicts between the dense network and subnetworks during compatible learning. Extensive experiments on diverse benchmarks and visual backbones demonstrate the effectiveness of our method. The code will be made publicly available.


Poster #442
Highlight
Less is More: Efficient Model Merging with Binary Task Switch

Biqing Qi · Fangyuan Li · Zhen Wang · Junqi Gao · Dong Li · Peng Ye · Bowen Zhou

As an effective approach to equip models with multi-task capabilities without additional training, model merging has garnered significant attention. However, existing merging methods face challenges of redundant parameter conflicts and the excessive storage burden of fine-tuned parameters. In this work, through controlled experiments, we reveal that for fine-tuned task vectors, only those parameters with magnitudes above a certain threshold contribute positively to the task, exhibiting a pulse-like characteristic. We then attempt leveraging this pulse-like characteristic to binarize the task vectors and reduce storage overhead. Further controlled experiments show that the binarized task vectors incur almost no decrease in fine-tuning and merging performance, and even exhibit stronger performance improvements as the proportion of redundant parameters increases. Based on these insights, we propose Task Switch (T-Switch), which decomposes task vectors into three components: 1) an activation switch instantiated by a binarized mask vector, 2) a polarity switch instantiated by a binarized sign vector, and 3) a scaling knob instantiated by a scalar coefficient. By storing task vectors in a binarized form, T-Switch alleviates parameter conflicts while ensuring efficient task parameter storage. Furthermore, to enable automated switch combination in T-Switch, we further introduce Auto-Switch, which enables training-free switch combination via retrieval from a small query set. Experiments indicate that our methods achieve significant performance improvements over existing baselines, requiring only 1-3% of the storage space of full-precision parameters.


Poster #443
On the Generalization of Handwritten Text Recognition Models

Carlos Garrido-Munoz · Jorge Calvo-Zaragoza

Recent advances in Handwritten Text Recognition (HTR) have led to significant reductions in transcription errors on standard benchmarks under the i.i.d. assumption, thus focusing on minimizing in-distribution (ID) errors.However, this assumption does not hold in real-world applications, which has motivated HTR research to explore Transfer Learning and Domain Adaptation techniques. In this work, we investigate the unaddressed limitations of HTR models in generalizing to out-of-distribution (OOD) data. We adopt the challenging setting of Domain Generalization, where models are expected to generalize to OOD data without any prior access. To this end, we analyze 336 OOD cases from eight state-of-the-art HTR models across seven widely used datasets, spanning five languages. Additionally, we study how HTR models leverage synthetic data to generalize. We reveal that the most significant factor for generalization lies in the textual divergence between domains, followed by visual divergence. We demonstrate that the error of HTR models in OOD scenarios can be reliably estimated, with discrepancies falling below 10 points in 70\% of cases. We identify the underlying limitations of HTR models, laying the foundation for future research to address this challenge.


Poster #444
Investigating the Role of Weight Decay in Enhancing Nonconvex SGD

Tao Sun · Yuhao Huang · Li Shen · Kele Xu · Bao Wang

Weight decay is a widely used technique in training machine learning models, known to empirically enhance the generalization of Stochastic Gradient Descent (SGD). While intuitively weight decay allows SGD to train a regularized model rather than the original one, there is limited theoretical understanding of why SGD with weight decay (SGDW) yields results consistent with the unregularized model, or how weight decay improves generalization. This paper establishes a convergence theory for SGDW in the context of the unregularized model, under weaker assumptions than previous analyses of weight decay. Our theory demonstrates that weight decay does not accelerate the convergence of SGD. For generalization, we provide the first theoretical proof of weight decay's benefit in nonconvex optimization. Additionally, we extend our results to sign-based stochastic gradient algorithms, such as SignSGD. Numerical experiments on classical benchmarks validate our theoretical findings.


Poster #445
Highlight
KAC: Kolmogorov-Arnold Classifier for Continual Learning

Yusong Hu · Zichen Liang · Fei Yang · Qibin Hou · Xialei Liu · Ming-Ming Cheng

Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Inspired by the success of Kolmogorov-Arnold Networks (KAN) in preserving learning stability during simple continual regression tasks, we set out to explore their potential in more complex continual learning scenarios. In this paper, we introduce the Kolmogorov-Arnold Classifier (KAC), a novel classifier developed for continual learning based on the KAN structure. We delve into the impact of KAN's spline functions and introduce Radial Basis Functions (RBF) for improved compatibility with continual learning. We replace linear classifiers with KAC in several recent approaches and conduct experiments across various continual learning benchmarks, all of which demonstrate performance improvements, highlighting the effectiveness and robustness of KAC in continual learning.

In continual learning (CL), catastrophic forgetting often arises due to feature drift.This challenge is particularly prominent in the exemplar-free continual learning (EFCL) setting, where samples from previous tasks cannot be retained.Therefore, the model struggles to maintain prior knowledge, leading to a more significant performance drop on an older task.To ensure consistent representations across tasks, it is vital to mitigate feature drift.Some EFCL methods aim to identify feature spaces that minimize the impact on previous tasks while accommodating new ones.However, they rely on static features or outdated statistics from old tasks, which prevents them from capturing the dynamic evolution of the feature space in CL, leading to performance degradation.In this paper, we introduce the Drift-Resistant Space (DRS), which effectively handles feature drifts without requiring explicit feature modeling or the storage of previous tasks.A novel parameter-efficient fine-tuning method called Low-Rank Adaptation Subtraction (LoRA$^-$) is proposed to develop the DRS.This method subtracts the LoRA weights of old tasks from the initial pre-trained weight before processing new task data to establish the DRS for model training.Therefore, LoRA$^-$ enhances stability, improves efficiency, and simplifies implementation.Furthermore, stabilizing feature drifts allows for better plasticity by learning with a triplet loss.Extensive experiments across multiple datasets show that our method consistently achieves state-of-the-art results, particularly for long sequences of learning tasks.


Poster #447
Maintaining Consistent Inter-Class Topology in Continual Test-Time Adaptation

Chenggong Ni · Fan Lyu · Jiayao Tan · Fuyuan Hu · Rui Yao · Tao Zhou

This paper introduces Topological Consistency Adaptation (TCA), a novel approach to Continual Test-time Adaptation (CTTA) that addresses the challenges of domain shifts and error accumulation in testing scenarios. TCA ensures the stability of inter-class relationships by enforcing a class topological consistency constraint, which minimizes the distortion of class centroids and preserves the topological structure during continuous adaptation. Additionally, we propose an intra-class compactness loss to maintain compactness within classes, indirectly supporting inter-class stability. To further enhance model adaptation, we introduce a batch imbalance topology weighting mechanism that accounts for class distribution imbalances within each batch, optimizing centroid distances and stabilizing the inter-class topology. Experiments show that our method demonstrates improvements in handling continuous domain shifts, ensuring stable feature distributions and boosting predictive performance.


Poster #448
Tripartite Weight-Space Ensemble for Few-Shot Class-Incremental Learning

Juntae Lee · Munawar Hayat · Sungrack Yun

Few-shot class incremental learning (FSCIL) enables the continual learning of new concepts with only a few training examples. In FSCIL, the model undergoes substantial updates, making it prone to forgetting previous concepts and overfitting to the limited new examples. Most recent trend is typically to disentangle the learning of the representation from the classification head of the model. A well-generalized feature extractor on the base classes (many examples and many classes) is learned, and then fixed during incremental learning. Arguing that the fixed feature extractor restricts the model's adaptability to new classes, we introduce a novel FSCIL method to effectively address catastrophic forgetting and overfitting issues. Our method enables to seamlessly update the entire model with a few examples. We mainly propose a tripartite weight-space ensemble (Tri-WE). Tri-WE interpolates the base, immediately previous, and current models in weight-space, especially for the classification heads of the models. Then, it collaboratively maintains knowledge from the base and previous models. In addition, we recognize the challenges of distilling generalized representations from the previous model from scarce data. Hence, we suggest a regularization loss term using amplified data knowledge distillation. Simply intermixing the few-shot data, we can produce richer data enabling the distillation of critical knowledge from the previous model. Consequently, we attain state-of-the-art results on the miniImageNet, CUB200, and CIFAR100 datasets.


Poster #449
T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning

Seong-Hyeon Hwang · Minsu Kim · Steven Euijong Whang

We study model confidence calibration in class-incremental learning, where models learn from sequential tasks with different class sets. While existing works primarily focus on accuracy, maintaining calibrated confidence has been largely overlooked. Unfortunately, most post-hoc calibration techniques are not designed to work with the limited memories of old-task data typical in class-incremental learning, as retaining a sufficient validation set would be impractical. Thus, we propose T-CIL, a novel temperature scaling approach for class-incremental learning without a validation set for old tasks, that leverages adversarially perturbed exemplars from memory. Directly using exemplars is inadequate for temperature optimization, since they are already used for training. The key idea of T-CIL is to perturb exemplars more strongly for old tasks than for the new task by adjusting the perturbation direction based on feature distance, with the single magnitude determined using the new-task validation set. This strategy makes the perturbation magnitude computed from the new task also applicable to old tasks, leveraging the tendency that the accuracy of old tasks is lower than that of the new task. We empirically show that T-CIL significantly outperforms various baselines in terms of calibration on real datasets and can be integrated with existing class-incremental learning techniques with minimal impact on accuracy.


Poster #450
Seeking Consistent Flat Minima for Better Domain Generalization via Refining Loss Landscapes

Aodi Li · Liansheng Zhuang · Xiao Long · MingHong Yao · Shafei Wang

Domain generalization aims to learn a model from multiple training domains and generalize it to unseen test domains. Recent theory has shown that seeking the deep models, whose parameters lie in the flat minima of the loss landscape, can significantly reduce the out-of-domain generalization error. However, existing methods often neglect the consistency of loss landscapes in different domains, resulting in models that are not simultaneously in the optimal flat minima in all domains, which limits their generalization ability. To address this issue, this paper proposes an iterative Self-Feedback Training (SFT) framework to seek consistent flat minima that are shared across different domains by progressively refining loss landscapes during training. It alternatively generates a feedback signal by measuring the inconsistency of loss landscapes in different domains and refines these loss landscapes for greater consistency using this feedback signal. Benefiting from the consistency of the flat minima within these refined loss landscapes, our SFT helps achieve better out-of-domain generalization. Extensive experiments on DomainBed demonstrate superior performances of SFT when compared to state-of-the-art sharpness-aware methods and other prevalent DG baselines. On average across five DG benchmarks, SFT surpasses the sharpness-aware minimization by 2.6\% with ResNet-50 and 1.5\% with ViT-B/16, respectively. The code will be available soon.


Poster #451
PEER Pressure: Model-to-Model Regularization for Single Source Domain Generalization

Dongkyu Cho · Inwoo Hwang · Sanghack Lee

Data augmentation is a popular tool for single source domain generalization, which expands the source domain by generating simulated ones, improving generalization on unseen target domains. In this work, we show that the performance of such augmentation-based methods in the target domains universally fluctuates during training, posing challenges in model selection under realistic scenarios. We argue that the fluctuation stems from the inability of the model to accumulate the knowledge learned from diverse augmentations, exacerbating feature distortion during training. Based on this observation, we propose a novel generalization method, coined Parameter-Space Ensemble with Entropy Regularization (PEER), that uses a proxy model to learn the augmented data on behalf of the main model. The main model is updated by averaging its parameters with the proxy model, progressively accumulating knowledge over the training steps. Maximizing the mutual information between the output representations of the two models guides the learning process of the proxy model, mitigating feature distortion during training. Experimental results demonstrate the effectiveness of PEER in reducing the OOD performance fluctuation and enhancing generalization across various datasets, including PACS, Digits, Office-Home, and VLCS. Notably, our method with simple random augmentation achieves state-of-the-art performance, surpassing prior approaches on sDG that utilize complex data augmentation strategies.


Poster #452
A Unified Framework for Heterogeneous Semi-supervised Learning

Marzi Heidari · Abdullah Alchihabi · Hao Yan · Yuhong Guo

In this work, we introduce a novel problem setup termed as Heterogeneous Semi-Supervised Learning (HSSL), which presents unique challenges by bridging the semi-supervised learning (SSL) task and the unsupervised domain adaptation (UDA) task, and expanding standard semi-supervised learning to cope with heterogeneous training data. At its core, HSSL aims to learn a prediction model using a combination of labeled and unlabeled training data drawn separately from heterogeneous domains that share a common set of semantic categories; this model is intended to differentiate the semantic categories of test instances sampled from both the labeled and unlabeled domains. In particular, the labeled and unlabeled domains have dissimilar label distributions and class feature distributions. This heterogeneity, coupled with the assorted sources of the test data, introduces significant challenges to standard SSL and UDA methods. Therefore, we propose a novel method, Unified Framework for Heterogeneous Semi-supervised Learning (Uni-HSSL), to address HSSL by directly learning a fine-grained classifier from the heterogeneous data, which adaptively handles the inter-domain heterogeneity while leveraging both the unlabeled data and the inter-domain semantic class relationships for cross-domain knowledge transfer and adaptation. We conduct comprehensive experiments and the experimental results validate the efficacy and superior performance of the proposed Uni-HSSL over state-of-the-art semi-supervised learning and unsupervised domain adaptation methods.


Poster #453
CGMatch: A Different Perspective of Semi-supervised Learning

Bo Cheng · Jueqing Lu · Yuan Tian · Haifeng Zhao · Yi Chang · Lan Du

Semi-supervised learning (SSL) has garnered significant attention due to its ability to leverage limited labeled data and a large amount of unlabeled data to improve model generalization performance. Recent approaches achieve impressive successes by combining ideas from both consistency regularization and pseudo-labeling. However, these methods tend to underperform in the more realistic situations with relatively scarce labeled data. We argue that this issue arises because existing methods rely solely on the model's confidence, making them challenging to accurately assess the model's state and identify unlabeled examples contributing to the training phase when supervision information is limited, especially during the early stages of model training. In this paper, we propose a novel SSL model called CGMatch, which, for the first time, incorporates a new metric known as Count-Gap (CG). We demonstrate that CG is effective in discovering unlabeled examples beneficial for model training. Along with confidence, a commonly used metric in SSL, we propose a fine-grained dynamic selection (FDS) strategy. This strategy dynamically divides the unlabeled dataset into three subsets with different characteristics: easy-to-learn set, ambiguous set, and hard-to-learn set. By selective filtering subsets, and applying corresponding regularization with selected subsets, we mitigate the negative impact of incorrect pseudo-labels on model optimization and generalization. Extensive experimental results on several common SSL benchmarks indicate the effectiveness of CGMatch especially when the labeled data are particularly limited.


Poster #454
Highlight
Label Shift Meets Online Learning: Ensuring Consistent Adaptation with Universal Dynamic Regret

Yucong Dai · Shilin Gu · Ruidong Fan · Chao Xu · Chenping Hou

Label shift, which investigates the adaptation of label distributions between the fixed source and target domains, has attracted significant research interests and broad applications in offline settings. In real-world scenarios, however, data often arrives as a continuous stream. Addressing label shift in online learning settings is paramount. Existing strategies, which tailor traditional offline label shift techniques to online settings, have degraded performance due to the inconsistent estimation of label distributions and violation of convex assumption for theoretical guarantee. In this paper, we propose a novel method to ensure consistent adaptation to online label shift. We construct a new convex risk estimator that is pivotal for both online optimization and theoretical analysis. Furthermore, we enhance an optimistic online algorithm as the base learner and refine the classifier using an ensemble method. Theoretically, we derive a universal dynamic regret which achieves minimax optimal. Extensive experiments on both real-world datasets and human motion task demonstrate the superiority of our method comparing existing methods.


Poster #455
Highlight
Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection

Zhuo Xu · Xiang Xiang · Yifan Liang

Vision-language models (VLMs), such as CLIP, have shown remarkable capabilities in downstream tasks. However, the coupling of semantic information between the foreground and the background in images leads to significant shortcut issues that adversely affect out-of-distribution (OOD) detection abilities. When confronted with a background OOD sample, VLMs are prone to misidentifying it as in-distribution (ID) data. In this paper, we analyze the OOD problem from the perspective of shortcuts in VLMs and propose OSPCoOp which includes background decoupling and mask-guided region regularization. We first decouple images into ID-relevant and ID-irrelevant regions and utilize the latter to generate a large number of augmented OOD background samples as pseudo-OOD supervision. We then use the masks from background decoupling to adjust the model's attention, minimizing its focus on ID-irrelevant regions. To assess the model's robustness against background interference, we introduce a new OOD evaluation dataset, ImageNet-Bg, which solely consists of background images with all ID-relevant regions removed. Our method demonstrates exceptional performance in few-shot scenarios, achieving strong results even in one-shot setting, and outperforms existing methods.


Poster #456
H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection

Yuhang Liu · Wenjie Zhao · Yunhui Guo

Task incremental learning (TIL) is a specific form of continual learning (CL), wherein the model is trained on a set of distinguishable tasks. However, current TIL methodologies are predicated on the closed-world assumption, which posits that test data remains in-distribution (ID). When deployed in an open-world scenario, test samples can be from out-of-distribution (OOD) sources. Current OOD detection methods primarily rely on model outputs, leading to an over-dependence on model performance. Additionally, a threshold is required to distinguish between ID and OOD, limiting their practical application. Moreover, these methods can only achieve coarse-grained binary classification and cannot obtain task identity. To address this, we propose Hierarchical Two-sample Tests (H2ST), which is compatible with any existing replay-based TIL frameworks. H2ST eliminates the necessity for thresholds by employing hypothesis testing while leveraging feature maps to harness the model's capabilities without excessive dependence. The proposed hierarchical architecture incorporates a task-level detection mechanism, simplifying classification for individual classifiers. Extensive experiments and analysis demonstrate the effectiveness of H2ST in open-world TIL scenarios and its superiority to the existing methods.

Out-of-Distribution (OOD) detection is critical for safe deployment; however, existing detectors often struggle to generalize across datasets of varying scales and model architectures, and some can incur high computational costs in real-world applications. Inspired by the phenomenon of Neural Collapse, we propose a versatile and efficient OOD detection method. Specifically, we re-characterize prior observations that in-distribution (ID) samples form clusters, demonstrating that, with appropriate centering, these clusters align closely with model weight vectors. Additionally, we reveal that ID features tend to expand into a simplex Equiangular Tight Frame, explaining the common observation that ID features are situated farther from the origin than OOD features. Incorporating both insights from Neural Collapse, our OOD detector leverages feature proximity to weight vectors and complements this approach by using feature norms to effectively filter out OOD samples. Extensive experiments on off-the-shelf models demonstrate the robustness of our OOD detector across diverse scenarios, mitigating generalization discrepancies and enhancing overall performance, with inference latency comparable to that of the basic softmax-confidence detector.


Poster #458
FedCS: Coreset Selection for Federated Learning

Chenhe Hao · Weiying Xie · Daixun Li · Haonan Qin · Hangyu Ye · Leyuan Fang · Yunsong Li

Federated Learning (FL) is an emerging direction in distributed machine learning that enables jointly training a model without sharing the data. However, as the size of datasets grows exponentially, computational costs of FL increase. In this paper, we propose the first Coreset Selection criterion for Federated Learning (FedCS) by exploring the Distance Contrast (DC) in feature space. Our FedCS is inspired by the discovery that DC can indicate the intrinsic properties inherent to samples regardless of the networks. Based on the observation, we develop a method that is mathematically formulated to prune samples with high DC. The principle behind our pruning is that high DC samples either contain less information or represent rare extreme cases, thus removal of them can enhance the aggregation performance. Besides, we experimentally show that samples with low DC usually contain substantial information and reflect the common features of samples within their classes, such that they are suitable for constructing coreset. With only two time of linear-logarithmic complexity operation, FedCS leads to significant improvements over the methods using whole dataset in terms of computational costs, with similar accuracies. For example, on the CIFAR-10 dataset with Dirichlet coefficient $\alpha=0.1$, FedCS achieves 58.88% accuracy using only 44% of the entire dataset, whereas other methods require twice the data volume as FedCS for same performance.


Poster #459
FedCALM: Conflict-aware Layer-wise Mitigation for Selective Aggregation in Deeper Personalized Federated Learning

Hao Zheng · Zhigang Hu · Boyu Wang · Liu Yang · Meiguang Zheng · Aikun Xu

Server aggregation conflict is a key challenge in personalized federated learning (PFL). While existing PFL methods have achieved significant progress with shallow base models (e.g., four-layer CNNs), they often overlook the negative impacts of deeper base models on personalization mechanisms. In this paper, we identify the phenomenon of deep model degradation in PFL, where as base model depth increases, the model becomes more sensitive to local client data distributions, thereby exacerbating server aggregation conflicts and ultimately reducing overall model performance. Moreover, we show that these conflicts manifest in insufficient global average updates and mutual constraints between clients. Motivated by our analysis, we proposed a two-stage conflict-aware layer-wise mitigation algorithm, which first constructs a conflict-free global update to alleviate negative conflicts, and then alleviates the conflicts between clients through a conflict-aware strategy.Notably, our method naturally leads to a selective mechanism that balances the tradeoff between clients involved in aggregation and the tolerance for conflicts. Consequently, it can boost the positive contribution to the clients even with the greatest conflicts with the global update.Extensive experiments across multiple datasets and deeper base models demonstrate that FedCALM outperforms four state-of-the-art (SOTA) methods by up to 9.88\% and seamlessly integrates into existing PFL methods with performance improvements of up to 9.01\%. Moreover, FedCALM achieves comparable or even better communication and computational efficiency than other SOTA methods.


Poster #460
Model Poisoning Attacks to Federated Learning via Multi-Round Consistency

Yueqi Xie · Minghong Fang · Neil Zhenqiang Gong

Model poisoning attacks are critical security threats to Federated Learning (FL). Existing model poisoning attacks suffer from two key limitations: 1) they achieve suboptimal effectiveness when defenses are deployed, and/or 2) they require knowledge of the model updates or local training data on genuine clients. In this work, we make a key observation that their suboptimal effectiveness arises from only leveraging model-update consistency among malicious clients within individual training rounds, making the attack effect self-cancel across training rounds. In light of this observation, we propose PoisonedFL, which enforces multi-round consistency among the malicious clients' model updates while not requiring any knowledge about the genuine clients.Our empirical evaluation on five benchmark datasets shows that \ourmodel{} breaks eight state-of-the-art defenses and outperforms seven existing model poisoning attacks. Our study shows that FL systems are considerably less robust than previously thought, underlining the urgency for the development of new defense mechanisms.


Poster #461
FedSPA: Generalizable Federated Graph Learning under Homophily Heterogeneity

Zihan Tan · Guancheng Wan · Wenke Huang · Guibin Zhang · He Li · Carl Yang · Mang Ye

Federated Graph Learning (FGL) has emerged as a solution to address real-world privacy concerns and data silos in graph learning, which relies on Graph Neural Networks (GNNs).Nevertheless, the homophily level discrepancies within the local graph data of clients, termed homophily heterogeneity, significantly degrade the generalizability of a global GNN. Existing research ignores this issue and suffers from unpromising collaboration. In this paper, we propose $\textbf{FedSPA}$, an effective hyperparameter-free framework that addresses homophily heterogeneity from the perspectives of homophily conflict and homophily bias, concepts that have yet to be defined or explored.In the first place, the homophily conflict arises when training on inconsistent homophily levels across clients. Correspondingly, we propose $\textbf{S}$ubgraph Feature $\textbf{P}$ropagation Decoupling (SFPD), thereby achieving collaboration on unified homophily levels across clients. To further address homophily bias, we design Homophily Bias-Driven $\textbf{A}$ggregation (HBDA) which emphasizes clients with lower biases. It enables the adaptive adjustment of each client contribution to the global GNN based on its homophily bias. The superiority of $\textbf{FedSPA}$ is validated through extensive experiments.


Poster #462
TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions

Wang Yu-Hang · Junkang Guo · Aolei Liu · Kaihao Wang · Zaitong Wu · Zhenyu Liu · Wenfei Yin · Jian Liu

Adversarial robustness remains a significant challenge in deploying deep neural networks for real-world applications. While adversarial training is widely acknowledged as a promising defense strategy, most existing studies primarily focus on balanced datasets, neglecting the fact that real-world data often exhibit a long-tailed distribution, which introduces substantial challenges to robustness. In this paper, we provide an in-depth analysis of adversarial training in the context of long-tailed distributions and identify the limitations of the current state-of-the-art method, AT-BSL, in achieving robust performance under such conditions. To address these challenges, we propose a novel training framework, TAET, which incorporates an initial stabilization phase followed by a stratified, equalization adversarial training phase. Furthermore, prior work on long-tailed robustness has largely overlooked a crucial evaluation metric—balanced accuracy. To fill this gap, we introduce the concept of balanced robustness, a comprehensive metric that measures robustness specifically under long-tailed distributions. Extensive experiments demonstrate that our method outperforms existing advanced defenses, yielding significant improvements in both memory and computational efficiency. We believe this work represents a substantial step forward in tackling robustness challenges in real-world applications.Supplementary material contains our code.


Poster #463
Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples

WEIWEI LI · Junzhuo Liu · Yuanyuan Ren · Yuchen Zheng · Yahao Liu · Wen Li

Deep learning models are known to often learn features that spuriously correlate with the class label during training but are irrelevant to the prediction task. Existing methods typically address this issue by annotating potential spurious attributes, or filtering spurious features based on some empirical assumptions (e.g., simplicity of bias). However, these methods may yield unsatisfying performance due to the intricate and elusive nature of spurious correlations in real-world data. In this paper, we propose a data-oriented approach to mitigate the spurious correlation in deep learning models. We observe that samples that are influenced by spurious features tend to exhibit a dispersed distribution in the learned feature space. This allows us to identify the presence of spurious features. Subsequently, we obtain a bias-invariant representation by neutralizing the spuriousfeatures based on a simple grouping strategy. Then, we learn a feature transformation to eliminate the spuriousfeatures by aligning with this bias-invariant representation. Finally, we update the classifier by incorporating the learned feature transformation and obtain an unbiased model. By integrating the aforementioned identifying, neutralizing, eliminating and updating procedures, we build an effective pipeline for mitigating spurious correlation. Experiments on four image and NLP debiasing benchmarks and one medical dataset demonstrate the effectiveness of our proposed approach, showing an improvement of worst-group accuracy by over 20\% compared to standard empirical risk minimization (ERM). Codes and checkpoints are available at https://anonymous.4open.science/r/ssc_debiasing-1CC8.


Poster #464
Uncertainty Weighted Gradients for Model Calibration

Jinxu Lin · Linwei Tao · Minjing Dong · Chang Xu

Model calibration is essential for ensuring that the predictions of deep neural networks accurately reflect true probabilities in real-world classification tasks.However, deep networks often produce over-confident or under-confident predictions, leading to miscalibration.Various methods have been proposed to address this issue by designing effective loss functions for calibration, such as focal loss. In this paper, we analyze its effectiveness and provide a unified loss framework of focal loss and its variants, where we mainly attribute their superiority in model calibration to the loss weighting factor that estimates sample-wise uncertainty.Based on our analysis, existing loss functions fail to achieve optimal calibration performance due to two main issues: including misalignment in optimization and insufficient precision in uncertainty estimation.Specifically, focal loss cannot align sample uncertainty with gradient scaling and the single logit cannot indicate the uncertainty.To address these issues, we reformulate the optimization from the perspective of gradients, which focuses on uncertain samples. Meanwhile, we propose to use the Brier Score as the loss weight factor, which provides a more accurate uncertainty estimation via all the logits. Extensive experiments on various models and datasets demonstrate that our method achieves state-of-the-art (SOTA) performance.

Trusted multi-view classification (TMVC) addresses variations in data quality by evaluating the reliability of each view based on prediction uncertainty at the evidence level, reducing the impact of low-quality views commonly encountered in real-world scenarios. However, existing TMVC methods often struggle to maintain robustness during testing, particularly when integrating noisy or corrupted views. This limitation arises because the evidence collected by TMVC may be unreliable, frequently providing incorrect information due to complex view distributions and optimization challenges, ultimately leading to classification performance degradation. To enhance the robustness of TMVC methods in real-world conditions, we propose a generalized evidence filtering mechanism that is compatible with various fusion strategies commonly used in TMVC, including Belief Constraint Fusion, Aleatory Cumulative Belief Fusion, and Averaging Belief Fusion. Specifically, we frame the identification of unreliable evidence as a multiple testing problem and introduce p-values to control the risk of false identification. By selectively down-weighting unreliable evidence during testing, our mechanism ensures robust fusion and mitigates performance degradation. Both theoretical guarantees and empirical results demonstrate significant improvements in the classification performance of TMVC methods, supporting their reliable application in challenging, real-world environments.


Poster #466
Enhanced then Progressive Fusion with View Graph for Multi-View Clustering

Zhibin Dong · Meng Liu · Siwei Wang · KE LIANG · Yi Zhang · Suyuan Liu · Jiaqi Jin · Xinwang Liu · En Zhu

Multi-view clustering aims to improve clustering accuracy by effectively integrating complementary information from multiple perspectives. However, existing methods often encounter challenges such as feature conflicts between views and insufficient enhancement of individual view features, which hinder clustering performance. To address these challenges, we propose a novel framework, EPFMVC, which integrates feature enhancement with progressive fusion to more effectively align multi-view data. Specifically, we introduce two key innovations: (1) a Feature Channel Attention Encoder (FCAencoder), which adaptively enhances the most discriminative features in each view, and (2) a View Graph-based Progressive Fusion Mechanism, which constructs a view graph using optimal transport (OT) distance to progressively fuse similar views while minimizing inter-view conflicts. By leveraging multi-head attention, the fusion process gradually integrates complementary information, ensuring more consistent and robust shared representations. These innovations enable superior representation learning and effective fusion across views. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art techniques, achieving notable improvements in multi-view clustering tasks across various datasets and evaluation metrics.


Poster #467
A Hubness Perspective on Representation Learning for Graph-Based Multi-View Clustering

Zheming Xu · He Liu · Congyan Lang · Tao Wang · Yidong Li · Michael C. Kampffmeyer

Recent graph-based multi-view clustering (GMVC) methods typically encode view features into high-dimensional spaces and construct graphs based on distance similarity. However, the high dimensionality of the embeddings often leads to the hubness problem, where a few points repeatedly appear in the nearest neighbor lists of other points. We show that this negatively impacts the extracted graph structures and message passing, thus degrading clustering performance. To the best of our knowledge, we are the first to highlight the detrimental effect of hubness in GMVC methods and introduce the hubREP (hub-aware Representation Embedding and Pairing) framework. Specifically, we propose a simple yet effective encoder that reduces hubness while preserving neighborhood topology within each view. Additionally, we propose a hub-aware pairing module to maintain structure consistency across views, efficiently enhancing the view-specific representations. The proposed hubREP is lightweight compared to the conventional autoencoders used in state-of-the-art GMVC methods and can be integrated into existing GMVC methods that mostly focus on novel fusion mechanisms, further boosting their performance. Comprehensive experiments performed on eight benchmarks confirm the superiority of our method. Code is included in the supplementary material.


Poster #468
CLOC: Contrastive Learning for Ordinal Classification with Multi-Margin N-pair Loss

Dileepa Pitawela · Gustavo Carneiro · Hsiang-Ting Chen

In ordinal classification, misclassifying neighboring ranks is common, yet the consequences of these errors are not the same.For example, misclassifying benign tumor categories is less consequential, compared to an error at the pre-cancerous to cancerous threshold, which could profoundly influence treatment choices. Despite this, existing ordinal classification methods do not account for the varying importance of these margins, treating all neighboring classes as equally significant. To address this limitation, we propose CLOC, a new margin-based contrastive learning method for ordinal classification that learns an ordered representation based on the optimization of multiple margins with a novel multi-margin n-pair loss (MMNP).CLOC enables flexible decision boundaries across key adjacent categories, facilitating smooth transitions between classes and reducing the risk of overfitting to biases present in the training data.We provide empirical discussion regarding the properties of MMNP and show experimental results on five real-world image datasets (Adience, Historical Colour Image Dating, Knee Osteoarthritis, Indian Diabetic Retinopathy Image, and Breast Carcinoma Subtyping) and one synthetic dataset simulating clinical decision bias.Our results demonstrate that CLOC outperforms existing ordinal classification methods and show the interpretability and controllability of CLOC in learning meaningful, ordered representations that align with clinical and practical needs.

Multimodal image-tabular learning is gaining attention, yet it faces challenges due to limited labeled data. While earlier work has applied self-supervised learning (SSL) to unlabeled data, its task-agnostic nature often results in learning suboptimal features for downstream tasks. Semi-supervised learning (SemiSL), which combines labeled and unlabeled data, offers a promising solution. However, existing multimodal SemiSL methods typically focus on unimodal or modality-shared features, ignoring valuable task-relevant modality-specific information, leading to a Modality Information Gap. In this paper, we propose STiL, a novel SemiSL tabular-image framework that addresses this gap by comprehensively exploring task-relevant information. STiL features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement. We also propose a novel consensus-guided pseudo-labeling strategy to generate reliable pseudo-labels based on classifier consensus, along with a new prototype-guided label smoothing technique to refine pseudo-label quality with prototype embeddings, thereby enhancing task-relevant information learning in unlabeled data. Experiments on natural and medical image datasets show that STiL outperforms state-of-the-art supervised/SSL/SemiSL image/multimodal approaches. Our code will be available on GitHub.


Poster #470
Q-PART: Quasi-Periodic Adaptive Regression with Test-time Training for Pediatric Left Ventricular Ejection Fraction Regression

Jie Liu · Tiexin Qin · Hui Liu · Yilei Shi · Lichao Mou · Xiao Xiang Zhu · Shiqi Wang · Haoliang Li

In this work, we address the challenge of adaptive pediatric Left Ventricular Ejection Fraction (LVEF) assessment. While Test-time Training (TTT) approaches show promise for this task, they suffer from two significant limitations. Existing TTT works are primarily designed for classification tasks rather than continuous value regression, and they lack mechanisms to handle the quasi-periodic nature of cardiac signals. To tackle these issues, we propose a novel \textbf{Q}uasi-\textbf{P}eriodic \textbf{A}daptive \textbf{R}egression with \textbf{T}est-time Training (Q-PART) framework. In the training stage, the proposed Quasi-Period Network decomposes the echocardiogram into periodic and aperiodic components within latent space by combining parameterized helix trajectories with Neural Controlled Differential Equations. During inference, our framework further employs a variance minimization strategy across image augmentations that simulate common quality issues in echocardiogram acquisition, along with differential adaptation rates for periodic and aperiodic components. Theoretical analysis is provided to demonstrate that our variance minimization objective effectively bounds the regression error under mild conditions. Furthermore, extensive experiments across three pediatric age groups demonstrate that Q-PART not only significantly outperforms existing approaches in pediatric LVEF prediction, but also exhibits strong clinical screening capability with high mAUROC scores (up to 0.9747) and maintains gender-fair performance across all metrics, validating its robustness and practical utility in pediatric echocardiography analysis. The relevant dataset and code will be released upon acceptance of this paper.


Poster #471
OralXrays-9: Towards Hospital-Scale Panoramic X-ray Anomaly Detection via Personalized Multi-Object Query-Aware Mining

Bingzhi Chen · Sisi Fu · Xiaocheng Fang · Jieyi Cai · Boya Zhang · Minhua Lu · Yishu Liu

In clinical practice, panoramic dental radiography is a widely employed imaging technique that can provide a detailed and comprehensive view of dental structures and surrounding tissues for identifying various oral anomalies. However, due to the complexity of oral anomalies and the scarcity of available data, existing research still suffers from substantial challenges in automated oral anomaly detection. To this end, this paper presents a new hospital-scale panoramic X-ray benchmark, namely “OralXrays-9”, which consists of 12,688 panoramic X-ray images with 84,113 meticulously annotated instances across nine common oral anomalies. Correspondingly, we propose a personalized Multi-Object Query-Aware Mining (MOQAM) paradigm, which jointly incorporates the Distribution-IoU Region Proposal Network (DI-RPN) and Class-Balanced Spherical Contrastive Regularization (CB-SCR) mechanisms to address the challenges posed by multi-scale variations and class-imbalanced distributions.To the best of our knowledge, this is the first attempt to develop AI-driven diagnostic systems specifically designed for multi-object oral anomaly detection, utilizing publicly available data resources. Extensive experiments on the newly-published OralXrays-9 dataset and real-world nature scenarios consistently demonstrate the superiority of our MOQAM in revolutionizing oral healthcare practices.


Poster #472
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation

Sang-Jun Park · Keun-Soo Heo · Dong-Hee Shin · Young-Han Son · Ji-Hye Oh · Tae-Eui Kam

The automatic generation of radiology reports has emerged as a promising solution to reduce a time-consuming task and accurately capture critical disease-relevant findings in X-ray images. Previous approaches for radiology report generation have shown impressive performance. However, there remains significant potential to improve accuracy by ensuring that retrieved reports contain disease-relevant findings similar to those in the X-ray images and by refining generated reports. In this study, we propose a Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation (DART) framework. In the first stage, we generate initial reports based on image-to-text retrieval with disease-matching, embedding both images and texts in a shared embedding space through contrastive learning. This approach ensures the retrieval of reports with similar disease-relevant findings that closely align with the input X-ray images. In the second stage, we further enhance the initial reports by introducing a self-correction module that re-aligns them with the X-ray images. Our proposed framework achieves state-of-the-art results on the MIMIC-CXR and IU X-ray benchmarks, surpassing previous approaches in both report generation and disease classification, thereby enhancing the trustworthiness of radiology reports.


Poster #473
FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification

Zhengrui Guo · Conghao Xiong · Jiabo MA · Qichen Sun · Lishuang Feng · Jinzhuo Wang · Hao Chen

Few-shot learning presents a critical solution for cancer diagnosis in computational pathology (CPath), addressing fundamental limitations in data availability, particularly the scarcity of expert annotations and patient privacy constraints. A key challenge in this paradigm stems from the inherent disparity between the limited training set of whole slide images (WSIs) and the enormous number of contained patches, where a significant portion of these patches lacks diagnostically relevant information, potentially diluting the model's ability to learn and focus on critical diagnostic features. While recent works attempt to address this by incorporating additional knowledge, several crucial gaps hinder further progress: (1) despite the emergence of powerful pathology foundation models (FMs), their potential remains largely untapped, with most approaches limiting their use to basic feature extraction; (2) current language guidance mechanisms attempt to align text prompts with vast numbers of WSI patches all at once, struggling to leverage rich pathological semantic information. To this end, we introduce the knowledge-enhanced adaptive visual compression framework, dubbed FOCUS, which uniquely combines pathology FMs with language prior knowledge to enable a focused analysis of diagnostically relevant regions by prioritizing discriminative WSI patches. Our approach implements a progressive three-stage compression strategy: we first leverage FMs for global visual redundancy elimination, and integrate compressed features with language prompts for semantic relevance assessment, then perform neighbor-aware visual token filtering while preserving spatial coherence. Extensive experiments on pathological datasets spanning breast, lung, and ovarian cancers demonstrate its superior performance in few-shot pathology diagnosis.


Poster #474
M3amba: Memory Mamba is All You Need for Whole Slide Image Classification

Tingting Zheng · Kui Jiang · Yi Xiao · Sicheng Zhao · Hongxun Yao

Multi-instance learning (MIL) has demonstrated impressive performance in whole slide image (WSI) analysis. However, existing approaches struggle with undesirable results and unbearable computational overhead due to the quadratic complexity of Transformers. Recently, Mamba has offered a feasible solution for modeling long-range dependencies with linear complexity. However, vanilla Mamba inherently suffers from contextual forgetting issues, making it ill-suited for capturing global dependencies across instances in large-scale WSIs. To address this, we propose a memory-driven Mamba network, dubbed M3amba, to fully explore the global latent relations among instances. Specifically, M3amba retains and iteratively updates historical information with a dynamic memory bank (DMB), thus overcoming the catastrophic forgetting defects of Mamba for long-term context representation. For better feature representation, M3amba involves an intra-group bidirectional Mamba (BiMamba) block to refine local interactions within groups. Meanwhile, we additionally perform cross-attention fusion to incorporate relevant historical information across groups, facilitating richer inter-group connections. The joint learning of inter- and intra-group representations with memory merits enables M3amba with a more powerful capability for achieving accurate and comprehensive WSI representation. Extensive experiments on four datasets demonstrate that M3amba outperforms the state-of-the-art by 6.2\% and 7.0\% in accuracy on the TCGA BRAC and TCGA Lung datasets while maintaining low computational costs.


Poster #475
MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images

Aniruddha Ganguly · Debolina Chatterjee · Wentao Huang · Jie Zhang · Alisa Yurovsky · Travis Steele Johnson · Chao Chen

Recent advances in Spatial Transcriptomics (ST) pair histology images with spatially resolved gene expression profiles, enabling predictions of gene expression across different tissue locations based on image patches. This opens up new possibilities for enhancing whole slide image (WSI) prediction tasks with localized gene expression. However, existing methods fail to fully leverage the interactions between different tissue locations, which are crucial for accurate joint prediction. To address this, we introduce MERGE (Multi-faceted hiErarchical gRaph for Gene Expressions), which combines a multi-faceted hierarchical graph construction strategy with graph neural networks (GNN) to improve gene expression predictions from WSIs. By clustering tissue image patches based on both spatial and morphological features, and incorporating intra- and inter-cluster edges, our approach fosters interactions between distant tissue locations during GNN learning. As an additional contribution, we evaluate different data smoothing techniques that are necessary to mitigate artifacts in ST data, often caused by technical imperfections. We advocate for adopting gene-aware smoothing methods that are more biologically justified. Experimental results on gene expression prediction show that our GNN method outperforms state-of-the-art techniques across multiple metrics.


Poster #476
Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation

Xingguo Lv · Xingbo Dong · Liwen Wang · Jiewen Yang · Lei Zhao · Bin Pu · Zhe Jin · Xuejun Li

Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. We will make all codes publicly available.

Accurate automatic breast ultrasound (BUS) image segmentation is essential for early screening and diagnosis of breast cancer. It is, however, a quite challenging task owing to (1) the large variation in the scale and shape of breast lesions, (2) the ambiguous boundaries caused by extensive speckle noise and artifacts in BUS images, and (3) the scarcity of high-quality pixel-level annotations. Most existing semi-supervised methods employ the mean-teacher architecture, which merely learns semantic information within a single image and heavily relies on the performance of the teacher model. Given the vulnerability of this framework, we present a novel cross-image semantic correlation semi-supervised framework, named CSC-PA, to improve the performance of BUS image segmentation. CSC-PA is trained based on a single network, which integrates a foreground prototype attention (FPA) and an edge prototype attention (EPA). Specifically, channel prototypes and an attention mechanism are used in the FPA to transfer complementary foreground information between labeled and unlabeled images, achieving more stable and complete lesion segmentation. On the other hand, EPA is proposed to enhance edge features of lesions by using edge prototype. To achieve this, we design a novel adaptive edge container to store global edge features and generate the edge prototype. Additionally, we propose a pixel affinity loss (PAL) to exploit previously ignored contextual correlation in supervision, which further improves performance on edges. We conduct extensive experiments on two benchmark BUS datasets, demonstrating that our model outperforms other state-of-the-art methods under different partition protocols. Codes will be available upon publication.


Poster #478
Take the Bull by the Horns: Learning to Segment Hard Samples

Yuan Guo · Jingyu Kong · Yu Wang · Yuping Duan

Medical image segmentation is vital for clinical applications, with hard samples playing a key role in segmentation accuracy. We propose an effective image segmentation framework that includes mechanisms for identifying and segmenting hard samples. It derives a novel image segmentation paradigm: 1) Learning to identify hard samples: automatically selecting inherent hard samples from different datasets, and 2) Learning to segment hard samples: achieving the segmentation of hard samples through effective feature augmentation on dedicated networks. We name our method ``Learning to Segment hard samples" (L2S). The hard sample identification module comprises a backbone model and a classifier, which dynamically uncovers inherent dataset patterns. The hard sample segmentation module utilizes the diffusion process for feature augmentation and incorporates a more sophisticated segmentation network to achieve precise segmentation. We justify our motivation through solid theoretical analysis and extensive experiments. Evaluations across various modalities show that our L2S outperforms other SOTA methods, particularly by substantially improving the segmentation accuracy of hard samples. On the ISIC dataset, our L2S improves the Dice score on hard samples and overall segmentation by 8.97\% and 1.01\%, respectively, compared to SOTA methods.


Poster #479
Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images

Jie Mei · Chenyu Lin · Yu Qiu · Yaonan Wang · Hui Zhang · Ziyang Wang · Dong Dai

Lung cancer is a leading cause of cancer-related deaths globally. PET-CT is crucial for imaging lung tumors, providing essential metabolic and anatomical information, while it faces challenges such as poor image quality, motion artifacts, and complex tumor morphology. Deep learning-based segmentation models are expected to address these problems, however, most existing datasets are small-scale and private, which is insufficient to support significant performance improvements for these methods. Hence, we introduce a large-scale PET-CT lung tumor segmentation dataset, termed PCLT20K, which comprises 21,930 pairs of PET-CT images from 605 patients. All images are manually labeled with pixel-level tumor masks by experienced doctors. Furthermore, we propose a cross-modal interactive perception network with Mamba (CIPA) for lung tumor segmentation in PET-CT images. Specifically, we design a channel-wise rectification module (CRM) that implements a channel state space block across multi-modal features to learn correlated representations and helps filter out modality-specific noise. A dynamic cross-modality interaction module (DCIM) is designed to effectively integrate position and context information, which employs PET images to learn regional position information and serves as a bridge to assist in modeling the relationships between local features of CT images. Extensive experiments on a comprehensive benchmark demonstrate the effectiveness of our CIPA compared to the current state-of-the-art segmentation methods. We hope our research can provide more exploration opportunities for medical image segmentation.

Magnetic resonance imaging (MRI), with modalities including T1, T2, T1ce, and Flair, providing complementary information critical for sub-region analysis, is widely used for brain tumor diagnosis. However, clinical practice often suffers from varying degrees of incompleteness of necessary modalities due to reasons such as susceptibility to artifacts. It significantly impairs segmentation model performance. Given the limited available modalities at hand, existing approaches attempt to project them into a shared latent space. However, they ignore decomposing the modality-shared and modality-specific information and failed to construct the relationship among different modalities. Such deficiency limits the effectiveness of the segmentation performance, particularly at a time when the amount of data in each modality is different. In this paper, we propose the plug-and-play Koopman Multi-modality Decomposition (KMD) module, leveraging the Koopman Invariant Subspace to disentangle modality-common and modality-specific information. It is capable of constructing modality relationships that minimize bias toward modalities across various modality-incomplete scenarios. More importantly, it can be integrated into several existing backbones feasibility. Through theoretical deductions and extensive empirical experiences on the BraTS2018 and BraTS2020 datasets, we have sufficiently demonstrated the effectiveness of the proposed KMD to promote generalization performance.


Poster #481
Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation

Kunpeng Qiu · Zhiqiang Gao · Zhiying Zhou · MINGJIE SUN · Yongxin Guo

Deep learning has revolutionized medical image segmentation, but its full potential is limited by the scarcity of annotated datasets. Diffusion models are used to generate synthetic image-mask pairs to expand these datasets, yet they also face the same data scarcity issues they aim to address. Traditional mask-only models often produce low-fidelity images due to insufficient generation of morphological characteristics, which can catastrophically undermine the reliability of segmentation models. To enhance morphological fidelity, we propose the Siamese-Diffusion model, which incorporates both image and mask prior controls during training and switches to mask-only guidance during sampling to preserve diversity and scalability. This model, comprising both Mask-Diffusion and Image-Diffusion, ensures high morphological fidelity by introducing a Noise Consistency Loss between the two diffusion processes, guiding the convergence trajectory of Mask-Diffusion toward higher-fidelity local minima in the parameter space. Extensive experiments validate the superiority of our method: with Siamese-Diffusion, SANet achieves mDice and mIoU improvements of 3.6% and 4.4% on the Polyps dataset, while UNet shows mDice and mIoU improvements of 1.52% and 1.64% on the ISIC2018 dataset. Code will be released.


Poster #482
DeNVeR: Deformable Neural Vessel Representations for Unsupervised Video Vessel Segmentation

Chun-Hung Wu · Shih-Hong Chen · Chih Yao Hu · Hsin-Yu Wu · Kai-Hsin Chen · Yu-You Chen · Chih-Hai Su · Chih-Kuo Lee · Yu-Lun Liu

This paper presents Deformable Neural Vessel Representations (DeNVeR), an unsupervised approach for vessel segmentation in X-ray angiography videos without annotated ground truth. DeNVeR utilizes optical flow and layer separation techniques, enhancing segmentation accuracy and adaptability through test-time training. Key contributions include a novel layer separation bootstrapping technique, a parallel vessel motion loss, and the integration of Eulerian motion fields for modeling complex vessel dynamics. A significant component of this research is the introduction of the XACV dataset, the first X-ray angiography coronary video dataset with high-quality, manually labeled segmentation ground truth. Extensive evaluations on both XACV and CADICA datasets demonstrate that DeNVeR outperforms current state-of-the-art methods in vessel segmentation accuracy and generalization capability while maintaining temporal coherency.


Poster #483
VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis

Zhifeng Wang · Renjiao Yi · Xin Wen · Chenyang Zhu · Kai Xu

Angiography imaging is a medical imaging technique that enhances the visibility of blood vessels within the body by using contrast agents. Angiographic images can effectively assist in the diagnosis of vascular diseases. However, contrast agents may bring extra radiation exposure which is harmful to patients with health risks. To mitigate these concerns, in this paper, we aim to automatically generate angiography from non-angiographic inputs, by leveraging and enhancing the inherent physical properties of vascular structures. Previous methods relying on 2D slice-based angiography synthesis struggle with maintaining continuity in 3D vascular structures and exhibit limited effectiveness across different imaging modalities. We propose VasTSD, a 3D vascular tree-state space diffusion model to synthesize angiography from 3D non-angiographic volumes, with a novel state space serialization approach that dynamically constructs vascular tree topologies, integrating these with a diffusion-based generative model to ensure the generation of anatomically continuous vasculature in 3D volumes. A pre-trained vision embedder is employed to construct vascular state space representations, enabling consistent modeling of vascular structures across multiple modalities. Extensive experiments on various angiographic datasets demonstrate the superiority of VasTSD over prior works, achieving enhanced continuity of blood vessels in synthesized angiographic synthesis in multiple modalities and anatomical regions.