CVPR 2026 Highlighted Papers

Skip to yearly menu bar Skip to main content

Poster

AVGGT: Rethinking Global Attention for Accelerating VGGT

Xianbing Sun ⋅ Zhikai Zhu ⋅ Zhengyu Lou ⋅ Bo Yang ⋅ Jinyang Tang ⋅ Liqing Zhang ⋅ He Wang ⋅ Jianfu Zhang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 25

Since DUSt3R, models such as VGGT and $\pi^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $\pi^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component.We instantiate this strategy on VGGT and $\pi^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.

View full details

Poster

MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction

JongMin Lee ⋅ Seungyeop Kang ⋅ Sungjoo Yoo

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 26

Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose \textbf{MV-RoMa}, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model's consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods.

View full details

Poster

MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer

Juntong Fang ⋅ Zequn Chen ⋅ Weiqi Zhang ⋅ Donglin Di ⋅ Xuancheng Zhang ⋅ Chengmin Yang ⋅ Yu-Shen Liu

Jun 7, 11:45 AM - 1:45 PM ExHall F 27

Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.

View full details

Poster

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

Tao Xie ⋅ Peishan Yang ⋅ Yudong Jin ⋅ Yingfeng Cai ⋅ Wei Yin ⋅ Weiqiang Ren ⋅ Qian Zhang ⋅ Wei Hua ⋅ Sida Peng ⋅ Xiaoyang Guo ⋅ Xiaowei Zhou

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 30

This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Schops_2019_CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving state-of-the-art performance in both pose estimation and 3D reconstruction accuracy while maintaining efficiency.

View full details

Poster

OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

Hao Li ⋅ Hao Li ⋅ Yalun Dai ⋅ Yushi Lan ⋅ Yihang Luo ⋅ Tianyu Qi ⋅ Zhengshen Zhang ⋅ Yufeng Zhan ⋅ Junfei Zhang ⋅ Wenchao Xu ⋅ Ziwei Liu

Jun 7, 3:30 PM - 5:30 PM ExHall A 30

General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.

View full details

Poster

E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Qitao Zhao ⋅ Hao Tan ⋅ Qianqian Wang ⋅ Sai Bi ⋅ Kai Zhang ⋅ Kalyan Sunkavalli ⋅ Shubham Tulsiani ⋅ Hanwen Jiang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 33

Self-supervised pre-training has revolutionized foundation models for language, 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv2, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.

View full details

Poster

Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting

Xinhang Liu ⋅ Pedro Miraldo ⋅ Suhas Lohit ⋅ Huaizu Jiang ⋅ Naoko Sawada ⋅ Yu-Wing Tai ⋅ Chi-Keung Tang ⋅ Moitreya Chatterjee

Jun 6, 11:45 AM - 1:45 PM ExHall F 34

Understanding how the 3D world evolves over time is a fundamental task in computer vision, essential for embodied settings, autonomous driving, etc. It requires not only the reconstruction of the observed scene but also the anticipation of how the scene dynamics will unfold in the future. While the area of 3D reconstruction has progressed rapidly with the advent of recent feed-forward neural networks, forecasting future dynamics in 3D, given the 2D frames of a video remains unexplored. We present Point4Cast, a unified framework that processes streaming 2D frame sequences of a video to estimate the past, present, and future of the underlying dynamic scene, in 3D. At the core of our approach lies a persistently evolving latent \emph{spacetime representation} that models the environment’s evolution across time. Upon receiving a new 2D frame, an update operation integrates the incoming evidence to refine the latent spacetime representation. When queried for any time instant, whether before, at, or beyond the timestamp of the last update. A readout procedure predicts temporally conditioned point maps and camera parameters describing the scene geometry at the queried time. Unlike prior approaches for online dynamic scene reconstruction that estimate each frame’s point map solely at the timestamp of the last observed frame, Point4Cast achieves coherent reconstruction across any queried time. Empirical evaluations show that \emph{Point4Cast} achieves state-of-the-art performance on streaming dynamic scene reconstruction and forecasting benchmarks, across multiple challenging datasets, while providing scene flow estimation and forecasting for free. The code will be released publicly.

View full details

Poster

AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend

Hengyi Wang ⋅ Lourdes Agapito

Jun 6, 11:45 AM - 1:45 PM ExHall F 35

We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.

View full details

Poster

Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization

Tze Ho Elden Tse ⋅ Jizong Peng ⋅ Angela Yao

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 35

Learning-based structure-from-motion methods such as ACE-Zero have demonstrated strong performance in estimating camera poses and scene coordinates from unordered image collections without requiring ground truth supervision. However, the lack of global and multi-view consistency constraints in ACE-Zero can lead to pose drift and misalignment, particularly in complex or ambiguous scenes. In this work, we propose a hybrid framework that integrates pose graph optimization (PGO) into ACE-Zero to refine camera poses and suppress incorrect refinements. We construct pose graphs directly from ACE-Zero outputs by extracting relative pose constraints from predicted scene coordinates. Furthermore, we introduce an uncertainty-aware optimization strategy by estimating confidence scores using geometric priors, including epipolar and optical flow consistencies across views. Our approach improves the robustness and accuracy of pose estimation, demonstrating that global geometric reasoning can effectively complement learning-based inference in structure-from-motion.

View full details

Poster

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

Runze Wang ⋅ Yuxuan Song ⋅ Youcheng Cai ⋅ Ligang Liu

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 37

Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. While causal VGGT transformers address this challenge through key-value (KV) cache mechanism, the linear growth of the cache introduces a significant memory bottleneck. When memory constraints trigger early eviction, reconstruction quality and temporal consistency deteriorate markedly. In this work, we observe that attention patterns in causal transformers for 3D reconstruction exhibit intrinsic spatio-temporal sparsity. Leveraging this insight, we propose **STAC**, a *S*patio-**T**emporally **A**ware **C**ache compression framework specifically designed for streaming 3D reconstruction using large causal transformers. STAC incorporates three key components: a **Working Temporal Token Caching** mechanism that preserves long-term informative tokens based on decayed cumulative attention scores; a **Long-term Spatial Token Caching** scheme that consolidates spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and a **Chunk-based Multi-frame Optimization** strategy that jointly optimizes consecutive frames to enhance temporal coherence and leverage GPU parallelism. Extensive experiments demonstrate that **STAC** achieves state-of-the-art reconstruction quality while reducing memory consumption by 8.5$\times$ and accelerating inference by a factor of 3.5$\times$, enabling scalable and real-time 3D reconstruction in streaming settings. The code will be made publicly available upon acceptance.

View full details

Poster

No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency

Cho-Ying Wu ⋅ Zixun Huang ⋅ Xinyu Huang ⋅ Liu Ren

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 37

We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection. Code will be released.

View full details

Poster

SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

ZhiCheng Qiu ⋅ Jiarui Meng ⋅ Tong-an Luo ⋅ Yican Huang ⋅ Xuan Feng ⋅ Xuanfu Li ⋅ Zhan Xu

Jun 7, 11:45 AM - 1:45 PM ExHall F 37

We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.

View full details

Poster

RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations

Mochu Xiang ⋅ Zhelun Shen ⋅ Xuesong li ⋅ Jiahui Ren ⋅ Jing Zhang ⋅ Chen Zhao ⋅ Shanshan Liu ⋅ Haocheng Feng ⋅ Jingdong Wang ⋅ Yuchao Dai

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 39

Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications.

View full details

Poster

Parallel Rigidity Matters for Bundle Adjustment

Lalit Manam ⋅ Venu Madhav Govindu

Jun 7, 11:45 AM - 1:45 PM ExHall F 38

Bundle adjustment is a long-standing problem in computer vision that solves for camera parameters and 3D point coordinates from 2D image observations. While there has been much work on various aspects, like adaptation to different camera models and sensors, and considerations for solving the optimization problem, in this paper, we deal with a fundamental and distinct aspect of the uniqueness of its solution. In particular, we examine the unique solvability of the 3D reconstruction problem using parallel rigidity theory. We design an algorithm to ensure that the topology of the bipartite graph formed by the camera-3D point relations in bundle adjustment does not result in independent scaling of the edges in its subgraphs. To tackle the generally large-sized bipartite graph, we leverage camera-camera relationships in 3D reconstruction problems for efficiency. We demonstrate the benefits of our analysis on a global structure-from-motion pipeline. Applying our proposed algorithm results in significantly cleaner reconstructions by removing misplaced cameras and 3D points.

View full details

Poster

Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization

Torsten Sattler ⋅ Zuzana Kukelova

Jun 7, 11:45 AM - 1:45 PM ExHall F 39

Visual localization, i.e., the problem of estimating the camera pose from which an image was taken, is an important part of applications such as augmented reality and autonomous robots. Many of these applications require a compact memory footprint. Thus, a considerable amount of work has been spent on designing memory-efficient scene representations for visual localization. In this paper, we focus on compressing the 3D structure of the scene by selecting a subset of points from a Structure-from-Motion (SfM) point cloud. In contrast to prior work, which aims to solve (complex) optimization problems, we propose a simple strategy that is almost trivial to implement. Our compression strategy is based on the idea of selecting triplets of points such that the camera pose of each database image (used to build the SfM point cloud) can be accurately estimated from these triplets. Despite its simplicity, our strategy performs similarly to or better than current state-of-the-art structure compression approaches. Combined with standard product quantization approaches to compress feature descriptors, our approach compares favorably with recent learning-based approaches for compact visual localization.

View full details

Poster

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

Chen Wang ⋅ Hao Tan ⋅ Wang Yifan ⋅ Zhiqin Chen ⋅ Yuheng Liu ⋅ Kalyan Sunkavalli ⋅ Sai Bi ⋅ Lingjie Liu ⋅ Yiwei Hu

Jun 7, 3:30 PM - 5:30 PM ExHall A 39

We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model’s capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.

View full details

Poster

Global Structure-from-Motion Meets Feedforward Reconstruction

Linfei Pan ⋅ Johannes Schönberger ⋅ Marc Pollefeys

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 41

Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved.Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited image overlap, and symmetries.However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, and robustness, and typically fall short of classical methods in standard reconstruction settings.In this work, we systematically analyze these limitations and propose a new state-of-the-art Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods.Extensive experiments over a wide range of reconstruction scenarios demonstrate the benefits of our approach by achieving state-of-the-art results across the board.The implementation of our pipeline will be shared as open source software.

View full details

Poster

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Yihao Meng ⋅ Hao Ouyang ⋅ Yue Yu ⋅ Qiuyu Wang ⋅ Wen Wang ⋅ Ka Leong Cheng ⋅ Hanlin Wang ⋅ Shuailei Ma ⋅ Yixuan LI ⋅ Chen Cheng ⋅ Yanhong Zeng ⋅ Xing Zhu ⋅ Yujun Shen ⋅ Huamin Qu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 44

State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future.

View full details

Poster

Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

Chuancheng Shi ⋅ Shangze Li ⋅ Shiming Guo ⋅ Simiao Xie ⋅ Wenhua Wu ⋅ Jingtong Dou ⋅ Chao Wu ⋅ Canran Xiao ⋅ Cong Wang ⋅ Zifeng Cheng ⋅ Fei Shen ⋅ Tat-seng Chua

Jun 6, 11:45 AM - 1:45 PM ExHall F 43

Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilised. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts.Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.

View full details

Poster

DuoGen: Towards Autonomous Interleaved Multimodal Generation

Min Shi ⋅ Xiaohui Zeng ⋅ Jiannan Huang ⋅ Yin Cui ⋅ Francesco Ferroni ⋅ Jialuo Li ⋅ Max Li ⋅ Yogesh Balaji ⋅ Haoxiang Wang ⋅ Tsung-Yi Lin ⋅ Xiao Fu ⋅ Yue Zhao ⋅ Chieh-Yun Chen ⋅ Ming-Yu Liu ⋅ Humphrey Shi

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 43

Unified multimodal generation aims to jointly model image-to-text and text-to-image tasks within a single architecture. However, current approaches struggle to produce coherent, interleaved sequences of text and images. This limitation hinders applications that rely on tightly integrated multimodal outputs—such as step-by-step instructional guides, visual planning tools, and interactive content editing—where textual explanations and visual elements must be generated in a coordinated manner. We introduce DuetGen, a general-purpose interleaved multimodal generation model and investigate data curation, architecture design, and evaluation. In terms of data, we construct a large-scale high-quality instruction-tuning corpus combining curated web content, rewritten multimodal conversations, and diverse synthetic examples covering everyday scenarios. Architecturally, DuetGen builds upon a pretrained MLLM and diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining while remaining scalable. A two-stage decoupled training strategy first instruct-tunes the MLLM and then aligns it with the DiT using large-scale curated interleaved image–text sequences. Experiments on public and newly constructed benchmarks show that DuetGen substantially outperforms prior open-source systems across text quality, image fidelity, and image–context alignment, achieving substantial gains on text-to-image and image-editing benchmarks. Code and data will be released.

View full details

Poster

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Jing Tan ⋅ Zhaoyang Zhang ⋅ Yantao Shen ⋅ Jiarui Cai ⋅ Shuo Yang ⋅ Jiajun Wu ⋅ Wei Xia ⋅ Zhuowen Tu ⋅ Stefano Soatto

Jun 6, 11:45 AM - 1:45 PM ExHall F 46

We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations—such as translating, rotating, or resizing objects—due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations.Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.

View full details

Poster

When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

Krzysztof Adamkiewicz ⋅ Brian B. Moser ⋅ Stanislav Frolov ⋅ Tobias Christian Nauen ⋅ Federico Raue ⋅ Andreas Dengel

Jun 7, 3:30 PM - 5:30 PM ExHall A 46

Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression.We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data.Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators.Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and label-image alignment. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.

View full details

Poster

GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

Xincheng Shuai ⋅ Ziye Li ⋅ Henghui Ding ⋅ Dacheng Tao

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 47

Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose ***GlyphPrinter***, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the ***GlyphCorrector*** dataset with region-level glyph preference annotations and propose ***Region-Grouped DPO*** (***R-GDPO***), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce ***Regional Reward Guidance***, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.

View full details

Poster

Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

Sidan Zhu ⋅ Hongteng Xu ⋅ Dixin Luo

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 48

As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a "selection-then-ranking" paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance.When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods.

View full details

Poster

Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation

Zihao Wang ⋅ Yuxiang Wei ⋅ Xinpeng Zhou ⋅ Tianyu Zhang ⋅ Tao Liang ⋅ Yalong Bai ⋅ Hongzhi Zhang ⋅ Wangmeng Zuo

Jun 7, 11:45 AM - 1:45 PM ExHall F 48

Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization.We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user's preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt.To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process.To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization.Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.

View full details

Poster

Transition Models: Rethinking the Generative Learning Objective

ZiDong Wang ⋅ Yiyuan Zhang ⋅ Xiaoyu Yue ⋅ Xiangyu Yue ⋅ Yangguang Li ⋅ Wanli Ouyang ⋅ Lei Bai

Jun 7, 11:45 AM - 1:45 PM ExHall F 51

A fundamental dilemma in generative modeling persists: iterative diffusion models achieve outstanding fidelity, but at a significant computational cost, while efficient few-step alternatives are constrained by a hard quality ceiling. This conflict between generation steps and output quality arises from restrictive training objectives that focus exclusively on either infinitesimal dynamics (PF-ODEs) or direct endpoint prediction. We address this challenge by introducing an exact, continuous-time dynamics equation that analytically defines state transitions across any finite time interval \(\Delta t\). This leads to a novel generative paradigm, Transition Models (TiM), which adapt to arbitrary-step transitions, seamlessly traversing the generative trajectory from single leaps to fine-grained refinement with more steps.Despite having only 865M parameters, TiM achieves state-of-the-art performance, surpassing leading models such as SD3.5 (8B parameters) and FLUX.1 (12B parameters) across all evaluated step counts. Importantly, unlike previous few-step generators, TiM demonstrates monotonic quality improvement as the sampling budget increases. Additionally, when employing our native-resolution strategy, TiM delivers exceptional fidelity at resolutions up to \(4096\times4096\). All the codes and model checkpoints will be released.

View full details

Poster

Composing Concepts from Images and Videos via Concept-prompt Binding

Xianghao Kong ⋅ Zeyu Zhang ⋅ Yuwei Guo ⋅ Zhuoran ZHAO ⋅ Songchun Zhang ⋅ Anyi Rao

Jun 6, 11:45 AM - 1:45 PM ExHall F 52

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind \& Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

View full details

Poster

DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior

Junjia Huang ⋅ Binbin Yang ⋅ Pengxiang Yan ⋅ Jiyang Liu ⋅ Bin Xia ⋅ Zhao Wang ⋅ Yitong Wang ⋅ Liang Lin ⋅ Guanbin Li

Jun 7, 3:30 PM - 5:30 PM ExHall A 53

Storyboard synthesis plays a crucial role in visual storytelling, aiming to generate coherent shot sequences that visually narrate cinematic events with consistent characters, scenes, and transitions. However, existing approaches are mostly adapted from text-to-image diffusion models, which struggle to maintain long-range temporal coherence, consistent character identities, and narrative flow across multiple shots. In this paper, we introduce DreamShot, a video generative model based storyboard framework that fully exploits powerful video diffusion priors for controllable multi-shot synthesis. DreamShot supports both Text-to-Shot and Reference-to-Shot generation, as well as story continuation conditioned on previous frames, enabling flexible and context-aware storyboard generation. By leveraging the spatial-temporal consistency inherent in video generative models, DreamShot produces visually and semantically coherent sequences with improved narrative fidelity and character continuity. Furthermore, DreamShot incorporates a multi-reference role conditioning module that accepts multiple character reference images and enforces identity alignment via a Role-Attention Consistency Loss, explicitly constraining attention between reference and generated roles. Extensive experiments demonstrate that DreamShot achieves superior scene coherence, role consistency, and generation efficiency compared to state-of-the-art text-to-image storyboard models, establishing a new direction toward controllable video model-driven visual storytelling.

View full details

Poster

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Yuran Wang ⋅ Bohan Zeng ⋅ Chengzhuo Tong ⋅ Wenxuan Liu ⋅ Yang Shi ⋅ Xiaochen Ma ⋅ Hao Liang ⋅ Yuanxing Zhang ⋅ Wentao Zhang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 56

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in realistic and complex visual settings. We propose Scone, a unified understanding-generation framework that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge that conveys semantic information and guides the generation expert to preserve subject identity while reducing inference. A two-stage training scheme first learns composition and then strengthens distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark designed to evaluate composition, distinction, and their combination across diverse scenarios. Experiments show that Scone outperforms existing open-source models in both composition and distinction tasks. Our model, benchmark, and training data will be open-sourced.

View full details

Poster

FoleyDirector: Directing Temporal Controllable Video-to-Audio Generation via Fine-Grained Temporal Scripts

You Li ⋅ Dewei Zhou ⋅ Fan Ma ⋅ Fu Li ⋅ Dongliang He ⋅ Yi Yang

Jun 7, 11:45 AM - 1:45 PM ExHall F 58

Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded/partially visible objects.In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model’s audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts(STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability.To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSound-Director and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable.

View full details

Poster

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Tian Ye ⋅ Song Fei ⋅ Lei Zhu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 59

Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data--model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval@4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and—with a LLM prompt refiner—matches or surpasses the proprietary Seedream 4.0.

View full details

Poster

FEAT: Fashion Editing and Try-On from Any Design

Soye Kwon ⋅ Keonyoung Lee ⋅ Dahuin Jung ⋅ Jaekoo Lee

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 60

Fashion design aims to express a designer’s creative intent and to depict how garments interact with the human body. Recent generative approaches condition on multimodal inputs to support garment editing and enable virtual try-on. However, existing methods still (i) confine design to garment-related images, excluding creative design sources such as artwork, abstract imagery, and natural photographs, and (ii) cannot support complete outfits, including accessories. We present FEAT Fashion Editing and Try-On from Any Design, a method that enables editing and try-on across both garments and accessories using diverse design sources. To achieve this, we introduce Disentangled Dual Injection (DDI). It takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Furthermore, we propose Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. Extensive experiments demonstrate that FEAT achieves state-of-the-art performance in design flexibility, prompt consistency, and visual realism.

View full details

Poster

DreamOmni2: Multimodal Instruction-based Generation and Editing

Bin Xia ⋅ Bohao Peng ⋅ Yuechen Zhang ⋅ Junjia Huang ⋅ Jiyang Liu ⋅ Jingyao Li ⋅ Haoru Tan ⋅ WU Sitong ⋅ Chengyao Wang ⋅ Yitong Wang ⋅ Bei Yu ⋅ Jiaya Jia

Jun 7, 11:45 AM - 1:45 PM ExHall F 60

Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.

View full details

Poster

AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models

Hongyi Cai ⋅ HONGYI CAI ⋅ MingKang Dong ⋅ Muxin Pu ⋅ Moayad Aloqaily ⋅ jie li ⋅ Xinfeng Li ⋅ Jialie Shen ⋅ Meikang Qiu ⋅ Qingsong Wen

Jun 7, 11:45 AM - 1:45 PM ExHall F 61

Text-to-Image (T2I) models generate high-quality images but are vulnerable to malicious backdoor attacks that inject harmful biases (e.g., trigger-activated gender or racial stereotypes). Existing debiasing methods, often designed for natural statistical biases, struggle with these deliberate and subtle injected attacks. We propose AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack vectors. Specifically, AutoDebias leverages vision-language models to detect trigger-activated visual patterns and constructs neutralization guides by generating counter-prompts. These guides drive a CLIP-guided training process that breaks the harmful associations while preserving the original model's image quality and diversity. Unlike methods designed for natural bias, AutoDebias effectively addresses subtle, injected stereotypes and multiple interacting attacks. We evaluate the framework on a new benchmark covering 17 distinct backdoor attack scenarios, including challenging cases where multiple backdoors co-exist. AutoDebias detects malicious patterns with 91.6\% accuracy and reduces the backdoor success rate from 90\% to negligible levels, while preserving the visual fidelity of the original model.

View full details

Poster

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Rishabh Kabra ⋅ Maks Ovsjanikov ⋅ Drew A Hudson ⋅ Ye Xia ⋅ Skanda Koppula ⋅ André Araujo ⋅ Joao Carreira ⋅ Niloy J. Mitra

Jun 7, 3:30 PM - 5:30 PM ExHall A 63

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes ``omnivorous'' by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

View full details

Poster

GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting

Yasmine Omri ⋅ Connor Ding ⋅ Tsachy Weissman ⋅ Thierry Tambe

Jun 6, 11:45 AM - 1:45 PM ExHall F 64

Modern vision–language pipelines are driven by RGB vision encoders trained on massive image–text corpora. While these pipelines have enabled impressive zero-shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy-intensive and costly, and (ii) patch-based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance-aware pruning, and batched CUDA kernels, achieving over $90\times$ faster fitting and $\sim97$% GPU utilization compared to prior implementations. We further adapt contrastive language-image pre-training (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat-aware input stem and a perceiver resampler, training only $\sim9.7 - 13.8$% of the total parameters.On a 12.8M dataset from DataComp, GS encoders yield competitive zero-shot performance on 38 datasets from the CLIP benchmark while compressing inputs $3$–$23.5\times$ relative to pixels. Our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission-efficient for edge–cloud learning.

View full details

Poster

Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

Hayeon Kim ⋅ Ji Ha Jang ⋅ Junghun James Kim ⋅ Se Young Chun

Jun 7, 3:30 PM - 5:30 PM ExHall A 64

While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized with entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks.

View full details

Poster

The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models

Shivang Chopra ⋅ Shaunak Halbe ⋅ Chengyue Huang ⋅ Brisa Maneechotesuwan ⋅ Zsolt Kira

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 65

Fine-tuning approaches for Vision-Language Models (VLMs) face a critical three-way trade-off between In-Distribution (ID) accuracy, Out-of-Distribution (OOD) generalization, and adversarial robustness. Existing robust fine-tuning strategies resolve at most two axes of this trade-off. Generalization-preserving methods retain ID/OOD performance but leave models vulnerable to adversarial attacks, while adversarial training improves robustness to targeted attacks but degrades ID/OOD accuracy. Our key insight is that the robustness trade-off stems from two geometric failures: sharp, anisotropic minima in parameter space and unstable feature representations that deform under perturbation. To address this, we propose GRACE (Gram-aligned Robustness via Adaptive Curvature Estimation), a unified fine-tuning framework that jointly regularizes the parameter-space curvature and feature-space invariance for VLMs. Grounded in Robust PAC-Bayes theory, GRACE employs adaptive weight perturbations scaled by local curvature to promote flatter minima, combined with a feature alignment loss that maintains representation consistency across clean, adversarial, and OOD inputs. On ImageNet fine-tuning of CLIP models, GRACE simultaneously improves ID accuracy by 10.8%, and adversarial accuracy by 8.9% while maintaining 57.0% OOD accuracy (vs. 57.4% zero-shot baseline). Geometric analysis confirms that GRACE converges to flatter minima without feature distortion across distribution shifts, providing a principled step toward generalized robustness in foundation VLMs.

View full details

Poster

BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

Risa Shinoda ⋅ Kaede Shiohara ⋅ Nakamasa Inoue ⋅ Kuniaki Saito ⋅ Hiroaki Santo ⋅ Fumio Okura

Jun 7, 11:45 AM - 1:45 PM ExHall F 66

Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology.While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding.

View full details

Poster

Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning

Kaichen He ⋅ Zihao Wang ⋅ Muyao Li ⋅ Anji Liu ⋅ Yitao Liang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 68

Autonomous end-to-end agents are increasingly required to operate in environments where actions are not derived directly from the environment's raw actions but instead selected from higher-level action spaces. These actions are then mapped to the corresponding low-level interactions with the environment through controllers. In existing research, the action space is typically predefined. However, in practice, the optimal action space is context-dependent and difficult to determine in advance. For example, in complex domains such as Minecraft, relying solely on low-level raw actions or high-level planning actions is insufficient to handle the wide range of open-ended tasks, which vary in complexity and time horizons. The effective granularity of the control inevitably varies depending on the situation.To address this challenge, we propose CrossAgent, which introduces a novel adaptive action-space selection framework. CrossAgent is built through two stages of reinforcement learning fine-tuning: cold-start single-step reinforcement learning and multi-step reinforcement learning. Within Minecraft, we define three complementary action spaces: motion, grounding, and raw action—each with distinct advantages and limitations. Our framework enables agents to dynamically switch among these spaces and balance task rewards against reasoning costs.Experiments on over 30 diverse tasks in Minecraft demonstrate that CrossAgent exhibits strong long-horizon planning, precise execution, generalization, and efficiency, significantly outperforming fixed-action baselines. These results highlight the critical role of dynamic action-space adaptation in the development of generalist agents capable of tackling open-ended environments.

View full details

Poster

Boosting Visual Reprogramming for CLIP with Dual Granularity Alignment

Jiayang Wu ⋅ Xinyang Chen ⋅ Ke Lv ⋅ Weili Guan

Jun 7, 11:45 AM - 1:45 PM ExHall F 67

Model reprogramming adapts pretrained models to downstream tasks by modifying their input and output spaces. Visual reprogramming (VR), a prominent instance, introduces learnable input transformations (e.g., visual prompts) to repurpose vision-language models like CLIP for downstream visual tasks. Existing VR methods primarily focus on single-level alignment between prompted images and text descriptions, overlooking inherent structural information in data that facilitates alignment: semantic granularity (label hierarchies) and visual granularity (multi-scale representations). To address this gap, we propose Dual Granularity Alignment (DGA): First, for visual granularity, we generate multi-scale images and propose Uncertainty-calibrated Prediction Fusion (UPF) to capture hierarchical spatial information within images. Second, for semantic granularity, we propose Prototype-guided Label Hierarchization (PLF) to construct category hierarchies from visual semantic similarities and propose Hierarchical Knowledge Propagation (HKP) to achieve top-down superclass-to-subclass guidance for coherent multi-level visual prompts alignment. Our DGA collaboratively integrate both granularities to enhance alignment effectiveness. Experiments across 12 downstream datasets demonstrate DGA's superiority over baselines on both ViT-based and ResNet-based CLIP architectures. Specifically, DGA achieves a 4.5% improvement over the previous state-of-the-art method on ViT-16-based CLIP. By explicitly modeling structural granularities, DGA establishes a new paradigm for visual reprogramming.

View full details

Poster

EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models

Yiyang Fang ⋅ Wenke Huang ⋅ Pei Fu ⋅ Yihao Yang ⋅ Kehua Su ⋅ Zhenbo Luo ⋅ Jian Luan ⋅ Mang Ye

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 70

Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual reasoning and understanding tasks but still struggle to capture the complexity and subjectivity of human emotions. Existing approaches based on supervised fine-tuning often suffer from limited generalization and poor interpretability, while reinforcement learning methods such as Group Relative Policy Optimization fail to align with the intrinsic characteristics of emotional cognition.To address these challenges, we propose Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3), a framework designed to enhance the emotional reasoning ability of MLLMs. Specifically, we introduce Structured Emotional Thinking to guide the model to perform step-by-step emotional reasoning in a structured and interpretable manner, and design a Reflective Emotional Reward that enables the model to re-evaluate its reasoning based on visual-text consistency and emotional coherence. Extensive experiments demonstrate that EMO-R3 significantly improves both the interpretability and emotional intelligence of MLLMs, achieving superior performance across multiple visual emotional understanding benchmarks.

View full details

Poster

Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

Xinyu Qiu ⋅ Heng Jia ⋅ Zhengwen Zeng ⋅ Shuheng Shen ⋅ Changhua Meng ⋅ Yi Yang ⋅ Linchao Zhu

Jun 7, 3:30 PM - 5:30 PM ExHall A 69

Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (\textbf{ADPO}), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: \textbf{a preference verification reward} improving verification capability and \textbf{a decoupled optimization mechanism} enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to \textbf{+34.1\%} higher verification AUC and \textbf{-53.5\%} lower inference time, with significant gains of \textbf{+2.8\%/+1.4\%} accuracy on MathVista/MMMU, \textbf{+1.9} cIoU on ReasonSeg, and \textbf{+1.7\%/+1.0\%} step success rate on AndroidControl/GUI Odyssey.

View full details

Poster

PersonaVLM: Long-Term Personalized Multimodal LLMs

Chang Nie ⋅ Chaoyou Fu ⋅ Yi-Fan Zhang ⋅ Haihua Yang ⋅ Caifeng Shan

Jun 6, 11:45 AM - 1:45 PM ExHall F 71

Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited.Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users’ evolving preferences and personality over time (see Fig.1).In this paper, we introduce Pal-R3, an innovative personalized multimodal agent framework designed for long-term personalization.Pal-R3 transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities:(a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database.(b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics.For evaluation, we establish MME-P, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks.Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (MME-P) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively.Our code is available in the supplementary materials.

View full details

Poster

Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks

Quanyu Zhang ⋅ Zhongyi Han ⋅ Hao Sun ⋅ Yongshun Gong ⋅ Xiaoyan Wang ⋅ Yilong Yin ⋅ Shuo Li

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 74

Pretraining on large-scale data followed by fine-tuning has become a standard paradigm for visual models. However, noise in the pretraining data can be absorbed by the model and carried into downstream tasks, causing catastrophic inheritance, where inherited pretraining noise reduces downstream generalization. Prior studies mainly link this issue to changes in the feature spectrum, arguing that noise reduces the strength of key feature components. Following this view, they aim to improve transferability by amplifying these components. However, these approaches focus only on spectral energy and implicitly assume that the feature directions remain fixed, which does not hold in practice. In this work, we revisit this view and reveal an overlooked effect: even mild pretraining noise can cause a clear rotation of the dominant feature subspace, despite negligible spectral energy degradation. To quantitatively characterize this phenomenon, we propose using the Principal Directional Angle (PDA) to measure the directional shift between the clean and noisy models. Building on this observation, we introduce the Feature Geometry Stabilization (FGS) framework, which aims to counteract the subspace rotation revealed by PDA by enhancing the geometric stability of the feature space through the synergistic interaction of perturbation consistency, variance-activation regularization, and feature consistency distillation. Experiments across multiple visual benchmarks demonstrate the effectiveness of FGS and verify the importance of stabilizing feature geometry to mitigate catastrophic inheritance.

View full details

Poster

Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning

Denis Huseljic ⋅ Marek Herde ⋅ Lukas Rauch ⋅ Paul Hahn ⋅ Bernhard Sick

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 75

Existing active learning (AL) strategies capture fundamentally different notions of data value, e.g., uncertainty or representativeness. Consequently, the effectiveness of strategies can vary substantially across datasets, models, and even AL cycles. Committing to a single strategy risks suboptimal performance, as no single strategy dominates throughout the entire AL process. We introduce REFINE, an ensemble AL method that combines multiple strategies without knowing in advance which will perform best. In each AL cycle, REFINE operates in two stages: (1) Progressive filtering iteratively refines the unlabeled pool by considering an ensemble of AL strategies, retaining promising candidates capturing different notions of value. (2) Coverage-based selection then chooses a final batch from this refined pool, ensuring all previously identified notions of value are accounted for. Extensive experiments across 6 classification datasets and 3 foundation models show that REFINE consistently outperforms individual strategies and existing ensemble methods. Notably, progressive filtering serves as a powerful preprocessing step that improves the performance of any individual AL strategy applied to the refined pool, which we demonstrate on an audio spectrogram classification use case. Finally, the ensemble of REFINE can be easily extended with upcoming state-of-the-art AL strategies.

View full details

Poster

CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection

Youngjun Song ⋅ Hyeongyu Kim ⋅ Dosik Hwang

Jun 6, 11:45 AM - 1:45 PM ExHall F 76

Test-time adaptation (TTA) enables real-time adaptation to domain shifts without offline retraining. Recent TTA methods have predominantly explored additive approaches that introduce lightweight modules for feature refinement. Very recently, a subtractive approach that removes domain-sensitive channels has emerged as an alternative direction. We observe that these paradigms exhibit complementary effectiveness patterns: subtractive methods excel under severe shifts by removing corrupted features, while additive methods are effective under moderate shifts requiring refinement. However, each paradigm operates effectively only within limited shift severity ranges, failing to generalize across diverse corruption levels. This motivates a fundamental question: can we adaptively balance both strategies based on measured feature-level domain shift?We propose CD-Buffer, a novel complementary dual-buffer framework where subtractive and additive mechanisms operate in opposite yet coordinated directions driven by a unified discrepancy metric. Our key innovation lies in the discrepancy-driven coupling: Our framework couples removal and refinement through a unified discrepancy metric, automatically balancing both strategies based on feature-level shift severity. This establishes automatic channel-wise balancing that adapts differentiated treatment to heterogeneous shift magnitudes without manual tuning. Extensive experiments on KITTI, Cityscapes, and ACDC datasets demonstrate state-of-the-art performance, consistently achieving superior results across diverse weather conditions and severity levels.

View full details

Poster

DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

Haochen Li ⋅ Rui Zhang ⋅ Hantao Yao ⋅ Xin Zhang ⋅ Yifan Hao ⋅ Shaohui Peng ⋅ Yongwei Zhao ⋅ Ling Li

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 77

Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain.Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations.However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features.Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features.Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM).IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment.OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment.Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.

View full details

Poster

Scaling Dense Event-Stream Pretraining from Visual Foundation Models

Zhiwen Chen ⋅ Junhui Hou ⋅ Zhiyu Zhu ⋅ Jinjian Wu ⋅ Guangming Shi

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 78

Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability. The source code will be available.

View full details

Poster

Towards Multimodal Domain Generalization with Few Labels

Hongzhao Li ⋅ Hao Dong ⋅ Hualei Wan ⋅ Shupan Li ⋅ Mingliang Xu ⋅ Muhammad Haris Khan

Jun 6, 11:45 AM - 1:45 PM ExHall F 78

Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code will be released to support future research.

View full details

Poster

Event-based Motion Deblurring with Unpaired Data

Hoonhee Cho ⋅ Yuhwan Jeong ⋅ Kuk-Jin Yoon

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 82

Event cameras provide high-temporal-resolution, motion-centric measurements that remain reliable under fast motion and challenging illumination, making them a promising sensing modality for motion deblurring. However, existing deblurring methods typically require large-scale paired blur–sharp datasets, which are extremely difficult to obtain in real-world settings, especially when an additional modality such as events is involved. In this work, we introduce EMP, an event-based motion deblurring framework that operates entirely in an unpaired setting, removing the need for aligned blur–sharp supervision. EMP bridges the disjoint blur and sharp domains through event information and leverages two complementary training mechanisms tailored to the unpaired regime: (1) an event-based physical prior with confidence masking that provides reliable self-supervisory signals for blurry inputs, and (2) a generative blur modeling process that extracts blur-related frequency-domain cues from blur–event pairs and transfers them to sharp images to synthesize realistic blur. As a result, these mechanisms enable stable and effective deblurring without requiring paired labels. Extensive experiments on various real-event datasets, including REBlur, EventAid, and HighREV, show that EMP outperforms existing unpaired baselines and achieves performance competitive with paired methods. We will make our code publicly available to the research community.

View full details

Poster

Geometric-Photometric Event-based 3D Gaussian Ray Tracing

Kai Kohyama ⋅ Yoshimitsu Aoki ⋅ Guillermo Gallego ⋅ Shintaro Shiba

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 81

Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in the event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves the state-of-the-art performance on the real-world datasets and competitive performance on the synthetic datasets. Also, the proposed method works without prior information (e.g., pretrained image reconstruction models) or COLMAP-based initialization, is more flexible in the event accumulation size, and achieves sharp reconstruction on scene edges. We hope that this work deepens our understanding of the sparse nature of events for 3D reconstruction. We will release the code upon acceptance.

View full details

Poster

Depth Hypothesis Guided Iterative Refinement for Event–Image Monocular Depth Estimation

Daikun Liu ⋅ Teng Wang ⋅ Changyin Sun

Jun 7, 11:45 AM - 1:45 PM ExHall F 82

Event cameras hold excellent dynamic properties, showing great potential for monocular depth estimation (MDE). However, existing methods mainly improve performance by optimizing contextual features, but still struggle with the ill-posed and nonlinear nature of direct full-depth regression. In this paper, we propose HypoDepth, the first event–image monocular depth iterative refinement framework. By introducing a discrete Depth Hypothesis Volume (DHV), we transform the depth regression problem into a constrained depth search task. Specifically, we construct a 3D cost volume between the DHV features and contextual features and perform a multi-scale correlation search to guide stable residual optimization. This lightweight cost volume enables efficient global-to-local refinement across multi-resolution. Our method outperforms existing approaches on DSEC and MVSEC with state-of-the-art results and strong zero-shot generalization. Meanwhile, our tiny model achieves an excellent balance between accuracy and efficiency, enabling real-time performance on resource-limited devices.

View full details

Poster

Event-based Visual Deformation Measurement

Yuliang Wu ⋅ Wei Zhai ⋅ Yuxin Cui ⋅ Tiesong Zhao ⋅ Yang Cao ⋅ Zheng-Jun Zha

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 84

Visual Deformation Measurement (VDM) aims to recover dense deformation fields by tracking surface motion from camera observations. Traditional image-based methods rely on minimal inter-frame motion to constrain the correspondence search space, which limits their applicability to highly dynamic scenes or necessitates high-speed cameras at the cost of prohibitive storage and computational overhead.We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation.By revisiting the solid elastic modeling prior, we propose an Affine Invariant Simplicial (AIS) framework that partitions the deformation field into multiple sub-regions and linearize the deformation within each sub-region using a low-parametric representation, effectively mitigating motion ambiguities arising from the sparse and noisy nature of event observations. To speed up parameter searching and reduce error accumulation, a neighborhood-greedy optimization strategy is introduced, enabling well-converged sub-regions to guide their poorly-converged neighbors, effectively suppress local error accumulation in long-term dense tracking.To evaluate the proposed method, a benchmark dataset with temporally aligned event streams and high-frame-rate videos is established, encompassing over 120 sequences spanning diverse deformation scenarios. Experimental results show that the proposed method outperforms the state-of-the-art baseline by 1.6× in terms of continuous measurement success rate (survival rate). Remarkably, our approach achieves superior performance while requiring only 18.9\% of the data storage and processing resources compared to traditional high-speed video-based methods, without compromising accuracy.

View full details

Poster

From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation

rui hu ⋅ Song Wu ⋅ Wen Yang ⋅ Jinjian Wu

Jun 6, 11:45 AM - 1:45 PM ExHall F 83

Estimating continuous optical flow is a fundamental yet challenging problem in dynamic visual perception. Event-based cameras, with microsecond latency and high dynamic range, capture brightness changes asynchronously, offering a unique opportunity to model motion with fine temporal precision. However, the scarcity of dense annotations restricts the effectiveness of supervised learning, while contrast maximization (CM) frameworks, focused on sharpening the Image of Warped Events (IWE), often neglect temporal continuity and structural coherence, leading to distorted trajectories under complex motion.To overcome these challenges, we propose a hybrid-supervised framework for continuous-time optical flow estimation, grounded in the principle of Spatio-temporal Structural Consistency (STSC). This paradigm jointly enforces local structural stability and trajectory continuity, ensuring physically coherent motion across time. To further enhance representation and robustness, we design a bidirectionally complementary multi-scale architecture and employ a curriculum-guided hybrid training strategy, enabling a smooth transition from supervised point constraints to self-supervised manifold regularization.Comprehensive experiments across multiple benchmarks show that our method achieves state-of-the-art performance in both continuous-time and standard optical flow estimation, demonstrating the effectiveness of the proposed learning paradigm.

View full details

Poster

High-Quality and Efficient Turbulence Mitigation with Events

Xiaoran Zhang ⋅ Jian Ding ⋅ Yuxing Duan ⋅ Haoyue Liu ⋅ Gang Chen ⋅ Yi Chang ⋅ Luxin Yan

Jun 7, 11:45 AM - 1:45 PM ExHall F 83

Turbulence mitigation (TM) is highly ill-posed due to the stochastic nature of atmospheric turbulence. Most methods rely on multiple frames recorded by conventional cameras to capture stable patterns in natural scenarios. However, they inevitably suffer from a trade-off between accuracy and efficiency: more frames enhance restoration at the cost of higher system latency and larger data overhead. Event cameras, equipped with microsecond temporal resolution and efficient sensing of dynamic changes, offer an opportunity to break the bottleneck. In this work, we present EHETM, a high-quality and efficient TM method inspired by the superiority of events to model motions in continuous sequences. We discover two key phenomena: (1) turbulence-induced events exhibit distinct polarity alternation correlated with sharp image gradients, providing structural cues for restoring scenes; and (2) dynamic objects form spatiotemporally coherent "event tubes'' in contrast to irregular patterns within turbulent events, providing motion priors for disentangling objects from turbulence. Based on these insights, we design two complementary modules that respectively leverage polarity-weighted gradients for scene refinement and event-tube constraints for motion decoupling, achieving high-quality restoration with few frames. Furthermore, we construct two real-world event-frame turbulence datasets covering atmospheric and thermal cases. Extensive experiments show that EHETM outperforms SOTA methods, especially under scenes with dynamic objects, while reducing data overhead and system latency by approximately 77.3% and 89.5%, respectively.

View full details

Poster

x^2-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space

Ruishan Guo ⋅ Ciyu Ruan ⋅ Haoyang Wang ⋅ Zihang GONG ⋅ Jingao Xu ⋅ Xinlei Chen

Jun 6, 11:45 AM - 1:45 PM ExHall F 85

Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex. Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the \textbf{Event Edge Space}. Building on this idea, we introduce \textbf{$x^2$-Fusion}, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation. Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that $x^2$-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.

View full details

Poster

Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning

Sixian Zhang ⋅ Yiyao Wang ⋅ Xinhang Song ⋅ Keming Zhang ⋅ Zijian Xu ⋅ Shuqiang Jiang

Jun 7, 3:30 PM - 5:30 PM ExHall A 85

Understanding the geometric and semantic structure of environments is essential for embodied agents. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics,and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region level concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian parameters from dense point clouds without gradient-based optimization. Experiments on ObjectNav, InstNav, and SQA tasks show that GLMap effectively enhances target localization and contextual reasoning, while remaining compatible with large-model-based methods in a zero-shot manner.

View full details

Poster

From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras

Taehun Ryu ⋅ Changwoo Kang ⋅ Kyungdon Joo

Jun 7, 11:45 AM - 1:45 PM ExHall F 87

The conventional checkerboard-based calibration for standard cameras faces fundamental limitations when applied to bio-inspired event cameras. Specifically, this stems from two challenges: (i) Events are triggered asynchronously at different timestamps along motion trajectories. If we accumulate them directly on the image plane, it causes temporal misalignment and produces blurred edges. Directly accumulating them on the image plane causes temporal misalignment and produces blurred edges. (ii) Checkerboard corners on event cameras show near-zero event occurrence at the corner itself. This hinders reliable corner localization and makes calibration difficult. To address these issues, we present a novel calibration framework that directly detects checkerboard corners from a raw event stream. We first mathematically analyze the absence of events at corner points. Based on this fact, we then leverage edge-driven event cues to initialize corner positions. Using the near-zero event occurrence at checkerboard corners, we gradually refine the estimated corner toward low event-density regions, achieving sub-pixel accuracy. Furthermore, we extend the corner detection to fiducial markers such as AprilTags, resulting in reliable detection even under partial visibility or occlusion. Evaluations on self-collected and public data demonstrate reliable checkerboard corner detection and stable camera calibration.

View full details

Poster

Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control

Zhe Li ⋅ Cheng Chi ⋅ Yangyang Wei ⋅ Boan Zhu ⋅ Tao Huang ⋅ Zhenguo Sun ⋅ Yibo Peng ⋅ Pengwei Wang ⋅ Zhongyuan Wang ⋅ Fangzhou Liu ⋅ Chang Xu ⋅ Shanghang Zhang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 89

Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive freestyle performers capable of reacting to audio.

View full details

Poster

CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

Andrew Jeong ⋅ Jaemin Kim ⋅ Sebin Lee ⋅ Sung-Eui Yoon

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 90

We propose CLaD (Cross-modal Latent Dynamics), a framework for learning temporally consistent cross-modal representations in robotic manipulation. Our approach models transition dynamics rather than static state correspondences: asymmetric cross-attention enables proprioceptive transitions to query semantic ones, extracting shared dynamics structure that respects the causal ordering imposed by actions. We formalize grounded latent foresight as predictions anchored through EMA-based targets from observed trajectories and auxiliary reconstruction to observable space—preventing collapse to abstract representations. A diffusion policy conditions on these learned foresights via feature modulation, decoupling dynamics learning from control optimization. Evaluated on LIBERO-LONG, our method achieves 94.9\% success with 0.66B parameters, demonstrating that explicit cross-modal transition modeling enables parameter-efficient planning outperforming larger VLAs.

View full details

Poster

GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching

Yuqi Chen ⋅ Junjie Gao ⋅ Yongzhou Pan ⋅ Siyuan Song ⋅ ZIXUAN ZHANG ⋅ Jiaping Xiao ⋅ Mir Feroskhan

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 93

Image-goal navigation driven by generative models has recently shown strong potential owing to their ability to perform multi-modal reasoning and stable learning in continuous control spaces.Despite their promise, current methods still face several fundamental limitations.Many rely on pre-built priors and lack explicit mechanisms for trajectory evaluation, restricting generalization and goal alignment in map-free navigation. Moreover, current generative policies often face inefficiency or temporal inconsistency, resulting in temporally unstable motion. The absence of interactive, closed-loop benchmarks further limits fair and reproducible comparison.To address these issues, we propose GeniNav, a generative image-goal navigation framework that couples a VLM-driven latent subgoal imagination module for high-level semantic guidance with Multi-Segment Consistency Flow Matching (MS-CFM) for temporally smooth and dynamically coherent motion generation. A hybrid trajectory evaluation module further integrates semantic alignment and geometric feasibility to assess goal consistency.We also introduce a closed-loop simulation benchmark with a large-scale dataset spanning 176 scenes and 491.6 km for standardized training and evaluation. Extensive experiments in simulation and on real robots demonstrate the effectiveness of our method.

View full details

Poster

SaPaVe: Towards Active Perception and Manipulation in Vision-Language Action Models for Robotics

Mengzhen Liu ⋅ Enshen Zhou ⋅ Cheng Chi ⋅ Yi Han ⋅ Shanyu Rong ⋅ Liming Chen ⋅ Pengwei Wang ⋅ Zhongyuan Wang ⋅ Shanghang Zhang

Jun 7, 3:30 PM - 5:30 PM ExHall A 92

Active perception and manipulation are crucial for embodied robots to interact with complex scenes. Existing methods struggle to unify semantic-driven perception actively with robust, viewpoint-invariant execution accordingly. To this end, we propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Central to our approach is a decoupling of camera and manipulation actions, contrary to shared-action-space, and learning in a bottom-up strategy: we first train semantic camera control on our proposed large-scale dataset, then jointly optimize both action types via hybrid data. To support this learning, we introduce ActiveViewPose-200K, comprising 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We further present ActiveManip-Bench, the first benchmark filling the gap to evaluate active manipulation. Extensive experiments in both simulation and real-world settings show that SaPaVe outperforms recent VLA models such as GR00T and $\pi_0$, achieving up to 31.25\% higher success rates in real-world tasks. Our results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation.

View full details

Poster

RealAppiance: Let High-fidelity Appliance Assets Controllable and Workable as Aligned Real Manauls

Yuzheng Gao ⋅ Yuxing Long ⋅ Lei Kang ⋅ Yuchong Guo ⋅ Ziyan Yu ⋅ Shangqing Mao ⋅ Jiyao Zhang ⋅ Ruihai Wu ⋅ Dongjiang Li ⋅ Hui Shen ⋅ Hao Dong

Jun 7, 3:30 PM - 5:30 PM ExHall A 94

Existing appliance assets suffer from poor rendering, incomplete mechanisms, and misalignment with manuals, leading to simulation-reality gaps that hinder appliance manipulation development. In this work, we introduce the RealAppliance dataset, comprising 100 high-fidelity appliances with complete physical, electronic mechanisms, and program logic aligned with their manuals. Based on these assets, we propose the RealAppliance-Bench benchmark, which evaluates multimodal large language models and embodied manipulation planning models across key tasks in appliance manipulation planning: manual page retrieval, appliance part grounding, open-loop manipulation planning, and closed-loop planning adjustment. Our analysis of model performances on RealAppliance-Bench provides insights for advancing appliance manipulation research

View full details

Poster

ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

Zhuoyang Zhang ⋅ Shang Yang ⋅ Qinghao Hu ⋅ Luke J. Huang ⋅ James Hou ⋅ Yufei Sun ⋅ Yao Lu ⋅ Song Han

Jun 7, 3:30 PM - 5:30 PM ExHall A 95

Vision-Language-Action (VLA) models convert abstract language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present \textit{Visually Grounded Planning}, a general and efficient high-level planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuomotor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image-generation module that predicts a high-quality 640×480 future observation from the current visual input and language instruction within only 0.33 s on an H100 GPU, together with a vision–language component that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on approximately 10 million multi-task, cross-embodiment samples, enabling it to learn robust embodied dynamics and achieve strong real-world generalization. We evaluate our framework on a benchmark consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4\%, demonstrating a +40.9\% absolute improvement over the $\pi_0$ baseline (46.5\%) and a +30.3\% absolute improvement over $\pi_0$ augmented with textual subtask guidance (57.1\%).

View full details

Poster

Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

Xinhao Liu ⋅ Jiaqi Li ⋅ Youming Deng ⋅ Ruxin Chen ⋅ Yingjia Zhang ⋅ Yifei Ma ⋅ Li Guo ⋅ Yiming Li ⋅ Jing Zhang ⋅ Chen Feng

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 97

Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI.

View full details

Poster

When Robots Should Say ''I Don’t Know'': Benchmarking Abstention in Embodied Question Answering

Tao Wu ⋅ Chuhao Zhou ⋅ Guangyu Zhao ⋅ Haozhi Cao ⋅ Yewen Pu ⋅ Jianfei Yang

Jun 6, 11:45 AM - 1:45 PM ExHall F 96

Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4\% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79\% abstention recall, while humans achieve 91.17\%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.

View full details

Poster

UniLight: A Unified Representation for Lighting

Zitian Zhang ⋅ Iliyan Georgiev ⋅ Michael Fischer ⋅ Yannick Hold-Geoffroy ⋅ Jean-François Lalonde ⋅ Valentin Deschaintre

Jun 7, 11:45 AM - 1:45 PM ExHall F 99

Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose Unilight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.

View full details

Poster

Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity

Zhengyao Fang ⋅ Zexi Jia ⋅ Yijia Zhong ⋅ Pengcheng Luo ⋅ Jinchao Zhang ⋅ Guangming Lu ⋅ Jun Yu ⋅ Wenjie Pei

Jun 7, 3:30 PM - 5:30 PM ExHall A 101

Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often $\textit{too vivid to be real}$ even when prompted for realistic-style images.To address this issue, we present $\textbf{Color Fidelity Dataset (CFD)}$ and $\textbf{Color Fidelity Metric (CFM)}$ for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free $\textbf{Color Fidelity Refinement (CFR)}$ that adaptively modulates spatial–temporal guidance scale in generation, thereby enhancing color authenticity.Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. All datasets and code will be publicly released.

View full details

Poster

Radiance Meshes for Volumetric Reconstruction

Alexander Mai ⋅ Trevor Hedstrom ⋅ George Kopanas ⋅ Janne Kontkanen ⋅ Falko Kuester ⋅ Jonathan T. Barron

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 102

We introduce Radiance Meshes for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization.Unlike a Voronoi diagram, a Delaunay tetrahedralization yields simple triangles that are natively supported by existing hardware. As such, our model is able to perform exact and fast volume rendering using both rasterization and ray-tracing. We introduce a new rasterization method that achieve faster rendering speeds than all prior radiance field representations (assuming an equivalent number of primitives and resolution) across a variety of platforms.Optimizing the positions of Delaunay vertices introduces topological discontinuities (edge flips). To solve this, we use a Zip-NeRF-style backbone which allows us to express a smoothly varying field even when the topology changes.Our rendering method exactly evaluates the volume rendering equation and enables high quality, real-time view synthesis on standard consumer hardware. Our tetrahedral meshes also lend themselves to a variety of exciting applications including fisheye lens distortion, physics-based simulation, editing, and mesh extraction.

View full details

Poster

Harmonic Canvas: Inversion-Free Editing for Visually-Guided Music Style Transfer

Yue Lei ⋅ Siqi Yang ⋅ Ting Zhong ⋅ Fan Zhou

Jun 7, 11:45 AM - 1:45 PM ExHall F 103

Music style transfer (MST) aims to reinterpret existing musical pieces in new stylistic forms while maintaining their melodic coherence. Conventional approaches conditioned on text or audio overlook the profoundly multimodal character of musical style. Visual ambience -- reflected in color, lighting, and composition -- encodes affective attributes that parallel timbre, rhythm, and harmony, which, however, remain underexplored in MST context. We introduce a flow-based, inversion-free framework for multimodal music style transfer that unifies textual and visual guidance. Our approach tackles two challenges: (1) capturing cross-modal semantics beyond language through a dual-encoder fusion module that merges CLIP- and ViT-derived embeddings, and (2) preserving melodic identity using a differentiable normalized chroma constraint that regulates pitch-class consistency along the generative flow. We reorganize and extend the MeLBench and MusicCaps collections into a genre-structured multimodal dataset to support style-aware analysis. Quantitative and perceptual evaluations demonstrate that our approach achieves superior control, structural fidelity, and cross-modal expressiveness, underscoring the role of visual perception in music generation.

View full details

Poster

CoRoGS: Contextual Gaussian Splatting for Robust Large-Deviation View Synthesis

Xin Ma ⋅ Peng Lu ⋅ Yisong Chen ⋅ Chengwei Pan ⋅ Sheng Li

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 104

Novel view synthesis (NVS) under large view deviations remains an underexplored challenge for 3D Gaussian Splatting (3DGS). In urban scenes with limited training coverage, models often fail to maintain geometric consistency when extrapolating to unseen viewpoints, resulting in severe distortions and degraded rendering quality. We introduce Context-Aware Gaussian Splatting (CoRoGS), a $\textbf{Co}$ntext-aware framework for $\textbf{Ro}$bust large-deviation novel view synthesis (LD-NVS) that embeds contextual reasoning into 3DGS. Instead of treating Gaussians as independent primitives, CoRoGS adopts a contextual formulation that explicitly models inter-Gaussian dependencies. This representation is implemented by constructing a 3D Gaussian graph, which propagates relational geometry and semantics via message passing, resulting in context-aware Gaussian updates. To further maintain structural consistency under substantial view deviation, we incorporate a progressive graph expansion strategy that adaptively grows and prunes Gaussians, leading to more coherent and complete scene reconstructions. Extensive experiments demonstrate that CoRoGS outperforms state-of-the-art 3DGS-based methods, producing higher-quality results. We highlight that CoRoGS robustly handles a wide range of view shifts, including lateral deviations (e.g., lane-level offsets) and cross-level transitions such as from ground-level driving views to elevated perspectives.

View full details

Poster

How to Take a Memorable Picture? Empowering Users with Actionable Feedback

Francesco Laiti ⋅ Davide Talon ⋅ Jacopo Staiano ⋅ Elisa Ricci

Jun 7, 11:45 AM - 1:45 PM ExHall F 104

Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of **Mem**orability **Feed**back (**MemFeed**), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present **MemCoach**, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., “emphasize facial expression,” “bring the subject forward”). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce **MemBench**, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators. Dataset and code will be publicly released upon publication.

View full details

Poster

ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes

Zhongtao Wang ⋅ Jiaqi Dai ⋅ Qingtian Zhu ⋅ Yilong Li ⋅ Mai Su ⋅ Fei Zhu ⋅ Meng GAI ⋅ Shaorong Wang ⋅ Chengwei Pan ⋅ Yisong Chen ⋅ Guoping Wang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 105

Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It‘s also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we release ChronoScene dataset, a benchmark of real and synthetic multi-period scenes, capturing geometric and appearance variation. Experiments demonstrate that ChronoGS consistently outperforms baselines in reconstruction quality and temporal consistency. Our code and the ChronoScene dataset will be made publicly available.

View full details

Poster

TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

Jiawei Ren ⋅ Michal Tyszkiewicz ⋅ Jiahui Huang ⋅ Žan Gojčič

Jun 6, 11:45 AM - 1:45 PM ExHall F 105

In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss.This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby _unbinding_ the number of predicted primitives from input image resolution and number of views. Our resulting method, __TokenGS__, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.

View full details

Poster

Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering

Hugo Blanc ⋅ Jean-Emmanuel Deschaud ⋅ Alexis Paljic

Jun 6, 11:45 AM - 1:45 PM ExHall F 106

Recent advances in novel view synthesis have enabled differentiable rendering methods to reconstruct 3D scenes directly from images. Algorithms such as 3D Gaussian Splatting and RayGauss use local basis functions to represent radiance fields, enabling fast, high-quality rendering of real-world scenes. However, these methods lack an exact geometric representation of the scene. In this work, inspired by Hermite Radial Basis Function (HRBF) implicits, we introduce a global implicit function constructed from local RBFs and their derivatives to represent surfaces. The proposed formulation enables learning scene geometry through differentiable rendering of an implicit function. By leveraging local basis functions, it achieves both an efficient geometric representation and fast rendering, using a bounding volume hierarchy (BVH) to accelerate intersections with the local basis functions. The implementation of our approach will be made publicly available upon the paper’s publication.

View full details

Poster

LumiMotion: Improving Gaussian Relighting with Scene Dynamics

Joanna Kaleta ⋅ Piotr Wójcik ⋅ Kacper Marzol ⋅ Tomasz Trzciński ⋅ Kacper Kania ⋅ Marek Kowalski

Jun 7, 3:30 PM - 5:30 PM ExHall A 106

In 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splatting-based methods primarily target static scenes and often assume simplified or moderate lighting to avoid entangling shadows with surface appearance. This limits their ability to accurately separate lighting effects from material properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements - regions of the scene that undergo motion - as a supervisory signal for inverse rendering. Motion reveals the same surfaces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This thesis is supported by our experimental results which show we improve LPIPS by 23\% for albedo estimation and by 15% for scene relighting relative to next-best baseline. To this end, we introduce LumiMotion, the first Gaussian-based approach that leverages dynamics for inverse rendering and operates in arbitrary dynamic scenes. Our method learns a dynamic 2D Gaussian Splatting representation that employs a set of novel constraints which encourage the dynamic regions of the scene to deform, while keeping static regions stable. As we demonstrate, this separation is crucial for correct optimization of the albedo. Finally, we release a new synthetic benchmark comprising five scenes under four lighting conditions, each in both static and dynamic variants, for the first time enabling systematic evaluation of inverse rendering methods in dynamic environments and challenging lighting.

View full details

Poster

RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes

Jiarui Zhang ⋅ Zhihao Li ⋅ Chong Wang ⋅ Bihan Wen

Jun 6, 11:45 AM - 1:45 PM ExHall F 107

Neural fields (NFs) have achieved remarkable success in scene reconstruction and novel view synthesis. However, existing NF approaches that rely on RGB or LiDAR inputs often struggle under adverse weather conditions, limiting their robustness in real-world outdoor environments such as autonomous driving. In contrast, millimeter-wave radar is inherently resilient to environmental variations, yet its integration with NFs remains largely underexplored. Moreover, outdoor driving scenes frequently involve dynamic objects, making spatiotemporal modeling crucial for temporally consistent novel view synthesis. To address these challenges, we present RF4D, a radar-based neural field framework tailored for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, enabling more accurate modeling of object motion. A dedicated \textbf{scene flow module} further predicts temporal offsets between adjacent frames, enforcing temporal occupancy coherence during dynamic scene reconstruction. Moreover, we propose a \textbf{radar-specific power rendering formulation} grounded in radar sensing physics, improving both synthesis accuracy and interpretability. Extensive experiments on public radar datasets demonstrate that RF4D substantially outperforms existing methods in radar measurement synthesis and occupancy estimation accuracy, with particularly strong gains in dynamic outdoor environments.

View full details

Poster

GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers

Yuxuan Xue ⋅ Ruofan Liang ⋅ Egor Zakharov ⋅ Timur Bagautdinov ⋅ Chen Cao ⋅ Giljoo Nam ⋅ Shunsuke Saito ⋅ Gerard Pons-Moll ⋅ Javier Romero

Jun 7, 11:45 AM - 1:45 PM ExHall F 107

Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: **GeoRelight**. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.

View full details

Poster

ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation

Jinsheng Quan ⋅ Qiaowei Miao ⋅ Yichao Xu ⋅ Zizhuo Lin ⋅ Ying Li ⋅ Wei Yang ⋅ Zhihui Li ⋅ Yawei Luo

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 108

The ability to extrapolate dynamic 3D scenes beyond the observed timeframe is fundamental to advancing physical world understanding and predictive modeling. Existing dynamic 3D reconstruction methods have achieved high-fidelity rendering of temporal interpolation, but typically lack physical consistency in predicting the future. To overcome this issue, we propose ParticleGS, a physics-based framework that reformulates dynamic 3D scenes as physically grounded systems. ParticleGS comprises three key components: 1) an encoder that decomposes the scene into static properties and initial dynamic physical fields; 2) an evolver based on Neural Ordinary Differential Equations (Neural ODEs) that learns continuous-time dynamics for motion extrapolation; and 3) a decoder that reconstructs 3D Gaussians from evolved particle states for rendering. Through this design, ParticleGS integrates physical reasoning into dynamic 3D representations, enabling accurate and consistent prediction of the future. Experiments show that ParticleGS achieves state-of-the-art performance in extrapolation while maintaining rendering quality comparable to leading dynamic 3D reconstruction methods.

View full details

Poster

IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors

Qingan Zhang ⋅ Wensheng Li ⋅ Chengying Gao

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 112

Applying 3D Gaussian Splatting to inverse rendering, especially for relightable assets under high-illuminance conditions, remains challenging. Strong specular highlights and complex reflections complicate material-light disentanglement, often baking in shadows and losing specular detail. To address this, we introduce IR-HGP, a framework that achieves robust disentanglement using three synergistic modules: First, a Hybrid Visibility Decomposition module ensures physical visibility consistency. Second, a Generative Illumination Field Prior module infers detailed and high-dynamic range environmental lighting. Finally, a Physics-Aware Radiance Correction module stabilizes optimization and mitigates illumination artifacts. Our framework achieves SOTA material recovery and relighting performance, outperforming existing methods under challenging illumination conditions. It reconstructs the view-dependent “shiny” appearance of reflective surfaces in real time, surpassing the limits of prior 3DGS-based inverse rendering methods.

View full details

Poster

Semantic Foam: Unifying Spatial and Semantic Scene Decomposition

Amr Sharafeldin ⋅ Aryan Mikaeili ⋅ Thomas Walker ⋅ Shrisudhan Govindarajan ⋅ Daniel Rebain ⋅ Kwang Moo Yi ⋅ Andrea Tagliasacchi

Jun 7, 11:45 AM - 1:45 PM ExHall F 111

Current generation scene reconstruction methods like 3D Gaussian Splatting are capable of producing photo-realistic novel view synthesis at real-time speeds, yet see only limited adoption in many practical graphics applications.One significant contributing factor to this gap is the difficulty of interacting with and editing these representations in comparison to classic human-authored 3D assets.While work has been done to impose semantic decomposition onto these representations, there are still significant limitations in the quality and consistency of these segmentations.We address this by proposing a semantically decomposed variant of the recently introduced Radiant Foam method.Our approach, Semantic Foam, combines the natural spatial volumetric decomposition provided by Radiant Foam's Voronoi mesh with an explicit semantic feature field parameterized on the cells.The explicit mesh structure enables direct spatial regularization that prevents artifacts caused by inconsistent supervision across views or occlusion, which affect similar approaches for other point-based representations.We show that our method achieves superior performance on object-level segmentation compared to Gaussian Grouping and SAGA.

View full details

Poster

MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification

Sangwoon Kwak ⋅ Weeyoung Kwon ⋅ Jun Young Jeong ⋅ Geonho Kim ⋅ Won-Sik Cheong ⋅ Jihyong Oh

Jun 7, 3:30 PM - 5:30 PM ExHall A 111

Recent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes.However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naïve extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes.Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal coherence. By learning bidirectional deformations between KfA and adaptively blending them through learnable opacity control, our approach mitigates temporal discontinuities and flickering artifacts.We further introduce a Feature-variance-guided Hierarchical Densification (FHD) scheme that effectively densifies KfA's while keeping rendering quality, based on an assigned level of feature-variance.To effectively evaluate our model's capability to handle real-world long-range 4D motion, we newly compose long-range 4D motion-contained dataset, called SelfCap$_{\text{LR}}$. It has larger average dynamic motion magnitude, captured at spatially wider spaces, compared to previous dynamic video datasets.Overall, our MoRel achieves temporally coherent and flicker-free long-range 4D reconstruction while maintaining bounded memory usage, demonstrating both scalability and efficiency in dynamic Gaussian-based representations. The code and project page will be publicly released.

View full details

Poster

Residual Diffusion Bridge Model for Image Restoration

Hebaixu Wang ⋅ Jing Zhang ⋅ Haoyang Chen ⋅ Haonan Guo ⋅ Di Wang ⋅ Jiayi Ma ⋅ Bo Du

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 112

Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Additionally, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the state-of-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks.

View full details

Poster

LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis

Stanislaw Szymanowicz ⋅ Minghao Chen ⋅ Jianyuan Wang ⋅ Christian Rupprecht ⋅ Andrea Vedaldi

Jun 6, 11:45 AM - 1:45 PM ExHall F 112

Novel View Synthesis has often relied on explicit 3D representations, which inject a strong 3D bias in the process; however, recent work has shown that network-based rendering can work better despite lacking 3D inductive biases. In this paper, we show that much better quality can be obtained by leveraging a strong 3D bias without a 3D representation. To do so, we introduce LagerNVS, an encoder-decoder network that uses 3D-aware features as a latent scene encoding. The encoder is initialized from a 3D reconstruction network, paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis results (including 31.1 PSNR on Re10k), with and without known cameras, renders in real-time, generalizes to in-the-wild data without known cameras, and can be paired with a diffusion decoder for generative completions.

View full details

Poster

Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields

Ankit Dhiman ⋅ Tao Lu ⋅ Srinath Ravi ⋅ Emre Arslan ⋅ Angela Xing ⋅ Yuanbo Xiangli ⋅ R. Venkatesh Babu ⋅ Srinath Sridhar

Jun 6, 11:45 AM - 1:45 PM ExHall F 113

Novel-view synthesis plays a crucial role in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent approaches, such as 3D Gaussian Splatting (3DGS), have emerged as state-of-the-art solutions, offering high-quality novel view synthesis in real time. However, training 3DGS models remains slow, particularly for high-resolution images, often requiring hours to fit a scene with 200 views. In this work, we aim to accelerate the fitting process by reducing computational overhead and improving learning efficiency. Specifically, we introduce a dilated rendering technique that renders only a subset of pixels instead of the full image, significantly reducing computational costs. To enhance learning efficiency, we develop a convergence-aware budget control mechanism that balances the addition of new Gaussians with the optimization of existing ones. Additionally, to improve densification efficiency and prevent gradient vanishing, we incorporate both positional and appearance error to enhance densification effectiveness. With these improvements, we achieve fast 4K-resolution fitting while maintaining, or even improving, novel view rendering quality. Extensive experiments demonstrate that our method achieves significantly faster optimization than existing approaches while preserving high rendering fidelity.

View full details

Poster

NeAR: Coupled Neural Asset–Renderer Stack

Hong Li ⋅ Chongjie Ye ⋅ Houyuan Chen ⋅ Weiqing Xiao ⋅ Ziyang Yan ⋅ Lixing Xiao ⋅ Zhaoxi Chen ⋅ Jianfeng XIANG ⋅ Shaocong Xu ⋅ Xuhui Liu ⋅ Yikai Wang ⋅ Baochang Zhang ⋅ Xiaoguang Han ⋅ Jiaolong Yang ⋅ Hao Zhao

Jun 7, 11:45 AM - 1:45 PM ExHall F 113

Neural asset authoring and neural rendering have emerged as largely disjoint threads: one generates digital assets using neural networks for traditional graphics pipelines, while the other develops neural renderers that map conventional assets to images. However, the joint design of the asset representation and renderer remains largely unexplored. We argue that coupling them can unlock an end-to-end learnable graphics stack with benefits in fidelity, consistency, and efficiency. In this paper, we explore this possibility with **NeAR**: a Coupled Neural Asset–Renderer Stack. On the **asset** side, we build on Trellis-style Structured 3D Latents and introduce a lighting-homogenized neural asset: from a casually lit input, a rectified-flow backbone predicts a Lighting-Homogenized SLAT that encodes geometry and intrinsic material cues in a compact, view-agnostic latent. On the **renderer** side, we design a lighting-aware neural renderer that uses this neural asset, along with explicit view embeddings and HDR environment maps, to produce lighting-aware renderings in realtime. We validate NeAR on four tasks: (1) G-buffer–based forward rendering, (2) random-lit single-image reconstruction, (3) unknown-lit single-image relighting, and (4) novel-view relighting, where our coupled stack surpasses state-of-the-art baselines in quantitative metrics and perceptual quality. We hope this coupled asset-renderer perspective inspires new graphics stacks that view neural assets and renderers as co-designed components instead of independent ones.

View full details

Poster

Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis

M. Kerem Aydin ⋅ Vishwanath Saragadam ⋅ Emma Alexander

Jun 7, 11:45 AM - 1:45 PM ExHall F 114

Thermal cameras provide reliable visibility in darkness and adverse conditions, but thermal imagery remains significantly harder to use for novel view synthesis (NVS) than visible-light images. This difficulty stems primarily from two characteristics of affordable thermal sensors. First, thermal images have extremely low dynamic range, which weakens appearance cues and limits the gradients available for optimization. Second, thermal data exhibit rapid frame-to-frame photometric fluctuations together with slow radiometric drift, both of which destabilize correspondence estimation and create high-frequency floater artifacts during view synthesis, particularly when no RGB guidance is available. Guided by these observations, we introduce a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. Our approach achieves state-of-the-art performance across thermal-only NVS benchmarks, without requiring any dataset-specific tuning.

View full details

Poster

WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

Xuweiyi Chen ⋅ Wentao Zhou ⋅ Zezhou Cheng

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 116

We present **WildRayZer**, a self-supervised framework for novel view synthesis (NVS) in dynamic environments, where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, causing ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.

View full details

Poster

Towards Generalized Multimodal Homography Estimation

Jinkun You ⋅ Jiaxin Cheng ⋅ Jie Zhang ⋅ Yicong Zhou

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 115

Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when applied to unseen modalities. To address this issue, we propose a training data synthesis method that generates unaligned image pairs with ground-truth offsets from a single input image. Our approach renders the image pairs with diverse textures and colors while preserving their structural information. These synthetic data empower the trained model to achieve greater robustness and improved generalization across various domains. Additionally, we design a network to fully leverage cross-scale information and decouple color information from feature representations, thus improving estimation accuracy. Extensive experiments show that our training data synthesis method improves generalization performance. The results also confirm the effectiveness of the proposed network.

View full details

Poster

PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis

chunji lv ⋅ Zequn Chen ⋅ Donglin Di ⋅ Weinan Zhang ⋅ Hao Li ⋅ Wei Chen ⋅ Yinjie Lei ⋅ Changsheng Li

Jun 7, 11:45 AM - 1:45 PM ExHall F 115

Despite advances in physics-based 3D motion synthesis, current methods face key limitations: reliance on pre-reconstructed 3D Gaussian Splatting (3DGS) built from dense multi-view images with time-consuming per-scene optimization; physics integration via either inflexible, hand-specified attributes or unstable, optimization-heavy guidance from video models using Score Distillation Sampling (SDS); and naïve concatenation of prebuilt 3DGS with physics modules, which ignores physical information embedded in appearance and yields suboptimal performance. To address these issues, we propose PhysGM, a feed-forward framework that jointly predicts 3D Gaussian representation and physical properties from a single image, enabling immediate simulation and high-fidelity 4D rendering. Unlike slow appearance-agnostic optimization methods, we first pre-train a physics-aware reconstruction model that directly infers both Gaussian and physical parameters. We further refine the model with Direct Preference Optimization (DPO), aligning simulations with the physically plausible reference videos and avoiding the high-cost SDS optimization. To address the absence of a supporting dataset for this task, we propose PhysAssets, a dataset of 50K+ 3D assets annotated with physical properties and corresponding reference videos. Experiments show that PhysGM produces high-fidelity 4D simulations from a single image in one minute, achieving a significant speedup over prior work while delivering realistic renderings.

View full details

Poster

Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling

Tang Long ⋅ Huiyu Duan ⋅ Guoquan Zheng ⋅ Jianbo Zhang ⋅ Jie Hao ⋅ Liang Yuan

Jun 7, 11:45 AM - 1:45 PM ExHall F 116

Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks. The code will be released upon the publication.

View full details

Poster

SDUIE: Semi-Supervised Diffusion for Underwater Image Enhancement with Quant-Text Dual Control

Xiaofeng Cong ⋅ Yu-Xin Zhang ⋅ Hao Shen ⋅ Yeying Jin ⋅ Junming Hou ⋅ Jie Gui

Jun 7, 3:30 PM - 5:30 PM ExHall A 116

Underwater images often exhibit dominant blue-green hues due to wavelength-dependent light attenuation. While existing enhancement methods have achieved promising performance, they typically overlook the subjective nature of visual preferences. To address this gap, we propose SDUIE, a level-aware Semi-supervised Diffusion framework for Underwater Image Enhancement that enables dual control through both quantitative and textual inputs. SDUIE-Quant allows continuous, numerical adjustment of enhancement levels via low-rank adaptation weight merging within a dual-branch diffusion model. This model comprises a supervised branch trained on synthetic underwater-terrestrial pairs and a self-supervised branch designed to preserve the natural hues of real-world underwater scenes. Building on this, SDUIE-Text introduces intuitive, language-guided control by aligning semantic prompts with visual enhancement effects, leveraging the learned fusion weights. This dual-modality design offers both precise control and flexible, user-preferred enhancement. Experimental results demonstrate that SDUIE achieves state-of-the-art results while better preserving the aesthetic qualities often missed by conventional methods. The source code will be made publicly available.

View full details

Poster

HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation

Daichao Zhao ⋅ Qiupu Chen ⋅ Feng He ⋅ Xin Ning ⋅ Qiankun Li

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 118

Lane detection is a crucial task in autonomous driving, which is conducive to ensuring the safe operation of vehicles. However, current datasets like CULane and TuSimple have relatively limited data under extreme weather conditions, such as rain, snow and fog, which makes detection models unreliable in extreme conditions, potentially leading to serious safety-critical failures on the road. In this direction, we propose \textbf{\textit{HG-Lane}}, a \textbf{H}igh-fidelity \textbf{G}eneration framework for \textbf{Lane} Scenes under adverse weather and lighting conditions, without the need for re-annotation and training. Based on our framework, we further propose a benchmark that includes adverse weather and lighting conditions, with 30,000 images. Experiment results demonstrate that our method constantly and significantly improves the detection performance of all the related lane detection networks. Taking the state-of-the-art CLRNet as an example, the overall mF1 on our benchmark increases by 20.87%. The F1@50 for the overall, normal, snow, rain, fog, night, and dusk categories increases by 19.75%, 8.63%, 38.8%, 14.96%, 26.84%, 21.5%, and 12.04%, respectively. Code and dataset are included in the supplementary materials.

View full details

Poster

MDS-VQA: Model-Informed Data Selection for Video Quality Assessment

Jian Zou ⋅ Xiaoyu Xu ⋅ Zhihua Wang ⋅ Yilin Wang ⋅ Balu Adsumilli ⋅ Kede Ma

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 119

Recent advances in learning-based video quality assessment (VQA) have achieved remarkable progress, yet the two fundamental components, model and data, are often studied in isolation.Model-centric approaches tend to design superior architectures over fixed and repeatedly used datasets, risking overfitting to benchmark-specific characteristics. In contrast, data-centric efforts emphasize constructing large-scale datasets through costly and time-consuming subjective experiments, typically overlooking the strengths and failure modes of existing VQA models. This separation limits progress, leading to brittle generalization and inefficient use of annotation resources.To bridge the gap, we introduce MDS-VQA, a model-informed data selection method that integrates model-centric and data-centric VQA. In its specific instantiation, a learned failure prediction module trained via a learning-to-rank formulation is combined with a content diversity measure based on deep semantic video features.Experiments across multiple VQA datasets demonstrate that MDS-VQA effectively spots diverse and challenging samples that expose model weaknesses.The selected videos are proven to be particularly informative for fine-tuning, offering a principled path toward constructing more challenging datasets and developing more generalizable and robust VQA models.

View full details

Poster

Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events

Yunshan Qi ⋅ Lin Zhu ⋅ Nan Bao ⋅ Yifan Zhao ⋅ Jia Li

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 120

Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We utilize NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above NeRF-rendered HDR pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results from single-exposure blurry LDR images and corresponding events.

View full details

Poster

Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters

Dongxin Xie ⋅ Yan Huang ⋅ Yong Xu ⋅ Hui Ji

Jun 7, 11:45 AM - 1:45 PM ExHall F 121

Atmospheric turbulence severely degrades long-range images with distortions and blur, hindering downstream applications. While supervised methods rely on synthetic data with limited real-world generalization, existing unsupervised approaches often ignore the underlying physics, leading to suboptimal restoration. We propose TMFS, an optimization-based and physically-grounded approach for unsupervised turbulence mitigation. The method operates by optimizing an imaging model with frame-shared degradation parameters under physically-motivated regularization. Inspired by sampling procedures in physical simulators, the degradation parameters are further decomposed into a frame-shared correlation function and per-frame noise maps. TMFS gains a strong inductive bias that improves generalization and mitigates overfitting. In extensive experiments, TMFS achieves state-of-the-art results among unsupervised methods. In contrast, supervised methods show a significant domain gap on real data, thereby validating the advantage of our physics-aware, unsupervised approach.

View full details

Poster

MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Peiqing Yang ⋅ Shangchen Zhou ⋅ Kai Hao ⋅ Qingyi Tao

Jun 7, 3:30 PM - 5:30 PM ExHall A 121

Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Quality Evaluator (QE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The QE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.

View full details

Poster

Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model

Mostofa Uddin Uddin ⋅ HM Shadman Tabib ⋅ Thanh-Huy Nguyen ⋅ Kashish Gandhi ⋅ Min Xu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 122

We introduce an unsupervised approach for segmenting multiscale subcellular objects in 3D volumetric cryo-electron tomography (cryo-ET) images. To this end, we address key challenges such as lack of annotated data, large data volumes, high heterogeneity of subcellular shapes and sizes, and high inter-domain variability of cellular cryo-ET images across different experiments and contexts. Our method requires users to only select a small number of slabs from a few representative tomograms in the dataset. The core of our method is extracting features for the corresponding slabs, leveraging a Stable Diffusion foundation model pretrained on mostly natural images. The feature extraction is followed by a novel heuristic-based feature aggregation strategy, and adaptive thresholding to segment the aggregated features. The resulting masks are refined with pretrained CellPose to split composite regions, and then utilized as pseudo-ground truth for training supervised deep learning models. We validated our unsupervised foundation-model based pipeline on publicly available cryo-ET benchmark datasets, demonstrating performance that closely approximates expert human annotations. This fully automated, data-driven framework enables the mining of multi-scale subcellular patterns, paving the way for accelerated biological discoveries from large-scale cellular cryo-ET datasets.

View full details

Poster

LF-BVN: Blind-View Network for Self-Supervised Light Field Denoising

Longzhao Guo ⋅ shuo zhang ⋅ Chen Gao ⋅ Qian Tian ⋅ Youfang Lin

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 125

Recent advances in learning-based Light Field (LF) image denoising have achieved impressive results. However, these methods rely heavily on large-scale noisy-clean image pairs and often fail to generalize to unseen or complex noise.In this work, we observe that the inherent multi-view consistency of LF images makes it highly unlikely for noise to be coherent across views, offering a more reliable supervisory signal for self-supervised denoising.Building on this insight, we extend the blind-spot principle to the LF domain and propose a novel LF Blind-View denoising Network (LF-BVN). We first introduce a geometric invariance mask that leverages angular redundancy for efficient full-view supervision. To enforce cross-view photometric consistency, we further introduce latent representation volumes and enforce consistency between them.Additionally, we exploit focus stacks to extract latent depth cues from noisy observations, providing further guidance.Extensive experiments show that LF-BVN achieves competitive denoising performance while maintaining strong cross-view consistency without requiring clean data or external supervision.

View full details

Poster

ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization

Bingchen Li ⋅ Zhixin Wang ⋅ Fan Li ⋅ Jiaqi Xu ⋅ Jiaming Guo ⋅ Renjing Pei ⋅ Xin Li ⋅ Zhibo Chen

Jun 6, 11:45 AM - 1:45 PM ExHall F 124

Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization.This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.

View full details

Poster

Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning

Yikai Huang ⋅ Renmin Han ⋅ Yuxuan Wang ⋅ Youcheng Cai ⋅ Ligang Liu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 124

Segment Anything Model (SAM)-based approaches have demonstrated remarkable potential for biomedical image segmentation. However, these methods often struggle to maintain spatial consistency in 3D electron microscopy (3D-EM) data and require extensive manual annotations. To this end, we propose Spatial-SAM, a spatially consistent and annotation-efficient framework that achieves high precision on 3D-EM data. Our method introduces two key innovations. First, we incorporate a 3D Signed Distance Field (SDF) memory mechanism that replaces the original memory in SAM2 with SDF representations precomputed by a 3D U-Net, providing richer geometric information and improving spatial consistency. Second, by combining the few-shot capability of SAM2 with a dual-track pseudo-label iterative optimization strategy, Spatial-SAM efficiently learns to segment large-scale 3D-EM datasets from minimal annotations. Experiments show that Spatial-SAM significantly outperforms existing semi-supervised methods and achieves performance comparable to state-of-the-art fully supervised approaches on multiple 3D-EM benchmarks, reducing annotation costs while preserving spatial consistency. The code will be publicly released upon acceptance.

View full details

Poster

CROWn: A Unified Framework for Anti‑Aliased Downsampling and Phase‑Calibrated Fusion in 3D Medical Segmentation

Xingru Huang ⋅ Shuanghua Ye ⋅ Zhao Huang ⋅ Wenwen Tang ⋅ Huiyu Zhou ⋅ Zhiwen Zheng ⋅ Jin Liu ⋅ Xiaoshuai Zhang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 125

Precise 3D medical image segmentation is a clinical cornerstone for diagnosis, therapy planning, and longitudinal monitoring. However, routine acquisition with anisotropic voxel spacing and heterogeneous reconstruction induces downsampling aliasing and cross-scale misalignment that blur boundaries, fragment topology, and undermine reliability. Existing U-shaped CNN or Transformer designs neither control alias injection at decimation nor explicitly align high-resolution evidence before decoder fusion, leading to unstable interfaces under device and protocol variability. We introduce the Coset-fibRated micrO-local co-attention Network (CROWn), a general segmentation framework that couples sampling theory with representation learning to jointly suppress aliasing and calibrate cross-scale fusion. CROWn comprises two complementary components. The Microlocal Polyphase Co-Attentive Decimator ($\mu$PCAD) performs axis-aware polyphase analysis with pooled–subband co-attention and explicit anti-alias low-pass, routing boundary-relevant high-frequency evidence while attenuating spurious phase components during downsampling. The Octaphase Coset Fibration (OCF) anti-aliases high-resolution skips, restructures them via 3D space-to-depth into cosets, and applies phase attention with edge-gated modulation to deliver compact, phase-aligned, boundary-aware features to the decoder. Extensive evaluations across 15 publicly available datasets spanning CT, MRI, and OCT demonstrate CROWn's state-of-the-art performance against 17 recent leading methods, improves overlap and topological consistency, consistently reduces boundary errors, while maintaining controlled training and inference cost. The code is publicly available.

View full details

Poster

Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models

Karim Kadry ⋅ Abdalla Abdelwahed ⋅ Ajay Manicka ⋅ Naravich Chutisilp ⋅ Farhad R. Nezami ⋅ Elazer R Edelman

Jun 6, 11:45 AM - 1:45 PM ExHall F 126

We present an inference-time guidance framework for generating 3D multi-class anatomical voxel maps with localized geometric and topological control. During generation, we use cuboidal control domains of varying dimensionality, location, and shape to slice out relevant substructures. These local substructures are used to compute differentiable penalty functions that steer the sample towards target constraints. We penalize geometric features such as size, shape, position, and orientation through voxel-wise moments, while topological features such as connected components, loops, and voids are enforced through persistent homology. Lastly, we implement this guidance framework for latent diffusion models, where a neural field decoder can partially extract substructures, enabling efficient measurement and control of anatomical properties. This formulation unlocks a rich design space, where several constraints can be composed to control complex structures defined over arbitrary dimensions and coordinate systems. We show that Anatomica flexibly applies to a variety of anatomical systems, enabling the rational design of synthetic datasets for virtual simulation trials or machine learning workflows.

View full details

Poster

TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning

Yunbi Liu ⋅ Enqi Tang ⋅ Shiyu Li ⋅ hui shuai ⋅ Lei Ma ⋅ Juncheng Li ⋅ Kuai Yu ⋅ Shu Lou ⋅ Yongchu Pan ⋅ Qingshan Liu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 126

Orthodontic treatment hinges on tooth alignment, which significantly affects occlusal function, facial aesthetics, and patients' quality of life. Current deep learning approaches predominantly predict transformation matrices for the misaligned tooth point cloud via point-to-point geometric constraints to achieve tooth alignment. Nevertheless, these matrices are likely to exhibit clinical-specific distributions, which deterministic constraints fail to capture. To address this, we introduce a new automatic tooth alignment method named TAlignDiff, which is assisted by diffusion-based transformation learning. TAlignDiff comprises two main components: a primary point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD). Geometry-constrained losses supervise PRN learning for point cloud-level alignment. DTMD, as an auxiliary module, learns the latent distribution of transformation matrices from clinical data. We integrate point cloud-based transformation regression and diffusion-based transformation modeling into a unified framework, allowing bidirectional feedback between geometric constraints and diffusion refinement. We validate our method on a challenge dataset from clinical practice and an extra orthodontic dataset. Its efficacy was confirmed through effective ablation studies and comparative analyses, highlighting its potential for application in orthodontic treatment.

View full details

Poster

AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation

Xiya Shen ⋅ Qinglin Zhao ⋅ Li Feng

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 131

Prototype or region-attention modules have recently improved medical image segmentation but still suffer from two fundamental limitations: 1) they represent each semantic concept as a point or isotropic region, failing to capture the inherently anisotropic geometry of real feature distributions; and 2) many rely on non-differentiable clustering or one-way kernel weighting, which restricts their ability to form coherent region-level representations. We address these issues with the Anisotropic Differentiable Granular-Ball (AD-GBC) module, which generalizes prototypes into learnable geometric regions parameterized by a center and an anisotropic vector scale. AD-GBC aggregates local features into region-level semantics and redistributes the refined representation back to pixels in a fully differentiable manner, enabling geometry-aware refinement within modern UNet-style architectures. Two geometric regularizers, a Wasserstein-based diversity loss and a radius–dispersion consistency loss, prevent center collapse and encourage stable, well-formed region geometry.AD-GBC yields consistent improvements across four widely used medical segmentation benchmarks (BUSI, GlaS, CVC-ClinicDB, ISIC17) when integrated into two strong backbones (Rolling-UNet and U-KAN), demonstrating that the proposed geometric region formulation generalizes well across different imaging conditions.

View full details

Poster

OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement

Rui Wang ⋅ Huisi Wu ⋅ Jing Qin

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 132

Accurate segmentation of cardiac chambers in echocardiography videos is essential for quantitative cardiac assessment. However, ultrasound noise, artifacts, and cardiac motion pose significant challenges to robust spatiotemporal modeling. Recent approaches such as Transformers, linear attention, and state-space models improve accuracy, yet Transformers often remain computationally expensive, whereas linear attention and state-space models typically lack geometric regularization, leading to unstable spatiotemporal interactions under complex cardiac motion. We introduce OSA, a lightweight linear sequence architecture designed for stable and efficient cardiac video segmentation. OSA incorporates an Anatomical Prior-aware Feature Enhancement (APFE) module that decouples and fuses complementary anatomical components to strengthen boundary–region discrimination. Orthogonalized State Update (OSU) enforces spectral-norm and orthogonality constraints during recurrent transitions, preserving spatiotemporal coherence. Evaluated on the CAMUS and EchoNet-Dynamic datasets, OSA consistently outperforms state-of-the-art methods in segmentation accuracy and temporal consistency, while maintaining real-time inference efficiency. This framework offers a principled and efficient solution for dynamic cardiac analysis in echocardiography. The code will be released upon publication.

View full details

Poster

Diffusion-Based Native Adversarial Synthesis for Enhanced Medical Segmentation Generalization

Hongyu Zhang ⋅ Haipeng Chen ⋅ Zhimin Xu ⋅ Chengxin Yang ⋅ Yingda Lyu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 135

Diffusion models (DMs) demonstrate strong capabilities in generating anatomically realistic medical images, enabling promising avenues for improving model generalization via synthetic augmentation. However, bridging the gap between generative prowess (realism) and measurable improvements in downstream generalization (utility) remains a key challenge. This work unifies theory and practice to tackle two central questions: (1) What to synthesize? We identify synthetic adversariality—the expected empirical loss induced by synthetic data—as a key driver of generalization. Crucially, only native adversariality (i.e., hard examples drawn from the DM's distribution) yields consistent improvements, while artificial adversariality from attack-style perturbations degrades performance. (2) How to synthesize? We introduce the Adversariality Miner, a lightweight, plug-and-play module that efficiently selects initial noise to elicit native adversarial samples, without modifying or retraining the DM. Extensive experiments across diverse diffusion backbones and medical benchmarks confirm the effectiveness of our approach, establishing a principled path toward diffusion-driven generalization.

View full details

Poster

When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

Hui Lu ⋅ Yi Yu ⋅ Yiming Yang ⋅ Chenyu Yi ⋅ Qixin Zhang ⋅ Bingquan Shen ⋅ Alex C. Kot ⋅ Xudong Jiang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 134

Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of **universal, transferable adversarial patches** against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce **UPA-RFAS** (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.

View full details

Poster

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Tianyi Xiong ⋅ Yi Ge ⋅ Ming Li ⋅ Zuolong Zhang ⋅ Pranav Kulkarni ⋅ Kaishen Wang ⋅ Qi He ⋅ Zeying Zhu ⋅ Chenxi Liu ⋅ Ruibo Chen ⋅ Tong Zheng ⋅ Yanshuo Chen ⋅ Xiyao Wang ⋅ Ray Zhang ⋅ Wenhu Chen ⋅ Heng Huang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 137

Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored.We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts.Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria—especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment.Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges.As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.

View full details

Poster

MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction

Shuo Tang ⋅ Jian Xu ⋅ Jiadong Zhang ⋅ yi chen ⋅ Qizhao Jin ⋅ Lingdong Shen ⋅ Chenglin Liu ⋅ Shiming Xiang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 138

Timely and accurate forecasts of severe weather events are essential for early warning and for constraining downstream analysis and decision-making. Since severe weather events prediction still depends on subjective, time-consuming expert interpretation, end-to-end “AI weather station” systems are emerging but face three major challenges: (1) scarcity of severe weather event samples; (2) imperfect alignment between high-dimensional meteorological data and textual warnings; (3) current multimodal language models cannot effectively process high-dimensional meteorological inputs or capture their complex spatiotemporal dependencies. To address these challenges, we introduce MP-Bench, the first large-scale multimodal dataset for severe weather events prediction, comprising 421,363 pairs of raw multi-year meteorological data and corresponding text caption, covering a wide range of severe weather scenarios. On top of this dataset, we develop a Meteorology Multimodal Large Model (MMLM) that directly ingests 4D meteorological inputs. In addition, it is designed to accommodate the unique characteristics of 4D meteorological data flow, incorporating three plug-and-play adaptive fusion modules that enable dynamic feature extraction and integration across temporal sequences, vertical pressure layers, and spatial dimensions. Extensive experiments on MP-Bench show that MMLM achieves strong performance across multiple tasks, demonstrating effective severe weather understanding and representing a key step toward automated, AI-driven severe weather events forecasting systems. Our source code and dataset will be made publicly available.

View full details

Poster

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Zhiheng Liu ⋅ Weiming Ren ⋅ Haozhe Liu ⋅ Zijian Zhou ⋅ Shoufa Chen ⋅ Haonan Qiu ⋅ Xiaoke Huang ⋅ Zhaochong An ⋅ Fanny Yang ⋅ Aditya Patel ⋅ Viktar Atliha ⋅ Tony Ng ⋅ Xiao Han ⋅ Chuyan Zhu ⋅ Chenyang Zhang ⋅ Ding Liu ⋅ Juan-Manuel Pérez-Rúa ⋅ Sen He ⋅ Jürgen Schmidhuber ⋅ Wenhu Chen ⋅ Ping Luo ⋅ Wei Liu ⋅ Tao Xiang ⋅ Jonas Schult ⋅ Yuren Cong

Jun 6, 11:45 AM - 1:45 PM ExHall F 140

Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.

View full details

Poster

How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?

Arda Senocak ⋅ Sooyoung Park ⋅ Tae-Hyun Oh ⋅ Joon Chung

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 140

We present the first scalable framework for training sound source localization (SSL) models using synthetic data from text-to-X models. Although SSL has made notable progress, existing models remain constrained by limited-scale, uncurated real-world datasets that often suffer from semantic misalignment. Furthermore, the introduction of new SSL tasks and benchmarks has increased the need for more generalizable models. To address these challenges, we leverage synthetic data to create synthetic clones of the VGGSound dataset, enabling both fully synthetic and hybrid real–synthetic training. We demonstrate that synthetic data can effectively replace, refine, and scale real training datasets. Extensive experiments across multiple benchmarks show that synthetic data not only matches real data in performance but also enables significant improvements when combined with real samples. Our findings provide the first systematic evidence that synthetic data can serve as a scalable and effective approach for advancing SSL models.

View full details

Poster

Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast

Fan Yang ⋅ Yuanzhi Zhao ⋅ Haimei Zhao ⋅ Yudong Zhao ⋅ Haikun Xu

Jun 7, 11:45 AM - 1:45 PM ExHall F 143

In unsupervised cross-modal hashing, real-world data often exhibit partial alignment and semantic mismatch: dominant modalities tend to overrule fusion, fine-grained complementary cues are overlooked, and mini-batch “negative samples” are contaminated by semantically related items, yielding frequent false negatives. Treating all pairs equally in contrastive learning thus makes training noise-prone and ill-suited to partially aligned data. To mitigate these pains, we present Unsupervised Weighted Masked Contrastive Hashing (UWMCH), whose core is: (i) random masked fusion deliberately suppresses part of modality evidence during feature interaction, forcing the model to learn complementary semantics under diverse “partial interactions,” avoiding reliance on a single modality and explicitly exposing hard cases; (ii) pairwise weighting no longer treats masked and unmasked pairs as equivalent but adaptively assigns a weight to each cross-modal pair by combining instance-level semantic consistency with a K-means induced cluster-consensus prior, injecting the weight into the contrastive objective to suppress suspected false negatives and amplify more informative masked positives. To stabilize the global structure, we further introduce two constraints: Cluster-Centroid Agreement (CCA) forms global semantic anchors at the prototype level in synergy with UWMCH; Semantic Structure Regularization (SSR) builds higher-order semantic structure and aligns it with cross-modal similarity, maintaining intra-modal compactness and inter-modal separability under masking. Extensive benchmark experiments show that UWMCH achieves better retrieval accuracy and convergence stability across multiple datasets. The code will be released.

View full details

Poster

Reliable Clustering Number Estimation for Contrastive Multi-View Clustering

Zhengzhong Zhu ⋅ Pei Zhou ⋅ Lanxi Bai ⋅ Li Cheng ⋅ Jia Nie ⋅ Shiquan min ⋅ Jiangping Zhu

Jun 7, 11:45 AM - 1:45 PM ExHall F 144

In recent years, contrastive multi-view clustering has achieved remarkable performance improvements. However, existing methods still face two key challenges: (1) reliance on a predefined number of clusters k, which is often unknown in real-world scenarios; and (2) contrastive learning might cause representation degeneration when thecollected multiple views inherently have inconsistent semantic information . To address these issues, we propose a novel framework—Reliable Clustering Number Estimation for Contrastive Multi-View Clustering (RCNMC). RCNMC consists of a Semantics-Aware Contrastive Learning module and a Reinforcement Learning-based Cluster Number Learning module. Specifically, the Semantics-Aware Contrastive Learning module first measures the discrepancy between pairwise representations and adaptively strengthens useful pairwise views while weakening unreliable ones, thereby alleviating representation degeneration. The Reinforcement Learning-based Cluster Number Learning module infers the optimal number of clusters in an unsupervised manner by using intra-cluster and inter-cluster distances as a reward-driven strategy. The two modules complement each other, making RCNMC more suitable for complex multi-view clustering tasks in real-world scenarios. Extensive experiments on multiple benchmark datasets demonstrate that RCNMC significantly outperforms existing state-of-the-art methods.

View full details

Poster

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

tengjin Weng ⋅ Wenhao Jiang ⋅ Jingyi Wang ⋅ Ming Li ⋅ Lin Ma ⋅ Zhong Ming

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 146

Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision–language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis.In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position.Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model’s fine-grained visual discrimination ability.We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence.All resources will be publicly released upon acceptance.

View full details

Poster

EgoAVU: Egocentric Audio-Visual Understanding

Ashish Seth ⋅ Xinhao Mei ⋅ Changsheng Zhao ⋅ Varun Nagaraja ⋅ Ernie Chang ⋅ Gregory P. Meyer ⋅ Gael Le Lan ⋅ Yunyang Xiong ⋅ Vikas Chandra ⋅ Yangyang Shi ⋅ Dinesh Manocha ⋅ zhipeng cai

Jun 6, 11:45 AM - 1:45 PM ExHall F 146

Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph based curation ensure both the data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct — a large scale training dataset of 3M samples, and EgoAVU-Bench — a manually verified evaluation split of 3K samples. EgoAVU-Bench clearly reveals the limitation of existing MLLMs: they bias heavily towards visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively solves this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefit can also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.

View full details

Poster

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Shan Ning ⋅ Longtian Qiu ⋅ Jiaxuan Sun ⋅ Xuming He

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 148

Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia.Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment.In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER.WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training.Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16\% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER.

View full details

Poster

Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis

Xianbing Zhao ⋅ Lan Luo ⋅ Hengyang Lu ⋅ Buzhou Tang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 147

Multimodal Sentiment Analysis (MSA) aims to integrate textual, acoustic, and visual information to predict sentiment polarity. With the emergence of Large Language Models (LLMs), existing studies commonly employ learnable queries to compress audio–visual representations and feed them as soft prompts into LLMs for MSA. However, due to the implicit learning mechanism of the learnable queries, these learnable queries lack explicit guidance regarding how each query encodes sentiment semantics. To address this issue, we propose a prototype-as-prompt framework that maps audio–visual representations into a fixed set of multimodal sentiment prototypes. These prototypes are then used as soft prompts to guide the LLM in performing MSA. Concretely, we first compress both textual and non-textual features into multimodal prototypes using a resampling-based strategy. We further introduce a sentiment-aware prototype learning that explicitly binds multimodal prototypes with sentiment semantics. To ensure both cross-modal consistency and intra-modal diversity of multimodal sentiment prototypes, we design a cross-modal prototype alignment constraint and a distance-weighted prototype diversity constraint. Extensive experiments across three LLMs and four benchmark datasets show that PaP achieves superior performance with only 0.09\%–0.26\% of trainable parameters, highlighting its effectiveness and parameter efficiency.

View full details

Poster

Learning Anchor in Dual Orthogonal Space for Fast Multi-view Clustering

Yalan Qin ⋅ Hanzhou Wu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 150

Large-scale multi-view clustering aims to explore the complementary and consistent information among different views in efficient manner. Despite the impressive performance gained by the existing methods, they just perform anchor learning in a single space with the orthogonal or some other constraints from the multi-view data, leading to undesired anchors. The anchors can simultaneously occur in more spaces and the complementary information among these spaces is able to be adopted for learning anchors. Meanwhile, the space with basis being the anchored cluster center is neglected to learn anchors by most existing works. In this work, we propose learning anchor in Dual Orthogonal Space for Fast Multi-view Clustering (DOSFMVC). DOSFMVC conducts anchor learning in dual orthogonal space, aiming at utilizing the complementary information among two spaces in producing anchors with high quality. DOSMFVC introduces the consensus anchored cluster center as basis of the extra space and clustering indicator of anchors based on this bais in anchor learning. The anchor learning and partition are integrated into a unified model, where the final cluster assignment can be adopted for clustering results. Extensive experiments confirm the superiority of our method compared with some state-of-the-art methods on several benchmark datasets.

View full details

Poster

EXOTIC: External Vision-driven Incomplete Multi-view Classification

Shilin Xu ⋅ Dezhong Peng ⋅ Zhenwen Ren ⋅ Yuan Sun

Jun 7, 11:45 AM - 1:45 PM ExHall F 149

Due to sensor failures and occlusions during data acquisition, multi-view data often suffer from partial missing samples, thereby producing incomplete multi-view data. Recently, Incomplete Multi-View Classification (IMVC) has become one of the research hot topics, where numerous IMVC methods have been proposed. Although these methods have achieved promising performance by exploiting internal semantic information from partially observed data, they primarily rely on limited internal supervision for view completion. Clearly, this largely constrains their performance ceiling. To overcome this limitation, we propose an EXternal visiOn-driven incomplete mulTi-vIew Classification (EXOTIC) paradigm that incorporates external vision knowledge as semantic guidance, thereby assisting in imputing incomplete views. To the best of our knowledge, it is the first work that leverages external vision knowledge as supervision signals, thereby guiding missing-view completion. Specifically, we first introduce an external vision knowledge library based on a pre-trained vision–language model. Then, we design a Knowledge Filtering module to adaptively select task-relevant knowledge. Afterwards, we present a Knowledge Purification module to align external knowledge with internal representations. Finally, we propose External Completion that leverages the refined knowledge to impute missing views, thereby enhancing the classification decision ability. Extensive experiments on multiple incomplete multi-view datasets demonstrate that the proposed EXOTIC consistently outperforms existing methods, especially under high missing rates.

View full details

Poster

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Qihao Liu ⋅ Chengzhi Mao ⋅ Yaojie Liu ⋅ Alan L. Yuille ⋅ Wen-Sheng Chu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 152

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce **AuditDM**, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

View full details

Poster

Cross-View Distillation and Adaptive Masking for Incomplete Multi-View Multi-Label Classification

Yadong Liu ⋅ Qiaoqi Li ⋅ Yueying Wang ⋅ Lunke Fei ⋅ Jie Wen

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 151

While existing incomplete multi-view multi-label learning methods have achieved promising performance, few studies have focused on the issue of multi-view imbalance. Existing methods using gradient modulation or alternating optimization strategies alleviate this problem but often oversimplify the interaction between views, resulting in persistently performance. In response to the challenge, we propose the Cross-view Distillation and Adaptive Masking (CDAM) framework, a novel approach designed to achieve balanced multi-view optimization for the challenging double incomplete multi-view multi-label learning tasks. First, to overcome the performance bottleneck of views, we design a cross-view distillation module. This module aligns low-quality student representations with high-quality teacher representations, thereby effectively mitigating the multi-view imbalance problem. Second, recognizing that distillation may not rectify all low-quality views, we introduce a subsequent adaptive masking module to perform an explicit quality assessment. This module dynamically identifies and masks out any remaining unreliable representations before multi-view fusion, thus preventing low-quality information from corrupting the fused representation. Extensive comparisons with nine state-of-the-art methods on six datasets validate the effectiveness and stability of our method.

View full details

Poster

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

yusheng dai ⋅ Zehua Chen ⋅ Yuxuan Jiang ⋅ Qiuhong Ke ⋅ Jianfei Cai ⋅ Jun Zhu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 154

Training a unified model for the generation of video-to-audio (V2A), text-to-audio (T2A) and joint video-text-to-audio (VT2A) offers significant flexibility, but is hindered by critical and unexplored challenges. We identify two foundational problems: (1) the scarcity of high-quality audio captions that feature a tight A-V-T alignment, leading to severe semantic conflict in multimodal training data, and (2) cross-task and intra-task competition during joint multi-task training, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce **SoundAtlas**, the first large-scale, human-expert-level audio caption dataset, augmenting VGGSound and AudioSet with semantically rich and temporally detailed captions. Powered by a novel, multi-turn agentic annotation pipeline (using advanced foundation models) that operates cost-effectively, SoundAtlas features a tight A-V-T alignment and a much lower hallucination rate than existing datasets. Second, we propose **Omni2Sound**, a diffusion-based unified VT2A model that supports flexible modality combinations. To address cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct **VGGSound-Omni**, a comprehensive benchmark for unified evaluation of VT2A, V2A and T2A, including challenging off-screen tracks. As a result, with a vanilla DiT backbone, Omni2Sound achieves unified state-of-the-art performance in all three tasks within a single model. It also demonstrates strong generalization across multiple benchmarks with different caption and video styles. Demonstrations are provided in the Appendix.

View full details

Poster

THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT

Stefanos Koutoupis ⋅ Michaela Areti Zervou ⋅ Konstantinos Kontras ⋅ Maarten De Vos ⋅ Panagiotis Tsakalides ⋅ Grigorios Tsagkatakis

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 154

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

View full details

Poster

Text-Driven 3D Hand Motion Generation from Sign Language Data

Léore Bensabath ⋅ Mathis Petrovich ⋅ Gul Varol

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 155

Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model (HandMDM), that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.

View full details

Poster

Is the Modality Gap a Bug or a Feature? A Robustness Perspective

Rhea Chowers ⋅ Oshri Naparstek ⋅ Udi Barzelay ⋅ Yair Weiss

Jun 7, 11:45 AM - 1:45 PM ExHall F 156

Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss will lead to a representation in which the two modalities are separated by a global gap vector that is orthogonal to the embeddings of both modalities. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when small, semantically inconsequential changes are made to the input. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss to clean accuracy.

View full details

Poster

SAMTok: Representing Any Mask with Two Words

yikang zhou ⋅ Tao Zhang ⋅ Dengxian Gong ⋅ Yuanzheng Wu ⋅ Ye Tian ⋅ Haochen Wang ⋅ Haobo Yuan ⋅ Jiacong Wang ⋅ Lu Qi ⋅ Hao Fei ⋅ Shunping Ji ⋅ Anran Wang ⋅ Zhuochen Wang ⋅ Yujing Wang ⋅ Cheng CHEN ⋅ Xiangtai Li

Jun 7, 3:30 PM - 5:30 PM ExHall A 156

Pixel-wise capabilities are essential for building interactive intelligent systems. However pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To solve these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two textual special tokens and reconstructs masks from these tokens with high fidelity. By treating masks as a new language, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a simple and scalable paradigm for equipping MLLMs with strong pixel-wise capabilities. Code and models will be available.

View full details

Poster

GenHOI: Towards Object-Consistent Hand–Object Interaction with Temporally Balanced and Spatially Selective Object Injection

Xuan Huang ⋅ Mochu Xiang ⋅ Zhelun Shen ⋅ Jinbo Wu ⋅ Chenming Wu ⋅ Chen Zhao ⋅ Kaisiyuan Wang ⋅ Hang Zhou ⋅ Shanshan Liu ⋅ Haocheng Feng ⋅ Wei He ⋅ Jingdong Wang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 157

Hand–Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing competitors.

View full details

Poster

Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling

Xingyu Liu ⋅ Pengfei Ren ⋅ Qi Qi ⋅ Haifeng Sun ⋅ Zirui Zhuang ⋅ Jianxin Liao ⋅ Jingyu Wang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 158

Understanding hand-object interaction from monocular videos is crucial for immersive and dexterous interactions in AR/VR and robotic applications. However, existing monocular reconstruction methods primarily assume rigid grasping and static object geometry. When applied to articulated manipulations, the continuous joint rotations and frequent component deformations introduce a strong coupling between shape and motion, leading to severe ambiguity and instability in articulation optimization under monocular observation. To address this challenge, we propose a Clay-to-Stone dual-phase framework, modeling the articulated manipulation at hierarchical granularities, enabling a progression from flexible semantic exploration to structured articulation recovery. In the CLAY phase, our method performs fine-grained control over geometric deformation, guided by inter-part semantic correlation learning. As semantic and motion priors emerge, the STONE phase enforces rigid constraints to consolidate articulated structures and explicitly estimates motion parameters. Experiments on a real-world manipulation dataset show that our method achieves state-of-the-art reconstruction quality and plausible articulation modeling from monocular videos.

View full details

Poster

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Tim Engelbracht ⋅ René Zurbrügg ⋅ Matteo Wohlrapp ⋅ Martin Büchner ⋅ Abhinav Valada ⋅ Marc Pollefeys ⋅ Hermann Blum ⋅ Zuria Bauer

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 159

We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper - where the tool embodiment provide synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.

View full details

Poster

TouchDream: 3D Object Completion through Imagined Touch

Yuanbo Wang ⋅ Xinning Wang ⋅ Zhaoxuan Zhang ⋅ Changlong Wang ⋅ qianchen xia ⋅ Xiaopeng Wei ⋅ Xin Yang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 161

Point cloud completion is crucial for robust 3D perception but remains challenging due to its ill-posed nature. Coarse-to-fine methods can lead to unconstrained local guesses in the absence of key structures, whereas diffusion-based approaches may introduce geometric inconsistencies. To overcome these limitations, we present TouchDream, a novel framework that leverages a diffusion model to 'dream' of tactile sensing on object surfaces, which reformulates the sensing process as a learnable generative modeling task. Unlike visual cues, tactile data provides rich local geometry that can be directly converted into 3D space for point fusion, offering a powerful guide for detail-aware completion. Specifically, our approach generate compact tactile latent representations conditioned on coarse points and sampled touch poses. A touch-guided refinement module then leverages touch features to optimize coarse points. Extensive experiments show that our TouchDream model achieves the state-of-the-art performance, significantly enhancing the recovery of local details.

View full details

Poster

ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions

Zikai Wang ⋅ Zhilu Zhang ⋅ Yiqing Wang ⋅ Hui Li ⋅ Wangmeng Zuo

Jun 6, 11:45 AM - 1:45 PM ExHall F 164

Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. The code and datasets will be made publicly available.

View full details

Poster

VideoCoF: Unified Video Editing with Temporal Reasoner

xiangpeng yang ⋅ Ji Xie ⋅ Yiyuan Yang ⋅ Yue Ma ⋅ Yan Huang ⋅ Min Xu ⋅ Qiang Wu

Jun 7, 3:30 PM - 5:30 PM ExHall A 164

Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a "seeing, reasoning, then editing" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach.

View full details

Poster

Scalable Trajectory Generation for Whole-Body Mobile Manipulation

Yida Niu ⋅ Xinhai Chang ⋅ Xin Liu ⋅ Ziyuan Jiao ⋅ Yixin Zhu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 166

Mobile robots need coordinated whole-body motion to perform household tasks effectively. Current mobile manipulation datasets rely on expensive teleoperation or slow planning methods, limiting available data to hundreds of demonstrations. This data scarcity severely constrains the development of generalizable learning-based policies. Here, we demonstrate that GPU-accelerated planning generates up to 5,000 episodes per GPU hour, over 80 $\times$ faster than existing methods. Our AutoMoMa pipeline produces 500K diverse physically valid whole-body motions across 300 household scenes and multiple robot embodiments, compared to previous datasets limited to narrow robot-scene pairs with a few hundred demonstrations. Downstream validation demonstrates consistent policy improvements with large-scale training data. This work provides the first scalable solution to the mobile manipulation data bottleneck. By enabling massive dataset generation, AutoMoMa accelerates progress toward general-purpose household robots capable of complex coordination tasks.

View full details

Poster

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Qingyan Bai ⋅ Qiuyu Wang ⋅ Hao Ouyang ⋅ Yue Yu ⋅ Hanlin Wang ⋅ Wen Wang ⋅ Ka Leong Cheng ⋅ Shuailei Ma ⋅ Yanhong Zeng ⋅ Zichen Liu ⋅ Yinghao Xu ⋅ Yujun Shen ⋅ Qifeng Chen

Jun 7, 3:30 PM - 5:30 PM ExHall A 167

Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing. We will release our dataset and models for reproducibility.

View full details

Poster

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

Xinyu Zhang ⋅ Ziyi Kou ⋅ Chuan Qin ⋅ Mia Huang ⋅ Ergys Ristani ⋅ Ankit Kumar ⋅ Lele Chen ⋅ Kun He ⋅ Abdeslam Boularias ⋅ Li Guan

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 169

Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information, such as contact forces and motion dynamics, and are prone to frequent occlusions. To address these challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove data in HOI videos into photorealistic bare-hand representations, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures both temporal and multi-view rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we introduce HandSense, the first multi-modal HOI dataset featuring multi-view bare-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.

View full details

Poster

Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

Tae Eun Choi ⋅ Sumin Shim ⋅ Junhyeok Kim ⋅ Seong Jae Hwang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 168

Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation.As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment.Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path.Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully.TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.

View full details

Poster

Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation

Yuanpeng Tu ⋅ Yunpeng Chen ⋅ Xinyu Zhang ⋅ Chao Liao ⋅ Hengshuang Zhao

Jun 6, 11:45 AM - 1:45 PM ExHall F 170

MeanFlow is a powerful few-step generative framework that can be trained from scratch, but its performance degrades significantly when the one-step loss uses a large portion of training data. This stems from a temporal scale imbalance: gradients from different stages of generation contribute unevenly, leading to unstable optimization—evident in blurry samples and high FID scores. The core issue is a conflict between two opposing forces: terms that amplify variance over long time spans and strong constraints needed near the start of generation, which a fixed sampling strategy cannot reconcile. To resolve this, we propose Temporal Equilibrium MeanFlow (TEMF), which balances these competing demands through two simple yet effective components: (1) a temporal equilibrium weighting function that equalizes gradient influence across all time scales, and (2) a dynamic boundary scheduler that gradually shifts training focus—from stabilizing early steps to refining the full trajectory as training progresses. Without changing the model architecture, TEMF retains true one-step generation with classifier-free guidance, achieving a state-of-the-art FID of 2.62 on ImageNet 256×256—achieving the best results among diffusion- and flow-based one-step methods.

View full details

Poster

DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO

Henglin Liu ⋅ Huijuan Huang ⋅ Jing Wang ⋅ Chang Liu ⋅ Xiu Li ⋅ Xiangyang Ji

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 172

Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, restricting the application scenarios of the model.This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on single-sample quality as the reward signal, driving the model to converge toward a few high-reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early-stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality–diversity trade-off.Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves an 13\%$\sim$18\% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.

View full details

Poster

Improved Mean Flows: On the Challenges of Fastforward Generative Models

ZHENGYANG GENG ⋅ Yiyang Lu ⋅ Zongze Wu ⋅ Eli Shechtman ⋅ Zico Kolter ⋅ Kaiming He

Jun 7, 11:45 AM - 1:45 PM ExHall F 172

MeanFlow provides a principled framework for fastforward generative modeling. However, the original MeanFlow has key limitations in both the training objective and the guidance. First, the original MeanFlow prediction depends not only on the noisy state but also explicitly on the noise and data, causing the training target to drift with the network. We reformulate it as velocity prediction, predicting the instantaneous velocity solely from the noisy state and reducing it to the regression problem. Second, on the guidance side, the original MeanFlow fixes the guidance scale during training by directly learning a guided field, achieving 1-NFE sampling but losing the flexibility to adjust the guidance at inference. Instead, we condition the model on guidance scale and train it on a range of guidance scales, enabling flexible guidance as diffusion/flow models in inference while preserving one-step sampling. On ImageNet 256$\times$256, our improved MeanFlow (iMF) achieves a 1-step FID of 2.74 with a model of 118M parameters, and our largest model further pushes the 1-step FID to 1.72, establishing a new state of the art for one-step generative modeling.

View full details

Poster

Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner

Haotian Dong ⋅ Wenjing Wang ⋅ Chen Li ⋅ Jing LYU ⋅ Di Lin

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 174

Generating RGB-A videos, which include alpha channels for transparency, has wide applications. However, current methods often suffer from low quality due to confusion between RGB and alpha. In this paper, we address this problem by learning shiftable RGB‑A distributions. We adjust both the latent space and noise space, shifting the alpha distribution outward while preserving the RGB distribution, thereby enabling stable transparency generation without compromising RGB quality. Specifically, for the latent space, we propose a transparency‑aware bidirectional diffusion loss during VAE training, which shifts the RGB‑A distribution according to likelihood. For the noise space, we propose shifting the mean of diffusion noise sampling and applying a Gaussian ellipse mask to provide transparency guidance and controllability. Additionally, we construct a high‑quality RGB‑A video dataset. Compared to state‑of‑the‑art methods, our model excels in visual quality, naturalness, transparency rendering, inference convenience, and controllability.

View full details

Poster

MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On

Xiaoyu Han ⋅ Chenyang Wang ⋅ Jing Wang ⋅ Shunyuan Zheng ⋅ Quanling Meng ⋅ Shengping Zhang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 175

Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.

View full details

Poster

Inference-time Physics Alignment of Video Generative Models with Latent World Models

Jianhao Yuan ⋅ Zhang Xiaofeng ⋅ Felix Friedrich ⋅ Nicolas Beltran-Velez ⋅ Melissa Hall ⋅ Reyhane Askari ⋅ Xiaochuang Han ⋅ Nicolas Ballas ⋅ Michal Drozdzal ⋅ Adriana Romero-Soriano

Jun 6, 11:45 AM - 1:45 PM ExHall F 175

State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, on the challenging PhysicsIQ benchmark we achieve 62.00% final score, outperforming previous state of the art by 6.78%. Our work demonstrates the viability of using latent world models to improve physical plausibility of video generation, beyond this specific instantiation or parameterization.

View full details

Poster

STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows

Jiatao Gu ⋅ Ying Shen ⋅ Tianrong Chen ⋅ Laurent Dinh ⋅ Yuyang Wang ⋅ Miguel Ángel Bautista ⋅ David Berthelot ⋅ Joshua Susskind ⋅ Shuangfei Zhai

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 178

Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models.

View full details

Poster

Not All Birds Look The Same: Identity-Preserving Generation For Birds

Aaron Sun ⋅ Oindrila Saha ⋅ Subhransu Maji

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 183

Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users.Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning.While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data—especially videos or multi-view observations of the same subject—making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail.Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds.We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex---used as a proxy for identity---substantially improves performance on both seen and unseen species.

View full details

Poster

LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation

yushi Huang ⋅ Xingtong Ge ⋅ RUIHAO GONG ⋅ Chengtao Lv ⋅ Jun Zhang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 183

Video diffusion models (DMs) have enabled high-quality video synthesis, but their computation costs scale quadratically with sequence length due to the nature of self-attention. While linear attention offers a more efficient alternative, fully replacing quadratic attention demands costly pretraining. This is largely because linear attention lacks sufficient expressiveness and struggles with the complex spatiotemporal dynamics inherent to video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose a selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and even inefficiency of existing objectives in optimizing this challenge transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is highly efficient and recovers model performance. Extensive experiments show that LinVideo achieves a $\mathbf{1.43\text{-}1.71\times}$ speedup while preserving generation quality, and the 4-step distilled models further reduce latency by $\mathbf{15.9\text{-}20.9\times}$ with only a minor drop in visual quality.

View full details

Poster

EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

YANG FU ⋅ Yike Zheng ⋅ Ziyun Dai ⋅ Henghui Ding

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 185

Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce **VOR** (**V**ideo **O**bject **R**emoval), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60k high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose ***EffectErase***, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion–removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

View full details

Poster

Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization

Xingyue Lin ⋅ Shuai Peng ⋅ Xiangyu Xie ⋅ Jianhua Zhu ⋅ Yuxuan Zhou ⋅ Liangcai Gao

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 185

Image vectorization aims to convert raster images into editable, scalable vector representations while preserving visual fidelity. Existing vectorization methods struggle to represent complex real-world images, often producing fragmented shapes at the cost of semantic conciseness. In this paper, we propose COVec, an illumination-aware vectorization framework inspired by the Clair-Obscur principle of light–shade contrast. COVec is the first to introduce intrinsic image decomposition in the vector domain, separating an image into albedo, shade, and light layers in a unified vector representation. A semantic-guided initialization and two-stage optimization refine these layers with differentiable rendering. Experiments on various datasets demonstrate that COVec achieves higher visual fidelity and significantly improved editability compared to existing methods.

View full details

Poster

RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution

Ali Mosleh ⋅ Faraz Ali ⋅ Fengjia Zhang ⋅ Stavros Tsogkas ⋅ Junyong Lee ⋅ Michael S. Brown ⋅ Alex Levinshtein

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 186

Digital zoom on smartphones relies on learning-based super-resolution (SR) models that operate on RAW sensor images, but obtaining sensor-specific training data is challenging due to the lack of ground-truth images. Synthetic data generation via ``unprocessing'' pipelines offers a potential solution by simulating the degradations that transform high-resolution (HR) images into their low-resolution (LR) counterparts. However, these pipelines can introduce domain gaps due to incomplete or unrealistic degradation modeling. In this paper, we demonstrate that principled and carefully designed degradation modeling can enhance SR performance in real-world conditions. Instead of relying on generic priors for camera blur and noise, we model device-specific degradations through calibration and unprocess publicly available rendered images into the RAW domain of different smartphones. Using these image pairs, we train a single-image RAW-to-RGB SR model and evaluate it on real data from a held-out device. Our experiments show that accurate degradation modeling leads to noticeable improvements, with our SR model outperforming baselines trained on large pools of arbitrarily chosen degradations. We will make our calibrated kernels and noise models publicly available, to facilitate research on image enhancement for mobile photography.

View full details

Poster

VENI: Variational Encoder for Natural Illumination

Paul Walker ⋅ James A. D. Gardner ⋅ Andreea Ardelean ⋅ William A. P. Smith ⋅ Bernhard Egger

Jun 6, 11:45 AM - 1:45 PM ExHall F 187

Inverse rendering is an ill-posed problem, but priors like illumination priors, can simplify it.Existing work either disregards the spherical and rotation-equivariant nature of illumination environments or does not provide a well-behaved latent space.We propose a rotation-equivariant variational autoencoder that models natural illumination on the sphere without relying on 2D projections.To preserve the SO(2)-equivariance of environment maps, we use a novel Vector Neuron Vision Transformer (VN-ViT) as encoder and a rotation-equivariant conditional neural field as decoder.In the encoder, we reduce the equivariance from SO(3) to SO(2) using a novel SO(2)-equivariant fully connected layer, an extension of Vector Neurons.We show that our SO(2)-equivariant fully connected layer outperforms standard Vector Neurons when used in our SO(2)-equivariant model.Compared to previous methods, our variational autoencoder enables smoother interpolation in latent space and offers a more well-behaved latent space.

View full details

Poster

HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution

Chao Yang ⋅ Boqian Zhang ⋅ Jinghao Xu ⋅ Guang Jiang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 189

Diffusion-based methods have shown great promise in single image super-resolution (SISR); however, existing approaches often produce blurred fine details due to insufficient guidance in the high-frequency domain. To address this issue, we propose a High-Frequency Guided Diffusion Network based on Wavelet Decomposition (HDW-SR), which replaces the conventional U-Net backbone in diffusion frameworks. Specifically, we perform diffusion only on the residual map, allowing the network to focus more effectively on high-frequency information restoration. We then introduce wavelet-based downsampling in place of standard CNN downsampling to achieve multi-scale frequency decomposition, enabling sparse cross-attention between the high-frequency subbands of the pre-super-resolved image and the low-frequency subbands of the diffused image for explicit high-frequency guidance. Moreover, a Dynamic Thresholding Block (DTB) is designed to refine high-frequency selection during the sparse attention process. During upsampling, the invertibility of the wavelet transform ensures low-loss feature reconstruction. Experiments on both synthetic and real-world datasets demonstrate that HDW-SR achieves competitive super-resolution performance, excelling particularly in recovering fine-grained image details. The code will be available after acceptance.

View full details

Poster

Bidirectional Normalizing Flow: From Data to Noise and Back

Yiyang Lu ⋅ Qiao Sun ⋅ Xianbang Wang ⋅ Zhicheng Jiang ⋅ Hanhong Zhao ⋅ Kaiming He

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 191

Normalizing Flows (NFs) are a principled framework for generative modeling, consisting of a forward process and a reverse process. The forward process maps data to a simple prior distribution, while the reverse process generates samples by inverting this mapping. Traditional approaches focus on designing expressive forward transformations under strict requirement of explicitly invertibility, so that the reverse process can serve as their exact analytic inverse. Recent advances such as TARFlow enhance the forward model with Transformers and autoregressive structures, achieving state-of-the-art generation quality—but at the expense of slow sampling due to autoregressive decoding. In this work, we introduce Bidirectional Normalizing Flow ($\textbf{BiFlow}$), a new framework that removes the need for an exact analytic inverse by learning a flexible, data-driven reverse model to $\textbf{approximate}$ the inverse mapping. This relaxation enables richer architectures and loss formulations while preserving the probabilistic foundation of NFs. BiFlow performs direct, single-forward (1-NFE) generation, eliminating autoregressive bottlenecks and achieving up to two orders of magnitude faster sampling with improved generation quality. We hope this work encourages rethinking Normalizing Flows as direct, flexible, and efficient generative models.

View full details

Poster

Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching

Guangxun Zhang ⋅ Mason Haberle ⋅ Davi Geiger

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 191

The Mean Flow Matching algorithm is the state-of-the-art for one-step generative models. Building on this idea, we propose the Stable Mean Flow algorithm and introduce a Lyapunov-inspired stability regularizer that enforces local non-expansivity of the single-step transport map. This design guarantees uniqueness of characteristics and bounds trajectory drift. We conduct experiments that show improved output quality and convergence speed over Mean Flow. Moreover, we establish explicit upper bounds on error growth for both one-step and multi-step generation.

View full details

Poster

CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation

Bohao Li ⋅ Zhicheng Cao ⋅ Huixian Li ⋅ Yangming Guo

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 191

State-of-the-art whole-body pose estimators often lack robustness, producing anatomically implausible predictions in challenging scenes. We posit this failure stems from spurious correlations learned from visual context, a problem we formalize using a Structural Causal Model (SCM). The SCM identifies visual context as a confounder that creates a non-causal backdoor path, corrupting the model's reasoning. We introduce the Causal Intervention Graph Pose (CIGPose) framework to address this by approximating the true causal effect between visual evidence and pose. The core of CIGPose is a novel Causal Intervention Module: it first identifies confounded keypoint representations via predictive uncertainty and then replaces them with learned, context-invariant canonical embeddings. These deconfounded embeddings are processed by a hierarchical graph neural network that reasons over the human skeleton at both local and global semantic levels to enforce anatomical plausibility. Extensive experiments show CIGPose achieves a new state-of-the-art on COCO-WholeBody. Notably, our CIGPose-x model achieves 67.0\% AP, surpassing prior methods that rely on extra training data. With the additional UBody dataset, CIGPose-x is further boosted to 67.5\% AP, demonstrating superior robustness and data efficiency.

View full details

Poster

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Sirui Xu ⋅ Samuel Schulter ⋅ Morteza Ziyadi ⋅ Xialin He ⋅ Xiaohan Fei ⋅ Yu-Xiong Wang ⋅ Liang-Yan Gui

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 194

Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified control policy, i.e., interaction motion prior through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multi-modal and partially specified goal cues. A targeted diversity process, combining data augmentation and physical perturbations, broadens exposure to varied contact and object conditions, producing a motion prior that generalizes beyond the training data. To address the vast configuration space of large-scale human-object interaction, a reinforcement learning finetuning enhances unseen goal competence, enabling recovery from unsuccessful grasp. The resulting policy acts as a reusable motion prior that can absorb new behaviors, including interactions with unseen objects. We also show its effectiveness in user-interactive control and across different embodiments.

View full details

Poster

Thermal Diffusion Matters: Infrared Spatial-Temporal Video Super-Resolution through Heat Conduction Priors

Mingxuan Zhou ⋅ Shuang Li ⋅ Yutang Zhang ⋅ Jing Geng ⋅ Yirui Shen ⋅ Jingxuan Kang ⋅ Fuzhen Zhuang ⋅ Shuigen Wang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 198

Infrared video acquisition inherently suffers from low spatial resolution and limited frame rates due to the physical constraints of thermal imaging sensors. These limitations make infrared video enhancement uniquely challenging, as it requires restoring spatial details and temporal continuity from highly undersampled thermal signals. To address this challenge, we propose `THERIS`, a unified **THER**mal-physics inspired framework for **I**nfrared spatial-temporal video **S**uper-resolution. Grounded in the physical principles of thermal diffusion, `THERIS` leverages heat conduction dynamics that govern the spatiotemporal evolution of infrared pixel intensities. Specifically, the proposed Thermal Diffusion Interpolation Module (TDIM) treats temporal feature sequences as one-dimensional heat fields and performs frequency-domain diffusion to synthesize temporally coherent intermediate frames. Building on this foundation, the Thermo-Aware State Space Module (TSSM) refines spatiotemporal representations through learnable spectral filtering and selective state-space modeling, while maintaining consistency guided by the thermodynamic prior inherited from TDIM. Additionally, a Temperature Field Modeling Loss is introduced to enforce adherence to the heat conduction equation, promoting temporal coherence and spatial stability in the generated results. Extensive experiments demonstrate that `THERIS` achieves state-of-the-art performance while producing visually coherent results. To facilitate further research in the infrared video processing domain, we also introduce **IRVAL**, a high-resolution dataset comprising 108,512 video frames at 512$\times$512 resolution.

View full details

Poster

Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

Sicheng Xu ⋅ Yu Deng ⋅ Shoukang Hu ⋅ Yichuan Wang ⋅ Yizhong Zhang ⋅ Zhan Chen ⋅ Jiaolong Yang ⋅ Baining Guo

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 197

Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an auto-regressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.

View full details

Poster

Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset

Yang Zou ⋅ Jun Ma ⋅ Zhidong Jiao ⋅ Xingyuan Li ⋅ Zhiying Jiang ⋅ Jinyuan Liu

Jun 6, 11:45 AM - 1:45 PM ExHall F 198

Infrared Image Super-Resolution (IISR) under real-world conditions is a practically significant yet rarely addressed task. Pioneering works are often trained and evaluated on simulated datasets or neglect the intrinsic differences between infrared and visible imaging. In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. Specifically, a Thermal-Structural Guidance module encodes thermal priors to mitigate the mismatch between thermal radiation and structural edges. Since non-uniform degradations typically induce quantization bias, Real-IISR adopts a Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors. Also, a Thermal Order Consistency Loss enforces a monotonic relation between temperature and pixel intensity, ensuring relative brightness order rather than absolute values to maintain physical consistency under spatial misalignment and thermal drift. We build FLIR-IISR, a real-world IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments on real and synthetic datasets demonstrate the promising performance of Real-IISR, providing a unified foundation for real-world IISR and benchmarking.

View full details

Poster

Unified Number-Free Text-to-Motion Generation Via Flow Matching

Guanhe Huang ⋅ Oya Celiktutan

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 199

Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF’s effectiveness as a generalist model for multi-person motion generation from text. We will release the code.

View full details

Poster

CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

Fengyi Fang ⋅ Sicheng Yang ⋅ Wenming Yang

Jun 7, 11:45 AM - 1:45 PM ExHall F 199

Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches. Code and demo video are in the supplementary material and will be released upon paper acceptance.

View full details

Poster

Generative Diffusion Priors for 3D Mapping of the Dark Universe

Brandon Zhao ⋅ Diana Scognamiglio ⋅ Olivier Doré ⋅ Katherine L. Bouman

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 200

Reconstructing the three-dimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simulations to build a new dataset \texttt{Conicus3D}, which enables us to learn a data-driven diffusion-model prior capturing the full 3D distribution of dark matter structure across cosmic time. Building on recent plug-and-play approaches, we modify a diffusion-based posterior sampling scheme to the 3D weak-lensing setting, combining the learned prior with a differentiable physical forward model. On realistic simulations targeting a modern weak lensing survey, our approach yields substantially improved 2D and 3D reconstruction accuracy over baseline methods. Moreover, it produces posterior samples whose statistics closely track the underlying simulations, while remaining robust to moderate shifts in cosmology.

View full details

Poster

IntrinsicWeather: Controllable Weather Editing in Intrinsic Space

Yixin Zhu ⋅ Zuo-Liang Zhu ⋅ Jian Yang ⋅ Milos Hasan ⋅ Jin Xie ⋅ Beibei Wang

Jun 7, 11:45 AM - 1:45 PM ExHall F 200

We present WeatherDiffusion, a diffusion-based framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches. We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. WeatherDiffusion outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.

View full details

Poster

FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation

yuchen zou ⋅ Huikai Shao ⋅ Lihuang Fang ⋅ Zhipeng Xiong ⋅ Dexing Zhong

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 201

Recently, synthetic palmprints have been increasingly used as substitutes for real data to train recognition models. To be effective, such synthetic data must reflect the diversity of real palmprints, including both style variation and geometric variation. However, existing palmprint generation methods mainly focus on style translation, while geometric variation is either ignored or approximated by simple handcrafted augmentations. In this work, we propose FlowPalm, an optical-flow-driven palmprint generation framework capable of simulating the complex non-rigid deformations observed in real palms. Specifically, FlowPalm estimates optical flows between real palmprint pairs to capture the statistical patterns of geometric deformations. Building on these priors, we design a progressive sampling process that gradually introduces the geometric deformations during diffusion while maintaining identity consistency. Extensive experiments on six benchmark datasets demonstrate that FlowPalm significantly outperforms state-of-the-art palmprint generation approaches in downstream recognition tasks. Notably, FlowPalm achieves a higher TAR at FAR=1e-4 than the best generative model does at FAR=1e-3.

View full details

Poster

Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

Binxu Wang ⋅ Jingxuan Fan ⋅ Xu Pan

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 203

Diffusion Transformers (DiTs) have greatly advanced text-to-image generation, but models still struggle to generate the correct spatial relations between objects as specified in the text prompt. In this study, we adopt a mechanistic interpretability approach to investigate how a DiT can generate correct spatial relations between objects. We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt. We find that, although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, we find that the spatial-relation information is passed to image tokens through a two-stage circuit, involving two cross-attention heads that separately read the spatial relation and single-object attributes in the text prompt. When using a pretrained text encoder (T5), we find that the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token. We further show that, although the in-domain performance is similar for the two settings, their robustness to out-of-domain perturbations differs, potentially suggesting the difficulty of generating correct relations in real-world scenarios.

View full details

Poster

Geometric Neural Distance Fields for Learning Human Motion Priors

Zhengdi Yu ⋅ Simone Foti ⋅ Linguang Zhang ⋅ Amy Zhao ⋅ Cem Keskin ⋅ Stefanos Zafeiriou ⋅ Tolga Birdal

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 206

We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to “roll out” realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.

View full details

Poster

End-to-End Language-Action Model for Humanoid Whole Body Control

Yuxuan Wang ⋅ Haobin Jiang ⋅ Shiqing Yao ⋅ Ziluo Ding ⋅ Zongqing Lu

Jun 7, 3:30 PM - 5:30 PM ExHall A 207

Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language–action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.

View full details

Poster

GeoRK2: Geometry-Guided Runge–Kutta Integration for Diffusion Transformer Acceleration

Chaoqun Sun ⋅ Zongjing Fu ⋅ Powei Chang ⋅ Jinpeng Zhang ⋅ JianXiang Xiang ⋅ Yukang Gao ⋅ Chenyu Wang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 208

Diffusion transformer models deliver state-of-the-art image synthesis quality but suffer from prohibitively slow iterative sampling. Fewer sampling steps accelerate inference but inevitably distort intermediate features and degrade visual fidelity, while offering little relief in computational cost. To address these limitations, we present GeoRK2, a training-free framework that bridges numerical analysis and information geometry. GeoRK2 couples second-order Runge–Kutta (RK2) integration with a curvature-aware geometric flow derived from the model's noise predictions, establishing provably stable feature evolution dynamics under manifold-aware integration. By leveraging an empirical feature covariance–induced metric estimated from gradient covariances to capture intrinsic feature geometry and applying parallel transport along the manifold connection, GeoRK2 constrains error propagation under large-step integration, ensuring both numerical stability and structural fidelity. As a fully plug-and-play method, GeoRK2 requires no retraining and is compatible with mainstream pretrained diffusion transformers. Comprehensive experiments on image generation and super-resolution tasks across representative diffusion backbones (e.g., DiT-XL, HunyuanVideo, and FLUX.1-dev) demonstrate that GeoRK2 achieves 4–5× faster inference than baseline frameworks (FORA, TaylorSeer) with only marginal perceptual differences (∆FID ≈ 0.81), confirming its effectiveness and generality. All implementation details and code are provided in the supplementary material.

View full details

Poster

MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters

Soomin Park ⋅ Eunseong Lee ⋅ Kwang Bin Lee ⋅ Sung-Hee Lee

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 211

We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control.The framework follows a two-stage residual learning paradigm.In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions.This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions.In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere.We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator.Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.

View full details

Poster

Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

Yifan Zhou ⋅ Zeqi Xiao ⋅ Tianyi Wei ⋅ Shuai Yang ⋅ Xingang Pan

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 210

Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-$K$ sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing $K$ required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure.In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-$K$ selection, progressively adopting sparse Top-$K$ selection with the indices found at the previous level,and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks.We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by $ 28.27 \times$ and DiT training by $ 6.09 \times$ on $256 \times 256$ pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently.

View full details

Poster

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

YIYI CAI ⋅ Yuhan Wu ⋅ Kunhang Li ⋅ YOU ZHOU ⋅ Bo Zheng ⋅ Haiyang Liu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 212

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events.We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning.With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available.

View full details

Poster

GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models

Korada Sri Vardhana ⋅ Soma Biswas

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 215

Text-to-Image (T2I) diffusion models power modern creative tools, but their open-ended generative nature raises safety, ethical, and copyright concerns. Retraining or fine-tuning to remove every unsafe or copyrighted concept is impractical, motivating training-free interventions that suppress specific semantics while preserving general visual quality. Existing guard-railing methods face a core trade-off: they are either rigid, failing to generalize to paraphrased or context-shifted prompts, or coarse, distorting unrelated content and fidelity. We present GenErase (GENeralizable ERAsure with SEmantic Awareness), a training-free, geometry-grounded framework for robust concept removal in diffusion models. GenErase enforces semantic orthogonality in the cross-attention value space via an explicit \emph{erase-and-replace} operation, guided by a per-token preserve projector and a hard geometric gate. This design enables precise erasure, explicit protection of critical semantics, and stability across layers, paraphrases, and multi-concept cases. Extensive experiments on identity, object, and style erasure, together with a new GenBench-40 benchmark, show that GenErase achieves state-of-the-art erasure fidelity and superior paraphrase-level generalization, establishing it as a practical and principled guard-rail for safe, real-time diffusion deployment. Code will be released upon acceptance.

View full details

Poster

DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis

Xinglong Luo ⋅ Ao Luo ⋅ Zhengning Wang ⋅ Yueqi Yang ⋅ Chaoyu Feng ⋅ Lei Lei ⋅ Bing Zeng ⋅ Shuaicheng Liu

Jun 6, 11:45 AM - 1:45 PM ExHall F 214

Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve.Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment.Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Code and dataset will be released.

View full details

Poster

VL-RouterBench: A Benchmark for Vision–Language Model Routing

Zhehao Huang ⋅ Baijiong Lin ⋅ Jingyuan Zhang ⋅ Jingying Wang ⋅ Yuhang Liu ⋅ Ning Lu ⋅ Tao Li ⋅ Xiaolin Huang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 218

Multi-model routing has evolved from an engineering technique into essential infrastructure, yet existing work lacks a systematic, reproducible benchmark for evaluating vision–language models (VLMs). We present **VL-RouterBench** to assess the overall capability of VLM routing systems systematically. The benchmark is grounded in raw inference and scoring logs from VLMs and constructs quality and cost matrices over sample–model pairs. In scale, VL-RouterBench covers 14 datasets across 3 task groups, totaling 30,540 samples, and includes 15 open-source models and 2 API models, yielding 519,180 sample–model pairs and a total input–output token volume of 34,494,977. The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets. On this benchmark, we evaluate 10 routing methods and baselines and observe a significant routability gain, while the best current routers still show a clear gap to the ideal Oracle, indicating considerable room for improvement in router architecture through finer visual cues and modeling of textual structure. We will open-source the complete data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.

View full details

Poster

InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior

Weimin Bai ⋅ Suzhe Xu ⋅ Yiwei Ren ⋅ Jinhua Hao ⋅ Ming Sun ⋅ Wenzheng Chen ⋅ He Sun

Jun 6, 11:45 AM - 1:45 PM ExHall F 218

Video inverse problems such as inpainting, deblurring and super-resolution are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers—leading to temporal artifacts—or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher’s strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the standard VAE in video diffusion backbone with a highly efficient LeanVAE, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100x speedups over iterative video diffusion priors. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.

View full details

Poster

CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

Marc-Antoine Lavoie ⋅ Anas Mahmoud ⋅ Aldo Zaimi ⋅ Arsene Fansi Tchango ⋅ Steven L. Waslander

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 219

CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP’s pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

View full details

Poster

Curvature-Aware Captioning: Leveraging Geodesic Attention for 3D Scene Understanding

Ziyao He ⋅ Yingjie Liu ⋅ Zhang Yangrui ⋅ Mingsong Chen ⋅ Xuan Tang ⋅ Xian Wei

Jun 7, 11:45 AM - 1:45 PM ExHall F 221

Accurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data. Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve fine-grained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions. In this work, we propose a novel \textbf{\textsc{Curvature-Aware Captioning}} framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic relationships across scene instances, enabling simultaneous precision in object localization and coherence in scene descriptions. Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization while preserving inherent hierarchical relationships. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, with significant gains in both localization accuracy and descriptive richness.

View full details

Poster

Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding

Shawn Huang ⋅ Brian Price ⋅ Yifei Fan ⋅ Bryan Morse

Jun 7, 3:30 PM - 5:30 PM ExHall A 222

Automatic album organization has been studied extensively over the past decades due to significant progress in digital photography. Recent vision-language models (VLMs) have shown strong performance on multi-image understanding, making them natural candidates for automating album organization workflows. While VLMs' abilities in multi-image understanding have been widely studied, their performance on album organization remains underexplored. To bridge this gap, we introduce AlbumBench, the first comprehensive benchmark for automatic album organization. Specifically, we (1) define album organization tasks as photo selection for album-specific user objectives, photo rating according to how well user intents are fulfilled, and album-specific photo grouping given a user query which requires contextual understanding of the album; (2) establish AlbumBench, a benchmark dataset containing 27051 images across 641 albums with 5 annotations per image; and (3) evaluate mainstream open-source and proprietary VLMs on AlbumBench. We show that AlbumBench presents unique challenges compared to traditional multi-image understanding benchmarks due to its requirement for understanding album context and user intent. Our findings reveal a significant performance gap between open-source and proprietary VLMs on album organization tasks. Despite this gap, even the best-performing proprietary models sometimes struggle with tasks that humans find relatively easy. We hope that AlbumBench can serve as a foundation for unifying album organization research and motivate improvements in VLMs' performance on these tasks.

View full details

Poster

ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding

Ao Cheng ⋅ Xingming Li ⋅ Xuanyu Ji ⋅ Xixiang He ⋅ Qiyao Sun ⋅ Chunping Qiu ⋅ Runke Huang ⋅ Qingyong Hu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 224

Electronic Navigational Charts (ENCs) are the safety-critical backbone of modern maritime navigation, yet it remains unclear whether multimodal large language models (MLLMs) can reliably interpret them. Unlike natural images or conventional charts, ENCs encode regulations, bathymetry, and route constraints via standardized vector symbols, scale-dependent rendering, and precise geometric structure---requiring specialized maritime expertise for interpretation. We introduce ENC-Bench, the first benchmark dedicated to professional ENC understanding. ENC-Bench contains 20,490 expert-validated samples from 840 authentic National Oceanic and Atmospheric Administration (NOAA) ENCs, organized into a three-level hierarchy: Perception (symbol and feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, emergency planning under multiple constraints). All samples are generated from raw S-57 data through a calibrated vector-to-image pipeline with automated consistency checks and expert review. We evaluate 10 state-of-the-art MLLMs such as GPT-4o, Gemini 2.5, Qwen3-VL, InternVL-3, and GLM-4.5V, under a unified zero-shot protocol. The best model achieves only 47.88% accuracy, with systematic challenges in symbolic grounding, spatial computation, multi-constraint reasoning, and robustness to lighting and scale variations. By establishing the first rigorous ENC benchmark, we open a new research frontier at the intersection of specialized symbolic reasoning and safety-critical AI, providing essential infrastructure for advancing MLLMs toward professional maritime applications.

View full details

Poster

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu ⋅ Xiao Huang ⋅ Yaohui Chen ⋅ Ya Zhang ⋅ Yanfeng Wang ⋅ Weidi Xie

Jun 7, 11:45 AM - 1:45 PM ExHall F 224

Existing studies on multimodal large language models (MLLMs) in spatial understanding are typically limited by fragmented assessments.This work considers a comprehensive evaluation of the spatial understanding abilities of existing MLLMs. Concretely, we make the following contributions in this paper: (i) we propose **SpatialScore**, the most comprehensive and diverse multimodal spatial intelligence benchmark to date, encompassing various visual data types, input modalities, and QA formats with around 5K manually verified samples across 30 distinct tasks; (ii) we construct **SpatialCorpus**, a large-scale training resource with 331K multimodal QA samples for supervised fine-tuning Qwen3-VL on spatial understanding; (iii) we develop **SpaitalAgent**, a multi-agent system incorporating 12 specialized spatial perception tools, supporting both *Plan-Execute* and *ReAct* reasoning paradigms, enabling to improve spatial reasoning in a training-free manner; and (iv) we conduct extensive evaluations on 40 representative MLLMs, revealing persistent challenges in spatial intelligence while demonstrating the effectiveness of our data-driven and agent-based solutions. All data, code, and models will be publicly available.

View full details

Poster

Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

Cheng Cui ⋅ Ting Sun ⋅ Suyin Liang ⋅ Tingquan Gao ⋅ Zelun Zhang ⋅ Jiaxuan Liu ⋅ Xueqing Wang ⋅ Changda Zhou ⋅ Hongen Liu ⋅ Manhui Lin ⋅ Yue Zhang ⋅ yubo zhang ⋅ Jing Zhang ⋅ Jun Zhang ⋅ Xing Wei ⋅ Yi Liu ⋅ Dianhai Yu ⋅ Yanjun Ma

Jun 6, 11:45 AM - 1:45 PM ExHall F 225

Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding.

View full details

Poster

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Chiao-An Yang ⋅ Ryo Hachiuma ⋅ Sifei Liu ⋅ Subhashree Radhakrishnan ⋅ Raymond A. Yeh ⋅ Yu-Chiang Frank Wang ⋅ Min-Hung Chen

Jun 7, 11:45 AM - 1:45 PM ExHall F 225

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting.We tackle these issues by introducing:(a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception;(b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and(c) \ourbenchmark, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline.Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

View full details

Poster

SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation

Jiale Huang ⋅ Shangfei Wang

Jun 7, 3:30 PM - 5:30 PM ExHall A 228

3D Referring Expression Segmentation (3D-RES) aims to segment objects in point clouds according to language descriptions. Unlike common practices in 2D that utilize learnable query embeddings, recent 3D-RES methods typically generate queries directly from 3D points. However, this direct coupling of queries to raw point clouds introduces new challenges: an impractically large number of queries derived from massive point cloud data and a reliance on non-deterministic sampling algorithms. In this paper, we propose a Semantic-based Adaptive Query Network (SAQN), which introduces a novel query strategy for 3D-RES. Instead of generating queries from points, SAQN employs a learnable query vector for each semantic class. This approach drastically reduces the number of queries while maintaining the advantage of avoiding Hungarian matching through implicit class alignment. Additionally, to address potential cross-object ambiguity within semantic classes, we introduce supplementary queries that are adaptively fused with each class query to disambiguate and enrich representations. Comprehensive experiments show that SAQN achieves state-of-the-art performance while reducing the number of queries.

View full details

Poster

Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis

Yuanzhe Li ⋅ Hao Chen ⋅ Rui Yin ⋅ Juyan Ba ⋅ Yu Zhang ⋅ Sheng Lu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 230

Recent vision language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis. Each case in Gastric-X includes paired resting and dynamic CT scans, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.

View full details

Poster

SpatialTree: How Spatial Intelligence Branches Out in MLLMs

Yuxi Xiao ⋅ longfei li ⋅ Shen Yan ⋅ Xinhang Liu ⋅ Sida Peng ⋅ Yunchao Wei ⋅ Xiaowei Zhou ⋅ Bingyi Kang

Jun 6, 11:45 AM - 1:45 PM ExHall F 229

Spatial Intelligence (SI) has emerged as a critical frontier for MLLMs, encompassing a hierarchy of skills from foundational perception to high level spatial reasoning. However, how these abilities are acquired, emerge, and transferred remains largely unknown. To investigate this, we propose SpatialTree a hierarchical taxonomy that organizes SI into a capability tree—from low level perception (L1), mental mapping (L2), mental simulation (L3), to agentic competence (L4). Building on this, we construct a hierarchical, capability-centric benchmark using our proposed Spatial Engine, annotating each ability according to its level. Guided by the benchmark's correlation analysis, we conduct targeted supervised fine-tuning (SFT) and prompting experiments on key abilities. The results confirm the independence of abilities at the same level, reveal cross-level transfer, and further demonstrate a multi-ability synergy when these abilities are trained jointly. Our work provides a novel framework for analyzing SI in MLLMs, offering a comprehensive methodology to study how foundational abilities emerge and support higher-level competencies.

View full details

Poster

PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

Zekai Lin ⋅ Xu Zheng

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 230

360° panoramic images are increasingly used in VR, autonomous driving, and robotics for holistic scene understanding. However, current Vision–Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce \textbf{\textit{PanoEnv}}, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations—depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34\% overall and 8.36\% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward combining five geometry-aware strategies (e.g., distance tolerance, spatial consistency). A two-stage curriculum further mitigates catastrophic forgetting: Stage\~1 trains on structured tasks (T/F, MCQ), and Stage\~2 fine-tunes on mixed OE data for generalization. Our 7B model sets a new SoTA performance, improving total accuracy to 52.93\% (+3.59\%) and OE accuracy to 14.83\% while maintaining structured-task performance. It also achieves top semantic scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.

View full details

Poster

Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

Yuxin Yang ⋅ Yinan Zhou ⋅ Yuxin Chen ⋅ Ziqi Zhang ⋅ Zongyang Ma ⋅ Chunfeng Yuan ⋅ Bing Li ⋅ Jun Gao ⋅ Weiming Hu

Jun 7, 11:45 AM - 1:45 PM ExHall F 235

Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible, multimodal queries that combine a reference image and modification text.However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential.In this work, we propose **O**bject-**A**nchored **C**omposed **I**mage **R**etrieval (**OACIR**), a novel fine-grained retrieval task that mandates strict instance-level consistency.To advance research on this task, we construct **OACIRR** (**OACIR** on **R**eal-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors.Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation.To perform the OACIR task, we propose ***AdaFocal***, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context.Extensive experiments demonstrate that ***AdaFocal*** substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.

View full details

Poster

Physical Object Understanding with a Physically Controllable World Model

Rahul Venkatesh ⋅ Klemen Kotar ⋅ Lilian Naing Chen ⋅ Wanhee Lee ⋅ Gia Ancone ⋅ Seungwoo Kim ⋅ Luca Thomas Wheeler ⋅ Jared Watrous ⋅ Honglin Chen ⋅ Daniel Bear ⋅ Stefan Stojanov ⋅ Daniel L.K. Yamins

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 239

A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations -- capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract coherent physical objects and articulated object subparts, achieving state-of-the-art results on SpelkeBench and DragAMove. Having discovered these objects, our world model can manipulate them in 3D, emerging as the strongest performer on 3DEditBench. Finally, we demonstrate that physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.

View full details

Poster

EthoCLIP: Ontology-Enhanced Video-Language Pretraining for Animal Behavior Understanding

Yinuo Jing ⋅ Jinyan Wu ⋅ Zixi Yang ⋅ Kongming Liang ⋅ Xiatian Zhu ⋅ Zhanyu Ma

Jun 7, 11:45 AM - 1:45 PM ExHall F 239

Vision-language models (VLMs) have achieved remarkable success across numerous domains, yet they lag significantly in animal behavior understanding due to severe data scarcity. Annotated animal behavior videos are prohibitively expensive and time-consuming to collect, requiring domain expertise and controlled observation conditions. To address this challenge, we leverage structured domain knowledge as an inductive bias from the Neuro Behavior Ontology (NBO), which provides professional annotations, hierarchical behavior structures, and comprehensive semantic coverage. We present EthoCLIP, an ontology-enhanced vision–language contrastive learning framework that explicitly embeds ontology semantics through an ontology-aware graph module to capture hierarchical relationships among behaviors and learn structured semantic dependencies. Incorporating ontological information reduces the burden of learning purely from data, thereby alleviating requirements for large-scale datasets. To enhance EthoCLIP training, we construct AnimalBand, an NBO-consistent dataset integrating 74,671 videos across multiple species and behaviors with semantic standardization and extended knowledge coverage. Extensive experiments validate both our method and dataset. Results demonstrate that EthoCLIP pretrained on AnimalBand substantially improves behavior recognition accuracy and transfer learning performance across diverse benchmarks, confirming that ontology-driven semantic enrichment effectively addresses data scarcity in animal behavior understanding.

View full details

Poster

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

Luigi Seminara ⋅ Davide Moltisanti ⋅ Antonino Furnari

Jun 7, 11:45 AM - 1:45 PM ExHall F 243

Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal, a fundamental ability for intelligent agents operating in complex environments. Existing approaches typically rely on large-scale models that learn procedural structures implicitly, resulting in limited sample-efficiency and high computational cost. In this work we introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL). The DVL embeds a Procedural Knowledge Graph (PKG) directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization. This design allows the model to learn through graph-based decoding. Experiments on CrossTask, COIN, and NIV demonstrate that ViterbiPlanNet achieves state-of-the-art performance with an order of magnitude fewer parameters than diffusion- and LLM-based planners. Extensive ablations show that performance gains arise from our differentiable structure-aware training rather than post-hoc refinement, resulting in improved sample efficiency and robustness to shorter unseen horizons. We also address testing inconsistencies establishing a unified testing protocol with consistent splits and evaluation metrics. With this new protocol, we run experiments multiple times and report results using bootstrapping to assess statistical significance.

View full details

Poster

Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval

Hao Sun ⋅ Yadong Huo ⋅ Qibing Qin ⋅ Wenfeng Zhang ⋅ Lei Huang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 246

Recent cross-modal hashing methods have introduced sample generation strategies to enrich training signals. Despite these advances, sample generation-driven hashing still faces two major challenges: (1) Interpolation-based methods adopt deterministic and class-independent generation that restricts synthetic samples to a small region around the original data. Consequently, intra-class diversity is limited, which weakens the model’s ability to learn discriminative binary codes. (2) Generation network-based methods, which leverage a complex generative model to produce synthetic samples, leading to extra model complexity. To address these issues, we propose a novel Intra-class Distribution-guided Generative Hashing (IDGH) that adaptively generates synthetic samples directly from estimated intra-class distributions. Specifically, we suggest an Intra-class Distribution Estimation (IDE) scheme to model the characteristic distribution of each class, providing essential support for adaptive sample generation. Meanwhile, by utilizing the distribution information from neighboring classes, we design a Neighbor-guided Distribution Refinement (NDR) mechanism to correct flawed estimations for classes. With refined intra-class distributions, we propose a Distribution-aware Adaptive Generation (DAG) strategy that synthesizes informative training samples by shifting features along diverse directions guided by intra-class distribution patterns. The proposed approach is plug-and-play and can be seamlessly integrated into various objective functions, providing semantically diverse training samples, thus enhancing similarity learning. Extensive experiments on benchmark datasets demonstrate that IDGH outperforms existing methods.

View full details

Poster

Learning Effective Sign Features without Text for Gloss-free Sign Language Translation

Shiwei Gan ⋅ Xiao Liu ⋅ Yafeng Yin ⋅ Nan Liu ⋅ Kuizhuang Liu ⋅ Desibieer Tuerdaken ⋅ Zhiwei Jiang ⋅ Lei Xie ⋅ Sanglu Lu ⋅ Hongkai Wen

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 247

Self-supervised learning (SSL) has achieved remarkable success across both NLP and CV domains. However, sign language translation (SLT) models still heavily rely on gloss annotations in gloss-based SLT or text annotations in gloss-free SLT (GFSLT) during pretraining, aiming to ensure that the backbone provides effective sign language (SL) features for the translation model. Such reliance restricts the scalability and generalization ability of the SLT model. One natural question arises: \textbf{Can existing SSL methods be directly applied to the SL domain to train an effective sign feature extractor for downstream GFSLT tasks, eliminating the need for text annotations?}In this paper, we propose a simple yet effective pretraining framework with two goals:(1) decoupling the pretraining process from gloss or text annotations, relying purely on sign frames; and(2) only global frames are required during inference for simplicity. We show that directly applying existing SSL methods yields suboptimal performance, as SL features involve subtle motion patterns and discriminative cues that are often confined to local regions. To achieve this, we introduce SignDINO, a simple yet effective sign-aware DINO training strategy that learns effective and semantically meaningful representations from global frames without any textual supervision. Specifically, a student–teacher architecture is employed, where the teacher model receives the global sign frame, while the student model learns from masked local views that preserve only the hand and facial regions. Such a simple design encourages the model to infer global semantics from discriminative local cues, allowing the teacher model to extract SL-related feature during inference solely based on global views. Extensive experiments on public SL datasets show that SignDINO achieves highly competitive performance on the GFSLT task without relying on extra cues or additional SL-related pretraining.

View full details

Poster

V^2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan ⋅ Runze Wang ⋅ Tianwen Qian ⋅ Mohammad Mahdi ⋅ Yanwei Fu ⋅ Xiangyang Xue ⋅ Xiaomeng Huang ⋅ Luc Van Gool ⋅ Danda Paudel ⋅ Yuqian Fu

Jun 6, 11:45 AM - 1:45 PM ExHall F 248

Cross-view object correspondence, exemplified by the representative task of ego–exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V$^{2}$-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V$^{2}$-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V$^{2}$-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego–exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V$^{2}$-SAM, achieving new state-of-the-art performance on Ego-Exo4D (Ego–Exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence). Codes will be released upon acceptance.

View full details

Poster

Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition

Jakob Paul Zimmermann ⋅ Georg Loho

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 249

It has been demonstrated in various contexts that monotonicity leads to better explainability in neural networks. However, not every function can be well approximated by a monotone neural network.We demonstrate that monotonicity can still be used in two ways to boost explainability. First, we use an adaptation of the decomposition of a trained ReLU network into two monotone and convex parts, thereby overcoming numerical obstacles from an inherent blowup of the weights in this procedure. Our proposed saliency methods -- SplitCAM and SplitLRP --improve onstate of the art results on both VGG16 and Resnet18 networks on ImageNet-S across all Quantus saliency metric categories.Second, we exhibit that training a model as the difference between two monotone neural networks results in a system with strong self-explainability properties.

View full details

Poster

RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval

Yijiang Li ⋅ Kunal Kotian ⋅ Ali Marjaninejad ⋅ Meir Friedenberg ⋅ Kaushik Pavani ⋅ Sunny Dasgupta

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 251

Current multimodal image retrieval benchmarks focus on relatively simple queries where target images are either described directly or by simple composition with an input image. When retrieval requires complex reasoning to determine the target image, the task becomes significantly more challenging, yet standardized benchmarks for this setting do not exist. To fill this gap, we introduce RMIR, a benchmark dataset of $1,634$ queries requiring reasoning across three categories: functional (object affordances), temporal (time-based relationships), and causal (cause-effect reasoning). Each query combines visual and textual inputs that demand robust visual understanding together with logical inference, beyond surface-level matching, to identify correct target images. Evaluation of state-of-the-art models on RMIR reveals significant performance gaps, with the best model achieving only $46.53$\% recall@$20$ averaged across reasoning categories. Our systematic analysis exposes fundamental limitations in current multimodal retrieval systems and establishes RMIR as a challenging testbed for developing multimodal, reasoning-capable retrieval models.

View full details

Poster

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Samyak Rawlekar ⋅ Amitabh Swain ⋅ Yujun Cai ⋅ Yiwei Wang ⋅ Ming-Hsuan Yang ⋅ Narendra Ahuja

Jun 7, 11:45 AM - 1:45 PM ExHall F 251

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in \texttt{[CLS]} token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the \texttt{[CLS]} token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the \texttt{[CLS]} token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

View full details

Poster

Rounded or Streamlined Head? Bridging Concept Bottleneck Models and Attribute-Described Object Parts

Yang Liu ⋅ Jiajin Zhang ⋅ Yaojun Hu ⋅ Bingguang Hao ⋅ Xin Cao ⋅ Yingda Xia ⋅ Danyang Tu ⋅ Shi Gu ⋅ Ling Zhang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 252

A faithful decision-making process requires models to ground human-understandable concepts both spatially (where they appear in the image) and causally (how they influence the prediction). Recent advances in Vision–Language Models (VLMs) enable concept-level alignment and have inspired Concept Bottleneck Models (CBMs), which explain predictions by mapping image representations to human-understandable concepts, allowing users to trace decisions through explicit semantic reasoning. However, existing CBMs suffer from two key inconsistencies. First, semantic inconsistency: VLMs often fail to localize fine-grained part–attribute concepts, producing noisy or incomplete masks. Second, object inconsistency: object-agnostic concepts such as "head: streamlined front profile" may describe multiple categories (e.g., fish or human); without enforcing object identity, non-targeted regions can introduce spurious evidence that corrupts the bottleneck representation. To address these challenges, we propose a new Object-Aware Concept Bottleneck Model (OA-CBM) that jointly enforces semantic- and object-level consistency. Specifically, (1) we redefine concepts as part–attribute pairs to enhance VLM robustness at the semantic level, and (2) introduce class-agnostic object clustering to suppress irrelevant visual evidence. We further annotate two grounding datasets with part–attribute descriptions and conduct extensive experiments. Results demonstrate that OA-CBM produces more faithful and robust explanations while maintaining competitive predictive performance.

View full details

Poster

HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

Joungbin An ⋅ Kristen Grauman

Jun 6, 11:45 AM - 1:45 PM ExHall F 252

Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or using fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba’s selective scanning to produce compact anchor tokens summarizing video content across scales. We further introduce anchor-conditioned and segment-pooled contrastive losses-two complementary objectives that encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.

View full details

Poster

RiskProp: Collision-Anchored Self-Supervised Risk Propagation For Early Accident Anticipation

Yiyang Zou ⋅ Tianhao Zhao ⋅ Peilun Xiao ⋅ Hongyu Jin ⋅ Longyu Qi ⋅ Yuxuan Li ⋅ Liyin Liang ⋅ Yifeng Qian ⋅ Chunbo Lai ⋅ Yutian Lin ⋅ Zhihui Li ⋅ Yu Wu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 255

Accident anticipation aims to predict impending collisions from dashcam videos and trigger early alerts. Existing methods rely on binary supervision with manually annotated “anomaly onset” frames, which are subjective and inconsistent, leading to inaccurate risk estimation. In contrast, we propose Risk Propagation (RiskProp), a collision-anchored supervised framework enhanced with self-supervised temporal constraints, which removes the need for anomaly onset annotations by leveraging only the reliably labeled collision frame. RiskProp models temporal risk evolution through two observation-driven losses: first, since future frames contain more definitive evidence of an impending accident, we introduce a future-frame regularization loss that uses the model’s next-frame prediction as a soft target to supervise the current frame, enabling backward propagation of risk signals; second, inspired by the empirical trend of rising risk before accidents, we design an adaptive monotonic constraint to encourage a non-decreasing progression over time. Experiments on CAP-DATA and Nexar demonstrate that RiskProp achieves state-of-the-art performance and produces smoother, more discriminative risk curves, improving both early anticipation and interpretability.

View full details

Poster

Mechanisms of Object Localization in Vision–Language Models

Timothy Schaumlöffel ⋅ Martina G. Vilas ⋅ Gemma Roig

Jun 7, 11:45 AM - 1:45 PM ExHall F 254

Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis.We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while internal structure is largely ignored. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early–mid layers for LLaVA and mid–late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads.Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.

View full details

Poster

LitePT: Lighter Yet Stronger Point Transformer

Yuanwen Yue ⋅ Damien Robert ⋅ Jianyuan Wang ⋅ Sunghwan Hong ⋅ Jan D. Wegner ⋅ Christian Rupprecht ⋅ Konrad Schindler

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 255

Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has 3.6× fewer parameters, runs 2× faster, and uses 2× less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets.

View full details

Poster

SuP: Sub-cloud Driven Point Cloud Registration

Sheldon Fung ⋅ Wei Pan ⋅ Ling Cao ⋅ Fei Hou ⋅ Ling Chen ⋅ Shasha Mao ⋅ Hongdong Li ⋅ Xuequan Lu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 256

While existing point-cloud-registration methods can well handle high-overlap scenarios of two point clouds, they often struggle with low-overlap scenarios, due to inevitable geometric/semantic ambiguities in the non-overlapping regions. In this paper, we introduce SuP, a novel framework that reformulates low-overlap registration as a high-overlap sub-cloud pairs (anchor pairs) mining problem. Central to SuP is our Dual-phase Sub-cloud Anchor Mining (DSAM) module, which first subdivides the source and target point clouds into multiple sub-clouds, followed by introducing a dual-phase weighting pipeline: 1) an efficient overlap-guided prior-weighting scheme (OPS) that leverages feature salience to identify candidate anchor pairs, and 2) a multi-scale post-weighting network (MPN) that exploits neighborhood feature consensus to further identify anchor pairs. Subsequently, final correspondences are generated through a merge-to-match module using the anchor pairs. To train DSAM, we design an alignment-aware weighting loss that uses on-the-fly alignment errors as supervision. Comprehensive experiments on the color-enhanced 3DMatch and 3DLoMatch demonstrate that SuP significantly outperforms state-of-the-art methods, achieving higher registration recall and more accurate alignment, especially under challenging low-overlap conditions.

View full details

Poster

MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations

Raghav Magazine ⋅ Xingjian Li ⋅ Min Xu

Jun 7, 3:30 PM - 5:30 PM ExHall A 256

Saliency-based explainability methods are widely used to interpret deep learning models in medical imaging, yet many existing approaches rely on white box access of models, which is not always possible due to privacy concerns. In this work, we introduce **MedLIME**, a novel, model-agnostic explanation framework designed to enhance the robustness and fidelity of saliency maps for medical imaging abnormality localization. Building upon the Local Interpretable Model-agnostic Explanations (LIME) paradigm, MedLIME integrates three key components: (1) **Generative Masking** (GM), (2) **Supervised Test-Time Adaptation** (STT) and (3) a **Evidence-based Regularization** (EBR) to improve the saliency map generation accuracy of LIME. Extensive experiments on multiple medical datasets, across three model architectures demonstrate that MedLIME consistently outperforms gradient-based and perturbation-based baselines in abnormality localization as measured by AUPRC. Our results highlight that incorporating generative reconstruction, adaptive perturbation and data-driven regularization improves the reliability and interpretability of medical imaging models.

View full details

Poster

LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization

Jianshi Wu ⋅ Minghang Zhu ⋅ dq Liu ⋅ Wen Li ⋅ Sheng Ao ⋅ Siqi Shen ⋅ Chenglu Wen ⋅ Cheng Wang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 257

LiDAR relocalization has attracted increasing attention as it can deliver accurate 6-DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we propose **LEADER**, a robust LiDAR-based localization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predictions. Extensive experiments on the Oxford RobotCar and NCLT datasets demonstrate that PerfectLoc outperforms state-of-the-art methods, achieving 24.1% and 73.9% relative reductions in position error over existing techniques, respectively. Source code will be released soon.

View full details

Poster

Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

Yunxiang Peng ⋅ Mengmeng Ma ⋅ Ziyu Yao ⋅ Xi Peng

Jun 7, 3:30 PM - 5:30 PM ExHall A 257

Reliable generalization metrics are fundamental to both the development and evaluation of machine learning models. Especially in high-stakes applications where labeled target data are scarce, evaluation of models' generalization performance under distribution shift is a pressing need. We focus on two practical scenarios: (1) Before deployment, how to select the best model for unlabeled target data? (2) After deployment, how to monitor model performance under distribution shift? The central need in both cases is a reliable, label-free proxy metric. Yet existing proxy metrics, such as model confidence or accuracy-on-the-line, are often unreliable as they only assess model outputs while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using a model’s inner working, i.e. circuits, as a predictive metric of generalization performance. Leveraging circuit discovery, we extract the causal interactions between internal representations as a circuit, from which we derive two metrics tailored to the two practical scenarios. (1) Before deployment, we introduce Dependency Depth Bias, which measures different models' generalization capability on target data. (2) After deployment, we propose Circuit Shift Score, which predicts a model's generalization under different distribution shifts. Across diverse tasks, both metrics demonstrate significantly improved correlation with generalization performance, outperforming existing proxies by an average of 11.0% and 45.3%, respectively.

View full details

Poster

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Baifeng Shi ⋅ Stephanie Fu ⋅ Long Lian ⋅ Hanrong Ye ⋅ David Eigen ⋅ Aaron Reite ⋅ Jan Kautz ⋅ Boyi Li ⋅ David Chan ⋅ Trevor Darrell ⋅ Pavlo Molchanov ⋅ Danny Yin

Jun 6, 11:45 AM - 1:45 PM ExHall F 258

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos---they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that reconstructs the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 66.5% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with multi-minute 4K videos, where an MLLM scaled with AutoGaze outperform the previous SOTA MLLM by 6.3%.

View full details

Poster

Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

Yehonatan Elisha ⋅ Oren Barkan ⋅ Noam Koenigstein

Jun 6, 11:45 AM - 1:45 PM ExHall F 259

Vision Transformers (ViTs) often fail under distribution shifts because they learn spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground/background masks, overlook the fine-grained semantic concepts that truly define an object (e.g., "long beak" and "wings" for a "bird"). To address this, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps (via AttnLRP) to align with spatially-grounded concept masks. These guidance masks are generated automatically and without manual annotation: class-relevant concepts are first proposed using an LLM-driven, label-free method, and then segmented using a Vision-Language Model (GroundingSAM). The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas and preserving classifier confidence via a dedicated loss term. This process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution (OOD) benchmarks, show that our method significantly enhances model robustness across multiple ViT-based models and an additional CNN model. Furthermore, we validate that the resulting relevance maps exhibit improved alignment with semantic object parts, providing a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective guidance for model robustness than conventional segmentation maps, validating our hypothesis.

View full details

Poster

One-Shot Flow, Any-Time Frame: A Bidirectional Warping Framework for Event-Based Video Frame Interpolation

Linghui Fu ⋅ Yuhan Liu ⋅ Hao Chen ⋅ Zhen Yang ⋅ Yongjian Deng

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 261

Video Frame Interpolation (VFI) is a crucial task in video processing. Flow-based methods, despite their success, are constrained by a fundamental dilemma: forward warping is efficient but prone to artifacts, while backward warping yields higher quality at a significant computational cost, especially for multi-frame interpolation. This trade-off is a major bottleneck. To overcome this, we introduce ``One-Shot Flow, Any-Time Frame," a novel framework for Event-based VFI (E-VFI) that achieves both high efficiency and superior quality for arbitrary-time interpolation. Our framework uniquely computes a comprehensive motion trajectory representation in a single pass using a Bidirectional Flow Estimation Block (BiFEB), leveraging the high temporal resolution of event data. Subsequently, our Flow Query (FQ) module can instantly retrieve the bidirectional optical flow for any timestamp, enabling the generation of any number of frames without repeated computation. Finally, a novel Bidirectional Warping (BiW) mechanism intelligently fuses the strengths of both warping directions, effectively mitigating artifacts and producing high-fidelity results. Extensive experiments show that our approach consistently surpasses state-of-the-art E-VFI methods in both reconstruction quality and inference efficiency, representing a substantial advance in efficient and high-quality event-based video interpolation. *The code will be released after acceptance.*

View full details

Poster

Explaining Object Detectors via Collective Contribution of Pixels

Toshinori Yamauchi ⋅ Hiroshi Kera ⋅ Kazuhiko Kawamoto

Jun 6, 11:45 AM - 1:45 PM ExHall F 260

Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code will be publicly available soon.

View full details

Poster

TF-CADE: Foreground-Concentrated Text-Video Alignment for Zero-Shot Temporal Action Detection

Yearang Lee ⋅ Ho-Joong Kim ⋅ Seong-Whan Lee

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 262

Zero-Shot Temporal Action Detection (ZSTAD) aims to localize and recognize action instances from unseen action categories in untrimmed videos. Although existing methods have shown effectiveness by advancing architectural text-video alignment, they still struggle with capturing semantic distinctions between action classes, resulting in text-irrelevant predictions.To address this issue, we propose a Text-Foreground Concentrated Alignment for zero-shot temporal action DEtector (TF-CADE) that explicitly aligns textual information with action-relevant foreground regions.Specifically, we introduce Action Concentrate Aggregation (ACA), which extracts action concentrate scores to aggregate temporally informative video segments into a foreground-weighted video embedding.This foreground concentrated alignment enhances the semantic consistency between text and video features and improves inter-class discriminability.In addition, a Certainty-based Confidence Re-weighting (CCR) strategy refines per-snippet confidence scores by leveraging foreground-aware similarity, effectively suppressing irrelevant action classes during inference.Extensive evaluations show that our TF-CADE not only achieves state-of-the-art performance under in-distribution settings but also excels in cross-dataset generalization to unseen action classes.

View full details

Poster

GM-R^2: Generative Matching Learning for Unsupervised Geometric Representation and Registration

Haobo Jiang ⋅ Liang Yu ⋅ Jianmin Zheng

Jun 7, 11:45 AM - 1:45 PM ExHall F 261

This paper proposes GM-R^2, a novel Generative Matching Learning framework for unsupervised geometric descriptor learning and correspondence matching. By reformulating descriptor learning as geometry-conditioned cross-view image generation, GM-R^2 leverages the proxy supervisory signal from structurally aligned view synthesis to implicitly enforce feature consistency across correspondence, enabling robust 3D matching. To instantiate GM-R^2, we introduce Denoising-Agnostic Coupled ControlNet conditioned on depth maps as the required geometry-conditioned cross-view generator. It effectively extends the single-view generation of naive ControlNet to the cross-view via coupled depth-map input design and further remove the latent noise dependency to support geometry-only inference (expected by 3D matching). Moreover, we present Zoomable Equirectangular Projection for intrinsics-free point cloud-to-depth mapping that adaptively zooms into the angular region occupies by the narrow-FOV input for dense range-map acquisition. Extensive experiments on 3DMatch and ScanNet datasets verify the superior precision of our GM-R^2, even surpassing supervised methods.

View full details

Poster

PRISM: Prototype-based Reasoning with Inter-modal Semantic Mining for Interpretable Image Recognition

Anni Yu ⋅ Yu-Bin Yang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 263

Prototype-based methods enhance interpretability in image recognition by establishing intermediate part prototypes to build interpretable classifiers, enabling transparent reasoning through part-level attention and reference to prototypical examples. However, existing methods typically depend on unimodal visual supervision and constrain prototypes within the visual embedding space, which inherently restricts their semantic alignment with human-interpretable concepts. In this work, we present PRISM (Prototype-based Reasoning with Inter-modal Semantic Mining), an interpretable image recognition framework that leverages natural language as an auxiliary modality to guide the learning of class-specific part prototypes. PRISM introduces an information-theoretic attribution mechanism that identifies semantically salient image regions conditioned on textual descriptions. By aligning these attribution maps with prototype activation patterns, PRISM implicitly anchors visual part prototypes to conceptually meaningful image regions, enhancing interpretability without requiring explicit concept modeling. To further enhance the distinctiveness and localization of prototypes, we introduce a spatial compactness constraint that encourages each prototype to attend to specific, non-overlapping image regions. Extensive experiments on fine-grained benchmarks demonstrate that the proposed PRISM not only improves classification performance but also provides faithful and semantically grounded visual explanations.

View full details

Poster

Task-Driven Implicit Representations for Automated Design of LiDAR Systems

Nikhil Behari ⋅ Aaron Young ⋅ Tzofi Klinghoffer ⋅ Akshat Dave ⋅ Ramesh Raskar

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 262

Imaging system design is a complex, time-consuming, and largely manual process; LiDAR design, ubiquitous in mobile devices, autonomous vehicles, and aerial imaging platforms, adds further complexity through unique spatial and temporal sampling requirements. In this work, we propose a framework for automated, task-driven LiDAR system design under arbitrary constraints. To achieve this, we represent LiDAR configurations in a continuous six-dimensional design space and learn task-specific implicit densities in this space via flow-based generative modeling. We then synthesize new LiDAR systems by modeling sensors as parametric distributions in 6D space and fitting these distributions to our learned implicit density using expectation-maximization, enabling efficient, constraint-aware LiDAR system design. We validate our method on diverse tasks in 3D vision, enabling automated LiDAR system design across real-world-inspired applications in face scanning, robotic tracking, and object detection.

View full details

Poster

Evaluating Generative Models via One-Dimensional Code Distributions

Zexi Jia ⋅ Pengcheng Luo ⋅ Yijia Zhong ⋅ Jinchao Zhang ⋅ Jie Zhou

Jun 6, 11:45 AM - 1:45 PM ExHall F 263

Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of \emph{discrete} visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce Codebook Histogram Distance(CHD), a training-free distribution metric in token space, and Code Mixture Model Score(CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose VisForm, a benchmark of 210K images spanning 62 visual forms and 11 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.

View full details

Poster

U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences

Xiang Xu ⋅ Ao Liang ⋅ Youquan Liu ⋅ Linfeng Li ⋅ Lingdong Kong ⋅ Ziwei Liu ⋅ Qingshan Liu

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 266

Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present **U4D**, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) *uncertainty-region modeling*, which reconstructs high-entropy regions with fine geometric fidelity, and (2) *uncertainty-conditioned completion*, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.

View full details

Poster

SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models

Sofian Chaybouti ⋅ Sanath Narayan ⋅ Yasser Dahou ⋅ Phúc H. Lê Khắc ⋅ Ankit Singh ⋅ Ngoc Dung Huynh ⋅ Wamiq Reyaz Para ⋅ Hilde Kuehne ⋅ Hakim Hacid

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 270

Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, (3) hierarchical clustering and sampling of training data—typically reserved for self-supervised learning—substantially improves sample efficiency over random sampling for multi-teacher distillation, and (4) the resulting representations transfer effectively to early-fusion Grounding-VLMs, outperforming models trained from scratch. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts, our AMoE initializes an early-fusion Grounding-VLM that replaces the conventional ViT→LLM stack, demonstrating improved performance compared to a model trained from scratch. We release OpenLVD200M and distilled checkpoints.

View full details

Poster

Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

Ray (Rui) Zhang ⋅ Carl Greiff ⋅ Thomas Lew ⋅ John Subosits

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 272

We propose a fast and correspondence-free point cloud registration method that leverages local geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The proposed method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order methods used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR registration task in the driving domain, we achieve a reduction of $>55\%$ in both translational and rotational drift in challenging feature-sparse environments.

View full details

Poster

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Ruoxiang Huang ⋅ Zhen Yuan

Jun 7, 11:45 AM - 1:45 PM ExHall F 271

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in multimodal understanding, yet their positional encoding mechanisms remain fundamentally limited. Current approaches apply uniform positional indices across all tokens, failing to account for dramatic variations in information density between and within modalities. This uniform treatment leads to suboptimal attention allocation and inefficient cross-modal fusion. We introduce MODIX (Multimodal Information-Driven Positional Index Scaling), a training-free framework that dynamically adapts positional granularity based on information-theoretic analysis of modality contributions. By jointly quantifying intrinsic information density within each modality and cross-modal interaction strength, MODIX assigns finer positional strides to information-rich content and coarser strides to redundant regions. Operating purely at inference time, our method requires no architectural modifications or retraining, enabling plug-and-play integration with existing VLMs. Comprehensive experiments across multiple state-of-the-art architectures and six benchmarks demonstrate that MODIX consistently improves multimodal reasoning, achieving up to 8.4% gains on ScienceQA and 6.8% on RealWorldQA, while dynamically adapting positional resolution to task-specific information distributions.

View full details

Poster

StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation

Mengmeng Liu ⋅ Jiuming Liu ⋅ Michael Ying Yang ⋅ Chaokang Jiang ⋅ Jiangtao Li ⋅ Yunpeng Zhang ⋅ Hesheng Wang ⋅ Francesco Nex ⋅ Hao Cheng

Jun 7, 3:30 PM - 5:30 PM ExHall A 271

We propose StreamVLO, a streaming visual–LiDAR odometry framework that performs unified spatio-temporal correlation with Mamba models and tackles the long-standing cumulative drift problem via an online Cumulative Drift Compensation scheme for localization in 4D dynamic environments. Specifically, StreamVLO introduces a unified spatio-temporal correlation module built on Mamba to fuse heterogeneous visual and LiDAR cues across multi-frame clips, overcoming the limited temporal exploration of prior pairwise methods. Furthermore, a Cumulative Drift Compensation module minimizes cumulative drift by iteratively learning residual corrections from multiple historical frames in a causal manner. To strengthen spatial feature representation on salient regions, we adopt a Keypoint-Aware Auxiliary Loss with a winner-takes-all strategy. StreamVLO achieves state-of-the-art performance on two commonly used autonomous driving datasets, reducing errors by 19\% ($t_{\text{rel}}$) and 22\% ($r_{\text{rel}}$) on KITTI, and by 18\% ATE and 16\% RPE on Argoverse, while remaining suitable for real-time deployment.

View full details

Poster

Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding

Jincen Jiang ⋅ Qianyu Zhou ⋅ Yuhang Li ⋅ Kui Su ⋅ Meili Wang ⋅ Jian Chang ⋅ Jian Jun Zhang ⋅ Xuequan Lu

Jun 7, 3:30 PM - 5:30 PM ExHall A 272

While recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for single-task or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using centroid-based topology and geodesic curvature continuity. We further devise hierarchical domain-aware modeling (HDM) that stabilizes cross-domain reasoning by consolidating intra-domain structure and fusing inter-domain relations. At test time, we introduce a lightweight spectral graph alignment (SGA) that shifts target features toward source prototypes in the spectral domain without updating model parameters, ensuring structure-preserving test-time feature shifting. In addition, we introduce MP3DObject, a real-scan object dataset for multi-task DG evaluation. Comprehensive experiments demonstrate that the proposed approach improves structural fidelity and consistently outperforms state-of-the-art methods across multiple tasks including reconstruction, denoising, and registration.

View full details

Poster

NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code

Seemandhar Jain ⋅ Keshav Gupta ⋅ Kunal Gupta ⋅ Manmohan Chandraker

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 275

The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. NERFIFY achieves domain-specific executability through six key innovations: (1) Context-free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph-of-Thought code synthesis: Specialized multi-file-agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel regularizers or architectural optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper-to-code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (±0.5 dB PSNR, ±0.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations will be publicly released.

View full details

Poster

Mirror Illusion Art

Xiaopei Zhu ⋅ Zeyuan Li ⋅ Jun Zhu ⋅ Xiaolin Hu

Jun 7, 11:45 AM - 1:45 PM ExHall F 275

Mirror Illusion Art is a novel reflection-conditioned 3D illusion where one object yields two target appearances (front and mirror). The task is formulated as inverse design from two target 2D images (front and mirror) to a printable 3D object with geometry and texture. Prior topology-driven and shadow-based approaches demand substantial manual effort, optimize shape only, and often yield non-smooth or incomplete geometry. To address these challenges, we propose AutoMIA, an automated Mirror Illusion Art design pipeline that jointly optimizes shape and color. To stabilize optimization and suppress artifacts, four mechanisms are introduced: (1) projection-alignment component (PAC) selection to reduce surface noise, (2) position-weighted adaptive (PWA) suppression for background noise, (3) internal voxel preservation (IVP) to prevent internal fractures, and (4) shape-color decoupled (SCD) optimization that balance shape and color optimization. AutoMIA generate diverse smooth Mirror Illusion artworks successfully both in the digital and physical world, with only around 76s design time and 2.6 GB memory on average using a single RTX 3090, advancing inverse graphics and computational design.

View full details

Poster

BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation

Miaowei Wang ⋅ Qingxuan Yan ⋅ Zhi Cao ⋅ Yayuan Li ⋅ Oisin Mac Aodha ⋅ Jason J. Corso ⋅ Amir Vaxman

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 277

Text-guided dynamic 3D character generation has advanced rapidly, yet producing high-quality motion that faithfully reflects rich textual descriptions remains challenging. Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and discrete frame-wise representations that fail to capture rich motion semantics. We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model. Specifically, our closed-form, Laplacian-regularized B-spline solver efficiently compresses variable-length motion sequences into compact representations with a fixed number of control points. Further, we introduce a normal-fusion strategy for input shape adherence along with correspondence-aware and local-rigidity losses for motion-restoration quality. To train our model, we collate BIMO, a new dataset containing diverse variable-length 3D motion sequences with rich, high-quality text annotations. Extensive evaluations show that our feed-forward framework BiMotion generates more expressive, higher-quality, and better prompt-aligned motions than existing state-of-the-art methods, while also achieving faster generation. The code is available in the supplemental material and will be made publicly available upon publication.

View full details

Poster

Towards Human-Like Robot Handwriting via Contour-Aware Generation

Yutao Qin ⋅ Gang Dai ⋅ Yifan Zhang ⋅ Youwei Han ⋅ Qisheng He ⋅ Shuangping Huang

Jun 7, 11:45 AM - 1:45 PM ExHall F 277

Empowering machines to simulate human handwriting is a promising research direction. Most existing methods, however, primarily focus on reproducing the writing trajectory to capture the overall character structure, while neglecting the critical aspect of stroke contour modeling. Consequently, these methods struggle to generate visually realistic, human-like handwriting, limiting their applicability in scenarios such as calligraphy robots. To address this issue, we propose a new task, called Contour-aware Handwriting Trajectory Reconstruction (CHTR). This task presents two major challenges: 1) Existing handwriting datasets lack stroke contour annotations, making supervised learning difficult; 2) Previous methods are unable to recover stroke contour and preserve the overall character structure jointly. To address the dataset limitation, we present CHTR-110K, a large-scale character dataset with refined stroke contour annotations. To tackle the technical challenge, we propose Graph-based Handwriting Trajectory Reconstruction (G-HTR), a novel method using contour-aware graphs to jointly model stroke contour and character structure. We use a Graph Neural Network to capture structural relationships among nodes and introduce a multi-scale graph learning strategy to encode both fine-grained stroke details and global character structure. Extensive experiments verify the effectiveness of G-HTR, outperforming previous state-of-the-art methods on both our CHTR-110K and the widely-used CASIA-OLHWDB dataset. G-HTR further shows strong real-world results when deployed on robots, confirming its practical value. To support future research, we will release source code and dataset.

View full details

Poster

LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation

Xucong Wang ⋅ Pengkun Wang ⋅ Zhe Zhao ⋅ Liheng Yu ⋅ Rui Mao ⋅ Yang Wang

Jun 7, 3:30 PM - 5:30 PM ExHall A 277

Prompt Learning (PL) has emerged as a parameter-efficient technique for adapting Vision-Language Models (VLMs) to downstream tasks. However, almost all existing PL methods are primarily designed and evaluated on well-curated datasets, overlooking a critical post-deployment phenomenon, i.e., the intrinsic connection between input resolution and storage-memory consumption. Specifically, to satisfy the stringent storage-memory constraints on edge devices, models are often limited to low-resolution inputs (e.g., $\le$ 224$\times$224 for CLIP-ViT/B-16) and generate fewer tokens (with the position embedding resized), which poses a unique challenge in performance robustness. To tackle this issue, we propose LOREAL, an efficient prompt self-distillation framework that learns resolution-invariant representations by excavating attribute semantics. At the heart of LOREAL is a dual-student architecture, i.e., two student models fed with inputs at different resolutions synergistically learn from each other. Building upon this, we contextualize the students' prompt with resolution-invariant attributes queried from the LLM, then leverage cross-modality meta-nets to generate attribute semantics. These meta-nets are bridged between the different encoders of two students, wherein we introduce Low-Level Distillation (LLD) and High-Level Distillation (HLD) to facilitate the learning of more cross-resolution representations. Extensive experiments show that LOREAL significantly improves VLMs' performance and robustness under varied resolution settings, underscoring significant practical utilities.

View full details

Poster

Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

Muyang Li ⋅ Yucheng Liu ⋅ Jianbo Ma ⋅ Elliot Osborne ⋅ Bo Han ⋅ Tongliang Liu

Jun 6, 11:45 AM - 1:45 PM ExHall F 278

Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.

View full details

Poster

GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents

Mengtian Li ⋅ Fan Yang ⋅ Ruixue Xiong ⋅ Yiyan Fan ⋅ Zhifeng Xie ⋅ Zeyu Wang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 278

Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens.

View full details

Poster

DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs

Nikhil Behari ⋅ Diego Rivero ⋅ Luke Apostolides ⋅ Suman Ghosh ⋅ Paul Pu Liang ⋅ Ramesh Raskar

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 281

Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full time-resolved histograms containing direct and multi-bounce light returns; these multi-bounce returns encode rich non-line-of-sight (NLOS) cues that can enable perception of hidden objects in a scene. However, severe hardware limitations of consumer LiDARs make NLOS reconstruction with conventional methods difficult. In this work, we motivate a complementary direction: enabling NLOS perception with low-cost LiDARs through data-driven inference. We present DENALI, the first large-scale real-world dataset of space–time histograms from low-cost LiDARs capturing hidden objects. We capture time-resolved LiDAR histograms for 72,000 hidden-object scenes across diverse object shapes, positions, lighting conditions, and spatial resolutions. Using our dataset, we show that consumer LiDARs can enable accurate, data-driven NLOS perception. We further identify key scene and modeling factors that limit performance, as well as simulation-fidelity gaps that hinder current sim-to-real transfer, motivating future work toward scalable NLOS vision with consumer LiDARs.

View full details

Poster

Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models

Vishal Pramanik ⋅ Maisha Maliha ⋅ Susmit Jha ⋅ Alvaro Velasquez ⋅ Olivera Kotevska ⋅ Sumit Jha

Jun 7, 11:45 AM - 1:45 PM ExHall F 283

We study concept-level forgetting in pretrained vision models: removing an entire semantic category so the system no longer recognizes that object in unseen images and contexts, rather than merely forgetting specific training examples. Prior work either applies blunt global projections or fine-tunes parameters, which can introduce collateral damage to unrelated features, add compute, and become unstable as forgetting strength increases. We introduce Contrastive Subnet Erasure (CSE), a training-free, encoder-centric edit that targets a compact set of channels most responsible for the class and attenuates them in a calibrated manner. The modification is algebraically folded into the subsequent layer, yielding no inference-time overhead and leaving task heads unchanged. To evaluate whether forgetting generalizes beyond the data used to specify the class, we introduce a cross dataset protocol in which the class is defined on a source dataset and performance is measured on a disjoint target dataset drawn from a different distribution with no shared images. This setup tests whether the model still fails to recognize the object when it looks different or appears in new scenes, and it avoids overfitting to patterns of the source dataset. Across CIFAR 10, CIFAR 100, and ImageNet under this protocol, CSE achieves stronger forgetting of the target class while better preserving non target utility than existing baselines in both single class and multi class settings. Overall, CSE provides a simple stable and deployment ready mechanism for class level unlearning in vision.

View full details

Poster

Vocabulary Scaling Law: Tuning Open-vocabulary Predictors for Their Openness

Ziliang Chen ⋅ Yulu Li ⋅ Liangda Fang ⋅ jusheng zhang ⋅ Yongsen Zheng ⋅ Quanlong Guan ⋅ Xipeng Chen

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 285

Open-vocabulary learning on CLIP provides remarkable generalization on diverse concepts, however, falters under the realistic streaming open-world evaluations for Stability against distractor classes and Extensibility to novel classes. Current fine-tuning methods often fail these tests since they are mainly designed for closed-set conditions, leading to the performance gaps while the target vocabulary progressively scales. We formalize a ``vocabulary scaling law'' showing that these openness measures can be lower-bounded by performance on the full class-name universe, implying that robust fine-tuning should: (i) account for the entire vocabulary, (ii) tune class-name embeddings rather than context, and (iii) enforce orthogonality between prompt embeddings including training and open-set class names. Guided by our analysis, we propose Submodular-Vocabulary Fine-tuning (SVFT), a bi-level optimization framework that approximates the intractable objective of tuning all class name embedding by greedily selecting a small, informative subset of class names via constrained submodular maximization, thus, allows the employment of efficient greedy algorithm for the near-optimal class-name subset selection to fine-tune CLIP instead of using all open classes. Across extensive experiments, SVFT consistently improves both stability and extensibility, advancing the openness and practical robustness of CLIP-based vision–language models.

View full details

Poster

Same or Not? Enhancing Visual Perception in Vision-Language Models

Damiano Marsili ⋅ Aditya Mehta ⋅ Ryan Y. ⋅ Georgia Gkioxari

Jun 6, 11:45 AM - 1:45 PM ExHall F 284

Vision–language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition (“Is it a cat or a dog?”) over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models.

View full details

Poster

Frequency-domain Manipulation for Face Obfuscation

Jintae Kim ⋅ Keunsoo Ko ⋅ Chang-Su Kim

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 285

Facial image datasets have become essential resources for various face analysis tasks, but their use raises significant privacy concerns. To address this issue, face obfuscation has emerged as a practical approach to hide identity from humans while retaining cues decipherable by machines. However, existing methods often leave exploitable visual traces, making them vulnerable to reconstruction attacks that restore hidden identity. To address this issue, we propose a frequency-domain manipulation framework, called FreM, which adjusts frequency subbands differently to hide identity, retain machine-decipherable cues, and improve robustness against reconstruction attacks. Specifically, the proposed FreM first decomposes a facial image into frequency subbands and applies subband-adaptive modulation that regulates information according to the characteristics of each subband. The modulation parameters are then refined to yield the reliable obfuscated result. Extensive experiments across multiple face analysis benchmarks demonstrate that FreM achieves superior obfuscation quality and strong robustness against reconstruction attacks. The source code will be made publicly available.

View full details

Poster

Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Jooyeol Yun ⋅ Jaegul Choo

Jun 6, 11:45 AM - 1:45 PM ExHall F 285

Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic.Yet automating the animation of vector graphics remains challenging for vision–language models (VLMs) despite recent progress in code generation and motion planning.VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions.By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.

View full details

Poster

Unified Vector Floorplan Generation via Markup Representation

Kaede Shiohara ⋅ Toshihiko Yamasaki

Jun 7, 3:30 PM - 5:30 PM ExHall A 287

Automatic residential floorplan generation has long been a central challenge bridging architecture and computer graphics, aiming to make spatial design more efficient and accessible. While early methods based on constraint satisfaction or combinatorial optimization ensure feasibility, they lack diversity and flexibility. Recent generative models achieve promising results but struggle to generalize across heterogeneous conditional tasks, such as generation from site boundaries, room adjacency graphs, or partial layouts, due to their suboptimal representations. To address this gap, we introduce Floorplan Markup Language (FML), a general representation that encodes floorplan information within a single structured grammar, which casts the entire floorplan generation problem into a next token prediction task. Leveraging FML, we develop a transformer-based generative model, Floorplan Markup Language Model (FMLM), capable of producing high-fidelity and functional floorplans under diverse conditions. Comprehensive experiments on the RPLAN dataset demonstrate that FMLM, despite being a single model, surpasses the previous task-specific state-of-the-art methods.

View full details

Poster

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Soumyaratna Debnath ⋅ Bui Manh Duc ⋅ Zinan Liu ⋅ Lin Wang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 289

Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static. It is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS). This empowers us to design a Möbius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align the perceptual saliency with the textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering (VQA) benchmarks. The results show that ours achieves dramatic gains, with average improvements by +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, results reveal that LLMind can retain up to 82%, 92% and 97% of the full-resolution performance with only 1%, 3% and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.

View full details

Poster

SPDMark: Selective Parameter Displacement for Robust Video Watermarking

Samar Fares ⋅ Nurbek Tastan ⋅ Karthik Nandakumar

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 291

The advent of high-quality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called \textbf{\texttt{SPDMark}} (pronounced `SpeedMark') based on \textbf{selective parameter displacement} of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of \textbf{\texttt{SPDMark}} to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.

View full details

Poster

HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning

Zihao Peng ⋅ Nan Zou ⋅ Jiandian Zeng ⋅ Guo Li ⋅ Ke Chen ⋅ Boyuan Li ⋅ Tian Wang

Jun 7, 11:45 AM - 1:45 PM ExHall F 291

Vision Transformers (ViTs) have been widely adopted in vision tasks due to their strong transferability. In Federated Learning (FL), where full fine-tuning is communication-heavy, Low-Rank Adaptation (LoRA) provides an efficient and communication-friendly way to adapt ViTs. However, existing LoRA-based federated tuning methods overlook latent client structures in real-world settings, limiting shared representation learning and hindering generalization to unseen clients. To address this, we propose HiLoRA, a hierarchical LoRA framework that places adapters at three levels: root, cluster, and leaf, each designed to capture global, subgroup, and client-specific knowledge, respectively. Through cross-tier orthogonality and cascaded optimization, HiLoRA separates update subspaces and aligns each tier with its residual personalized objective. In particular, we develop a LoRA-Subspace Adaptive Clustering mechanism that infers latent client groups via subspace similarity analysis, thereby facilitating knowledge sharing across structurally aligned clients. Theoretically, we establish a tier-wise generalization analysis that supports HiLoRA’s design. Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization.

View full details

Poster

Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning

Xinghao Wu ⋅ Jianwei Niu ⋅ Xuefeng Liu ⋅ Guogang Zhu ⋅ Jiayuan Zhang ⋅ Shaojie Tang ⋅ Wei Chen

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 292

Federated Prototype Learning (FedCL) has emerged as an effective strategy for handling data heterogeneity in Federated Learning (FL). In FedCL, clients collaboratively construct a set of global feature centers (prototypes), and let local features align with these prototypes to mitigate the effects of data heterogeneity. The performance of FedCL highly depends on the quality of prototypes. Existing methods assume that larger inter-class distances among prototypes yield better performance, and thus design different methods to increase these distances. However, we observe that while these methods increase prototype distances to enhance class discrimination, they inevitably disrupt essential semantic relationships among classes, which are crucial for model generalization. This raises an important question: how to construct prototypes that inherently preserve semantic relationships among classes? Directly learning these relationships from limited and heterogeneous client data can be problematic in FL. Recently, the success of pre-trained language models (PLMs) demonstrates their ability to capture semantic relationships from vast textual corpora. Motivated by this, we propose FedTSP, a novel method that leverages PLMs to construct semantically enriched prototypes from the textual modality, enabling more effective collaboration in heterogeneous data settings. We first use a large language model (LLM) to generate fine-grained textual descriptions for each class, which are then processed by a PLM on the server to form textual prototypes. To address the modality gap between client image models and the PLM, we introduce trainable prompts, allowing prototypes to adapt better to client tasks. Extensive experiments demonstrate that FedTSP mitigates data heterogeneity while significantly accelerating convergence.

View full details

Poster

The Invisible Gorilla Effect in Out-of-distribution Detection

Harry Anthony ⋅ Ziyun Liang ⋅ Hermione Warr ⋅ Konstantinos Kamnitsas

Jun 7, 3:30 PM - 5:30 PM ExHall A 292

Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model’s ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations will be released upon acceptance.

View full details

Poster

FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning

Zhiqiang Kou ⋅ Junxiang Wu ⋅ Wenke Huang ⋅ Wenwen He ⋅ Ming-Kun Xie ⋅ Changwei Wang ⋅ Yuheng Jia ⋅ Di Jiang ⋅ Yang Liu ⋅ Xin Geng ⋅ Qiang Yang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 293

Multi-label representations encode higher-order label dependencies, yet in federated settings the local estimates of these dependencies are statistically inconsistent, causing structural drift across clients and rendering naive quantity-weighted aggregation suboptimal. We propose FedHarmony, a federated multi-label learning framework that harmonizes heterogeneous label correlations without sharing raw data. A Correlation Expert is formed by leave-one-out consolidation of clients’ label–label correlation statistics to provide a round-wise global consensus. Guided by this expert, each client performs consensus-guided correction that aligns its local correlation to the consensus within clusters of strongly related labels obtained via spectral clustering of the expert matrix. This block-wise alignment targets dense, high-signal subspaces. We establish two guarantees: (i) restricting alignment to in-cluster pairs strictly improves optimization curvature and linear convergence rate; (ii) ignoring cross-cluster entries incurs only a bounded, quantitatively small information loss when the consensus is near block-diagonal. Finally, a correlation-aware central aggregation combines data quantity with a dynamic measure of correlation learning quality, using a dynamic balance factor that transitions from quantity-driven weighting in early rounds to structure-driven weighting later. Extensive experiments under diverse non-IID regimes (varying label distributions, client heterogeneity, and client counts) show consistent gains over federated baselines in mAP/F1/Hamming Loss, with improved stability and communication efficiency.

View full details

Poster

Fully Decentralized Certified Unlearning

Hithem Lamri ⋅ Michail Maniatakos

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 293

Machine unlearning (MU) seeks to remove the influence of specified data from a trained model in response to privacy requests or data poisoning. While certified unlearning has been analyzed in centralized and server-orchestrated federated settings (via guarantees analogous to differential privacy, DP), the decentralized setting—where peers communicate without a coordinator—remains underexplored. We study certified unlearning in decentralized networks with fixed topologies and propose \methodname, a random-walk procedure that performs one projected gradient ascent step on the forget set at the unlearning client and a geometrically distributed number of projected descent steps on the retained data elsewhere, combined with subsampled Gaussian noise and projection onto a trust region around the original model. We provide (i) convergence guarantees in the convex case and stationarity guarantees in the nonconvex case, (ii) $(\varepsilon,\delta)$ network-unlearning certificates on client views via subsampled Gaussian R\'enyi DP (RDP) with segment-level subsampling, and (iii) deletion-capacity bounds that scale with the forget-to-local data ratio and quantify the effect of decentralization (network mixing and randomized subsampling) on the privacy–utility trade-off. Empirically, on image benchmarks (MNIST, CIFAR-10), \methodname~ matches a given $(\varepsilon,\delta)$ while achieving higher test accuracy than decentralized DP baselines and reducing forget accuracy to random guessing (\(\approx 10\%\)).

View full details

Poster

Designing to Forget: Deep Semi-parametric Models for Unlearning

Amber Yija Zheng ⋅ YU-SHAN TAI ⋅ Raymond A. Yeh

Jun 6, 11:45 AM - 1:45 PM ExHall F 294

Recent advances in machine unlearning have focused on developing algorithms to remove specific training samples from a trained model. In contrast, we observe that not all models are equally easy to unlearn. Hence, we introduce a family of deep semi-parametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module that aggregates information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters. Empirically, we demonstrate that SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. Notably, on ImageNet classification, SPMs reduce the prediction gap relative to a retrained (oracle) baseline by $11\\%$ and achieve over $10\times$ faster unlearning compared to existing approaches on parametric models.

View full details

Poster

Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift

Heewon Park ⋅ Mugon Joe ⋅ Miru Kim ⋅ Kyungjin Im ⋅ Minhae Kwon

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 294

Federated learning (FL) in post-deployment settings must adapt to non-stationary data streams across heterogeneous clients without access to ground-truth labels. A major challenge is learning rate selection under client-specific, time-varying distribution shifts, where fixed learning rates often lead to underfitting or divergence. We propose Fed-ADE (Federated Adaptation with Distribution Shift Estimation), an unsupervised federated adaptation framework that leverages lightweight estimators of distribution dynamics. Specifically, Fed-ADE employs uncertainty dynamics estimation to capture changes in predictive uncertainty and representation dynamics estimation to detect covariate-level feature drift, combining them into a per-client, per-timestep adaptive learning rate. We provide theoretical analyses showing that our dynamics estimation approximates the underlying distribution shift and yields dynamic regret and convergence guarantees. Experiments on image and text benchmarks under diverse distribution shifts (label, covariate, and concept) demonstrate consistent improvements over strong baselines. These results highlight that distribution shift-aware adaptation enables effective and robust federated post-adaptation under real-world non-stationarity.

View full details

Poster

OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning

Hengrui Kang ⋅ Zhuangcheng Gu ⋅ Zhiyuan Zhao ⋅ Zichen Wen ⋅ Bin Wang ⋅ Weijia Li ⋅ Conghui He

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 296

Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, layout generation, remains underexplored. Distinct from traditional graphic layout design and room layout planning, document layout generation typically involves a larger number of elements per page and exhibits greater structural diversity and complexity. Currently, a major obstacle lies in the scarcity of diverse document layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniDocLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniDocLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from our dataset with coarse category definitions, and 2) transferring the knowledge to a specific domain with few fine-grained annotated samples. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^6$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, dataset, and models will be publicly released.

View full details

Poster

When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm

Ye Leng ⋅ Junjie Chu ⋅ Mingjie Li ⋅ Chenhao Lin ⋅ Chao Shen ⋅ Michael Backes ⋅ Yun Shen ⋅ Yang Zhang

Jun 7, 3:30 PM - 5:30 PM ExHall A 297

Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Our work shows that MLLMs pair usability with higher risks, highlighting the need for adaptive safeguards to mitigate real-world harms.Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability may also introduce new and potentially greater safety risks.Taking diffusion models as a reference point, we systematically analyze and compare the safety risks of emerging MLLMs along two dimensions: unsafe content generation and fake image synthesis.Across multiple unsafe generation benchmark datasets, we observe that MLLMs tend to generate more unsafe images than diffusion models. This difference partly arises because diffusion models often fail to interpret abstract prompts, producing corrupted outputs, whereas MLLMs can comprehend these prompts and generate unsafe content.For current advanced fake image detectors, MLLM-generated images are also notably harder to identify. Even when detectors are retrained with MLLMs-specific data, they can still be bypassed by simply providing MLLMs with longer and more descriptive inputs.Our measurements indicate that the emerging safety risks of the cutting-edge generative paradigm, MLLMs, have not been sufficiently recognized, posing new challenges to real-world safety.

View full details

Poster

A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images

Sungik Choi ⋅ Hankook Lee ⋅ Jaehoon Lee ⋅ Robin Kim ⋅ Stanley Jungkyu Choi ⋅ Moontae Lee

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 300

As recent AI models have successfully generated high-resolution photorealistic images, it has also been socially important to detect whether an image is generated by AI. Since training data for the detection task is often not available due to the diversity of generative models, training-free detection approaches have been practically considered. A common approach is to utilize the image-level reconstruction error from the latent diffusion model (LDM). However, we find this score suffers from instance-specific biases, particularly in images with simple backgrounds. To this end, we propose a novel image-level debiasing score function that cancels out background contribution by normalizing the reconstruction error on the augmented images with similar background information. To be specific, we show that rotation and low-pass filtering are effective augmentation strategies. To promote generalization to broader generative models, we newly explore latent-level reconstruction error as an additional training-free signal. However, we observe that the latent-level score also suffers to latent-specific bias. To mitigate this, we introduce a rotation-based latent-level debiasing score based on the normalization of the rotated latent. We unify the aforementioned scores into a single unified debiasing score, RDD, which achieves state-of-the-art training-free detection performance across diverse generative models. Furthermore, our framework can be robust to corruption of the examined images.

View full details

Poster

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Shihao Wang ⋅ Guo Chen ⋅ De-An Huang ⋅ Zhiqi Li ⋅ Minghan LI ⋅ Guilin Liu ⋅ Jan Kautz ⋅ Jose M. Alvarez ⋅ Lei Zhang ⋅ Zhiding Yu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 299

While Video Large Language Models (Video-LLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization. However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. Specifically, we design the VidThinker pipeline, which automates annotation by generating instruction-conditioned captions, retrieving relevant video segments, and selecting key frames to enable efficient supervision. Using VidThinker, we build the VideoITG-40K dataset with 40K videos and 500K temporal grounding annotations. Our plug-and-play VideoITG model leverages Video-LLMs’ visual-language alignment and reasoning for discriminative frame selection. VideoITG consistently boosts the performance on multiple multimodal video understanding benchmarks, demonstrating its effectiveness and potential.

View full details

Poster

Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning

Yuhua Wang ⋅ Qinnan Zhang ⋅ Xiaodong Li ⋅ Huan Zhang ⋅ Yifan Sun ⋅ Wangjie Qiu ⋅ Hainan Zhang ⋅ Yongxin Tong ⋅ Zhiming Zheng

Jun 7, 3:30 PM - 5:30 PM ExHall A 301

Prototype-based Personalized Federated Learning (ProtoPFL) enables efficient cross-domain adaptation by communicating compact class prototypes, but directly sharing prototypes raises privacy risks. A common defense involves per-example $\ell_2$ clipping before prototype computation to limit sensitivity, followed by the addition of isotropic Gaussian noise during upload to enforce Local Differential Privacy (LDP). However, this Isotropic Gaussian Prototype Perturbation (IGPP) often over-perturbs key discriminative dimensions and struggles to balance the clipping threshold with representation fidelity. We propose VPDR, a client-side privacy plug-in that can be seamlessly integrated into existing ProtoPFL frameworks. Motivated by the statistical prior that dimension-wise class variance reflects discriminability, we introduce Variance-adaptive Prototype Perturbation (VPP), which uses groupwise calibration to apply less noise to discriminative subspaces, preserving semantic separability while ensuring privacy. We further design Distillation-guided Clipping Regularization (DCR), which enables feature norms to adaptively concentrate near the predefined clipping threshold while maintaining prediction consistency. Theoretical analysis shows that our groupwise noise provides privacy guarantees no weaker than those of the isotropic mechanism under the same privacy constraints. Extensive experiments on multiple cross-domain benchmarks demonstrate that VPDR achieves a superior privacy-utility trade-off, outperforming IGPP in personalized federated fine-tuning while maintaining strong privacy protection under realistic attack scenarios.

View full details

Poster

From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity

Zhuang Qi ⋅ Yingpeng Tang ⋅ Lei Meng ⋅ Guoqing Chao ⋅ Lei Wu ⋅ Han Yu ⋅ Xiangxu Meng

Jun 6, 11:45 AM - 1:45 PM ExHall F 302

Exemplar replay has become an effective strategy for mitigating catastrophic forgetting in federated continual learning (FCL) by retaining representative samples from past tasks. Existing studies focus on designing sample-importance estimation mechanisms to identify information-rich samples. However, they typically overlook strategies for effectively utilizing the selected exemplars, which limits their performance under continual dynamic heterogeneity across clients and tasks. To address this issue, this paper proposes a federated geometry-aware correction method, termed FEAT, which alleviates imbalance-induced representation collapse that drags rare-class features toward frequent classes across clients. Specifically, it consists of two key modules: 1) the Geometric Structure Alignment module performs structural knowledge distillation by aligning the pairwise angular similarities between feature representations and their corresponding Equiangular Tight Frame prototypes, which are fixed and shared across clients to serve as a class-discriminative reference structure. This encourages geometric consistency across tasks and helps mitigate representation drift; 2) the Energy-based Geometric Correction module removes task-irrelevant directional components from feature embeddings, which reduces prediction bias toward majority classes. This improves sensitivity to minority classes and enhances the model's robustness under class-imbalanced data distributions. Extensive experiments on three benchmark datasets demonstrate that FEAT substantially achieves a 4%–8% improvement in Top-1 accuracy compared to nine state-of-the-art methods.

View full details

Poster

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Tao Chen ⋅ Kun Zhang ⋅ Qiong Wu ⋅ Xiao Chen ⋅ Chao Chang ⋅ Xiaoshuai Sun ⋅ Yiyi Zhou ⋅ Rongrong Ji

Jun 7, 11:45 AM - 1:45 PM ExHall F 303

Long video understanding is a key challenge that plagues the advancement of Multimodal Large language Models (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed Flexible Memory (FlexMem). In principle, FlexMem aims to mimic human behavior of video watching, i.e., continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one.To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on a single 3090 GPU, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than 1k frames, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, e.g. , GPT-4o and Gemini-1.5 Pro. Our code project is given in the supplementary materials.

View full details

Poster

Unsafe2Safe: Controllable Image Anonymization for Downstream Utility

Minh Dinh ⋅ SouYoung Jin

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 307

Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision--language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy}, and Utility dimensions. Across Caltech101 and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.

View full details

Poster

CoIn: Coverage and Informativeness-Guided Token Reduction for Efficient Large Multimodal Models

Chenxi Du ⋅ Yongheng Deng ⋅ Jiani Liu ⋅ Yujia Zhang ⋅ Xi Chen ⋅ Ju Ren

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 309

Large Multimodal Models (LMMs) have shown remarkable success in visual understanding tasks. LMMs encode visual and textual inputs into tokens, which are then processed by Large Language Models (LLMs). However, the large number of visual tokens poses a major bottleneck for inference efficiency and memory usage. Reducing visual tokens is a promising training-free solution, but existing methods remain limited. Importance-based approaches suffer from poor generalization, are incompatible with kernel-level inference optimizations, and only consider information from a single modality. Diversity-based strategies typically focus on pairwise token redundancy and treat all tokens as equally important. Recent attempts to sequentially combine importance and diversity criteria still fail to address the intrinsic drawbacks of their underlying metrics. To address these limitations, we reformulate visual token reduction as an optimal subset selection problem jointly guided by two complementary objectives: informativeness and coverage. Informativeness is quantified through per-token intrinsic saliency and visual–textual alignment, while coverage is enforced via a volume-based subset selection criterion that ensures global representativeness in the visual feature space.This joint formulation effectively integrates visual saliency, cross-modal alignment, and global coverage in an end-to-end token selection process, yielding a computationally efficient, model-agnostic framework compatible with modern inference accelerators. Extensive experiments demonstrate that CoIn substantially reduces computation and memory cost while maintaining strong task performance. We will release our code once accepted.

View full details

Poster

FedSDR: Federated Graph Learning with Structural Noise Detection and Reconstruction

Jiaqi Liu ⋅ Zihan Tan ⋅ Guancheng Wan ⋅ Wenke Huang ⋅ He Li ⋅ Mang Ye

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 312

Federated Graph Learning (FGL) has emerged as a principled framework for decentralized training of Graph Neural Networks (GNNs) while preserving data privacy. In subgraph-FL scenarios, however, structural noise arising from data collection and storage can damage the GNN message-passing scheme of clients, leading to conflicts in collaboration. Existing approaches exhibit two critical limitations: 1) Globally, they fail to identify corrupted clients, causing destructive knowledge inconsistencies. 2) Locally, the global GNN performs poorly on these clients due to structural noise, limiting their ability to benefit from federated collaboration. To address these challenges, we propose $\textbf{FedSDR}$, a spectra-based FGL framework against high-structural-noise scenarios. Specifically, Structural Noise-Aware Aggregation (SNAA) introduces a noise evaluation metric to detect corrupted clients and reduce their contributions, thereby mitigating the impact of noise on the global GNN. Furthermore, Robust Local Structure Reconstruction (RLSR) leverages the knowledge from the healthy global model to repair locally corrupted graph structures. Extensive experiments demonstrate that FedSDR outperforms state-of-the-art methods across various scenarios under structural noise.

View full details

Poster

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

Masatoshi Tateno ⋅ Gido Kato ⋅ Hirokatsu Kataoka ⋅ Yoichi Sato ⋅ Takuma Yagi

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 319

Hand–object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects.However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI.We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs.Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes.HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation.We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding.We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.

View full details

Poster

One Layer’s Trash is Another Layer’s Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs

Yongru Chen ⋅ Kai Zhang ⋅ Zeliang Zong ⋅ Yuchen Lu ⋅ Wenming Tan ⋅ Ye Ren ⋅ Jilin Hu

Jun 6, 11:45 AM - 1:45 PM ExHall F 319

Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model's accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.

View full details

Poster

SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World

Jungho Kim ⋅ Jiyong Oh ⋅ Seunghoon Yu ⋅ Hongjae Shin ⋅ Donghyuk Kwak ⋅ Jun Won Choi

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 319

The end-to-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model. SafeDrive comprises two complementary networks: the Sparse World Network (SWNet) and the Fine-grained Reasoning Network (FRNet). SWNet constructs trajectory-conditioned sparse worlds that simulate the future behaviors of critical dynamic agents and road entities, providing interaction-centric representations for downstream reasoning. FRNet then evaluates agent-specific collision risks and temporal adherence to drivable regions, enabling precise identification of safety-critical events across future timesteps. SafeDrive achieves state-of-the-art performance on both open-loop and closed-loop benchmarks. On NAVSIM, it records a PDMS of 91.6 and an EPDMS of 87.5, with only 61 collisions out of 12,146 scenarios (0.5%). On Bench2Drive, SafeDrive attains a 66.6% driving score.

View full details

Poster

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Rui Zhao ⋅ Bin Shi ⋅ Kai Sun ⋅ Bo Dong

Jun 7, 3:30 PM - 5:30 PM ExHall A 324

Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real-world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive theoretical and experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance.

View full details

Poster

CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Jiacheng Tang ⋅ Zhiyuan Zhou ⋅ Zhuolin He ⋅ Jia Zhang ⋅ Kai Zhang ⋅ Jian Pu

Jun 7, 11:45 AM - 1:45 PM ExHall F 326

Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. As its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS first constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby purifying the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method also demonstrates superior robustness against both data bias and noisy scenarios specifically configured to induce causal confusion. We will release our code upon paper acceptance.

View full details

Poster

The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery

Haiyang Zheng ⋅ Nan Pu ⋅ Yaqi Cai ⋅ Teng Long ⋅ Wenjing Li ⋅ Nicu Sebe ⋅ Zhun Zhong

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 329

Generalized Category Discovery (GCD) aims to categorize unlabeled samples that may belong to either known or unknown categories by leveraging the knowledge from labeled data. Most previous methods jointly optimize supervised and unsupervised objectives and achieve promising results. However, inherent optimization interference still limits their ability to improve further. Through quantitative analysis, we identify a key issue, *i.e.*, **gradient entanglement**, which 1) distorts supervised gradients and weakens discrimination among known classes, and 2) induces representation-subspace overlap between known and novel classes, reducing the separability of novel categories. To address this issue, we propose the Energy-Aware Gradient Coordinator (EAGC), a plug-and-play gradient-level module that explicitly regulates the optimization process. EAGC comprises two components: Anchor-based Gradient Alignment (AGA) and Energy-aware Elastic Projection (EEP). AGA introduces a reference model to anchor the gradient directions of labeled samples, preserving the discriminative structure of known classes against the interference of unlabeled gradients. EEP softly projects unlabeled gradients onto the complement of the known-class subspace and derives an energy-based coefficient to adaptively scale the projection for each unlabeled sample according to its degree of alignment with the known subspace, thereby reducing subspace overlap without suppressing unlabeled samples that likely belong to known classes. EAGC can be seamlessly integrated with both parametric and non-parametric GCD methods. Experiments show that EAGC consistently boosts existing approaches and establishes new state-of-the-art results on multiple GCD benchmarks.

View full details

Poster

Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation

Ting Yang ⋅ Qilong Wang ⋅ Qibin Hou ⋅ Qinghua Hu

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 328

The rise of vision-language models (VLMs) has driven the initial exploration of open-vocabulary remote sensing image semantic segmentation (OVRSIS), enabling recognition of unseen categories in complex Earth observation scenes. However, existing methods primarily focus on enhancing visual representations of domain-specific remote sensing images, while overlooking the effect of textual information. In this paper, we argue that there exists a crucial issue of textual ambiguity in OVRSIS task, limiting the final segmentation performance. Therefore, we propose a plug-and-play yet effective Test-time Multi-Prompt Adaptation (TMPA) method to mitigate textual ambiguity in OVRSIS. Specifically, our TMPA first generates a group of diverse, context-aware descriptions for each category instead of the naive class name by executing a large language model with a task-driven prompt, which can effectively avoid some textual ambiguity, i.e., background class has different meanings in various tasks. Furthermore, TMPA develops a visual-guided test-time adaptation strategy for the generated multi-prompts, which adaptively refines the prompt representations of each category with high-confidence visual features for the uncertain predictions with high entropy, making our TMPA better applicable to different scenarios. Particularly, a pixel-level loss with entropy minimization is proposed to optimize the text prompt with a bias during inference, where prompt bias is constructed based on a weighted combination of high-confidence visual features. Our TMPA can be flexibly integrated into existing methods for boosting their performance. Extensive experiments are conducted on 17 remote sensing datasets, and the results show our TMPA can significantly improve its counterparts, while achieving state-of-the-art performance.

View full details

Poster

Plug-and-Play Incomplete Multi-View Clustering via Janus-Faced Affinity Learning with Topology Harmonization

Shengju Yu ⋅ Suyuan Liu ⋅ Wenhao SHAO ⋅ Siwei Wang ⋅ KE LIANG ⋅ Xihong Yang ⋅ Tiejun Li ⋅ Xinwang Liu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 332

Prevailing incomplete multi-view clustering (IMVC) approaches typically fail to account for the interference of view-exclusive artifacts when learning view-consensus representations, which could compromise the fidelity of the resulting similarity measure. Moreover, inconsistencies in anchor order across views may distort the graph structure, impairing the clustering performance. The reliance on carefully-tuned regularization hyper-parameters also usually undermines the model's practical utility. To alleviate these issues, we propose a plug-and-play IMVC framework named PJFTH that incorporates Janus-faced affinity learning with topology harmonization. It explicitly models the exclusive-to-consensus interplay, derives a view-private graph from each view, and adaptively integrates them into a global consensus affinity according to the respective view's intrinsic characteristics. Furthermore, a permutation transformation with unary encoding constraints is applied to anchor matrix, realigning anchor topology while preserving the values. This process synchronizes anchor order prior to similarity integration and maintains original anchor properties. Notably, all components are coupled seamlessly and optimized in a joint manner. Also, the provable overall linear complexity further enlarges its scalability and practicality. Experimental results confirm that PJFTH receives competitive performance compared to several leading methods.

View full details

Poster

Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems

Tolga Dimlioglu ⋅ Nadine Chang ⋅ Maying Shen ⋅ Rafid Mahmood ⋅ Jose M. Alvarez

Jun 6, 11:45 AM - 1:45 PM ExHall F 331

Large-scale deep learning models for physical AI applications depend on diverse training data collection efforts. These models and correspondingly, the training data, must address the different evaluation criteria necessary for the models to be deployable in real-world environments. Data selection policies can guide the development of the training set, but current frameworks do not account for the ambiguity in how data points affect different metrics. In this work, we propose Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a general data selection framework that operates by: (i) partitioning the dataset into domains; (ii) fitting neural scaling laws from each data domain to the evaluation metrics; and (iii) optimizing a data mixture by iteratively adding data from domains that maximize the change in metrics. We apply MOSAIC to autonomous driving (AD), where an End-to-End (E2E) planner model is evaluated on the Extended Predictive Driver Model Score (EPDMS), an aggregate of driving rule compliance metrics. Here, MOSAIC outperforms a diverse set of baselines on EPDMS with up to 80\% less data.

View full details

Poster

ReAttnCLIP: Training-Free Open-Vocabulary Remote Sensing Image Segmentation via Re-defined Attention in CLIP

Xin Niu ⋅ Manqi Zhao ⋅ Dongsheng Jiang ⋅ Yingying Wu ⋅ Bing Su

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 331

Remote sensing image segmentation is critical for a range of applications, including natural disaster monitoring and precision agriculture. Open-vocabulary segmentation enhances flexibility by removing fixed category constraints, enabling more fine-grained and adaptive scene understanding. Unlike CLIP’s original pretraining objective, which emphasizes global image-text alignment, segmentation tasks require accurate and discriminative patch-level representations to support precise pixel-wise predictions. As a result, the quality of attention maps—particularly those generated in the final transformer layers—plays a pivotal role in guiding inter-region interactions. However, current methods generate suboptimal representations when capturing the complex spatial hierarchies in remote sensing. We address this gap by optimizing CLIP's 197×197 attention matrix through three key modifications: (1) substituting the 196×196 patch-to-patch submatrix with intermediate-layer feature similarities to preserve spatial structures; (2) prioritizing intermediate-layer attention for global-to-local (class-to-patch) token alignment to reduce classification interference; (3) disabling the \texttt{[CLS]} token's self-attention to mitigate bias. Extensive experiments on eight remote sensing benchmarks and two building/road extraction datasets demonstrate that our method achieves state-of-the-art performance among existing training-free approaches.

View full details

Poster

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

Ehsan Ahmadi ⋅ Hunter Schofield ⋅ Behzad Khamidehi ⋅ Fazel Arasteh ⋅ Jinjun Shan ⋅ Lili Mou ⋅ Dongfeng Bai ⋅ Kasra Rezaee

Jun 7, 3:30 PM - 5:30 PM ExHall A 331

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism enhancement, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning.

View full details

Poster

CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography

Gasser Elazab ⋅ Frank Neuhaus ⋅ Tilman Koß ⋅ Malte Splietker ⋅ Aditya Date ⋅ Michael Unterreiner ⋅ Maximilian Jansen ⋅ Olaf Hellwich

Jun 6, 11:45 AM - 1:45 PM ExHall F 333

Autonomous driving must operate reliably across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other public driving datasets. The dataset spans ~110 km and 4.7 hours across Germany and Italy. In addition, CARD provides 2D bounding boxes targeting road-topography irregularities, enabling accurate benchmarking for both geometry and perception tasks. Furthermore, we introduce a standardized evaluation protocol for road surface irregularities and a stereo-guided depth completion variant that achieves leading performance on CARD. Moreover, we benchmark state-of-the-art depth estimation models to establish strong baselines. We host CARD on Hugging Face with an open source SDK and standardized splits to enable public leaderboards and reproducible evaluation.

View full details

Poster

Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features

Junbo Ke ⋅ Yangyang Xu ⋅ Chao Wang ⋅ You-Wei Wen

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 337

Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high-frequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev features as a complementary component to Fourier bases. This combination provides a stronger and more stable frequency representation. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach, consistently achieving superior performance over existing methods.

View full details

Poster

Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation

Yaowen Chang ⋅ Zhen Cao ⋅ Xu Zheng ⋅ Xiaoxin Mi ⋅ Zhen Dong

Jun 7, 11:45 AM - 1:45 PM ExHall F 336

Panoramic semantic segmentation is pivotal for comprehensive 360° scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from label-rich pinhole-camera datasets offers a viable alternative, many real-world tasks impose a stricter source-free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo-labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence-Guided Denoising (PCGD) module generates high-fidelity, class-balanced pseudo-labels by enforcing perturbation consistency and incorporating neighborhood-level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts. DAPASS achieves state-of-the-art performances on outdoor (Cityscapes-to-DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.

View full details

Poster

ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving

Han Lu ⋅ Xiaosong Jia ⋅ Yichen Xie ⋅ Siyu Sun ⋅ Wenlong Liao ⋅ Xiaokang Yang ⋅ Junchi Yan

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 338

End-to-end differentiable learning has emerged as a prominent paradigm in autonomous driving (AD). A significant bottleneck in this approach is its substantial demand for high-quality labeled data, such as 3D bounding boxes and semantic segmentation, which are especially expensive to annotate manually. This challenge is exacerbated by the long tailed distribution in AD datasets, where a substantial portion of the collected data might be trivial (e.g. simply driving straight on a straight road) and only a minority of instances are critical to safety. In this paper, we propose ActiveAD, a planning-oriented active learning strategy designed to enhance sampling and labeling efficiency in end-to-end autonomous driving. ActiveAD progressively annotates parts of collected raw data based on our newly developed metrics. We design innovative diversity metrics to enhance initial sample selection, addressing the cold-start problem. Furthermore, we develop uncertainty metrics to select valuable samples for the ultimate purpose of route planning during subsequent batch selection. Empirical results demonstrate that our approach significantly surpasses traditional active learning methods. Remarkably, our method achieves comparable results to state-of-the-art end-to-end AD methods - by using only 30% data in both open-loop nuScenes and closed-loop CARLA evaluation.

View full details

Poster

Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction

Hao Zhou ⋅ Lu Qi ⋅ Xiangtai Li ⋅ Jie Zhang ⋅ Yi Liu ⋅ Xu Yang ⋅ Mingyu Fan ⋅ Fei Luo

Jun 6, 11:45 AM - 1:45 PM ExHall F 337

Trajectory prediction is critical for autonomous driving, enabling safe and efficient planning in dense, dynamic traffic. Most existing methods optimize prediction accuracy under fixed-length observations. However, real-world driving often yields variable-length, incomplete observations, posing a challenge to these methods. A common strategy is to directly map features from incomplete observations to those from complete ones. This one-shot mapping, however, struggles to learn accurate representations for short trajectories due to significant information gaps. To address this issue, we propose a $\textbf{P}$rogressive $\textbf{R}$etrospective $\textbf{F}$ramework (PRF), which gradually aligns features from incomplete observations with those from complete ones via a cascade of retrospective units. Each unit consists of a Retrospective Distillation Module (RDM) and a Retrospective Prediction Module (RPM), where RDM distills features and RPM recovers previous timesteps using the distilled features. Moreover, we propose a Rolling-Start Training Strategy (RSTS) that enhances data efficiency during PRF training. PRF is plug-and-play with existing methods. Extensive experiments on datasets Argoverse 2 and Argoverse 1 demonstrate the effectiveness of PRF. Code will be released.

View full details

Poster

TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation

Qingwen Zhang ⋅ Chenhan Jiang ⋅ Xiaomeng Zhu ⋅ Yunqi Miao ⋅ Yushan Zhang ⋅ Olov Andersson ⋅ Patric Jensfelt

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 339

Self-supervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals.In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision. TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames.Extensive evaluations demonstrate that TeFlow establishes a new state-of-the-art for self-supervised feed-forward methods, achieving performance gains of **up to 33\%** on the challenging Argoverse 2 and nuScenes datasets. Our method performs on par with leading optimization-based methods, yet speeds up **150** times.

View full details

Poster

SAME: Sparse and Anchored Model Editing for Heterogeneous Incremental Learning under Limited Data

Zixuan Duan ⋅ Zeyu Zhang ⋅ Fengyuan Lu ⋅ Shaofeng Zhang ⋅ Wenbin Li ⋅ Qi Fan ⋅ Yang Gao

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 338

Existing Incremental Learning (IL) methods are primarily evaluated under either a single-domain class-incremental setting, or a multi-domain task-incremental setting with known task identifiers. However, these assumptions often fail to hold in real-world applications. To bridge this gap, we introduce Heterogeneous Incremental Learning (HIL), a new setting for evaluating IL methods under realistic and challenging conditions, where task boundaries are ambiguous or unknown, class distributions shift dynamically across environments, and training data is limited. Model editing is inherently well-suited for this challenging HIL, as it allows for the efficient integration of new knowledge while preserving model capabilities. Thus, we propose a novel Sparse and Anchored Model Editing (SAME) for addressing HIL. Specifically, SAME sparsely and selectively updates task-relevant model parameters to extract compact, task-specific key–value knowledge pairs from limited data. Using these task knowledge pairs, the model performs knowledge injection for new tasks under double-anchor constraints. The knowledge anchor aligns the updated and original model features, while the parameter anchor constrains parameter magnitudes, ensuring stable and consistent knowledge injection. Our method can efficiently solve HIL using only a few labeled examples, without introducing additional model parameters. Extensive experiments on 11 diverse visual-language datasets across 22 sequential tasks show that our method outperforms existing continual learning approaches by 6.8% in average accuracy, while retaining 95.8% of the oracle model performance, demonstrating strong stability and cross-domain generalization.

View full details

Poster

Changes in Real Time: Online Scene Change Detection with Multi-View Fusion

Chamuditha Jayanga Galappaththige ⋅ Jason Lai ⋅ Lloyd Windrim ⋅ Donald Dansereau ⋅ Niko Suenderhauf ⋅ Dimity Miller

Jun 7, 11:45 AM - 1:45 PM ExHall F 338

Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines. Code will be released upon acceptance.

View full details

Poster

Exemplar-Free Continual Learning for State Space Models

ISAAC NING LEE ⋅ Leila Mahmoodi ⋅ Trung Le ⋅ Mehrtash Harandi

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 340

State-Space Models (SSMs) excel at capturing long-range dependencies with structured recurrence, making them well-suited for sequence modeling. However, their evolving internal states pose unique challenges in Continual Learning (CL). Without access to the full distribution of previous tasks, updates to the state-space dynamics become unconstrained, leading to catastrophic forgetting. To address this, we propose $\textbf{Inf-SSM}$, a geometry-aware regularization framework for CL in SSMs. It constrains state evolution via the infinite-dimensional Grassmannian of SSM observability subspaces, without requiring any exemplars from past tasks. Unlike classical CL methods that restrict weight updates, Inf-SSM directly regularizes the infinite-horizon state evolution encoded by the extended observability subspace of the SSM. We show that enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which typically incurs $\mathcal{O}(n^3)$ complexity. Thus, we develop a $\mathcal{O}(n^2)$ solution by exploiting the structure and properties of SSMs. This leads to an efficient regularization mechanism that can be seamlessly integrated into existing CL methods. Comprehensive experiments on challenging benchmarks of ImageNet-R, CIFAR-100, and Caltech-256 demonstrate a significant reduction in forgetting while improving accuracy across sequential tasks.

View full details

Poster

AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning

S Divakar Bhat ⋅ Amit Popat More ⋅ Mudit Soni ⋅ Bhuvan Aggarwal

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 341

Long-Tail Class Incremental Learning (LTCIL) combines two fundamental challenges: \textit{catastrophic forgetting} of past tasks and \textit{severe class imbalance}. Existing approaches mitigate one challenge at a time, through rehearsal, reweighting, or classifier alignment, but they typically assume \emph{static priors} and rely on multi-stage training. In contrast, we propose \textbf{AdaPrior}, a simple Bayesian framework that treats LTCIL as a problem of \emph{dynamic prior misalignment}. Our key idea is to estimate model-induced priors online via an exponential moving average and use them for (i) debiasing during training (\textbf{AdaPrior Loss}), and (ii) lightweight post-hoc correction at inference. The combined approach unifies loss-level and inference-level debiasing without additional stages or heavy computation. We provide theoretical analysis showing that AdaPrior’s prior estimator converges to the true model prior and that its logit adjustment yields well calibrated posteriors under mild assumptions. Extensive experiments on CIFAR100-LT, Food-101-LT, ImageNet-LT-subset, and iNaturalist18-subset demonstrate consistent gains over recent LTCIL baselines. Beyond accuracy, AdaPrior improves calibration, and forgetting curves, making it a practical and scalable solution for long-tail continual learning.

View full details

Poster

Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation

Nikolay Kormushev ⋅ Josip Šarić ⋅ Matej Kristan

Jun 6, 11:45 AM - 1:45 PM ExHall F 341

Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision–language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP’s region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5\% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1\% and +3\% PQ, respectively). The code will be available here.

View full details

Poster

GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving

Lin Liu ⋅ Caiyan Jia ⋅ Guanyi Yu ⋅ Ziying Song ⋅ Junqiao Li ⋅ Feiyang Jia ⋅ Peiliang Wu ⋅ Xiaoshuai Hao ⋅ Yadan Luo

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 344

Driving planning is a critical component of end-to-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose GuideFlow, a novel planning framework that leverages Constrained Flow Matching. Concretely, GuideFlow explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, GuideFlow unifies the training of the flow matching with the Energy-Based Model (EBM) to enhance the model's autonomous optimization capability to robustly satisfy physical constraints. Secondly, GuideFlow parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Extensive evaluations on major driving benchmarks (Bench2Drive, NuScenes, NavSim and ADV-NuScenes) validate the effectiveness of GuideFlow. Notably, on the NavSim test hard split (Navhard), GuideFlow achieved SOTA with an EPDMS score of 43.0. The code will be released.

View full details

Poster

BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling

Jiachen Yang ⋅ Xianhui Lin ⋅ Yi Dong ⋅ Zebiao Zheng ⋅ Xing Liu ⋅ Hong Gu ⋅ Yanmei Fang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 343

Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences. Our code will be made publicly available.

View full details

Poster

RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation

Anuvab Sen ⋅ Mir Sayeed ⋅ Saibal Mukhopadhyay

Jun 6, 11:45 AM - 1:45 PM ExHall F 344

We introduce RAVEN, a deep learning architecture for processing frequency-modulated continuous-wave (FMCW) radar data that is designed for high computational efficiency. RAVEN reduces computation by using a learnable antenna mixer module on independent receiver state space encoders (SSM) to compress the virtual MIMO array into a compact set of learned features and by performing per-chirp inference with a calibrated early-exit rule, so the model reaches a decision using only a subset of chirps in a radar frame. These design choices yield up to 170× lower computation and 4× lower end-to-end latency than conventional frame-based radar backbones, while achieving state-of-the-art detection and BEV free-space segmentation performance on automotive radar datasets.

View full details

Poster

SyncDreamer: Controllable and Expressive Avatar Generation Beyond the Talking Head

Fatemeh Nazarieh ⋅ Zhenhua Feng ⋅ Diptesh Kanojia ⋅ Josef Kittler ⋅ Muhammad Awais

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 344

Generating realistic and expressive audio-driven talking avatars remains a central challenge in digital human synthesis. Existing methods often depend on intermediate representations such as pose estimations for natural body motion, which restricts flexibility and adds visual distortions. Moreover, most audio-driven approaches rely on discrete emotion classifiers or text labels to regulate facial expression, reducing complex affective dynamics to coarse categories such as happy, sad, or angry. Such categorical supervision fails to capture the continuous and fine-grained speech dynamics (rhythm, energy, intensity) resulting in limited synchronization and emotionally shallow motion. To overcome these limitations, we present SyncDreamer, a unified Diffusion Transformer framework that generates identity-preserving and emotionally expressive talking avatars from only a single image, speech audio, and text prompt.We propose a visual adapter with Attention Localization Loss to maintain identity fidelity, further incorporating an audio dynamics encoder for rhythm- and emotion-aware motion, and an RL-based Cross-Modal Prompt Enhancer grounding textual cues in visual context for fine-grained motion control. Extensive experiments on portrait and full-body benchmarks demonstrate state-of-the-art performance in realism, synchronization accuracy, and semantic controllability, establishing a scalable foundation for expressive digital avatars in interactive and creative applications.

View full details

Poster

Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models

Zizhi Chen ⋅ Yizhen Gao ⋅ Minghao Han ⋅ Yizhou Liu ⋅ Zhaoyu Chen ⋅ Dingkang Yang ⋅ Lihua Zhang

Jun 7, 11:45 AM - 1:45 PM ExHall F 344

Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous Medical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.

View full details

Poster

Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting

Jinhyeok Jang ⋅ Jaehong Kim ⋅ Jung Uk Kim

Jun 7, 3:30 PM - 5:30 PM ExHall A 346

Pre-trained weights have become a cornerstone of modern deep learning, enabling efficient knowledge transfer and improving downstream task performance, especially in data-scarce scenarios. However, a fundamental question remains: how can we obtain better pre-trained weights that encapsulate more knowledge beyond the given dataset? In this work, we introduce KNowledge-Overflowed Weights (KNOW) prediction, a novel strategy that leverages structured forgetting and its inversion to synthesize knowledge-enriched weights. Our key insight is that sequential fine-tuning on progressively downsized datasets induces a structured forgetting process, which can be modeled and reversed to recover knowledge as if trained on a larger dataset. We construct a dataset of weight transitions governed by this controlled forgetting and employ meta-learning to model weight prediction effectively. Specifically, our KNowledge-Overflowed Weights Nowcaster (KNOWN) acts as a hyper-model that learns the general evolution of weights and predicts enhanced weights with improved generalization. Extensive experiments across diverse datasets and architectures demonstrate that KNOW prediction consistently outperforms Na\"ive fine-tuning and simple weight prediction, leading to superior downstream performance. Our work provides a new perspective on reinterpreting forgetting dynamics to push the limits of knowledge transfer.

View full details

Poster

Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure

Xin Zhang ⋅ Liang Bai ⋅ Guanchao Wang ⋅ Xian Yang

Jun 6, 11:45 AM - 1:45 PM ExHall F 348

Exemplar-Free Class Incremental Learning (EFCIL) aims to enable models to learn new classes sequentially without retaining samples from previous tasks. While recent approaches leverage pre-trained models with parameter-efficient tuning to mitigate forgetting, they often overlook a crucial cause of forgetting: the collapse of the class-discriminative structure. This structure comprises two interdependent components: intra-class structure, which characterizes the shape of individual classes, and inter-class structure, which characterizes the global geometric relationships among class prototypes. We reveal that catastrophic forgetting stems from the simultaneous deterioration of both intra-class and inter-class structures. To address this, we propose a unified framework that preserves the class-discriminative structure. It preserves the intra-class structure by reshaping class means and covariances to preserve each class’s shape during migration, and maintains inter-class structure by stabilizing angular relationships between samples and old prototypes. Extensive experiments demonstrate that our framework outperforms existing leading methods on multiple EFCIL benchmarks, validating that preserving the class-discriminative structure is crucial for mitigating catastrophic forgetting.

View full details

Poster

PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning

Yuanhang Lei ⋅ Tao Cheng ⋅ Xingxuan Li ⋅ Boming Zhao ⋅ Siyuan Huang ⋅ Ruizhen Hu ⋅ Peter Yichen Chen ⋅ Hujun Bao ⋅ Zhaopeng Cui

Jun 7, 11:45 AM - 1:45 PM ExHall F 348

Achieving real-time physics-based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics-informed framework that addresses this challenge. In the spirit of Linear Blend Skinning, we learn continuous skinning fields as basis functions lifting motion subspace coordinates to full-space deformation, with subspace defined by handle transformations. To generate mesh-free, discretization-agnostic, and physically consistent skinning fields that generalize well across diverse 3D shapes, PhysSkin employs a new neural skinning fields autoencoder which consists of a transformer-based encoder and a cross-attention decoder.Furthermore, we also develop a novel physics-informed self-supervised learning strategy that incorporates on-the-fly skinning-field normalization and conflict-aware gradient correction, enabling effective balancing of energy minimization, spatial smoothness, and orthogonality constraints.PhysSkin shows outstanding performance on generalizable neural skinning and enables real-time physics-based animation.

View full details

Poster

Unifying Language-Action Understanding and Generation for Autonomous Driving

Xinyang Wang ⋅ Qian Liu ⋅ WENJIE DING ⋅ Zhao Yang ⋅ Wei Li ⋅ Chang Liu ⋅ Bailin Li ⋅ Kun Zhan ⋅ XianPeng Lang ⋅ Wei Chen

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 351

Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language–action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method (C2F) that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.

View full details

Poster

Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions

Shiqin Wang ⋅ Haoyang Chen ⋅ Huaizhou Huang ⋅ Yinkan He ⋅ Dongfang Sun ⋅ Xiaoqing Chen ⋅ Xingyu Liu ⋅ Zheng Wang ⋅ Kaiyan Zhao

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 353

The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model's evolving, high-dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an \emph{autonomous class scheduler}. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model's training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source–target supervision, the learned class rankings direct the network’s focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving), and shows generalization ability in synthetic-to-real semantic segmentation (i.e., SYNTHIA $\rightarrow$ Cityscapes).

View full details

Poster

HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation

Chengjie Fan ⋅ Cong Pan ⋅ Zijian Liu ⋅ Ningzhong Liu ⋅ Jie Qin

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 353

Inspired by the general Vision-and-Language Navigation (VLN) task, aerial VLN has drawn widespread attention, owing to its significant application value in areas such as logistics delivery and urban inspection. However, existing methods in complex urban environments face several challenges, including insufficient generalization to unknown scenes, suboptimal performance in long-distance path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that blends Imitation Learning (IL) and Reinforcement Learning (RL) into a hybrid IL-RL paradigm. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative interaction between macro-level path planning and fine-grained action control. Furthermore, a map representation learning module is introduced to deepen its understanding of spatial continuity in open domains. On the CityNav benchmark, our method achieves state-of-the-art performance at all levels of scenes and task difficulties. Experimental results demonstrate that this framework significantly improves navigation precision and robustness in complex urban environments.

View full details

Poster

DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations

Yuxiang Shi ⋅ Zhe Li ⋅ Yanwen Wang ⋅ Hao Zhu ⋅ Xun Cao ⋅ Ligang Liu

Jun 7, 3:30 PM - 5:30 PM ExHall A 356

Portrait animation from a single source image and a driving video is a long-standing problem.Recent approaches tend to adopt diffusion-based image/video generation models for realistic and expressive animation.However, none of these diffusion models realizes high-fidelity disentangled control between the head pose and facial expression, hindering applications like expression-only or pose-only editing and animation.To address this, we propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals.Specifically, we represent the pose as an explicit global transformation and the expression as an implicit latent code.First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals.Then we propose to inject the pose transformation into the diffusion model through a dual-branch conditioning mechanism, and the expression latent through cross attention.Finally, we design a progressive hybrid classifier-free guidance for more faithful identity consistency.Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.

View full details

Poster

SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting

Bo Li ⋅ Jiahao Kang ⋅ Yubo Ma ⋅ Feng-Lin Liu ⋅ Bin Liu ⋅ Fang-Lue Zhang ⋅ Lin Gao

Jun 7, 3:30 PM - 5:30 PM ExHall A 357

3D Gaussian representations have emerged as a powerful paradigm for digital head modeling, achieving photorealistic quality with real-time rendering. However, intuitive and interactive creation or editing of 3D Gaussian head models remains challenging. Although 2D sketches provide an ideal interaction modality for fast, intuitive conceptual design, they are sparse, depth-ambiguous, and lack high-frequency appearance cues, making it difficult to infer dense, geometrically consistent 3D Gaussian structures from strokes—especially under real-time constraints. To address these challenges, we propose SketchFaceGS, the first sketch-driven framework for real-time generation and editing of photorealistic 3D Gaussian head models from 2D sketches. Our method uses a feed-forward, coarse-to-fine architecture. A Transformer-based UV feature-prediction module first reconstructs a coarse but geometrically consistent UV feature map from the input sketch, and a 3D UV feature enhancement module refines it with high-frequency, photorealistic detail to produce a high-fidelity 3D head. For editing, we introduce a UV Mask Fusion technique combined with a layer-by-layer feature-fusion strategy, enabling precise, real-time, free-viewpoint modifications. Extensive experiments show that SketchFaceGS outperforms existing methods in both generation fidelity and editing flexibility, producing high-quality, editable 3D heads from sketches in a single forward pass.

View full details

Poster

Globally Optimal Pose from Orthographic Silhouettes

Agniva Sengupta ⋅ Dilara Kus ⋅ Jianning Li ⋅ Stefan Zachow

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 358

We solve the problem of determining the pose of known shapes in $\mathbb{R}^3$ from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by correspondences, for any shape, irrespective of its convexity and genus. We validate our method on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches.

View full details

Poster

HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

Hezhen Hu ⋅ Wangbo Zhao ⋅ Lanqing Guo ⋅ Hanwen Jiang ⋅ Jonathan C. Liu ⋅ Zhiwen Fan ⋅ Kai Wang ⋅ Zhangyang Wang ⋅ Georgios Pavlakos

Jun 6, 11:45 AM - 1:45 PM ExHall F 359

In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions.

View full details

Poster

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Zhongxing Xu ⋅ Zhonghua Wang ⋅ Zhe Qian ⋅ Dachuan Shi ⋅ feilong tang ⋅ Ming Hu ⋅ Shiyan Su ⋅ Xiaocheng Zou ⋅ Wei Feng ⋅ Dwarikanath Mahapatra ⋅ Yifan Peng ⋅ Minquan Lin ⋅ Zongyuan Ge

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 361

Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.

View full details

Poster

Same Attention, Different Truths: Put Logit-Lens over Visual Attention to Detect and Mitigate LVLM Object Hallucination

Zichuan Wang ⋅ Songlin Yang ⋅ Bo Peng ⋅ Zhenchen Tang ⋅ Yang Li ⋅ BeibeiDong BeibeiDong ⋅ Beibei Dong

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 362

Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating objects that are absent from the image. Prior work largely attributes this to insufficient visual attention. However, in this work, we are surprised to find that both real and hallucinated objects receive equally strong visual attention in the model’s mid-to-late layers. This indicates that the key issue may not be how much the model attends, but **what it attends to and why**. To this end, we decode the visual features of high-attention regions using Logit Lens, and observe that high-attention regions corresponding to real objects can be correctly decoded to the target object token, whereas those for hallucinated objects cannot. Building on this, we identify two distinct hallucination mechanisms: **(i) visual uncertainty**, triggered by semantically similar or confusable regions, masking these regions eliminates the hallucination. **(ii) contextual prior**, triggered by strong co-occurrence priors, even when the initially attended region is masked, the hallucination persists and attention drifts to other regions. Based on these findings, we propose a simple yet effective training-free **Detect–Mitigate framework** comprising a Logit-Lens Consistency Check to detect hallucination and targeted remedies: High-Attention Regions Masking (HARM) for visual uncertainty hallucination, and Visual Evidence Enhanced Decoding (VEED) for contextual prior hallucination, which leverages genuine visual evidence to suppress erroneous priors. Our approach achieves state-of-the-art results on multiple hallucination benchmarks.

View full details

Poster

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

Chang-Hsun Wu ⋅ Kai-Po Chang ⋅ Yu-Yang Sheng ⋅ Hung-Kai Chung ⋅ Kuei-Chun Wang ⋅ Yu-Chiang Frank Wang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 364

Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.

View full details

Poster

PerpetualWonder: Long-horizon Action-conditioned 4D Scene Generation

Jiahao Zhan ⋅ Zizhang Li ⋅ Hong-Xing Yu ⋅ Jiajun Wu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 367

We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.

View full details

Poster

Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling

shikun zhang ⋅ Yong Li ⋅ Yiqun Wang ⋅ Qiuhong Ke ⋅ Cunjian Chen

Jun 7, 3:30 PM - 5:30 PM ExHall A 367

We propose Fresco, a unified optimization paradigm designed to mitigate early over-sharpening, and cross-view drifting in head avatar reconstruction. Fresco combines a Laplacian-pyramid-based frequency curriculum with UV-space consistency regularization to progressively enhance reconstruction quality. The optimization begins by stabilizing low-frequency appearance in the image domain, which suppresses spurious details and promotes reliable convergence. As learning proceeds, consistency across different viewpoints is reinforced through pixel-level alignment on shared UV texture coordinates. Finally, high-frequency components are refined under explicit frequency-band constraints, and seam boundary regularization is applied to preserve local continuity. By optimizing in a frequency- and UV-aligned space, Fresco achieves robust convergence without pseudo high-frequency artifacts and yields consistent, high-fidelity results across views. Experiments on the NeRSemble dataset validate the effectiveness of our design. Our method outperforms previous state-of the-art methods while avoiding additional training overhead through frequency scheduling and UV-bake caching.

View full details

Poster

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

Hyunsoo Cha ⋅ Wonjung Woo ⋅ Byungjun Kim ⋅ Hanbyul Joo

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 369

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front–back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment–posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

View full details

Poster

CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

Kaiyi Huang ⋅ Yukun Huang ⋅ Yu Li ⋅ Jianhong Bai ⋅ Xintao Wang ⋅ Zinan Lin ⋅ Xuefei Ning ⋅ Jiwen Yu ⋅ Yu Wang ⋅ Xihui Liu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 368

Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.

View full details

Poster

UIKA: Fast Universal Head Avatar from Pose-Free Images

Zijian Wu ⋅ Boyao Zhou ⋅ Liangxiao Hu ⋅ Hongyu Liu ⋅ Yuan Sun ⋅ Xuan Wang ⋅ Xun Cao ⋅ Yujun Shen ⋅ Hao Zhu

Jun 6, 11:45 AM - 1:45 PM ExHall F 370

We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise UV coordinate estimation. Such UV coordinate estimation allows us to project each valid pixel from screen space to UV space, which is independent of camera pose and character expression. We thus leverage this UV space to represent our Gaussian head avatar. To this end, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV token can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. Such a Gaussian avatar is directly animatable via standard linear blend skinning and supports real-time rendering. To train our large avatar model, we further prepare a large-scale, identity-rich training dataset with controllable views and motions, synthesized with a 3D GAN and a state-of-the-art image animation model. Our proposed method significantly outperforms existing approaches in rendering quality, 3D consistency, and inference efficiency on both single-view and multi-view input data.

View full details

Poster

Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning

Zhenghao Peng ⋅ Wenhao Ding ⋅ Yurong You ⋅ Yuxiao Chen ⋅ Wenjie Luo ⋅ Thomas Tian ⋅ Yulong Cao ⋅ Apoorva Sharma ⋅ Danfei Xu ⋅ Boris Ivanovic ⋅ Boyi Li ⋅ Yan Wang ⋅ Marco Pavone

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 372

Recent reasoning-augmented Vision-Language-Action (VLA) models have improved the interpretability of end-to-end autonomous driving by generating intermediate reasoning traces. Yet these models primarily describe what they perceive and intend to do, rarely questioning whether their planned actions are safe or appropriate. This work introduces Counterfactual VLA (CF-VLA), a self-reflective VLA framework that enables the model to reason about and revise its planned actions before execution. CF-VLA first generates time-segmented meta-actions that summarize driving intent, then performs a counterfactual reasoning pass conditioned on both the meta-actions and the visual. This step simulates potential outcomes, identifies unsafe behaviors, and outputs corrected meta-actions that guide the final trajectory generation. To efficiently obtain such self-reflection capabilities, we propose a rollout–filter–label pipeline that mines high-value scenes from a base (non-counterfactual) VLA's rollouts and labels counterfactual reasoning traces for subsequent counterfactual training rounds. Experiments on large-scale driving datasets show that CF-VLA improves trajectory accuracy by up to 17.6\%, enhances safety metrics, and exhibits adaptive thinking: it only enables counterfactual reasoning in challenging scenarios. By transforming reasoning traces from one-shot descriptions to causal self-correction signals, CF-VLA takes a step toward self-reflective autonomous driving agents that learn to think before they act.

View full details

Poster

WHU-MARS: A Multispectral Aerial-Ground Benchmark Towards Any-Scenario Person Re-Identification

Yuxuan Zhao ⋅ Zhongao Zhou ⋅ Bin Yang ⋅ He Li ⋅ Jian Liang ⋅ Jun Chen ⋅ Bo Du ⋅ Mang Ye

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 375

Recent person re-identification (ReID) leverages heterogeneous sensing with multiple modalities and viewpoints to improve robustness across diverse conditions. However, most approaches target predefined scenario pairs (e.g., visible-infrared or aerial-ground) and train separate task-specific models. In contrast, real-world applications require retrieving identities from galleries that cover all scenarios, making such designs inefficient and complex to deploy. To bridge this gap, we introduce Any-Scenario ReID (AS-ReID): given a query from any (modality, viewpoint) scenario, a single model retrieves the same identity from a heterogeneous gallery spanning all scenarios. Progress toward AS-ReID is limited by two factors: (i) the lack of a real-world-aligned benchmark with broad scenario coverage, and (ii) the challenge of learning representations that are cohesive within identities and strongly discriminative across identities under diverse scenarios. To this end, we construct MSAG, a Multispectral Aerial-Ground benchmark with 2,337 identities and 434,620 images captured by RGB, near-infrared, and thermal infrared cameras on both ground and UAV platforms. MSAG spans day-night, multiple seasons, and varied weather conditions, and supports AS-ReID as well as conventional ReID tasks. We further propose the Unified Alignment and Discrimination (UAD) framework. Progressive Center Alignment (ProCA) aggregates multi-view features into modality centers and then aligns them toward identity centers to reduce scenario bias. Global Prototype Discrimination (GPD) contrasts samples against global identity prototypes to enforce large-margin discrimination. Extensive experiments highlight the challenges of MSAG and demonstrate the effectiveness of UAD on AS-ReID. The dataset and code will be released.

View full details

Poster

Text-guided Feature Disentanglement for Cross-modal Gait Recognition

Zhiyang Lu ⋅ Ming Cheng

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 377

Gait recognition is a biometric technique that identifies individuals based on their walking patterns, offering advantages in long-range, non-intrusive scenarios. However, real-world scenarios often involve heterogeneous sensing modalities such as LiDAR and RGB cameras, making LiDAR-Camera Cross-modal Gait recognition (LCCGR) a critical yet challenging task due to the substantial modality gap between 2D videos and 3D point cloud sequences. To address this challenge, we propose TCFDNet, a Text-guided Cross-modal Feature Disentanglement Network, which leverages modality-aware textual priors as semantic anchors to guide the learning of disentangled modality-shared representations. Specifically, we construct a Gait Modality Text Dictionary (GMTD) using large language models to generate rich semantic descriptions of gait across modalities and viewpoints. A CLIP-based Multi-grained Feature Encoder then aligns visual and textual features within a unified vision-language space. Furthermore, the Text-guided Feature Disentanglement (TFD) module selects the $top\text{-}k$ matched textual descriptions to reconstruct modality-specific representations and derive modality-shared features via residual decomposition and orthogonality constraints. To mitigate the fragility of the disentangled shared features, we propose a Feature Stability Enhancement (FSE) module, which models spatial and channel-wise correlations to improve feature robustness. In addition, a cross-modal patch exchange strategy is introduced to further improve generalization. Extensive experiments on SUSTech1K and FreeGait datasets demonstrate that TCFDNet achieves new state-of-the-art results and validate the effectiveness of the proposed modules.

View full details

Poster

Vista4D: Video Reshooting with 4D Point Clouds

Kuan Heng Lin ⋅ Zhizheng Liu ⋅ Pablo Salamanca ⋅ Yash Kant ⋅ Ryan Burgert ⋅ Yuancheng Xu ⋅ Koichi Namekata ⋅ Yiwei Zhao ⋅ Bolei Zhou ⋅ Micah Goldblum ⋅ Paul Debevec ⋅ Ning Yu

Jun 7, 11:45 AM - 1:45 PM ExHall F 377

We present **Vista4D**, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. Results are best viewed as videos in the Supplement.

View full details

Poster

PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

Antonio Oroz ⋅ Matthias Nießner ⋅ Tobias Kirschstein

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 379

We present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles. We also extend our base model to disentangled 3D editing by swapping the encoder and fine-tuning the network. A segmentation map controls geometry and either a text prompt or a reference image specifies appearance. We highlight the intuitive and powerful 3D editing capabilities through an interactive GUI.

View full details

Poster

Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding

Nando Metzger ⋅ Prune Truong ⋅ Goutam Bhat ⋅ Konrad Schindler ⋅ Federico Tombari

Jun 7, 11:45 AM - 1:45 PM ExHall F 379

The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (respectively, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion.

View full details

Poster

Portable Active Learning for Object Detection

Rashi Sharma ⋅ Justin Timothy C. Bersamin ⋅ Karthikk Subramanian

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 380

Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning (AL) methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.

View full details

Poster

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

Ruijie Zhu ⋅ Jiahao Lu ⋅ Wenbo Hu ⋅ Xiaoguang Han ⋅ Jianfei Cai ⋅ Ying Shan ⋅ Chuanxia Zheng

Jun 7, 3:30 PM - 5:30 PM ExHall A 382

We introduce MotionCrafter, the first video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. To represent them effectively in latent space, we propose a 4D VAE that encodes point maps and scene flows as a unified latent compatible with pretrained video generators. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents—despite their fundamentally different distributions—we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in joint 4D geometry reconstruction and dense scene flow estimation, delivering 38.64\% and 25.0\% improvements in geometry and motion reconstruction, respectively, all without any post-optimization.

View full details

Poster

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

Yuxue Yang ⋅ Lue Fan ⋅ Ziqi Shi ⋅ Junran Peng ⋅ Feng Wang ⋅ Zhaoxiang Zhang

Jun 7, 3:30 PM - 5:30 PM ExHall A 385

In this paper, we propose **NeoVerse**, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks.

View full details

Poster

Unstitching the Chimera: Frame-Level Risk and Train-Free Mitigation for Video Hallucination

Songyuan Yang ⋅ Guijian Tang ⋅ Kun Hu ⋅ Haotian Wang ⋅ Shixuan Liu ⋅ Wenjing Yang ⋅ Long Lan ⋅ Huibin Tan

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 387

Hallucination limits the reliability of multimodal large language models (MLLMs), and it is particularly damaging in video where errors manifest as distorted narratives rather than single-frame mistakes. We introduce a frame-first study of **Chimera Hallucination**: model stitches visual segments that exist in space and time but do not belong to the same event chain, producing a spurious continuous story. We introduce **CH-Risk**, a single-forward, reference-free risk estimate tailored to this failure mode. CH-Risk combines two complementary signals: $SegCoverage@\alpha (\mathrm{SCR}@\alpha\)$ measures how many event segments are needed to cover most text-to-frame support, exposing long-range stitching; Alignment with Early Temporal Pathway (AETP) measures rank consistency between support and the temporal pathway formed in early–middle layers, exposing stage mismatch. To turn risk into correction, we further propose **CH-M(itigation)**, a train-free two-stage intervention. Segment-aligned Stage-Aligned Frame Routing (sSAFR) re-weights frames before the mid-layer softmax to route attention toward a small set of pathway-aligned segments. Residual Token Calibration (RTC) then stabilizes token usage within selected segments. Extensive experiments across 9 benchmarks and 6 VideoLLMs show that CH-Risk can predict Chimera and that CH-M consistently reduce hallucination and improves task accuracy with negligible overhead (sub-5\% latency, sub-2.5\% memory, \$\approx$1\% FLOPs).

View full details

Poster

Building a Precise Video Language with Human–AI Oversight

Zhiqiu Lin ⋅ Siyuan Cen ⋅ Chancharik Mitra ⋅ Isaac Li ⋅ Yuhan Huang ⋅ Yu Tong Tiffany Ling ⋅ Hewei Wang ⋅ Irene Pi ⋅ Shihang Zhu ⋅ Yili Han ⋅ Yilun Du ⋅ Deva Ramanan

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 386

Video–language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, supported by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce a critique-based human–AI (CHAI) oversight framework, where trained human experts provide correctional critiques to revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for fine-tuning, improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through standard SFT, offline RL (DPO), online RL (GSPO), and inference-time scaling. With modest expert supervision, the resulting system outperforms even closed-source models such as Gemini-2.5-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of over 400 words, achieving finer control over camera motion, angle, lens, perspectives, and shot composition. Overall, our results show that precise specification and human–AI oversight are key to achieving professional-level video understanding and generation.

View full details

Poster

Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control

Chenxi Song ⋅ Yanming Yang ⋅ Tong Zhao ⋅ Ruibo Li ⋅ Chi Zhang

Jun 7, 3:30 PM - 5:30 PM ExHall A 386

Video diffusion models have rich world priors, but their use in spatial tasks is limited by poor control, spatial-temporal inconsistent results, and entangled scene-camera dynamics. Current approaches, such as per-task fine-tuning or post-process warping strategies, are insufficient, often introducing visual artifacts, failing to generalize, or incurring high computational costs. We introduce a novel, training-free framework that operates purely at inference time to resolve these issues. Our method is comprised of three synergistic components. First, an intra-step refinement loop injects fine-grained motion guidance during the denoising process, iteratively correcting the output to ensure strict adherence to the target camera path. Second, an optical flow-based analysis identifies and isolates motion-related channels within the latent space. This allows our framework to selectively apply guidance, thereby decoupling motion from appearance and preserving visual fidelity. Third, a dual-path guidance strategy adaptively corrects for drift by comparing the guided generation against an unguided, reference denoising path, effectively neutralizing artifacts caused by misaligned structural inputs. These components work in concert to inject precise, trajectory-aligned control without any model retraining, achieving both accurate motion guidance and photorealistic synthesis. Our plug-and-play, model-agnostic solution demonstrates broad applicability for 3D/4D tasks. Extensive experiments confirm state-of-the-art performance in trajectory adherence and perceptual quality, outperforming both training-dependent and other inference-only methods.

View full details

Poster

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

Bingwen Zhu ⋅ Yuqian Fu ⋅ Qiaole Dong ⋅ Guolei Sun ⋅ Tianwen Qian ⋅ Yuzheng Wu ⋅ Danda Paudel ⋅ Yanwei Fu ⋅ Xiangyang Xue

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 387

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in visual-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.

View full details

Poster

URICA: A Uniformity Region Affine Identifier Capture Algorithm for Arbitrary Region Retrieval in Pathology Images

Ri Su ⋅ Zhao CHEN ⋅ Caleb Chen Cao ⋅ Lei Chen

Jun 7, 11:45 AM - 1:45 PM ExHall F 387

Whole slide image (WSI) region retrieval remains an open challenge in computational pathology, as existing methods struggle to represent and preserve information of all possible regions. Current approaches that rely on fixed-size patches or slide-level retrieval are misaligned with real clinical workflows, where pathologists often examine WSI regions of arbitrary orientations and sizes rather than predefined patches or slides. In this work, we redefine WSI retrieval as a semantically optimal matching problem between arbitrary regions under spatial transformations, which necessitates a region-level representation that maintains semantic consistency. To fulfill this requirement, we introduce semantic tessellation, which organizes patch units into flexible, geometry-aware region descriptors. Building on this representation, we develop the affine identifier, a semantic signature that enables rotation- and scale-consistent region matching. We further derive theoretical bounds between the tessellation-derived descriptors and the ideal pixel-level semantic mask objective, showing that they reliably approximate mask-based region similarity. Together, these components form URICA, a theoretically grounded algorithm for robust WSI region retrieval. Experiments on large public datasets demonstrate that URICA achieves strong and consistent performance across diverse WSI retrieval tasks.

View full details

Poster

Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance

William June Suk Choi ⋅ Kyungmin Lee ⋅ Sihyun Yu ⋅ Yisol Choi ⋅ Jinwoo Shin ⋅ Kimin Lee

Jun 7, 3:30 PM - 5:30 PM ExHall A 387

Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple training-free fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying a low-pass filter at the early stage of denoising. Extensive experiments show ALG significantly improves the temporal dynamics of generated videos, while preserving or even improving image fidelity and text alignment. For instance, on the VBench test suite, ALG achieves a 33\% average improvement across models in dynamic degree while maintaining the original video quality.

View full details

Poster

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo ⋅ Kangsan Kim ⋅ Jaehong Yoon ⋅ Sung Ju

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 388

Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

View full details

Poster

Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision

Zitang Sun ⋅ Masakazu Yoshimura ⋅ Junji Otsuka ⋅ Atsushi Irie ⋅ Takeshi Ohashi

Jun 7, 11:45 AM - 1:45 PM ExHall F 388

High-quality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model’s evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.

View full details

Poster

CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild

Alex Hoi Hang Chan ⋅ Neha Singhal ⋅ Onur Kocahan ⋅ Andrea Meltzer ⋅ Saverio Lubrano ⋅ Miya Warrington ⋅ Michael Griesser ⋅ Fumihiro Kano ⋅ Hemal Naik

Jun 6, 11:45 AM - 1:45 PM ExHall F 389

Long-term behavioural monitoring of individual animals is crucial for studying behavioural changes that occurs over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behaviour monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of coloured leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of colour rings with a database. We use application-specific benchmarking to show that CORVID outperforms state of the art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.

View full details

Poster

VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection

Shuohao Shi ⋅ Qiang Fang ⋅ Xin Xu

Jun 6, 11:45 AM - 1:45 PM ExHall F 391

Closed-set object detection in remote sensing imagery has made significant progress, but achieving high detection accuracy remains challenging. Vision-Language Models (VLMs), which possess rich prior knowledge, offer a promising solution to this challenge. However, most existing VLMs are designed for open-vocabulary tasks and exhibit inherent limitations when directly applied to closed-set scenarios, such as notable accuracy degradation and high deployment costs. To address these issues, we propose VLM4RSDet, a novel collaborative training framework that leverages vision-language model to enhance the performance of conventional closed-set remote sensing object detectors. Notably, during inference, VLM4RSDet only retains the standard object detection architecture, thus avoiding any additional deployment overhead. Furthermore, we introduce a Global–Local Cross-Attention (GLCA) module and a Learnable Hierarchical Prediction Strategy (LHPS) to further improve collaborative training performance. Extensive experiments on five benchmark datasets demonstrate the effectiveness and robustness of our approach. In particular, our method outperforms the state-of-the-art by 7.5\% in mAP$_{0.5:0.95}$ on the VisDrone2019 dataset. Our code will be made publicly available.

View full details

Poster

MFEN: Multi-Frequency Expert Network for Visible-Infrared Person Re-ID

Xulin Li ⋅ Yan Lu ⋅ Bin Liu ⋅ Qinhong Yang ⋅ Qi Chu ⋅ Tao Gong ⋅ Nenghai Yu

Jun 6, 11:45 AM - 1:45 PM ExHall F 393

Visible-infrared person re-identification (VI-ReID) is a challenging task due to the significant modality discrepancy between visible and infrared images. We contend that the discrepancy primarily arises from varying lighting conditions of the two modality data, including differences in the wavelengths of light and the types of light source. Recently, frequency-based VI-ReID approaches have achieved notable success, since frequency information can more effectively extract contours and details pertinent to identity while excluding irrelevant lighting and color. However, existing methods do not distinguish different frequency bands or focus solely on a particular frequency band, which is insufficient for capturing the inherent variations in frequency under diverse lighting conditions. To perform comprehensive frequency domain learning, we propose a Multi-Frequency Expert Network (MFEN) that enables multi-frequency modulation and adaptively combines different frequencies through a mixture-of-experts method. In addition, we further introduce a Random Frequency Augmentation (RFA) and a Frequency Auxiliary Optimization (FAO) to effectively train the MFEN in mining frequency information. The proposed three frequency modules are complementary to each other and adaptively capture critical frequency domain details to achieve robust representations. Extensive experiments on three VI-ReID datasets demonstrate the effectiveness of our approach.

View full details

Poster

VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues

Sirnam Swetha ⋅ Rohit Gupta ⋅ Parth Parag Kulkarni ⋅ David G. ⋅ Jeffrey A. Chan-Santiago ⋅ Nyle Siddiqui ⋅ Joseph Fioresi ⋅ Mubarak Shah

Jun 7, 11:45 AM - 1:45 PM ExHall F 393

Video Question Answering (VideoQA) has made significant strides by leveraging multimodal learning to align visual and textual modalities. However, current benchmarks overwhelmingly focus on questions answerable through explicit visual content - actions, objects, and events directly observable within individual frames or short clips. To truly understand videos as humans do, models must go beyond what is directly shown, inferring hidden relationships and contextual cues that are only implied across frames. Humans naturally excel at such implicit reasoning, seamlessly integrating partial visual cues over time to infer motin dynamics, spatial layout and context, constructing a coherent mental model of the scene even when such relationships are never explicitly depicted. Current benchmarks fail to capture this essential aspect of video understanding. To address this gap, we introduce VRR-QA, a benchmark for Visual Relational Reasoning Beyond Explicit Cues. We curate our benchmark from creative and cinematic videos such as movies, that deliberately employ storytelling techniques which omit direct depictions of certain events or relations, requiring viewers to infer them. VRR-QA comprises $1K$ meticulously expert-annotated QA pairs drawn from $1K$ creative video clips covering $15$ genres across $7$ decades of content, from both live-action and animated titles. These annotations are deliberately challenging, crafted by authors, validated through multiple annotators, and benchmarked against human performance to ensure high quality. Our extensive evaluations on $11$ leading VideoQA models reveals consistent and significant performance degradation, underscoring their reliance on surface-level visual cues and highlighting the difficulty of implicit reasoning. Even the best model substantially underperforms human baselines with only 64% accuracy. Performance variations across models further illustrate the complexity and diversity of the challenges presented by VRR-QA. By releasing both dataset and data collection framework, VRR-QA establishes a rigorous, diverse, and reproducible testbed for advancing VideoQA.

View full details

Poster

FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

Mengtian Li ⋅ Kunyan Dai ⋅ Yi Ding ⋅ Ruobing Ni ⋅ Ying Zhang ⋅ Wenwu Wang ⋅ Zhifeng Xie

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 396

Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporal aligned audio remains labor-intensive. We propose \textbf{FoleyDesigner}, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporal controllable Foley generation, and professional mixing capabilities.Technically, FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatiotemporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate film industry-grade post-production practices. To address the lack of high-quality stere Foley datasets in film, we introduce \textbf{FilmStereo}, the first professional stereo Foley dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For application, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility.Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with integration validated in film industrial-grade workflows.

View full details

Poster

CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Hanyang Wang ⋅ Yiyang Liu ⋅ Jiawei Chi ⋅ Fangfu Liu ⋅ Ran Xue ⋅ Yueqi Duan

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 395

Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called **CFG-Ctrl**, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional–unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (**SMC-CFG**), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales.

View full details

Poster

CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning

Darshan Singh S ⋅ Arsha Nagrani ⋅ Kawshik Manikantan ⋅ Harman Singh ⋅ Dinesh Tewari ⋅ Tobias Weyand ⋅ Cordelia Schmid ⋅ Anelia Angelova ⋅ Shachi Dave

Jun 7, 11:45 AM - 1:45 PM ExHall F 395

Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE, a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. We will release CURVE to foster the development of more equitable and capable multimodal foundation models.

View full details

Poster

TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space

Wenxuan Miao ⋅ Yulin Sun ⋅ Aiyue Chen ⋅ Jing Lin ⋅ Yiwu Yao ⋅ Yiming Gan ⋅ Jieru Zhao ⋅ Jingwen Leng ⋅ Minyi Guo ⋅ Yu Feng

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 396

The recent surge in video generation has shown the growing demand for high-quality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations.In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85\%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality ($<$0.06\% loss on VBench).

View full details

Poster

Refracting Reality: Generating Images with Realistic Transparent Objects

Yue Yin ⋅ Enze Tao ⋅ Dylan Campbell

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 398

Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image---a panorama centered at the object---using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.

View full details

Poster

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Dahye Kim ⋅ Deepti Ghadiyaram ⋅ Raghudeep Gadde

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 397

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity.We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.

View full details

Poster

HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph

Jinkai Zheng ⋅ jiaqing wei ⋅ Xinxiang Jin ⋅ Yaoqi Sun ⋅ Xichun Sheng ⋅ Ming Li ⋅ Liangqiong Qu ⋅ Xinchen Liu ⋅ Wu Liu

Jun 6, 11:45 AM - 1:45 PM ExHall F 397

In recent years, the gait parsing sequence has become increasingly popular due to its higher information entropy than the binary silhouette and the keypoint-based skeleton. However, existing parsing-based gait recognition methods have not fully explored the complex, non-linear relationships between features at different positions, semantic, and temporal dynamics levels, i.e., higher-order correlations. To unleash the power of parsing between human body parts and temporal dynamics, this paper proposes a novel hypergraph-based gait recognition framework, named HyperGait. The HyperGait contains a global head and two elaborately-designed modules. In particular, the Spatial Hypergraph Convolutional Module (SHCM) and the Temporal Hypergraph Convolutional Module (THCM) are designed to explore the high-order spatial-level and temporal-level features, respectively.The SHCM extracts fine-grained relationships between human body parts through the hypergraph.The THCM performs the high-order temporal information between temporally related human body parts.Comprehensive experiments on two large-scale gait datasets, i.e., Gait3D and SUSTech1K, show the superior performance of our proposed HyperGait.In highly challenging real-world scenarios, with only parsing as input, our HyperGait achieves the Rank-1 accuracy of 80.5\% on the Gait3D dataset.

View full details

Poster

Towards High-resolution and Disentangled Reference-based Sketch Colorization

Dingkun Yan ⋅ Xinrui Wang ⋅ Ru Wang ⋅ Zhuoru Li ⋅ Jinze Yu ⋅ Yusuke Iwasawa ⋅ Yutaka Matsuo ⋅ Jiaxian Guo

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 398

Sketch colorization models have been widely studied to automate and assist in the creation of animation frames and digital illustrations. However, current methods are still not satisfactory for industrial standard applications in high-resolution synthesis and precise controllability of details. To further enhance the synthesis quality and controllability, we propose an image-referenced sketch colorization method based on the powerful SDXL backbone and leverage sketches as spatial guidance and RGB images as color references. A split cross-attention mechanism is coupled with spatial masks to separately colorize the foreground and background regions to avoid spatial entanglement. A tagger network trained on a massive anime-style image dataset is employed to extract attribution-level information from reference images and integrated into the pipeline to provide precise control signals for synthesis. However, the increased resolution and number of attention layers in the SDXL backbone and precise reference information from the tagger network cause severe entanglement during colorization. We consequently combine a foreground encoder and a background encoder for disentanglement and better synthesis quality. Furthermore, a high-quality annotated and paired sketch colorization dataset is collected for fine-tuning. The proposed method is the first to achieve high resolution high quality sketch colorization with precise control, and obvious outperforms existing methods in quantitative and qualitative validations, as well as user studies in both quality and controllability. Ablation study reveals the influence of each component. Code and dataset will be made publicly available upon paper acceptance.

View full details

Poster

SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification

Huiyuan Huang ⋅ SANG MIN YOON

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 403

Person re-identification (Re-ID) requires a balance between discriminative capability and computational efficiency for real-world deployment. However, even the Visual State Space Model (SSM), despite its linear complexity, suffers from redundant computation due to dense token processing. We propose SSM-aware Token-Efficient VMamba (TE-VMamba), which integrates adaptive patch pruning and merging modules to reduce redundant tokens while preserving identity-discriminative cues. The layer-adaptive pruning strategy removes low-importance tokens in shallow layers to enhance efficiency, whereas the depth-aware merging strategy consolidates semantically similar tokens in deeper layers to improve representation compactness. Learnable layer-wise thresholds dynamically balance accuracy and computational cost across the network. On the Market-1501 benchmark, TE-VMamba reduces FLOPs by over 60\%, achieving significant computational savings while maintaining competitive accuracy. These results highlight the potential of structured token reduction in state-space models for efficient and powerful person re-identification.

View full details

Poster

FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing

Yilei Jiang ⋅ Zhen Wang ⋅ Yanghao Wang ⋅ Jun Yu ⋅ Yueting Zhuang ⋅ Jun Xiao ⋅ Long Chen

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 402

With the surge of pre-trained text-to-image flow matching models, text-based image editing performance has gained remarkable improvement, especially for **simple editing** that only contains a single editing target. However, to satisfy the exploding editing requirements, the **complex editing** that contains multiple editing targets is posed as a more challenging task. However, current complex editing solutions: single-round and multi-round editing are limited by long text following and cumulative inconsistency, respectively. Thus, they struggle to strike a balance between semantic alignment and source consistency.In this paper, we propose **FlowDC**, which decouples the complex editing into multiple sub-editing effects and superposes them in parallel during the editing process. Meanwhile, we observed that the velocity quantity that is orthogonal to the editing displacement harms the source structure preserving. Thus, we decompose the velocity and decay the orthogonal part for better source consistency.To evaluate the effectiveness of complex editing settings, we construct a complex editing benchmark: Complex-PIE-Bench. On two benchmarks, FlowDC shows superior results compared with existing methods. We also detail the ablations of our module designs.

View full details

Poster

CADC: Content Adaptive Diffusion-Based Generative Image Compression

Xihua Sheng ⋅ lingyu ZHU ⋅ Tianyu Zhang ⋅ Dong Liu ⋅ Shiqi Wang ⋅ Jing Wang

Jun 7, 11:45 AM - 1:45 PM ExHall F 402

Diffusion-based generative image compression has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire compression process content-adaptive, ensuring that the encoder's representation and the decoder's generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic quantization applies a uniform quantization step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model's noise-dependent prior. Second, the information concentration bottleneck---arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion decoder's fixed input---prevents the model from adaptively preserving essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec (CADC) with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization (UGAQ) method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration (ADGIC) method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning (BFATC) method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost. Comprehensive experimental results show that our codec achieves state-of-the-art perceptual quality at ultra-low bitrates.

View full details

Poster

Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

Youngjun Jun ⋅ Seil Kang ⋅ Woojung Han ⋅ Seong Jae Hwang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 403

Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity to text descriptions involving motion. However, the understanding of how Video DiTs convert motion words into video remains lagging behind. Furthermore, prior studies on interpretable saliency maps primarily target objects, leaving it behind to observe how Video DiTs behave with respect to motion. In this paper, we inquire into concrete motion features that specify which object moves and at what time for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively renders per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose an automatic motion-feature selecting algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motions spatially and temporally. Our methods discover concept saliency maps without the need for any gradient-based training or parameters. Experimentally, our methods show standout localization capability in the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.

View full details

Poster

Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Xingyu Zhou ⋅ Qifan Li ⋅ Xiaobin Hu ⋅ Hai Chen ⋅ Shuhang Gu

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 404

The diffusion model presents a powerful ability to obtian the entire (conditional) data distribution.However, due to the lack of sufficient training and data to learn, the model will be penalized for failing to cover low-probability areas.To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage.However, the standard CFG often leads to over-simplified or distorted samples. And the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process.This simple strategy yields significant improvements in both training efficiency and generation quality on DiTs and SiTs.On ImageNet 256×256, SiT-XL/2+IG achieves FID=5.31 and FID=1.88 which already exceeds the FID of the vanilla SiT-XL and REPA.More impressively, LightningDiT-XL/1+IG achieves FID=1.41 which achieves a large margin between all of these methods.Combined with classifier free guidance, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.23.

View full details

Poster

3D-Object Perception Transformer (3PT)

Agastya Kalra ⋅ Tim Salzmann ⋅ Guy Stoppi ⋅ Dmitrii Marin ⋅ Rishav Agarwal ⋅ Vage Taamazyan ⋅ Martin Bokeloh ⋅ Stefan Hinterstoisser ⋅ Anton Boykov ⋅ Alberto Dall'Olio ⋅ Pravin Dangol ⋅ Kartik Venkataraman ⋅ Huaijin Chen

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 404

Current approaches to zero-shot 3D-object perception typically rely on ensembles of frozen foundation models.This limits deep object understanding and cross-domain generalization, making performance inadequate for real-world deployment. The 3D-Object Perception Transformer (3PT) addresses this limitation by unifying detection, segmentation, and 6DoF pose estimation in a single framework, directly trained for 3D-object perception. Based on two large-scale trained Transformers that specialize in 2D and 3D object-centric scene understanding respectively, 3PT continuously refines its object representations without depth input, enhancing 3D understanding by incorporating multi-view information. 3PT surpasses task-specialized models for detection and pose estimation, often achieving double-digit percentage improvements on the diverse BOP-benchmarks. Achieving high accuracy and robustness, \algshort{} is well-suited for practical industrial robotics applications such as bin picking and precise insertion.

View full details

Poster

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

Handong Li ⋅ Zikang Liu ⋅ Longteng Guo ⋅ Tongtian Yue ⋅ Yepeng Tang ⋅ Xinxin Zhu ⋅ Chuanyang Zheng ⋅ Ziming Wang ⋅ Zhibin Wang ⋅ Jun Song ⋅ Cheng Yu ⋅ Bo Zheng ⋅ Jing Liu

Jun 7, 3:30 PM - 5:30 PM ExHall A 404

Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which dynamically selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

View full details

Poster

RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection

Hyeonjeong Park ⋅ Peixi Xiong ⋅ Xiaoqian Ruan ⋅ Dian Jia ⋅ Pei Yu ⋅ Wei Tang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 406

Monocular 3D object detection from a single RGB image remains challenging due to two fundamental challenges: the ill-posed nature of 3D localization, where multiple plausible configurations can correspond to the same 2D observation, and unreliable confidence estimation that fails to reflect true localization accuracy. Existing methods predict deterministic 3D boxes that often collapse to implausible mean estimates and rely on absolute confidence scores that are highly sensitive to localization errors. This paper introduces RARE, a unified framework that addresses both challenges through learning to rank and retrieve. RARE formulates confidence estimation as a ranking problem, learning to order detections by their relative quality rather than regressing absolute values. It provides more robust and stable confidence estimates that are less sensitive to localization uncertainty. Building on this improved confidence estimator, RARE learns to construct a query set for each object that predicts multiple diverse and plausible 3D configurations, and retrieves the top-ranked prediction. It explicitly models the multimodal nature of monocular 3D perception and produces more plausible localizations. Extensive experiments demonstrate the effectiveness of RARE. We will make the code publicly available.

View full details

Poster

NaTex: Seamless Texture Generation as Latent Color Diffusion

Zeqiang Lai ⋅ Yunfei Zhao ⋅ Zibo Zhao ⋅ Xin Yang ⋅ Xin Huang ⋅ Jingwei Huang ⋅ Xiangyu Yue ⋅ Chunchao Guo

Jun 6, 11:45 AM - 1:45 PM ExHall F 407

We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE–DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.

View full details

Poster

Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

jian ma ⋅ Qirong Peng ⋅ Xujie Zhu ⋅ Peixing Xie ⋅ Chen Chen ⋅ Haonan Lu

Jun 6, 11:45 AM - 1:45 PM ExHall F 409

Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments.

View full details

Poster

SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors

Aixuan Li ⋅ Mochu Xiang ⋅ Bosen Hou ⋅ Zhexiong Wan ⋅ Jing Zhang ⋅ Yuchao Dai

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 410

Adversarial robustness of BEV 3D object detectors is critical for autonomous driving (AD). Existing invasive attacks require altering the target vehicle itself (*e.g.* attaching patches), making them unrealistic and impractical for real-world evaluation. While non-invasive attacks that place adversarial objects in the environment are more practical, current methods still lack the multi-view and temporal consistency needed for physically plausible threats. In this paper, we present the first framework for generating universal, non-invasive, and 3D consistent adversarial objects that expose fundamental vulnerabilities for BEV 3D object detectors. Instead of modifying target vehicles, our method inserts rendered objects into scenes with an occlusion-aware module that enforces physical plausibility across views and time. To maintain attack effectiveness across views and frames, we optimize adversarial object appearance using a BEV spatial feature-guided optimization strategy that attacks the detector's internal representations. Extensive experiments demonstrate that our learned universal adversarial objects can consistently degrade multiple BEV detectors from various viewpoints and distances.More importantly, the new environment-manipulation attack paradigm exposes models' over-reliance on contextual cues and provides a practical pipeline for robustness evaluation in AD systems.

View full details

Poster

A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection

SuYeon Kim ⋅ Wongyu Lee ⋅ MyeongAh Cho

Jun 7, 11:45 AM - 1:45 PM ExHall F 411

3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)—where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.

View full details

Poster

Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations

Sudong Cai ⋅ Shuai Yuan ⋅ Bingzhi Chen ⋅ Rui Mao ⋅ Bing Wang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 412

Self-attention with separate pre- and post-projections can be a universal approximator (on compact domains) under mild conditions.Yet we observe a striking gap: an attention-only Transformer (w/o FFN layers) exhibits a marked accuracy drop relative to its standard interleaved attention--FFN baseline.We term this the **weak-independence** challenge of attention.We study this through a new conceptual lens, **Selection-as-Nonlinearity (SaN)**, which interprets effective nonlinearity as directed, cost-constrained selection, offering a coherent account of attention as context-gated activation.In this joint game–decision view, attention performs a resource-constrained cooperative allocation over values: each query distributes a unit-mass weight budget over shared values to optimize representational utility, under a normalizer (e.g., $\mathrm{softmax}$), and guided by context-derived scores (e.g., q-k similarities).SaN interprets *weak-independence* as a structural tension: the value weights almost cannot simultaneously attain the decoupled per-query (row-wise) and the per-value (column-wise) optimums under shared budgets, thereby limiting attention's stand-alone capacity.Guided by SaN, we introduce **CSaN**, an interpretable, efficient attention compensation paradigm with two key insights: **1) hierarchical budget calibration,** *re-allocate* row budgets via inter-query correction signals; and **2) public-private cooperation,** enhancing the *public* attention pathway with a per-token *private* value pathway to decouple conflicting demands.CSaN is evaluated on various vision benchmarks and demonstrates *level-jump gains* across popular Transformer families (Swin, ViT, Hiera), enabling models to rival much heavier same-family counterparts $\sim2\times$ as large.

View full details

Poster

Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers

Wongi Jeong ⋅ Kyungryeol Lee ⋅ Hoigi Seo ⋅ Se Young Chun

Jun 6, 11:45 AM - 1:45 PM ExHall F 412

Diffusion transformers (DiTs) offer excellent scalability for high-fidelity generation, but their computational overhead poses a great challenge for practical deployment. Existing acceleration methods primarily exploit the temporal dimension, whereas spatial acceleration remains underexplored. In this work, we investigate spatial acceleration for DiTs via latent upsampling. We found that na\"ive latent upsampling for spatial acceleration introduces artifacts, primarily due to aliasing in high-frequency edge regions and mismatching from noise-timestep discrepancies. Then, based on these findings and analyses, we propose a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate those artifacts while achieving spatial acceleration of DiTs by our mixed-resolution latent upsampling. RALU achieves artifact-free, efficient acceleration with early upsampling only on artifact-prone edge regions and noise-timestep matching for different latent resolutions, leading to up to 7.0$\times$ speedup on FLUX-1.dev and 3.0$\times$ on Stable Diffusion 3 with negligible quality degradation. Furthermore, our RALU is complementarily applicable to existing temporal acceleration methods and timestep-distilled models, leading to up to 15.9$\times$ speedup.

View full details

Poster

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

Seng Nam Chen ⋅ Hao Chen ⋅ Chenglam Ho ⋅ Xinyu Mao ⋅ Jinping Wang ⋅ Yu Zhang ⋅ Chao Li

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 417

Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision–language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: Can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. Scene-RAG improves VLM performance by +7.11%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.

View full details

Poster

Scene Reconstruction as Mapping Priors for 3D Detection

Yang Fu ⋅ Yuliang Zou ⋅ Hao Xiang ⋅ Xin Huang ⋅ Yijing Bai ⋅ Chen Song ⋅ Weijing Shi ⋅ Govind Thattai ⋅ Dragomir Anguelov ⋅ Mingxing Tan ⋅ Yingwei Li

Jun 6, 11:45 AM - 1:45 PM ExHall F 418

In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks like 3D object detection. Maps can provide robust structural priors of the static environment, suited to resolving ambiguities and correcting for sensor data sparsity or noise — issues especially prevalent for distant objects or during adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for achieving efficient, large-scale deployment. In this paper, we propose a scalable solution to systemically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Prior Augmented 3D detection (MPA3D) framework to effectively integrate the mapping priors with the distinct modalities of sensor data. Our extensive experiments on the Waymo Open Dataset demonstrate that our approach achieves new state-of-the-art results, and proving the effectiveness of using scalable, reconstructed scene priors to enhance 3D detection.

View full details

Poster

EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions

Taegyoon Yoon ⋅ Yegyu Han ⋅ Seojin Ji ⋅ Jaewoo Park ⋅ Sojeong Kim ⋅ Taein Kwon ⋅ Hyung-Sin Kim

Jun 7, 3:30 PM - 5:30 PM ExHall A 418

Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios—industrial maintenance, sports, and emergency rescue—designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset will be publicly available.

View full details

Poster

Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

Xuanpu Zhao ⋅ Zhentao Tan ⋅ Dianmo Sheng ⋅ Tianxiang Chen ⋅ Yao Liu ⋅ Yue Wu ⋅ Tao Gong ⋅ Qi Chu ⋅ Nenghai Yu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 419

To enhance the perception and reasoning capabilities of multimodal large models (MLLMs) in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning (SFT) and reinforcement learning (RL), have made significant progress, our empirical analysis reveals a key limitation. By adding random noise to the cropped images, we find that they still maintain most of the performance, especially for models using only reinforcement learning, indicating a heavy reliance on the global input and a weak dependence on details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the "Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to clipped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs.

View full details

Poster

Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection

Zhihao Zhang ⋅ Abhinav Kumar ⋅ Girish Chandar Chandar ⋅ Xiaoming Liu

Jun 6, 11:45 AM - 1:45 PM ExHall F 420

Monocular 3D detection (Mono3D) aims to infer 3D bounding boxes from a single RGB image.Without auxiliary sensors such as LiDAR, this task is inherently ill-posed since the 3D-to-2D projection introduces depth ambiguity.Previous works often predict 3D attributes (e.g., depth, size, and orientation) in parallel, overlooking that these attributes are inherently correlated through the 3D-to-2D projection.However, simply enforcing such correlations through sequential prediction can propagate errors across attributes, especially when objects are occluded or truncated, where inaccurate size or orientation predictions can further amplify depth errors.Therefore, neither parallel nor sequential prediction is optimal.In this paper, we propose MonoCoP, an adaptive framework that learns when and how to leverage inter-attribute correlations with two complementary designs.A Chain-of-Prediction (CoP) explores inter-attribute correlations through feature-level learning, propagation, and aggregation, while an Uncertainty-Guided Selector (GS) dynamically switches between CoP and parallel paradigms for each object based on the predicted uncertainty.By combining their strengths, MonoCoP achieves state-of-the-art (SOTA) performance on KITTI, nuScenes, and Waymo, significantly improving depth accuracy, particularly for distant objects.

View full details

Poster

Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation

Nassim Ali Ousalah ⋅ Peyman Rostami ⋅ Vincent Gaudillière ⋅ Emmanuel Koumandakis ⋅ Anis Kacem ⋅ Enjie Ghorbel ⋅ Djamila Aouada

Jun 7, 3:30 PM - 5:30 PM ExHall A 421

In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-$n$-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.

View full details

Poster

Seele: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices

He Zhu ⋅ Xiaotong Huang ⋅ Zihan Liu ⋅ Weikai Lin ⋅ Xiaohong Liu ⋅ Zhezhi He ⋅ Jingwen Leng ⋅ Minyi Guo ⋅ Yu Feng

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 423

3D Gaussian Splatting (3DGS) has become a crucial rendering technique for many real-time applications. However, the limited hardware resources on today's mobile platforms hinder these applications, as they struggle to achieve real-time performance. In this paper, we propose Seele, a general framework designed to accelerate the 3DGS pipeline for resource-constrained mobile devices.Specifically, we propose two GPU-oriented techniques: hybrid preprocessing and contribution-aware rasterization.Hybrid preprocessing alleviates the GPU compute and memory pressure by reducing the number of irrelevant Gaussians during rendering.The key is to combine our view-dependent scene representation with online filtering. Meanwhile, contribution-aware rasterization improves the GPU utilization at the rasterization stage by prioritizing Gaussians with high contributions while reducing computations for those with low contributions.Both techniques can be seamlessly integrated into existing 3DGS pipelines with minimal fine-tuning.Collectively, our framework achieves up to 6.3$\times$ speedup and 39.1\% model reduction while achieving superior rendering quality compared to existing methods.Our codes will be released upon publication.

View full details

Poster

MatLat: Material Latent Space for PBR Texture Generation

Kyeongmin Yeo ⋅ Yunhong Min ⋅ Jaihoon Kim ⋅ Minhyuk Sung

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 425

We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, **MatLat**, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops latent patches, decodes them, and aligns the corresponding image regions to maintain strong pixel–latent spatial correspondence. Ablations studies and comparison with previous baselines demonstrate that our framework improves PBR texture fidelity and that each component is critical for achieving state-of-the-art performance.

View full details

Poster

VMonarch: Efficient Video Diffusion Transformers with Structured Attention

Cheng Liang ⋅ Haoxian Chen ⋅ Liang Hou ⋅ Qi Fan ⋅ Gangshan Wu ⋅ Xin Tao ⋅ Limin Wang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 426

The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal fine-tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of $17.5$, and achieves a speedup of over $5\times$ in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90\% sparsity.

View full details

Poster

Draft and Refine with Visual Experts

SungHeon Jeong ⋅ Ryozo Masukawa ⋅ Jihong Park ⋅ Sanggeon Yun ⋅ Wenjun Huang ⋅ Hanning Chen ⋅ Mahdi Imani ⋅ Mohsen Imani

Jun 6, 11:45 AM - 1:45 PM ExHall F 426

While recent Large Vision–Language Models (LVLMs) exhibit impressive multimodal reasoning abilities, they often produce ungrounded, *hallucinated* responses by over-relying on linguistic priors rather than visual evidence. This critical limitation arises from the lack of a quantitative measure of how much these models actually rely on visual inputs during reasoning. We propose **Draft and Refine (DnR)**, an agent framework driven by a novel *question-conditioned utilization metric*. This metric quantifies the model’s actual reliance on visual evidence by first constructing a *query-conditioned relevance map* to localize question-specific evidence, and then assessing dependence through relevance-based probabilistic masking. Guided by this metric, the DnR agent refines its initial *draft* through targeted feedback from external visual experts. Each expert’s output (e.g., boxes, masks) is rendered as visual cues on the image, and the VLM is re-queried to select the response that yields the greatest improvement in utilization. This process strengthens visual grounding of predictions without retraining or architectural changes. Experiments across a broad range of VQA and captioning benchmarks demonstrate consistent accuracy gains and reduced hallucination. These results show that quantifying visual utilization provides a principled path for designing more interpretable and evidence-driven multimodal agent systems that effectively leverage visual experts.

View full details

Poster

EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images

Minh-Quan Viet Bui ⋅ Jongmin Park ⋅ Juan Luis Gonzalez Bello ⋅ Jaeho Moon ⋅ Jihyong Oh ⋅ Munchurl Kim

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 426

Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their parameters based on the target primitive count. Extensive experiments across multiple dense-view settings show that EcoSplat is robust and outperforms state-of-the-art methods under strict primitive-count constraints, making it well-suited for flexible downstream rendering tasks. Code and project page will be released.

View full details

Poster

Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction

Meng Wang ⋅ Changqun Xia ⋅ Yuze Wang ⋅ Junyi Wang ⋅ Wantong Duan ⋅ Xinxiong Xie ⋅ Yue Qi

Jun 7, 11:45 AM - 1:45 PM ExHall F 427

Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, enabling efficient and high-fidelity novel view synthesis. However, seamless integration of both aerial and street view images to model urban scenes remains a significant challenge for 3DGS. This joint setting suffers from extreme view coverage disparity, complex multi-scale details, and imbalanced viewpoint distributions.In this work, we present Urban-GS, a novel framework built upon Gaussian Splatting for the compact unified reconstruction and high-fidelity rendering of urban scenes from both aerial and street views. Specifically, we first develop an Aerial-Street Joint Adaptive Densification method to resolve the densification conflicts arising from large view coverage disparity. We then introduce a Contribution-based Anchor Pruning strategy to effectively mitigate the storage overhead from capturing multi-scale scene details. Furthermore, we propose a Global-to-Local Optimization strategy to refine the reconstruction of under-optimized regions resulting from imbalanced view distributions. Experiments across diverse urban scene datasets demonstrate that Urban-GS significantly outperforms the state-of-the-art method in novel-view rendering quality, while simultaneously reducing storage overhead by an average of 41\%.

View full details

Poster

HDR-VLM: HDR-Domain Adaptation of VLMs and Preference-Aligned Quality Assessment for HDR Video Color Grading

Hao Yuan ⋅ Jiabin Zhang ⋅ Yajing Wu ⋅ Ruixuan Pang ⋅ Jing Li

Jun 7, 3:30 PM - 5:30 PM ExHall A 427

Color grading is central to High Dynamic Range (HDR) video production, shaping the perceptual tone, contrast, and luminance of content across diverse displays. However, evaluating HDR color grading quality is particularly difficult due to its semantic, content-dependent nature and the lack of large-scale annotated data. While pre-trained Vision–Language Models (VLMs) offer strong semantic priors and generalization ability, their exposure is limited to Standard Dynamic Range (SDR) data, making them poorly equipped to handle HDR photometry and perceptual nuances. We propose HDR-VLM, the first method to adapt a VLM to the HDR domain for perceptual quality assessment. Specifically, HDR-VLM employs a two-stage design: it first bridges the domain gap using a unified HLG-based encoding and progressive adaptation; then it aligns model assessments with noisy, multi-scale human preferences via reinforcement learning with curriculum-inspired rewards. Experiments on a real-world, production-sourced HDR dataset show that HDR-VLM not only outperforms existing quality assessment methods but also produces interpretable attribution rationales. These rationales offer actionable guidance for content creators, enhancing the reliability and transparency of automated HDR quality evaluation.

View full details

Poster

FastGS: Training 3D Gaussian Splatting in 100 Seconds

Shiwei Ren ⋅ Tianci Wen ⋅ Yongchun Fang ⋅ Biao Lu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 434

The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.29× training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45× acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-6× training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping.

View full details

Poster

Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images

Xiangyu Sun ⋅ Haoyi Jiang ⋅ Liu Liu ⋅ Seungtae Nam ⋅ Gyeongjin Kang ⋅ Xinjie wang ⋅ Wei Sui ⋅ Zhizhong Su ⋅ Wenyu Liu ⋅ Xinggang Wang ⋅ Eunbyung Park

Jun 7, 11:45 AM - 1:45 PM ExHall F 434

Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce a novel feed-forward framework that reconstructs 3D scenes from unposed multi-view images. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction—all within a single, feed-forward pass. Extensive experiments demonstrate this method establishes a new state-of-the-art across multiple benchmarks, including RE10K and ScanNet. Our work signifies a novel paradigm towards generalizable 3D scene reconstruction.

View full details

Poster

TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction

Yihui Li ⋅ Chengxin Lv ⋅ Zichen Tang ⋅ Hongyu Yang ⋅ Di Huang

Jun 7, 3:30 PM - 5:30 PM ExHall A 436

We present **TokenSplat**, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images.At its core, TokenSplat introduces a **Token-aligned Gaussian Prediction** module that aligns semantically corresponding information across views directly in the feature space.Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians.To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an **Asymmetric Dual-Flow Decoder (ADF-Decoder)** that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement.Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods.

View full details

Poster

GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space

Yonghao Zhao ⋅ Yupeng Gao ⋅ Jian Yang ⋅ Jin Xie ⋅ Beibei Wang

Jun 7, 3:30 PM - 5:30 PM ExHall A 437

Recent advances in Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have made it standard practice to reconstruct 3D scenes from multi-view images. Removing objects from such 3D representations is a fundamental editing task that requires complete and seamless inpainting of occluded regions, ensuring consistency in geometry and appearance. Although existing methods have made notable progress in improving inpainting consistency, they often neglect global lighting effects, leading to physically implausible results. Moreover, these methods struggle with view-dependent non-Lambertian surfaces, where appearance varies across viewpoints, leading to unreliable inpainting. In this paper, we present 3D **G**aussian **O**bject **R**emoval in the **I**ntrinsic **S**pace (GOR-IS), a novel framework for physically consistent and visually coherent 3D object removal. Our approach decomposes the scene into intrinsic components and explicitly models light transport to maintain global lighting effects consistency. Furthermore, we introduce an intrinsic-space inpainting module that operates directly in the material and lighting domains, effectively addressing the challenges posed by non-Lambertian surfaces. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework substantially improves the physical consistency and visual coherence of object removal, outperforming existing methods by 13\% in perceptual similarity (LPIPS) and 2dB in peak signal-to-noise ratio (PSNR).

View full details

Poster

AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction

Tingyun Li ⋅ Xinyi Liu ⋅ Yongjun Zhang ⋅ Yi Wan ⋅ Xiaoan Liu ⋅ Weiwei Fan ⋅ Jiahao Liu

Jun 7, 3:30 PM - 5:30 PM ExHall A 438

Monocular UAV videos pose a fundamental challenge for 3D reconstruction: dynamic scene modeling requires accurate camera poses, yet recovering poses from long UAV trajectories often fails under texture-sparse regions and moving objects.Existing approaches typically handle either pose-free static reconstruction or dynamic reconstruction with known poses, but jointly solving both from casual aerial footage remains difficult due to motion coupling and severe scale variation.We introduce \modelname, a scale-aware Gaussian splatting framework that jointly recovers camera trajectories and reconstructs dynamic scenes from pose-free monocular videos.Central to our method are scale-aware spatio-temporal anchors (S$^2$A-Anchors), which enable a unified optimization via three key decoupling mechanisms:(i) separating ego-motion from object motion,(ii) isolating static geometry from temporal deformation, and(iii) adapting scale between distant terrain and nearby objects.This design effectively stabilizes optimization under large motion and scale imbalance.Extensive experiments on UAV and driving benchmarks show that \modelname~achieves state-of-the-art rendering quality (PSNR/LPIPS), precise trajectory recovery (ATE/RPE), and faithful motion reconstruction, consistently surpassing recent pose-free baselines.

View full details

Poster

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Alberto Compagnoni ⋅ Marco Morini ⋅ Sara Sarto ⋅ Federico Cocchi ⋅ Davide Caffagni ⋅ Marcella Cornia ⋅ Lorenzo Baraldi ⋅ Rita Cucchiara

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 439

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Source code and models will be made publicly available.

View full details

Poster

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Zichuan Lin ⋅ Yicheng Liu ⋅ Yang Yang ⋅ Lvfang Tao ⋅ Deheng Ye

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 441

Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.

View full details

Poster

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

shuyan ke ⋅ Yifan Mei ⋅ Changli Wu ⋅ yonghan zheng ⋅ Jiayi Ji ⋅ Liujuan Cao ⋅ Rongrong Ji

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 441

Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data introduces fundamentally different challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these UAV-specific conditions, we formally define the UAV Reasoning Segmentation task and organize its semantic demands into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, the first large-scale UAV reasoning segmentation benchmark, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision covering all three reasoning types. We further propose PixDLM, a pixel-level multimodal language model equipped with a Dual-Path Vision Encoder that preserves fine-grained high-resolution cues while maintaining strong global semantic alignment. Extensive experiments on DRSeg demonstrate that PixDLM achieves superior semantic consistency and spatial localization accuracy compared with existing multimodal models, offering a unified and efficient baseline for UAV reasoning segmentation. All datasets, models, and code will be released.

View full details

Poster

Learning Differentiable Hierarchies in 3D Gaussian Splatting

Youqi Pan ⋅ Wugen Zhou ⋅ Hongbin Zha

Jun 7, 3:30 PM - 5:30 PM ExHall A 441

Although 3D Gaussian Splatting (3DGS) has achieved impressive performance in real-time rendering, its unordered Gaussians make level-of-detail (LoD) construction and model compression highly challenging, limiting its applicability in customized scenarios.In this work, we propose a learning-based Gaussian hierarchy representation that ranks Gaussians by their contribution to the scene, enabling flexible LoD representations across arbitrary Gaussian counts.We first introduce a unified, continuous formulation and metric for Gaussian hierarchy. Then, we introduce a hierarchy-based modulated rendering method built upon a Differentiable Decreasing Step Function, which enables efficient hierarchy learning while maintaining approximately equivalent rendering. Moreover, we develop a PDF-Guided Active-Region Sampling strategy that encourages the learned hierarchy to become widely distributed within its value range.Our method requires no additional training stages and produces Gaussian hierarchies within training time comparable to classical 3DGS. Experiments on multiple datasets show that our approach achieves performance comparable to or surpassing state-of-the-art methods in both LoD rendering and model pruning.

View full details

Poster

Agile Deliberation: Concept Deliberation for Subjective Visual Classification

Leijie Wang ⋅ Otilia Stretcu ⋅ Wei Qiao ⋅ Thomas Denby ⋅ Krishnamurthy Viswanathan ⋅ Enming Luo ⋅ Chun-Ta Lu ⋅ Tushar Dogra ⋅ Ranjay Krishna ⋅ Ariel Fuxman

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 443

From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding.Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called Agile Deliberation that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user’s evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5\% higher $F_1$ scores than automated decomposition baselines and more than 3\% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.

View full details

Poster

Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM

Yunsong Wang ⋅ Gim Hee Lee

Jun 7, 11:45 AM - 1:45 PM ExHall F 442

Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated using 3D scene flow priors and are dynamically initialized with an adaptive insertion strategy. Alongside this, we model the temporal opacity and rotation using a Gaussian Mixture Model (GMM) to adaptively learn the complex dynamics. The empirical results demonstrate our state-of-the-art performance in tracking, dynamic reconstruction, and training efficiency. Our code will be made publicly available upon paper acceptance.

View full details

Poster

TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking

Hanzhi Guo ⋅ dongdong weng ⋅ Mo Su ⋅ Yixiao Chen ⋅ Xiaonuo Dongye ⋅ Chenyu Xu

Jun 7, 3:30 PM - 5:30 PM ExHall A 444

Topology-consistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable precise 3D keypoint tracking.

View full details

Poster

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models

Yingqi Fan ⋅ Junlong Tong ⋅ Anhao Zhao ⋅ Xiaoyu Shen

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 447

Multimodal LLMs (MLLMs) convert images into visual tokens for language-model processing, yet how these tokens encode semantics remains unclear. In this paper, we identify a consistent token structure across models: visual tokens cluster into sink, dead, and alive groups, with only the alive tokens ($\approx60$%) carrying meaningful information. Sink and dead tokens can be removed without hurting performance. Using a patch-compression benchmark and our probing tool *EmbedLens*, we show that alive tokens already encode fine-grained cues (objects, colors, OCR) before entering the LLM. Internal visual computation (visual attention and FFNs) are redundant and offers limited benefit for most tasks. This redundancy also extends to the model's depth: Our analysis shows that alive tokens align best with mid-layer LLM representations, while shallow layers contribute little. These findings provide a unified view of visual semantics in MLLMs and motivate architectures that use fewer visual tokens, reduced visual computation, and mid-layer injection for better efficiency and interpretability.

View full details

Poster

Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting

Qinzheng Zhou ⋅ Zaychik Liu ⋅ Lijing Lu ⋅ Zhihang Li

Jun 6, 11:45 AM - 1:45 PM ExHall F 448

Sparse-view 3D Gaussian Splatting (3DGS) reconstructs scenes using 3D Gaussians from sparse input views. Yet, this method is prone to overfitting, which is exacerbated at higher resolutions as the expanded dimensionality amplifies floating artifacts and reconstruction ambiguities. In this paper, we present a systematic study of 3DGS under sparse-view conditions and varying input resolutions. While prior work has overlooked resolution as a key factor in sparse-view performance, we identify and quantify a trade-off: lower-resolution inputs facilitate stable global geometry reconstruction, whereas higher-resolution inputs enable finer detail recovery but introduce high-frequency artifacts and instability. Building on this insight, we further propose **CAGS**, a Confidence-Guided Multi-Scale Aggregation that reconstructs scenes through a coarse-to-fine hierarchical optimization process‌. Our approach employs a matching-based weighting aggregation strategy to anchor high-resolution reconstructions to stabilize structural priors and filtering noise through cross-scale consistency, and a multi-scale pseudo-view regularization to refine local details without amplifying noise. Extensive experiments on the LLFF and Mip-NeRF360 datasets demonstrate that CAGS significantly outperforms existing methods, particularly under demanding high-resolution conditions. ‌Moreover, our paradigm can be seamlessly integrated into other 3DGS-based pipelines, thereby extending the field from low-resolution reconstructions to high-fidelity outputs under real-world sparse-view constraints.

View full details

Poster

SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method

Wentao Yang ⋅ FanZhen KONG ⋅ Zejian Kang ⋅ Xiangru Huang

Jun 7, 3:30 PM - 5:30 PM ExHall A 448

3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose \textbf{SparseOIT}, an OIT-based 3DGS reconstruction algorithm that maintains an active set of gaussian splats and enjoys an acceleration ratio that is proportional to the potential sparsity. SparseOIT is designed by jointly considering the OIT rendering equation, the reconstruction algorithm and the geometric regularization. Through extensive experiments, we demonstrate that SparseOIT outperforms existing methods in the OIT-family by a large margin and also achieves comparable performance to the state-of-the-art 3DGS reconstruction methods based on volumetric rendering.

View full details

Poster

RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting

Ji Shi ⋅ Xianghua Ying ⋅ Bowei Xing ⋅ Ruohao Guo ⋅ Wenzhen Yue

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 450

3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present **RT-Splatting**, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into the transmission branch, effectively reducing distracting floaters. Experiments on challenging semi-transparent scenes show that RT-Splatting achieves state-of-the-art performance, delivering high-fidelity reflections and clear transmission with real-time rendering. Moreover, our factorization naturally enables flexible scene editing.

View full details

Poster

ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting

Yingdong Gu ⋅ Shaocheng Yan ⋅ Zhenjun Zhao ⋅ Yuan Kou ⋅ Jianxin Luo ⋅ Pengcheng Shi ⋅ Jiayuan Li

Jun 6, 11:45 AM - 1:45 PM ExHall F 449

Visual localization is a core technology for augmented reality and autonomous navigation. Recent methods combine the efficient rendering of 3D Gaussian Splatting (3DGS) with feature-based localization. These methods rely on direct matching between 2D query features and the 3D Gaussian feature field, but this often results in mismatches due to an inherent bias in the learned Gaussian feature. We theoretically analyze the feature learning process in 3DGS, revealing that the widely adopted $\alpha$-blending optimization inherently introduces bias into 3D point features. This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted feature fusion. We further introduce keypoint-consensus landmark sampling to select reliable Gaussians and local geometric consistency verification to reject mismatches caused by rendering artifacts. On the Cambridge Landmarks dataset, ULF-Loc reduces the mean median translation error by 17\% compared to the state-of-the-art, while achieving superior efficiency with only 1/10 the training time and 1/6 the GPU memory of STDLoc.

View full details

Poster

BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction

Alessio Mazzucchelli ⋅ María Naranjo Almeida ⋅ Jorge Bustos Sanchez ⋅ Mariella Dimiccoli ⋅ Francesc Moreno-Noguer ⋅ Jordi Sanchez-Riera ⋅ Adrian Penate-Sanchez

Jun 7, 3:30 PM - 5:30 PM ExHall A 452

Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene don't optimize the underlying 3D geometry of the scene. This makes object-level editing or asset extraction challenging. Recent methods, like COBGS, Trace3D, and ObjectGS, acknowledge this limitation and propose approaches that modify the geometry of the scene to represent the underlying semantics. We go a step further and propose a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1. Modifying the geometry of visible Gaussians to respect semantic boundaries, and, 2. Modifying the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization to allow for seamless integration within the optimization of the Gaussian parameters. Our second loss also propagates gradients to the Gaussian parameters, but does so without passing through the rasterization. This allows it to modify the geometry of the scene, even if not much transmittance arrives to a Gaussian (partial or non-visible). Exhaustive comparisons to 12 state of the art methods over 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.

View full details

Poster

E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction

Yunsoo Kim ⋅ Changki Sung ⋅ Dasol Hong ⋅ Hyun Myung

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 455

The emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods still rely on known poses or depend on depth estimation models and auxiliary modalities such as RGB-D. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we exploit the distinct spatio-temporal characteristics of edges and non-edge regions. The event camera's movement induces consistent events along edges, while non-edge regions produce sparse noise. We leverage this through a patch-based temporal coherence analysis that measures local variance to extract edges while robustly suppressing noise. The extracted edges guide structure-aware Gaussian initialization and enable edge-weighted losses throughout initialization, tracking, and bundle adjustment. Extensive experiments on both synthetic and real datasets demonstrate that E2EGS achieves superior reconstruction quality and trajectory accuracy, establishing a fully pose-free paradigm for event-based 3D reconstruction.

View full details

Poster

ApET: Approximation-Error Guided Token Compression for Efficient VLMs

Qiankun Ma ⋅ Ziyao Zhang ⋅ Haofei Wang ⋅ Zhen Song ⋅ Jie Chen ⋅ Hairong Zheng

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 454

Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an \textbf{Ap}proximation-\textbf{E}rror \textbf{T}oken compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2\% of the original performance on image-understanding tasks and even attains 100.4\% on video-understanding tasks, while compressing the token budgets by 88.9\% and 87.5\%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical.

View full details

Poster

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

Xinyu Tian ⋅ Shu Zou ⋅ Zhaoyuan Yang ⋅ Mengqi He ⋅ Peter Henry Tu ⋅ Jing Zhang

Jun 7, 11:45 AM - 1:45 PM ExHall F 454

Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks.

View full details

Poster

DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization

Zhengxian Yang ⋅ Fei Xie ⋅ Xutao Xue ⋅ Rui Zhang ⋅ Taicheng Huang ⋅ Yang Liu ⋅ Mengqi Ji ⋅ Tao Yu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 457

3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye’s large FOV advantage; 2) Undistortion’s stretch‐and‐interpolate resampling spreads each pixel’s value over a larger area, diluting detail density— causes 3DGS overfitting these low‐frequency zones, producing blur and floating artifacts.In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS's original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap–driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views—a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets.

View full details

Poster

What Matters in Practical Learned Image Compression

Kedar Tatwawadi ⋅ Parisa Rahimzadeh ⋅ Zhanghao Sun ⋅ Zhiqi Chen ⋅ Ziyun Yang ⋅ Sanjay Nair ⋅ Divija Hasteer ⋅ Oren Rippel

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 457

One of the major differentiators unlocked by learned codecs relative to their hard-coded traditional counterparts is their ability to be optimized directly to appeal to the human visual system. Despite this potential, a perceptual yet practical image codec is yet to be proposed.In this work, we aim to close this gap. We conduct a comprehensive study of the key modeling choices that govern the design of a practical learned image codec, jointly optimized for perceptual quality and runtime — including within the ablations several novel techniques. We then perform performance-aware neural architecture search over millions of backbone configurations to identify models that achieve the target on-device runtime while maximizing compression performance as captured by perceptual metrics. We combine the various optimizations to construct a new codec that achieves a significantly improved tradeoff between speed and perceptual quality. Based on rigorous subjective user studies, it provides 2.3-3× bitrate savings against AV1, AV2, VVC, ECM and JPEG-AI, and 20-40% bitrate savings against the best learned codec alternatives. At the same time, on an iPhone 17 Pro Max, it encodes 12MP images as fast as 230ms, and decodes them in 150ms — faster than most top ML-based codecs run on a V100 GPU.

View full details

Poster

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Hao Yan ⋅ Yuliang Liu ⋅ Xingchen Liu ⋅ Yuyi Zhang ⋅ Minghui Liao ⋅ Jihao Wu ⋅ Wei Chen ⋅ Xiang Bai

Jun 7, 3:30 PM - 5:30 PM ExHall A 460

Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``Analysis, Localization and Reasoning'' workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as an ideal foundation for their implementation.

View full details

Poster

SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation

Zixuan Pan ⋅ Kaiyuan Tang ⋅ Jun Xia ⋅ Yifan Qin ⋅ Lin Gu ⋅ Chaoli Wang ⋅ Jianxu Chen ⋅ Yiyu Shi

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 463

2D Gaussian Splatting has emerged as a novel image representation technique that can support efficient rendering on low-end devices. However, scaling to high-resolution images requires optimizing and storing millions of unstructured Gaussian primitives independently, leading to slow convergence and redundant parameters. To address this, we propose Structured Gaussian Image (SGI), a compact and efficient framework for representing high-resolution images. SGI decomposes a complex image into multi-scale local spaces defined by a set of seeds. Each seed corresponds to a spatially coherent region and, together with lightweight multi-layer perceptrons (MLPs), generates structured implicit 2D neural Gaussians. This seed-based formulation imposes structural regularity on otherwise unstructured Gaussian primitives, which facilitates entropy-based compression at the seed level to reduce the total storage. However, optimizing seed parameters directly on high-resolution images is a challenging and non-trivial task. Therefore, we designed a multi-scale fitting strategy that refines the seed representation in a coarse-to-fine manner, substantially accelerating convergence. Quantitative and qualitative evaluations demonstrate that SGI achieves up to 7.5$\times$ compression over prior non-quantized 2D Gaussian methods and 1.6$\times$ over quantized ones, while also delivering 1.6$\times$ and 6.5$\times$ faster optimization, respectively, without degrading, and often improving, image fidelity. Uploaded code will be released upon acceptance.

View full details

Poster

Self-Attention Driven Tensor Representation for High-Order Data Recovery

Zhi-Wei SHI ⋅ Yu-Bang Zheng ⋅ Heng-Chao Li

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 463

Low-rank tensor representation (LRTR) is an effective tool for compactly modeling high-order data. While nonlinear LRTR models can better capture real-world nonlinear dependencies, most existing methods rely on fixed mappings of multilayer perceptrons (MLPs) or convolutional neural networks (CNNs), limiting their ability to model complex global dependencies. To overcome this limitation, we construct a novel paradigm called Self-Attention Driven Tensor Representation (SADTR), which is the first framework that models nonlinearity from the perspective of self-attention. Specifically, we design a factor self-representation mechanism to establish dynamic global mapping, thereby adaptively capturing both local and non-local nonlinear dependencies. Moreover, we introduce an implicit sparse representation to impose sparsity constraint while avoiding additional optimization problems. As a result, the proposed SADTR can achieve a more accurate low-rank representation. In theory, we provide a detailed analysis to demonstrate the recoverability of SADTR. To validate the effectiveness of SADTR, we apply it to three representative high-order data recovery tasks. Experimental results demonstrate that SADTR consistently outperforms existing state-of-the-art LRTR methods.

View full details

Poster

Homaloidal parametrization for detecting critical two-view configurations

Rakshith Madhavan ⋅ Matteo Forlivesi ⋅ Marina Bertolini ⋅ Cristina Turrini ⋅ Federica Arrigoni ⋅ Luca Magri

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 466

We consider the problem of identifying degenerate configurations while estimating the fundamental matrix from (at least) 8 point correspondences. It is known that such configurations correspond to an ill-posed estimation of the fundamental matrix, so it is important to identify them in practice. So far, a practical degeneracy test is only available for the cases of planar scenes and pure rotation, while the case of the general critical surface (e.g., a hyperboloid/cone/cylinder containing 3D points and camera centres) is less studied, and the only available method is highly unstable, involving a pre-computed fundamental matrix. In this paper, we propose a novel degeneracy test for detecting points on the critical surface. By exploiting the geometry of the so-called ``homaloidal net of conics'', we are able to design a simple and very practical test that requires the linear estimation of a quadratic transformation from image correspondences. Our test does not require a fundamental matrix in advance and turns out to be more stable than its closest competitor, as shown in our experiments on both synthetic and real-world degenerate configurations.

View full details

Poster

Compressed-Domain-Aware Online Video Super-Resolution

Yuhang Wang ⋅ Hai Li ⋅ Shujuan Hou ⋅ Zhetao Dong ⋅ Xiaoyao Yang

Jun 7, 11:45 AM - 1:45 PM ExHall F 466

In bandwidth-limited online video streaming, videos are usually downsampled and compressed. Although recent online video super-resolution (online VSR) approaches achieve promising results, they are still compute-intensive and fall short of real-time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed-domain-aware network (CDA-VSR) for online VSR, which utilizes compressed-domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. Specifically, we propose a motion-vector-guided deformable alignment module that uses motion vectors for coarse warping and learns only local residual offsets for fine-tuned adjustments, thereby maintaining accuracy while reducing computation. Then, we utilize a residual-map-guided gated fusion module to derive spatial weights from residual maps, suppressing mismatched regions and emphasizing reliable details. Further, we design a frame-type-aware reconstruction module for adaptive compute allocation across frame types, balancing accuracy and efficiency. On the REDS4 dataset, our CDA-VSR surpasses the state-of-the-art method TMP, with a maximum PSNR improvement of 0.13 dB while delivering more than double the inference speed.

View full details

Poster

CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization

Xindong Mao ⋅ Hang Li ⋅ Yuchen Wu ⋅ Jiahe Li ⋅ Xiao Bai ⋅ Jin Zheng

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 467

Scene Coordinate Regression (SCR) has emerged as a memory-efficient paradigm for visual localization.While SCR has demonstrated performance comparable to classic feature matching based approaches in small-scale scenes, it has consistently underperformed in large-scale environments.Large-scale localization is hampered by two challenges: sparse co-visibility and local appearance ambiguity.In this work, we propose **CoLoR**, a novel training framework tailored for large-scale SCR.First, we explicitly and efficiently partition scene points into multi-view and single-view sets and introduce a two-stage bootstrapping paradigm to provide complete and strong supervision for all points.Second, we propose a multi-granularity retrieval feature, which unifies the conventional global and local features as retrieval-oriented representations at the image and pixel levels, respectively, to enforce feature consistency.Our method achieves state-of-the-art performance on multiple challenging large-scale datasets and significantly narrows the accuracy gap with classical feature matching based approaches while retaining a compact map size.

View full details

Poster

ROSE: Rotate Your Large Language Model to See

Tongtian Yue ⋅ Xuange Gao ⋅ Longteng Guo ⋅ Zijia Zhao ⋅ Zikang Liu ⋅ Jie Jiang ⋅ Hua Huang ⋅ Jing Liu

Jun 6, 11:45 AM - 1:45 PM ExHall F 467

Recent advances in multimodal large language models (MLLMs) have shown impressive progress in integrating visual and linguistic understanding. However, most existing MLLMs inject visual information into the input space of large language models (LLMs), which substantially increases context length and computational overhead, while often disrupting pretrained linguistic priors by forcing the LLM to optimize on vision-dominant multimodal sequences. In this work, we propose a rotation-based vision injection paradigm that aligns visual information with the parameter space of LLMs. Visual semantics are encoded as rotation matrices and applied directly to the pretrained parameters. This parameter-space injection eliminates the need for long input sequences, thus avoiding the quadratic computational overhead inherent in input-space injection. Besides, it preserves the linguistic competence of the LLM by maintaining the intrinsic geometric structure of the pretrained parameters. Building upon this paradigm, we develop ROSE, a 7B MLLM that achieves fine-grained vision–language alignment with remarkable computational efficiency. Extensive experiments across 12 multimodal benchmarks show that ROSE delivers superior or competitive performance compared with leading models.At comparable accuracy, ROSE reduces FLOPs by 80.7% and inference latency by 56.4% relative to Qwen2.5-VL-7B, demonstrating the effectiveness and scalability. All training code, model weights and data will be publicly released.

View full details

Poster

AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization

Mohammad Omama ⋅ Gabriele Berton ⋅ Eric Foxlin ⋅ Yelin Kim

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 467

Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers.We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching.Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher's localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off.

View full details

Poster

Affine Perspective-Three-Point Problem

Gaku Nakano

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 468

This paper addresses the Perspective-Three-Point (P3P) problem under affine camera models. We derive direct closed-form solvers for weak perspective and para perspective, which are representative affine camera models. The affine P3P solution reduces to a bi-quadratic equation. Unlike exact P3P solvers that require a cubic or quartic equation, it allows for the simple and stable calculation of real solutions using the quadratic formula. Since affine approximations are valid only when scene depth variation is small, we further propose an iterative correction that upgrades the affine solution to the exact P3P solution. Through extensive comparisons using synthetic data and public datasets, we demonstrate that affine P3P solvers with two upgrade iterations achieve performance substantially comparable to that of the state-of-the-art P3P solver.

View full details

Poster

High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling

Feng Ye ⋅ Kai Zhang ⋅ Li zhang ⋅ Chuanmin Jia

Jun 7, 11:45 AM - 1:45 PM ExHall F 469

Exploiting bi-directional context prediction has long been recognized as a key direction for improving compression efficiency in neural video coding. However, existing neural B-frame codecs still exhibit limited performance gains, particularly in high-resolution videos with large motion, where optical flow estimation becomes unreliable and balanced prediction fusion introduces distortions. To address these challenges, we present the first High-Resolution bi-directional neural video coding method, termed as HR-NVC, which non-uniformly integrates confidence-guided predictive cues from both temporal directions to achieve more reliable and efficient compression. Specifically, we propose Spatio-Temporal Anchored Motion Estimation, which introduces virtual anchor frames and low-resolution priors to significantly improve estimation robustness under large displacements. We further design a Hierarchical Motion Representation that converges multi-scale motion with temporal references, enabling compact and adaptive modeling of motion reliability across resolutions. Finally, a Bi-Contextual Asymmetric Harmonization module performs confidence-guided fusion of bidirectional references, effectively suppressing unreliable contexts and restoring structural consistency near occlusion and scene transition regions. Notably, our model is the first end-to-end-optimized video codec evaluated on 4K-resolution videos, establishing a new benchmark for higher-resolution NVC and achieving state-of-the-art performance among neural B-frame codecs.

View full details

Poster

Enhancing Video Vision Language Model with Hippocampal Sensing

Xu Cao

Jun 7, 11:45 AM - 1:45 PM ExHall F 472

Current video vision language models (VLMs) process information passively, lacking the ability to dynamically plan their analysis or perform joint reasoning across crucial modalities such as video and audio. To address this, we introduce Visual-Audio Supersensing (VAS), a learning paradigm that shifts the focus from temporal predictive sensing (e.g., Cambrian-S) to cross-modal prediction. The core objective of VAS is to train the model to anticipate audio-caption summarizations from video and vice versa. We present VA-R1, a VLM that operationalizes this paradigm. Instead of passively ingesting all data, VA-R1 actively reasons about its information needs using Chain-of-Thought (CoT). Our training process is twofold: we first finetune VA-R1 with VAS, and then apply a novel contrastive Reinforcement Learning (RL) algorithm, Video-Audio Negative-aware Optimization (VANAO), to optimize this selective co-reasoning process. This approach proves highly effective: despite their significantly smaller size, our VA-R1-7B and VA-R1-8B models achieve competitive performance to massive MLLMs like GPT-4o and Gemini 1.5 Pro on multiple video VQA benchmarks.

View full details

Poster

SenseSearch: Empowering Vision-Language Models with High-Resolution Agentic Search-Reasoning via Reinforcement Learning

Yong Xien Chng ⋅ Tao Hu ⋅ Wenwen Tong ⋅ Xueheng Li ⋅ Jiandong Chen ⋅ Haojia Yu ⋅ Jiefan Lu ⋅ Hewei Guo ⋅ Hanming Deng ⋅ Chengjun Xie ⋅ Gao Huang ⋅ Lewei Lu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 473

Vision-Language Models (VLMs) are limited by static knowledge and insufficient fine-grained visual analysis, hindering their performance on knowledge-intensive and visually complex tasks. While recent research has explored VLMs that employ external tools like search or cropping to enhance model performance, they typically employ tools in isolation and lack the ability to coordinate multiple tools effectively. To address this gap, we propose SenseSearch, the first agentic VLM for search-reasoning that supports adaptive multi-tool coordination via reinforcement learning (RL). Specifically, SenseSearch dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. We first construct a high-quality cold-start dataset to instill basic tool-usage behaviors. In the subsequent RL stage, we introduce Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to enhance the tool invocation and reasoning ability. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseSearch achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks, outperforming baselines by 19.18% on HR-MMSearch. SenseSearch provides a promising path toward agentic VLMs with effective and robust tool invocation capabilities. All code and data will be publicly released.

View full details

Poster

VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

Juhye Park ⋅ Wooju Lee ⋅ Dasol Hong ⋅ Changki Sung ⋅ Youngwoo Seo ⋅ DongWan Kang ⋅ Hyun Myung

Jun 7, 11:45 AM - 1:45 PM ExHall F 473

Accurate global localization is crucial for autonomous driving and robotics, especially in dense urban environments where GNSS is often unreliable due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. To address this challenge, we propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods, reducing median position and orientation errors by 50.7\% and 76.5\% on KITTI, and 18.0\% and 46.8\% on VIGOR, respectively.

View full details

Poster

Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

Xinlei Yu ⋅ Chengming Xu ⋅ Zhangquan Chen ⋅ Yudong Zhang ⋅ Shilin Lu ⋅ Cheng Yang ⋅ Jiangning Zhang ⋅ Shuicheng Yan ⋅ Xiaobin Hu

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 476

The dominant paradigm of monolithic scaling in Vision-Language Models (VLMs) is failing for understanding and reasoning in documents, yielding diminishing returns as it struggles with the inherent need of this domain for document-based procedural reasoning, cognitive complexity, and factual accuracy. To this end, we introduce MACT, a Multi-Agent Collaboration framework with agent-wise adaptive Test-time scaling that pioneers a paradigm shift to procedural scaling, adapting dynamically to the functional entities of visual documents understanding and reasoning. MACT decomposes the visual document processing flow into four specialized agents, i.e., planning, execution, judgment, and answer, to resolve cognitive overload and introduce a critical self-correction loop for factual grounding.This collaborative architecture is amplified by an agent-wise adaptive test-time scaling strategy that intelligently allocates computational resources based on the complexity and redundancy of each functionality. Evaluated on multiple visual document understanding benchmarks, MACT achieves superior performance with a smaller parameter scale, adapting effectively to various document scenarios without compromising its general or mathematical reasoning capabilities. The three variants of MACT consistently attain top-three average performance rankings, with average performance enhancements of 9.9–11.5\% over the base models. The source code will be released publicly.

View full details

Poster

Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery

Yiming Zeng ⋅ Xile Zhao ⋅ Wei-Hao Wu ⋅ Teng-Yu Ji ⋅ Chao Wang

Jun 6, 11:45 AM - 1:45 PM ExHall F 476

Tensor singular value decomposition (t-SVD) is a promising tool for multi-dimensional image representation, which decomposes a multi-dimensional image into a latent tensor and an accompanying transform matrix. However, two critical limitations of t-SVD methods persist: (1) the approximation of the latent tensor (e.g., tensor factorizations) is coarse and fails to accurately capture spatial local high-frequency information; (2) The transform matrix is composed of fixed basis atoms (e.g., complex exponential atoms in DFT and cosine atoms in DCT) and cannot precisely capture local high-frequency information along the mode-3 fibers. To address the two limitations, we propose a Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework, which compactly and continuously represents multi-dimensional image. Specifically, we leverage tailored 2D Gaussian splatting and 1D Gaussian splatting to generate the latent tensor and transform matrix, respectively. The 2D and 1D Gaussian splatting are indispensable and complementary under this representation framework, which enjoys a powerful representation capability, especially for local high-frequency information. To evaluate the representation ability of the GSLR, we develop an unsupervised GSLR-based multi-dimensional image recovery model. Extensive experiments on multi-dimensional image recovery demonstrate that GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.

View full details

Poster

Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion

Yanglin Deng ⋅ Tianyang Xu ⋅ Chunyang Cheng ⋅ Hui Li ⋅ Xiao-Jun Wu ⋅ Josef Kittler

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 479

Infrared and visible image fusion (IVIF) aims to synthesise complementary information from the two source modalities while preserving natural textures and salient thermal signatures simultaneously. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting the generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100$\times$ larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies.

View full details

Poster

FreqSIC: Frequency-aware Stereo Image Compression with Bi-directional Checkerboard Context Model

Shiyu Qin ⋅ Yongkang Lu ⋅ Yimin Zhou ⋅ Jiawei Li ⋅ Yifan Ren ⋅ Yuerong Xue ⋅ Shu-Tao Xia ⋅ Bin Chen

Jun 6, 11:45 AM - 1:45 PM ExHall F 479

Stereo image compression is essential for a wide range of 3D vision. Recent methods have demonstrated strong capabilities in eliminating inter-view redundancy and enabling compact entropy coding via spatial-domain stereo transformation and advanced autoregressive entropy models. However, these approaches often suffer from high-frequency information loss and incur considerable coding latency. To overcome these limitations, we propose a novel frequency stereo context transfer (FSCT) module. Unlike spatial-domain methods, the FSCT module separately captures inter-view redundancy in high- and low-frequency components and dynamically balances their contributions to preserve reconstruction quality. In addition, we replace the conventional autoregressive framework with a checkerboard strategy and integrate the FSCT module to model inter-view priors, enabling faster and more efficient entropy coding. Extensive experiments demonstrate that our method achieves state-of-the-art rate-distortion performance among existing stereo image compression approaches, while also attaining the lowest coding latency.

View full details

Poster

Human-Centric Multi-Exposure Fusion: Benchmark and Bi-level Cognition Distillation Framework

Jingjie Shang ⋅ Tengyu Ma ⋅ Heng Zhang ⋅ Jinyuan Liu ⋅ Risheng Liu ⋅ Yuan Wang ⋅ Xiaochen Bo

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 479

Multi-Exposure Fusion (MEF) seeks to generate a single high-quality image from multiple inputs captured at different exposure levels. Despite substantial progress, most existing approaches depend on statistical metrics that poorly reflect human perceptual preferences. Electroencephalography (EEG) provides a direct physiological window into human cognition, yet its use in low-level vision remains limited due to scarce paired data and the absence of bio-signals during inference. We address these challenges through two key contributions. First, we introduce Cog-Expo, the first dataset capturing human cognitive responses to multi-exposure stimuli, establishing a bridge between neuroscience and computational photography. Second, we propose a bi-level coupled learning framework that leverages this cognitive information without requiring it during inference. A Mental Integrated Transformer serves as the Teacher, incorporating cognitive priors to guide visual feature learning, while a lightweight Student is trained to approximate these cues using only image inputs. Through bi-level optimization, the Teacher learns inherently distillable representations, enabling the Student to emulate cognitive guidance efficiently. Extensive experiments confirm that our method achieves state-of-the-art fusion performance and aligns more closely with human perception.

View full details

Poster

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

CHEN Yang ⋅ Xieyuanli Chen ⋅ Junxiang Li ⋅ Jie Tang ⋅ Tao Wu

Jun 6, 11:45 AM - 1:45 PM ExHall F 480

Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions---implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.

View full details

Poster

GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

Modi Jin ⋅ Yiming Zhang ⋅ Bo-Yuan Sun ⋅ Dingwen Zhang ⋅ Mingming Cheng ⋅ Qibin Hou

Jun 7, 3:30 PM - 5:30 PM ExHall A 480

This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which conflict with geographic characteristics. To address these issues, we first introduce GeoSeek, a new geolocation dataset comprising CoT data annotated by geographic experts and professional players. We further thoroughly explore the inherent characteristics of geographic tasks and propose a geo-similarity reward and a consistency reward assessed by a consistency agent to assist training. This encourages the model to converge towards correct answers from a geographic perspective while ensuring the integrity and consistency of its reasoning process. Experimental results show that GeoAgent outperforms existing methods and a series of general VLLMs across multiple grains, while generating reasoning that closely aligns with humans. Pretrained model and data will be openly available.

View full details

Poster

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

Shuhao Kang ⋅ Youqi Liao ⋅ Peijie Wang ⋅ Wenlong Liao ⋅ Qilin Zhang ⋅ Benjamin Busam ⋅ Xieyuanli Chen ⋅ Yun Liu

Jun 7, 3:30 PM - 5:30 PM ExHall A 481

Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird’s-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset will be publicly released.

View full details

Poster

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

Loick Chambon ⋅ Paul Couairon ⋅ Éloi Zablocki ⋅ Alexandre Boulch ⋅ Nicolas THOME ⋅ Matthieu Cord

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 482

Vision Foundation Models (VFMs) extract spatially downsampled representations, which poses challenges for pixel-level tasks that require fine-grained details.Existing approaches face a trade-off: classical filters are fast and broadly applicable but use fixed forms and feature-independent guidance, while modern upsamplers achieve stronger accuracy with learnable, VFM-specific guidance but require retraining per VFM.We introduce Neighborhood Attention Filtering (NAF), bridging classical filtering with modern upsamplers. Guided solely by the high-resolution input image, NAF learns adaptive content and spatial weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE).NAF is VFM-agnostic and zero-shot: once trained, it upsamples features from any VFM without retraining, being the first VFM-agnostic architecture to outperform VFM-specific upsamplers by achieving state-of-the-art scores on multiple downstream tasks.It remains highly efficient, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS.Beyond feature upsampling, NAF demonstrates strong performance on image restoration, showing its versatility. We open-source our code and checkpoints.

View full details

Poster

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

Songyuan Yang ⋅ Weijiang Yu ⋅ Jilin Ma ⋅ Ziyu Liu ⋅ Guijian Tang ⋅ Wenjing Yang ⋅ Huibin Tan ⋅ Nong Xiao

Jun 7, 11:45 AM - 1:45 PM ExHall F 482

Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce **Reinforce to Learn, Elect to Reason (RLER)**, a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In **RLER-Training**, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In **RLER-Inference**, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3 \% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.

View full details

Poster

HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps

Xuchang Zhong ⋅ Xu Cao ⋅ Jinke Feng ⋅ Hao Fang

Jun 7, 3:30 PM - 5:30 PM ExHall A 482

Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly released to foster future research.

View full details

Poster

TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

Arian Sabaghi ⋅ Jose Oramas

Jun 7, 3:30 PM - 5:30 PM ExHall A 483

Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with DINOv2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be available upon paper acceptance at https://anonymousRepoURL.com.

View full details

Poster

Beyond Single Solution: Multi-Hypothesis Deep Unfolding Network for Image Compressive Sensing

Wenxue Cui ⋅ Hualin Li ⋅ Yuhang Qin ⋅ Yifu Xu ⋅ Xiaopeng Fan ⋅ Debin Zhao

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 489

Recent deep unfolding networks (DUNs) have advanced Compressive Sensing (CS) by effectively integrating iterative optimization with deep learning architectures. However, most CS approaches predominantly confine their inference to a single solution space, neglecting the inherent ill-posedness of CS problems that intrinsically permits multiple plausible candidate hypotheses. In this paper, a novel Multi-Hypothesis Collaborative Deep Unfolding CS Network (MHC-DUN) is proposed, which explicitly models and leverages multiple hypotheses by jointly optimizing across diverse solution spaces. Specifically, following the Proximal Gradient Descent algorithm, MHC-DUN jointly performs gradient descent and proximal mapping within this multi-hypothesis paradigm. i) For gradient descent, a well-designed AlphaNet is introduced to dynamically predict spatially varying step sizes for all hypotheses, enabling collaborative gradient updates across multiple solutions. ii) For proximal operator, a sophisticated multi-hypothesis collaborative proximal mapping module is designed, which leverages both intra-hypothesis and inter-hypothesis correlation priors to jointly refine multiple solutions. To enable end-to-end training, a novel composite loss function is designed, which balances measurement fidelity, hypothesis diversity, and reconstruction accuracy, encouraging exploration of complementary solutions while maintaining reconstruction fidelity. Experimental results reveal that the proposed CS method outperforms existing CS networks.

View full details

Poster

UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation

Xingyuan Li ⋅ Songcheng Du ⋅ Yang Zou ⋅ HaoYuan Xu ⋅ Zhiying Jiang ⋅ Jinyuan Liu

Jun 7, 11:45 AM - 1:45 PM ExHall F 489

Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi-modal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose \textbf{UniFusion}, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion’s superior visual quality, generalization ability, and adaptability to real-world scenarios.

View full details

Poster

240FPS Stereo Vision from Monocular Mixed Spikes

Yeliduosi Xiaokaiti ⋅ Yakun Chang ⋅ Yang Bai ⋅ Zhaojun Huang ⋅ Peiqi Duan ⋅ Boxin Shi

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 490

Stereo vision is fundamental for enabling machines to perceive and interact with the world. While monocular stereo methods offer hardware compactness, they struggle with generalization due to reliance on data-driven priors. Binocular and multi-view systems improve accuracy but incur higher hardware complexity and data inefficiency. In this paper, we introduce a monocular solution for high-frame-rate stereo vision via temporal optical modulation. The modulation directs light from two views in a mixed manner while periodically attenuates one view at 60Hz. To capture the temporal variations introduced by this modulation, we employ a high-speed spike camera that records the mixed scene as temporally dense spikes. And the high temporal resolution of these spikes enables the construction of a linear system for efficient binocular video decoupling.Consequently, we introduce a two-stage decoding methodology for achieving high-quality stereo vision: An efficient least-squares based baseline reconstruction followed by a deep learning refinement module. Experimental results demonstrate that our approach achieves 240FPS binocular video reconstruction with superior accuracy compared to monocular systems, while maintaining the hardware compactness and data efficiency.

View full details

Poster

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li ⋅ Yansong Li ⋅ Hongze Shen ⋅ Mengdi Liu ⋅ Hong Chang ⋅ Shiguang Shan

Jun 6, 11:45 AM - 1:45 PM ExHall F 491

The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, thetemporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.

View full details

Poster

Specificity-aware reinforcement learning for fine-grained open-world classification

Samuele Angheben ⋅ Davide Berasi ⋅ Alessandro Conti ⋅ Elisa Ricci ⋅ Yiming Wang

Jun 7, 3:30 PM - 5:30 PM ExHall A 491

Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. We will release both code and model.

View full details

Poster

Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

Yafei Zhang ⋅ Meng Ma ⋅ Huafeng Li ⋅ Yu Liu

Jun 6, 11:45 AM - 1:45 PM ExHall F 494

Infrared–visible (IR–VIS) image fusion is vital for perception and security, yet most methods rely on the availability of both modalities during training and inference. When the infrared modality is absent, pixel-space generative substitutes become hard to control and inherently lack interpretability. We address missing-IR fusion by proposing a dictionary-guided, coefficient-domain framework built upon a shared convolutional dictionary. The pipeline comprises three key components: (1) Joint Shared-dictionary Representation Learning (JSRL) learns a unified and interpretable atom space shared by both IR and VIS modalities; (2) VIS-Guided IR Inference (VGII) transfers VIS coefficients to pseudo-IR coefficients in the coefficient domain and performs a one-step closed-loop refinement guided by a frozen large language model as a weak semantic prior; and (3) Adaptive Fusion via Representation Inference (AFRI) merges VIS structures and inferred IR cues at the atom level through window attention and convolutional mixing, followed by reconstruction with the shared dictionary. This \emph{encode$\rightarrow$transfer$\rightarrow$fuse$\rightarrow$reconstruct} pipeline avoids uncontrolled pixel-space generation while ensuring prior preservation within interpretable dictionary–coefficient representation. Experiments under missing-IR settings demonstrate consistent improvements in perceptual quality and downstream detection performance. To our knowledge, this represents the first framework that jointly learns a shared dictionary and performs coefficient-domain inference–fusion to tackle missing-IR fusion.

View full details

Poster

Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields

Berthy T. Feng ⋅ Andrew A. Chael ⋅ David Bromley ⋅ Aviad Levis ⋅ William Freeman ⋅ Katherine L. Bouman

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 496

With the success of static black-hole imaging, the next frontier is the dynamic and 3D imaging of black holes. Recovering the dynamic 3D gas near a black hole would reveal previously-unseen parts of the universe and inform new physics models. However, only sparse radio measurements from a single viewpoint are possible, making the dynamic 3D reconstruction problem significantly ill-posed. Previously, BH-NeRF addressed the ill-posed problem by assuming Keplerian dynamics of the gas, but this assumption breaks down near the black hole, where the strong gravitational pull of the black hole and increased electromagnetic activity complicate fluid dynamics. To overcome the restrictive assumptions of BH-NeRF, we propose *PINeRF*, a physics-informed approach that uses differentiable neural rendering to fit a 4D (time + 3D) emissivity field given EHT measurements. Our approach jointly reconstructs the 3D velocity field with the 4D emissivity field and enforces the velocity as a soft constraint on the dynamics of the estimated emissivity. In experiments on simulated data, we find significantly improved reconstruction accuracy over both BH-NeRF and a totally physics-agnostic approach. We demonstrate how our method can be used to estimate other physics parameters of the black hole, such as its spin.

View full details

Poster

PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization

Xiaoya Cheng ⋅ Long Wang ⋅ Yan Liu ⋅ Xinyi Liu ⋅ Hanlin Tan ⋅ Yu Liu ⋅ Maojun Zhang ⋅ Shen Yan

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 498

We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity.PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion.Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset will be made publicly available.

View full details

Poster

OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

Dongjian Yu ⋅ Weiqing Min ⋅ Qian Jiang ⋅ Xing Lin ⋅ Xin Jin ⋅ Shuqiang Jiang

Jun 7, 3:30 PM - 5:30 PM ExHall A 500

Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets focus on Western cuisines, with limited coverage of Chinese dishes, leading to limitations in accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food scenes with detailed nutritional annotations and multi-view images for each scene. In addition, to enhance models’ capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework to predict nutritional information from a single RGB image. We first predict a depth map from a single RGB image, then refine it using our Scale-Shift Residual Adapter (SSRA), which enforces global scale consistency and preserves local structural details. Second, the Frequency-Aligned Fusion Module (FAFM) hierarchically fuses RGB and adapted depth features, aligning multi-modal representations in the frequency domain across layers. Third, the Mask-based Prediction Head (MPH) emphasizes key ingredient regions via dynamic channel selection, improving prediction accuracy. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, providing a practical solution for daily dietary assessment.

View full details

Poster

TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis

Rui Peng ⋅ Ziru Liu ⋅ Lingyuan Ye ⋅ Yuxing Lu ⋅ Boxin Shi ⋅ Jinzhuo Wang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 501

Accurately modeling the relationship between perturbations, transcriptional responses, and phenotypic changes is essential for building an AI Virtual Cell (AIVC). However, existing methods typically constrained to modeling direct associations, such as *Perturbation $\rightarrow$ RNA* or *Perturbation $\rightarrow$ Morphology*, overlook the crucial causal link from RNA to morphology. To bridge this gap, we propose TRIDENT, a cascade generative framework that synthesizes realistic cellular morphology by conditioning on both the perturbation and the corresponding gene expression profile. To train and evaluate this task, we construct MorphoGene, a new dataset pairing L1000 gene expression with Cell Painting images for 98 compounds. TRIDENT significantly outperforms state-of-the-art approaches, achieving up to 7-fold improvement with strong generalization to unseen compounds. In a case study on docetaxel, we validate that RNA-guided synthesis accurately produces the corresponding phenotype. An ablation study further confirms that this RNA conditioning is essential for the model's high fidelity. By explicitly modeling transcriptome–phenome mapping, TRIDENT provides a powerful in silico tool and moves us closer to a predictive virtual cell.

View full details

Poster

Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal

Xiaolong Qian ⋅ Qi Jiang ⋅ Lei Sun ⋅ Zongxi Yu ⋅ Kailun Yang ⋅ Peixuan Wu ⋅ Jiacheng Zhou ⋅ Yao Gao ⋅ Yaoguang Ma ⋅ Ming-Hsuan Yang ⋅ Kaiwei Wang

Jun 7, 11:45 AM - 1:45 PM ExHall F 501

Beyond the commonly recognized optical aberrations, the imaging performance of compact optical systems—including single-lens and metalens designs—is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments.This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature.Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models.To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors.VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process.We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model.Extensive experiments on challenging compact optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods.These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler.All code and datasets will be publicly released.

View full details

Poster

Inter-Photon-Limited Videography

Andrew Xie ⋅ Dongyu Du ⋅ Sotiris Nousias ⋅ David B. Lindell ⋅ Kiriakos N. Kutulakos

Jun 7, 11:45 AM - 1:45 PM ExHall F 502

We consider the problem of imaging a dynamic scene when scene appearance variations can outpace photon arrivals. Under such conditions, a pixel is effectively ``blind'' to changes in appearance that occur within the timespan separating the photons it detects, and so the inter-photon interval presents a significant speed barrier to video acquisition systems. To analyze and advance imaging capabilities at the inter-photon limit, we introduce a novel reparameterization of time-varying flux that reveals the intrinsic difficulty of signal reconstruction by relating the Fourier decomposition of a flux function to the number of photons arriving within each oscillation period. We find that inter-photon-limited videography of general scenes is underexplored and beyond the reach of existing reconstruction techniques. To this end, we introduce Neural Flux Fields---a technique that combines statistical modeling of photon arrival with intrinsic priors of a neural network to achieve robust videography at the inter-photon limit. Using this approach, we demonstrate never-before-scene capabilties in video reconstruction across a range of captured single-photon video datasets spanning the inter-photon-limited regime.

View full details

Poster

LRHDR: Learning Representation-enhanced HDR Video Reconstruction

Chenzhuo Liao ⋅ Xin Chen ⋅ Bingchen Li ⋅ Yu Meng ⋅ Tao Yue ⋅ Xuemei Hu

Jun 7, 3:30 PM - 5:30 PM ExHall A 502

Reconstructing High Dynamic Range (HDR) video from alternately exposed Low Dynamic Range (LDR) frames is challenged by large motion, exposure-induced photometric inconsistency, and information loss in saturated or under-exposed regions. Prior HDR video pipelines typically follow an alignment–reconstruction paradigm, which is limited by the precision of alignment and the performance of the fusion module. We propose a new reconstruction framework called Learning Representation-enhanced HDR Video Reconstruction (LRHDR), which built around two novel components: an Amalgamated Cross-exposure Consistent Representation (ACCR) network and an Adaptive Pixel-wise Sparse Weighted Fusion (APSWF).The ACCR includes an Exposure-aware Interleaved Context (EIC) encoder and a Representation Mapper (RM).The EIC couples a large-field path with a high-fidelity sub-pixel path and an exposure gate to produce exposure-aware features. The RM avoids geometric warping by mapping features from different exposures into a unified representation via per-pixel, per-channel linear modulation and decoded into calibrated linear HDR domain. The APSWF treats fusion as pixel-wise candidate selection, producing sparse weighted masks to form a normalized fusion in the linear HDR domain, thereby suppressing artifacts.Extensive experiments on standard benchmarks demonstrate that our LRHDR outperforms previous methods.

View full details

Poster

OVI-MAP: Open-Vocabulary Instance-Semantic Mapping

Zilong Deng ⋅ Federico Tombari ⋅ Marc Pollefeys ⋅ Johanna Wald ⋅ Daniel Barath

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 505

Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks. The source code will be made publicly available.

View full details

Poster

Electromagnetic Inverse Scattering from a Single Transmitter

Yizhe Cheng ⋅ Chunxun Tian ⋅ Haoru Wang ⋅ Wentao Zhu ⋅ Xiaoxuan Ma ⋅ Yizhou Wang

Jun 7, 11:45 AM - 1:45 PM ExHall F 505

Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging, especially under sparse transmitter setups, e.g., with only one transmitter. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires time-consuming case-specific optimization and fails under sparse transmitter setups. To address these limitations, we revisit EISP from a data-driven perspective. The scarcity of transmitters leads to an insufficient amount of measured data, which fails to capture adequate physical information for stable inversion. Built on this insight, we propose a fully end-to-end and data-driven framework that predicts the relative permittivity of scatterers from measured fields, leveraging data distribution priors to compensate for the lack of physical information. This design enables data-driven training and feed-forward prediction of relative permittivity while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy and robustness. Notably, it achieves high-quality results even with a single transmitter, a setting where previous methods consistently fail. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.

View full details

Poster

ReMoT: Reinforcement Learning with Motion Contrast Triplets

Cong Wan ⋅ Zeyu Guo ⋅ Jiangyang Li ⋅ SongLin Dong ⋅ Yifan Bai ⋅ Lin Peng ⋅ Zhiheng Ma ⋅ Yihong Gong

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 508

We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency—a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (i) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (ii) Group Relative Policy Optimization, which we empirically validate, yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves SOTA performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1 performance leap on spatio-temporal reasoning tasks.

View full details

Poster

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Yunnan Wang ⋅ Kecheng Zheng ⋅ Jianyuan Wang ⋅ Minghao Chen ⋅ David Novotny ⋅ Christian Rupprecht ⋅ Yinghao Xu ⋅ Xing Zhu ⋅ Wenjun Zeng ⋅ Xin Jin ⋅ Yujun Shen

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 507

The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

View full details

Poster

B^3-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates

Hiromichi Kamata ⋅ Samuel Arthur Munro ⋅ Fuminori Homma

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 507

Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production.However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use.We propose \textbf{B$^3$-Seg (Beta--Bernoulli Bayesian Segmentation for 3DGS)}, a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under \textbf{camera-free} and \textbf{training-free} conditions.Our approach reformulates segmentation as sequential Beta--Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG).This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy.Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds.The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.

View full details

Poster

Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass

Liyi Chen ⋅ Pengfei Wang ⋅ Guowen Zhang ⋅ Zhiyuan Ma ⋅ Lei Zhang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 508

Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model's representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.

View full details

Poster

TokenLight: Precise Lighting Control in Images using Attribute Tokens

Sumit Chaturvedi ⋅ Yannick Hold-Geoffroy ⋅ Mengwei Ren ⋅ Jingyuan Liu ⋅ He Zhang ⋅ Yiqun Mei ⋅ Julie Dorsey ⋅ ZHIXIN SHU

Jun 6, 11:45 AM - 1:45 PM ExHall F 512

This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly.

View full details

Poster

Learning Hierarchical Hyperbolic Mixture Model for Part-aware 3D Generation

Qitong Yang ⋅ Mingtao Feng ⋅ Zijie Wu ⋅ Huixin Zhu ⋅ Weisheng Dong ⋅ Yaonan Wang ⋅ Ajmal Mian

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 513

3D shape generation has become increasingly important for graphics and vision applications. Current part-aware 3D generation usually overlooks hierarchical part relations or inefficiently encodes multi-level semantics in Euclidean space. Thus we propose a novel framework for hierarchical and efficient part-aware 3D generation in hyperbolic space. Our contributions are three-fold: (1) Hierarchical Hyperbolic Mixture Model (H$^2$MM): We propose part-aware semantic representation of objects within a hyperbolic manifold, providing a high-fidelity hierarchical part-aware representation of object details and semantics. (2) Hyperbolic Semantically Consistent Diffusion Model: We design the geodesic diffusion process that preserves the hierarchical and semantic structure of H$^{2}$MM, and progressively generates semantics from conditions and generates object under their joint guidance. We use an adaptive tree-structured neural network to loosen the constraint of jointly generating nodes and edges in previous hyperbolic diffusion. (3) Hyperbolic Diffusion Model Solver: We leverage higher-order Riemannian gradient on hyperbolic manifolds for designing a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. Extensive experiments demonstrate that our method achieves superior quality and efficiency. Code will be public.

View full details

Poster

Kaleidoscopic Scintillation Event Imaging

Alex Bocchieri ⋅ John Mamish ⋅ David Appleyard ⋅ Andreas Velten

Jun 6, 11:45 AM - 1:45 PM ExHall F 513

Scintillators are transparent materials that interact with high-energy particles and emit visible light as a result. They are used in state of the art methods of measuring high-energy particles and radiation sources.Most existing methods use fast single-pixel detectors to detect and time scintillation events.Cameras provide spatial resolution but can only capture an average over many events, making it difficult to image the events associated with an individual particle.Emerging single-photon avalanche diode cameras combine speed and spatial resolution to enable capturing images of individual events.This allows us to use machine vision techniques to analyze events, enabling new types of detectors.The main challenge is the very low brightness of the events.Techniques have to work with a very limited number of photons.We propose a kaleidoscopic scintillator to increase light collection in a single-photon camera while preserving the event's spatial information.The kaleidoscopic geometry creates mirror reflections of the event in known locations for a given event location that are captured by the camera.We introduce theory for imaging an event in a kaleidoscopic scintillatorand an algorithm to estimate the event's 3D position.We find that the kaleidoscopic scintillator design provides sufficient light collection to perform high-resolution event measurements for advanced radiation imaging techniques using a commercial CMOS single-photon camera.

View full details

Poster

Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy

Shuwei Shao ⋅ Kejin Zhu ⋅ Shixing Ma ⋅ Xinzhe Du ⋅ Baochang Zhang ⋅ Zhe Min

Jun 7, 11:45 AM - 1:45 PM ExHall F 513

Monocular depth estimation serves as a core technique in endoscopic applications such as 3D reconstruction and localization. However, most existing methods focus primarily on in-domain depth estimation, which limits their robustness and prevents them from delivering impressive cross-domain performance, due to variations in depth distributions, illumination conditions, and texture patterns. In this work, we propose Depth Any Endoscopy (DAE), a novel self-supervised framework for generalizable depth estimation in monocular endoscopy. To specify, we develop a dual-level Mixture-of-Experts (MoE) adaptation paradigm that effectively tailors Vision Foundation Models to diverse endoscopic procedures, such as laparoscopy and colonoscopy, accounting for the challenges posed by varying environments. Internally, we integrate LoRA and Adapter modules within the MoE architecture, allowing the model to flexibly adapt to the characteristics of input data. Externally, a mixture of domain-specific experts provides customized guidance to enhance the training stability. In addition, we introduce a learnable gradient harmonization mechanism to dynamically balance the optimization between the depth and pose networks, along with a semantic distribution calibration module that strengthens the semantic consistency of depth predictions. Extensive experiments demonstrate that the proposed DAE achieves state-of-the-art performance in both zero-shot and in-domain depth estimation scenarios.

View full details

Poster

SpiderCam: Low-Power Snapshot Depth from Differential Defocus

Marcos A. Ferreira ⋅ Tianao Li ⋅ John Mamish ⋅ Josiah Hester ⋅ Yaman Sangar ⋅ Qi Guo ⋅ Emma Alexander

Jun 7, 3:30 PM - 5:30 PM ExHall A 513

We introduce SpiderCam, an FPGA-based snapshot depth-from-defocus camera which produces 480x400 sparse depth maps in real-time at 32.5 FPS over a working range of 52 cm while consuming 611 mW of power in total. SpiderCam comprises a custom camera which simultaneously captures two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. To achieve state-of-the-art power consumption, we present algorithmic improvements to DfDD that overcome challenges caused by low-power sensors, and design a memory-local implementation for streaming depth computation on a device that is too small to store even a single image pair. We report the first sub-Watt total power measurement for passive FPGA-based 3D cameras in the literature.

View full details

Poster

MeshRipple: Structured Autoregressive Generation of Artist-Meshes

JunKai Lin ⋅ Hang Long ⋅ Huipeng Guo ⋅ Jielei Zhang ⋅ JiaYi Yang ⋅ Tianle Guo ⋅ Yang Yang ⋅ Jianwen Li ⋅ Wenxiao ZHANG ⋅ Matthias Nießner ⋅ Wei Yang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 514

Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a surface.MeshRipple rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological dependencies.This integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.

View full details

Poster

gQIR: Generative Quanta Image Reconstruction

Aryan Garg ⋅ Sizhuo Ma ⋅ Mohit Gupta

Jun 6, 11:45 AM - 1:45 PM ExHall F 514

Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw $\textit{quanta frames}$ contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging $\textit{Deforming (XD)}$ video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing.

View full details

Poster

FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

Hanxiao Wang ⋅ Yuanchen Guo ⋅ Ying-Tian Liu ⋅ Zi-Xin Zou ⋅ Biao Zhang ⋅ Weize Quan ⋅ Ding Liang ⋅ Yan-Pei Cao ⋅ Dong-Ming Yan

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 515

Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our ``one-face-one-token'' strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.

View full details

Poster

Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation

Haidong Wu ⋅ Snehal Bhayani ⋅ Janne Heikkilä

Jun 6, 11:45 AM - 1:45 PM ExHall F 515

Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gröbner-basis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed via inverse fast Fourier transform interpolation from sampled evaluations, avoiding symbolic expansion. Solving this polynomial yields the hidden variable, and the remaining unknowns are recovered by identifying rank-1 deficient submatrices and applying Cramer's rule. A greatest common divisor-based criterion ensures robust submatrix identification under noise. Experiments on diverse minimal problems demonstrate that the proposed solver achieves strong numerical stability and competitive runtime, particularly for small-scale problems, providing a practical alternative to traditional Gröbner-basis and resultant-based solvers.

View full details

Poster

Dark3R: Learning Structure from Motion in the Dark

Andrew Y. Guo ⋅ Anagh Malik ⋅ SaiKiran Tedla ⋅ Yutong Dai ⋅ Yiqian Qin ⋅ Zach Salehe ⋅ Benjamin Attal ⋅ Sotiris Nousias ⋅ Kiriakos N. Kutulakos ⋅ David B. Lindell

Jun 7, 11:45 AM - 1:45 PM ExHall F 516

We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB—a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher–student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson–Gaussian noise model applied to well-exposed raw images.To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.

View full details

Poster

CUPID: Generative 3D Reconstruction via Joint Object and Pose Modeling

Binbin Huang ⋅ Haobin Duan ⋅ Yiqun Zhao ⋅ Zibo Zhao ⋅ Yi Ma ⋅ Shenghua Gao

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 517

We introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. Our two-stage flow-based model first generates a coarse 3D structure and 2D-3D correspondences to estimate the camera pose robustly. Conditioned on this pose, a refinement stage injects pixel-aligned image features directly into the generative process, marrying the rich prior of a generative model with the geometric fidelity of reconstruction. This strategy achieves exceptional faithfulness, outperforming state-of-the-art reconstruction methods by over 3 dB PSNR and 10\% in Chamfer Distance. As a unified generative model that decouples the object and camera pose, Cupid naturally extends to multi-view and scene-level reconstruction tasks without requiring post-hoc optimization or fine-tuning.

View full details

Poster

From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images

Ruikun Zhang ⋅ Yan Yang ⋅ Liyuan Pan

Jun 6, 11:45 AM - 1:45 PM ExHall F 517

Spatial transcriptomics (ST) measures gene expression at fine-grained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction typically crop spots of interest from histopathology slide images, and train models to map each spot to a corresponding gene expression profile. However, these methods inherently lose the spatial resolution in gene expression:1) each spot often contains multiple cells with distinct gene expression profiles;2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. Different from previous methods that map individual spots to gene expression values, we generate a spatially dense continuous gene expression map from the histopathology slide image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on four common ST datasets in multiple spatial scales. The source code will be publicly available.

View full details

Poster

What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?

David Yan ⋅ Alexander Raistrick ⋅ Jia Deng

Jun 7, 11:45 AM - 1:45 PM ExHall F 517

Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We validate our findings by collecting the best settings and creating a large-scale dataset. Training only on this dataset achieves better performance than training on a mixture of widely used datasets, and is competitive with training on the FoundationStereo dataset, with the additional benefit of open-source generation code and an accompanying parameter analysis to enable further research. We open-source our system to enable further research on procedural stereo datasets.

View full details

Poster

RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion

Panjun Liu ⋅ Jiyuan Xia ⋅ YUANSHEN GUAN ⋅ Yong Li ⋅ Zhiqiang Lang ⋅ Ruikang Xu ⋅ Chang Chen ⋅ Dehua Song ⋅ Fenglong Song ⋅ Zhiwei Xiong

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 520

Extreme low-light Raw image restoration remains challenging due to overwhelming noise and severe detail loss.In this paper, we exploit the potential of the dual-exposure setting for this severely ill-posed problem.Existing methods suffer from unreliable cross-exposure alignment, resulting in degraded detail recovery and compromised color fidelity. To address these challenges, we propose RawMetaDiff, a novel generative diffusion framework that restores a high-fidelity Raw image from a short-exposure input, conditioned on a potentially misaligned long-exposure reference under the guidance of Raw metadata.At its core, we propsed two complementary mechanisms: the Meta-Assistant Color Transfer (MACT) enforces color consistency by aligning global color statistics along the channel dimension,while the Meta-Normed Cross Attention (MNCA) leverages Raw metadata to establish robust cross-exposure spatial correspondences and inject shadow details.To support robust diffusion training, we first collect a 1K real-world, dual-exposure Raw dataset, namely DERaw, and then design a realistic degradation model to synthesize data that closely approximates real-world conditions.Extensive experiments on both synthetic and real-world datasets demonstrate that RawMetaDiff significantly outperforms existing methods, justifying an effective new solution for extreme low-light Raw image restoration from the generative perspective.

View full details

Poster

Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

ZIREN GONG ⋅ Xiaohan Li ⋅ Fabio Tosi ⋅ Jiawei Han ⋅ Stefano Mattoccia ⋅ Jianfei Cai ⋅ Matteo Poggi

Jun 7, 11:45 AM - 1:45 PM ExHall F 520

We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips alongside object-level semantics; and 2D–3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation — marking a step forward toward real-time, semantics-aware Spatial AI.

View full details

Poster

M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation

Yiheng Zhang ⋅ Zhuojiang Cai ⋅ Mingdao Wang ⋅ Meitong Guo ⋅ Tianxiao Li ⋅ Li Lin ⋅ Yuwang Wang

Jun 7, 11:45 AM - 1:45 PM ExHall F 521

In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 21,367 layouts and over 433k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using both a text-conditioned diffusion model and a text-conditioned autoregressive model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis. All dataset and code will be made public upon acceptance.

View full details

Poster

Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction

David Novikov ⋅ Eilon Vaknin ⋅ Narek Tumanyan ⋅ Mark Sheinin

Jun 7, 3:30 PM - 5:30 PM ExHall A 521

The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years.However, most conventional cameras are bandwidth-limited to 30–60 FPS, restricting these methods to static or slowly evolving scenes.While overcoming bandwidth limitations is difficult in general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific scenarios (e.g., motion capture and particle image velocimetry).However, most of these methods require modifications to camera optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed scene using only unaugmented low-speed cameras. Instead of modifying the hardware or optics of each individual camera, we encode high-speed scene dynamics by illuminating the scene with a rapid, sequentialcolor sequence. This results in simultaneous multi-view capture of the scene, in which high-speed temporal information is encoded in the images' color. To construct a high-speed volumetric representation of the dynamic scene, we develop a novel dynamic Gaussian Splatting-based approach that decodes the temporal information from the images. We evaluate our approach on simulated scenes and real-world experiments using a multi-camera imaging setup, showing first-of-a-kind high-speed volumetric scene reconstructions.

View full details

Poster

UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents

Xufan He ⋅ Yushuang Wu ⋅ Xiaoyang Guo ⋅ Chongjie Ye ⋅ Jiaqing Zhou ⋅ Tianlei Hu ⋅ Xiaoguang Han ⋅ Dong Du

Jun 7, 11:45 AM - 1:45 PM ExHall F 522

Part-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry–segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances geometric fidelity by predicting part latents in both global and canonical spaces. Extensive experiments demonstrate that UniPart achieves superior segmentation controllability and part-level geometric quality compared with existing approaches.

View full details

Poster

NI-Tex: Non-isometric Image-based Garment Texture Generation

Hui Shan ⋅ Ming Li ⋅ Haitao Yang ⋅ Kai Zheng ⋅ Sizhe Zheng ⋅ Yanwei Fu ⋅ Xiangru Huang

Jun 6, 11:45 AM - 1:45 PM ExHall F 526

Existing industrial 3D garment meshes already cover most real-world clothing geometries, yet their texture diversity remains limited. To acquire more realistic textures, generative methods are often used to extract Physically-based Rendering (PBR) textures and materials from large collections of wild images and project them back onto garment meshes. However, most image-conditioned texture generation approaches require strict topological consistency between the input image and the input 3D mesh, or rely on accurate mesh deformation to match to the image poses, which significantly constrains the texture generation quality and flexibility.To address the challenging problem of non-isometric image-based garment texture generation, we construct 3D Garment Videos, a physically simulated, garment-centric dataset that provides consistent geometry and material supervision across diverse deformations, enabling robust cross-pose texture learning. We further employ Nano Banana for high-quality non-isometric image editing, achieving reliable cross-topology texture generation between non-isometric image-geometry pairs. Finally, we propose an iterative baking method via uncertainty-guided view selection and reweighting that fuses multi-view predictions into seamless, production-ready PBR textures. Through extensive experiments, we demonstrate that our feedforward dual-branch architecture generates versatile and spatially aligned PBR materials suitable for industry-level 3D garment design.

View full details

Poster

Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation

Xinyue Liu ⋅ Jin Liu ⋅ Hongbo Wang ⋅ Ran He ⋅ Huaibo Huang

Jun 7, 11:45 AM - 1:45 PM ExHall F 526

Recently, generating 3D assets using visual priors from pretrained diffusion models has shown remarkable results. However, due to the inherent lack of 3D geometric priors in 2D diffusion, the synthesized results often suffer from spatial hallucination and multi-view inconsistency. To address this limitation, we propose Thoughtful3D, a novel framework that enhances 3D content generation quality by introducing structural chain-of-thought (CoT) reasoning to alleviate inconsistent issues and mitigate hallucinations. Specifically, we design a dual-phase structural CoT strategy: (1) 3DBlueprint-CoT explicitly plans the 3D generation process through textual semantic parsing and logical deduction during the initialization phase. (2) 3DRefine-CoT dynamically evaluates latent inconsistencies by analyzing multiple renderings, employing a multi-round iterative refinement mechanism to suppress hallucinations and enhance cross-view consistency. To further promote consistency across views, we propose a Cross-view Semantic Appearance Alignment strategy that enhances multi-view consistency by establishing dynamic geometric associations between the same features from different viewpoints. Extensive experiments demonstrate that Thoughtful3D significantly improves the quality and consistency of generated 3D assets.

View full details

Poster

UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching

Qilin Huang ⋅ Quynh Anh Huynh ⋅ Long Le ⋅ Chen Wang ⋅ Chuhao Chen ⋅ Ryan Lucas ⋅ Eric Eaton ⋅ Lingjie Liu

Jun 6, 11:45 AM - 1:45 PM ExHall F 528

Recent progress in 3D reconstruction, such as NeRFs and 3D Gaussian Splatting, has made it easy to recover geometry and appearance from images. However, these static representations remain blind to the physics that govern how objects deform and respond to forces. Building interactive 3D worlds therefore requires predicting not only shape but the underlying material properties. Prior approaches either rely on slow test-time optimization or, more recently, a fast feed-forward predictor such as Pixie. However, these models produce only a single point estimate of physical parameters and are limited to a single simulation backend, restricting both expressiveness and portability. We introduce UniPixie, a generative physics-from-pixels framework that overcomes both limitations. UniPixie predicts a controllable, continuous soft-to-stiff distribution of plausible material properties from a single visual input, capturing inherent physical ambiguity. In addition, UniPixie is the first unified architecture to generate simulation-ready parameters for multiple physics solvers, including Material Point Method (MPM), Linear Blend Skinning (LBS), and Spring-Mass systems. Trained on our new PIXIEMULTIVERSE dataset of annotated material ranges, UniPixie produces diverse, physically consistent dynamics and achieves state-of-the-art accuracy, outperforming deterministic baselines by over 2x while inheriting the fast and generalizable inference from the prior feed-forward work.

View full details

Poster

2D-LFM: Lifting Foundation Model without 3D Supervision

Mosam Dabhi ⋅ Irhas Gill ⋅ László A. Jeni ⋅ Simon Lucey

Jun 7, 11:45 AM - 1:45 PM ExHall F 529

Recent vision foundation models give the impression that 3D reconstruction from RGB is largely solved. Yet these systems struggle with object-specific 3D structure: the fine-grained geometry implied by an object’s landmarks or skeleton. In this paper, we show that when a model is given only 2D landmarks, it can recover more accurate 3D structure than state-of-the-art depth-from-RGB foundation models. Classical lifting approaches such as PAUL demonstrate this principle but do not scale beyond single categories, while methods like 3D-LFM scale but require extensive 3D supervision. We present the first lifting foundation model that learns object-specific 3D geometry using only 2D supervision. The key idea is to inject correspondence structure into the model via a positional encoding inspired by classical structure-from-motion. This simple inductive bias enables robust, object-agnostic 3D lifting that rivals or exceeds recent 3D-supervised approaches, revealing that landmark-based lifting remains a powerful and under-exploited paradigm for 3D understanding.

View full details

Poster

HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction

Chen Zhang ⋅ Yilu An ⋅ Ying Chen ⋅ Hao Li ⋅ Xitong Ling ⋅ Lihao Liu ⋅ Junjun He ⋅ Yuxiang Lin ⋅ Zihui Wang ⋅ Rongshan Yu

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 531

Spatial Transcriptomics (ST) merges the benefits of pathology images and gene expression, linking molecular profiles with tissue structure to analyze spot-level function comprehensively. Predicting gene expression from histology images is a cost-effective alternative to expensive ST technologies. However, existing methods mainly focus on spot-level image-to-gene matching but fail to leverage the full hierarchical structure of ST data, especially on the gene expression side, leading to incomplete image-gene alignment. Moreover, a challenge arises from the inherent information asymmetry: gene expression profiles contain more molecular details that may lack salient visual correlates in histological images, demanding a sophisticated representation learning approach to bridge this modality gap. We propose HyperST, a framework for ST prediction that learns multi-level image-gene representations by modeling the data's inherent hierarchy within hyperbolic space, a natural geometric setting for such structures. First, we design a Multi-Level Representation Extractors to capture both spot-level and niche-level representations from each modality, providing context-aware information beyond individual spot-level image-gene pairs. Second, a Hierarchical Hyperbolic Alignment module is introduced to unify these representations, performing spatial alignment while hierarchically structuring image and gene embeddings. This alignment strategy enriches the image representations with molecular semantics, significantly improving cross-modal prediction. HyperST achieves state-of-the-art performance on four public datasets from different tissues, paving the way for more scalable and accurate spatial transcriptomics prediction.

View full details

Poster

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

Tanush Yadav ⋅ Reza Salehi ⋅ Jae Sung Park ⋅ Vivek Ramanujan ⋅ Hannaneh Hajishirzi ⋅ Yejin Choi ⋅ Ali Farhadi ⋅ Rohun Tripathi ⋅ Ranjay Krishna

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 530

Videos capture a rich array of subtleties in actions. While large video language models have advanced in understanding long videos, their ability to discern nuanced motions in domain-specific, fine-grained actions remains unclear. Current benchmarks evaluate for fine-grained actions in a domain agnostic manner, making to hard to evaluate models on this task. To address this gap, we introduce \dataset, a comprehensive benchmark aimed at evaluating the domain-specific, fine-grained action understanding of video models.This benchmark covers $1,087$ distinct actions spanning $38$ domains, from bouldering to suturing.Our evaluations demonstrate that current video models encounter significant difficulties in recognizing these actions in a zero-shot scenario. We then examine how to improve model performance on this task. To this end, we collect a training dataset of 160K clips of fine-grained, domain-specific actions. Post-training a 4B model on this data, we surpass all Gemini models and GPT-4o on our benchmark. Next, we evaluate few-shot evaluation and demonstrate that even the best-performing model, GPT-5, struggles in a few-shot evaluation setting. When given three in-context examples, the gap between model and human performance widens, with human accuracy improving by 13% while models only improve by 3%. This suggests that video language models are currently not effective few-shot learners--unlike their text-only counterparts and further gains may be elicited from improving these models' few-short learning capabilities.

View full details

Poster

FabricGen: Microstructure-Aware Woven Fabric Generation

Yingjie Tang ⋅ Di Luo ⋅ Zixiong Wang ⋅ Xiaoli Ling ⋅ Jian Yang ⋅ Beibei Wang

Jun 7, 11:45 AM - 1:45 PM ExHall F 532

Woven fabric materials are widely used in rendering applications, yet designing realistic examples typically involves multiple stages, requiring expertise in weaving principles and texture authoring. Recent advances have explored diffusion models to streamline this process; however, pre-trained diffusion models often struggle to generate intricate yarn-level details that conform to weaving rules. To address this, we present FabricGen, an end-to-end framework for generating high-quality woven fabric materials from textual descriptions. A key insight of our method is the decomposition of macro-scale textures and micro-scale weaving patterns. To generate macro-scale textures free from microstructures, we fine-tune pre-trained diffusion models on a collected dataset of microstructure-free fabrics. As for micro-scale weaving patterns, we develop an enhanced procedural geometric model capable of synthesizing natural yarn-level geometry with yarn sliding and flyaway fibers. The procedural model is driven by a specialized large language model, WeavingLLM, which is fine-tuned on an annotated dataset of formatted weaving drafts, and prompt-tuned with domain-specific fabric expertise. Through fine-tuning and prompt tuning, WeavingLLM learns to design weaving drafts and fabric parameters from textual prompts, enabling the procedural model to produce diverse weaving patterns that stick to weaving principles. The generated macro-scale texture, along with the micro-scale geometry, can be used for fabric rendering. Consequently, our framework produces materials with significantly richer detail and realism compared to prior generative models.

View full details

Poster

Spherical Leech Quantization for Visual Tokenization and Generation

Yue Zhao ⋅ Hanwen Jiang ⋅ Zhenlin Xu ⋅ Chutong Yang ⋅ Ehsan Adeli ⋅ Philipp Krähenbühl

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 533

Lookup-free quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ($\Lambda_{24}$-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.

View full details

Poster

Lafite: A Generative Latent Field for 3D Native Texturing

Chia-Hao Chen ⋅ Yuanchen Guo ⋅ Zi-Xin Zou ⋅ Ze Yuan ⋅ Guan Luo ⋅ Xiaojuan Qi ⋅ Ding Liang ⋅ Yan-Pei Cao ⋅ Song-Hai Zhang

Jun 6, 11:45 AM - 1:45 PM ExHall F 533

Generating detailed and seamless textures for 3D meshes remains an open challenge. Recent image and video generation models, empowered by large-scale visual priors, are capable of producing highly detailed images and are thus promising for multi-view texture synthesis. However, evaluating texture quality involves multiple dimensions beyond visual fidelity. Multi-view back-projection often introduces seams and inconsistencies between different views or near occluded regions, while direct generation on UV-unwrapped maps suffers from UV distortions and ambiguities.Generating textures directly in 3D space offers an inherent advantage in ensuring continuity and spatial coherence, making it a critical and worthwhile research direction. Therefore, we systematically investigate 3D-native texture generation from the perspectives of representation and generation, and present current best practices for this approach.To this end, we employ a local vector field with a structured latent representation to model the joint distribution of texture and geometry. This design enables texture generation conditioned on high-fidelity geometric features within a unified latent space. Crucially, our approach is inherently free from occlusion artifacts, multi-view inconsistencies, and UV-related distortions caused by fragmented surface parameterizations. Extensive experiments demonstrate that our method produces high-quality, seamless textures and supports flexible downstream tasks such as editing and inpainting, marking a significant step forward in 3D-native texture generation.

View full details

Poster

MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention

Pedro M. P. Curvo ⋅ Jan-Willem van de Meent ⋅ Maksim Zhdanov

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 534

A key scalability challenge in neural solvers for industrial-scale physics simulations is efficiently capturing both fine-grained local interactions and long-range global dependencies across millions of spatial elements. We introduce the Multi-Scale Patch Transformer (MSPT), an architecture that combines local point attention within patches with global attention to coarse patch-level representations. To partition the input domain into spatially-coherent patches, we employ ball trees, which handle irregular geometries efficiently. This dual-scale design enables MSPT to scale to millions of points on a single GPU. We validate MSPT on standard PDE benchmarks (elasticity, plasticity, fluid dynamics, porous flow) and large-scale aerodynamic datasets (ShapeNet-Car, Ahmed-ML), achieving state-of-the-art accuracy with substantially lower memory footprint and computational cost.

View full details

Poster

LiteSense: Lifting Lightweight ToF with RGB for High-Resolution Metric Depth Estimation

Yusheng Li ⋅ Lizhi LOU ⋅ Yan Tang ⋅ Zekai Miao ⋅ shaoming zhang ⋅ Jianmei Wang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 536

Metric depth estimation aims to recover depth maps with absolute scale, high resolution, and cross-scene consistency from visual observations. Existing approaches either rely on large-scale models or costly sensors to preserve metric accuracy and generalization, both ill-suited to resource-constrained deployment. In this paper, we propose **LiteSense**, a lightweight RGB-ToF fusion framework that leverages compact normalized histogram (CNH) signals together with RGB cues to achieve efficient and reliable metric depth estimation. Specifically, LiteSense leverages a U-Net-style encoder-decoder that forms an RGB-D input by concatenating RGB with upsampled ToF depth, providing explicit metric priors. To address resolution disparity and recover fine details, we introduce the **Patch-wise CNH Spatial Injection (PCSI)** module, which leverages zone-wise histogram measurements via cross-attention to guide high-level feature fusion. Extensively evaluated on NYUv2 and SUN RGB-D, LiteSense consistently outperforms monocular baselines and DELTAR with substantially lower computational cost, and demonstrates promising zero-shot generalization. We further introduce **THDR3K**, the first indoor RGB-ToF-CNH dataset, where LiteSense achieves real-world accuracy comparable to—and in challenging cases surpassing—Intel RealSense. All the relevant source codes and the collected dataset will be released.

View full details

Poster

SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

Phuc Pham ⋅ Uy Dieu Tran ⋅ Binh-Son Hua ⋅ Phong Nguyen

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 535

Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision-language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed via an efficient inverse mapping process, incorporating remeshing and dynamic stitching algorithms, thereby eliminating the need for physical re-simulation. Extensive experiments on the Multimodal GarmentCodeData benchmark demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

View full details

Poster

Drainage: A Unifying Framework for Addressing Class Uncertainty

Yasser Taha ⋅ Grégoire Montavon ⋅ Nils Körber

Jun 7, 3:30 PM - 5:30 PM ExHall A 535

Modern deep learning faces significant challenges with noisy labels, class ambiguity, as well as the need to robustly reject out-of-distribution or corrupted samples. In this work, we propose a unified framework based on the concept of a "drainage node" which we add at the output of the network. The node serves to reallocate probability mass toward uncertainty, while preserving desirable properties such as end-to-end training and differentiability. This mechanism provides a natural escape route for highly ambiguous, anomalous, or noisy samples, particularly relevant for instance-dependent and asymmetric label noise. In systematic experiments involving the addition of varying proportions of instance-dependent noise or asymmetric noise to CIFAR-10/100 labels, our drainage formulation achieves an accuracy increase of up to 9% over existing approaches in the high-noise regime. Our results on real-world datasets, such as mini-WebVision, mini-ImageNet and Clothing-1M, match or surpass existing state-of-the-art methods. Qualitative analysis reveals a denoising effect, where the drainage neuron consistently absorbs corrupt, mislabeled, or outlier data, leading to more stable decision boundaries. Furthermore, our drainage formulation enables applications well beyond classification, with immediate benefits for web-scale, semi-supervised dataset cleaning, and open-set applications.

View full details

Poster

3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

Hongcan Xiao ⋅ Xinyue Xiao ⋅ Yilin Wang ⋅ Yue Zhang ⋅ Yonggang Qi

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 536

Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DarwAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bézier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that tailors the recently proposed Generalized Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, i.e., each pair consisting of a relatively better and worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model’s 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DarwAgent can generate complex and coherent 3D Bézier sketches from textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for training-free 3D sketch intelligence.

View full details

Poster

MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

Chieh-Yun Chen ⋅ Zhonghao Wang ⋅ Qi Chen ⋅ Zhifan Ye ⋅ Min Shi ⋅ Yue Zhao ⋅ Yinan Zhao ⋅ Hui Qu ⋅ Wei-An Lin ⋅ Yiru Shen ⋅ Ajinkya Kale ⋅ Irfan Essa ⋅ Humphrey Shi

Jun 7, 11:45 AM - 1:45 PM ExHall F 536

Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax—improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.

View full details

Poster

The Midas Touch for Metric Depth

Yu Ma ⋅ Zizhan Guo ⋅ Zuyi Xiong ⋅ Haoran Zhang ⋅ Yi Feng ⋅ Hongbo Zhao ⋅ Hanli Wang ⋅ Rui Fan

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 538

Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present Midas Touch for Depth (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D modeling tasks.

View full details

Poster

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu ⋅ Yanhong Zeng ⋅ Haobo Li ⋅ Hao Ouyang ⋅ Qiuyu Wang ⋅ Ka Leong Cheng ⋅ Jiapeng Zhu ⋅ Hengyuan Cao ⋅ Zhipeng Zhang ⋅ Xing Zhu ⋅ Yujun Shen ⋅ Min Zhang

Jun 7, 11:45 AM - 1:45 PM ExHall F 537

Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

View full details

Poster

C^2FG: Control Classifier-Free Guidance via Score Discrepancy Analysis

Jiayang Gao ⋅ Tianyi Zheng ⋅ Jiayang Zou ⋅ Fengxiang Yang ⋅ Shice Liu ⋅ Luyao Fan ⋅ Zheyu Zhang ⋅ Hao Zhang ⋅ Jinwei Chen ⋅ Peng-Tao Jiang ⋅ Bo Li ⋅ Jia Wang

Jun 7, 11:45 AM - 1:45 PM ExHall F 538

Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process.This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce **Control Classifier-Free Guidance (C$^2$FG)**, a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C$^2$FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.

View full details

Poster

Spectral Conformal Risk Control: Distribution-Free Tail Guarantees via Bayesian Quadrature

Mohammad Mahdi Kazemi Esfeh ⋅ Qi Yan ⋅ Yongxing Zhang ⋅ Zahra Gholami ⋅ Renjie Liao ⋅ Purang Abolmaesumi

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 539

Modern vision systems are deployed in settings where occasional catastrophic failures matter more than average accuracy—for example in medical imaging, autonomous driving, and safety monitoring. While conformal prediction gives distribution-free uncertainty guarantees, most existing methods only control mean error and are hard to tune toward rare but high-cost mistakes. We propose Bayesian-Quadrature Spectral Risk Control (BQ-SRC), a general framework for controlling tail-focused risks (such as conditional value at risk (CVaR)-style objectives) in a distribution-free way. BQ-SRC views conformal prediction through a Bayesian-quadrature lens and replaces mean-risk control with a flexible family of risk-averse criteria, while keeping the same black-box access to a trained model. A binomial testing scheme reduces the Monte Carlo conservatism of prior approaches, leading to tighter sets without sacrificing guarantees. We evaluate BQ-SRC across diverse vision tasks, including synthetic regression, closed-set and zero-shot image classification, multilabel classification, and semantic segmentation. Across these settings, BQ-SRC consistently maintains finite-sample risk guarantees and often yields smaller or otherwise more informative prediction sets than existing conformal and risk-controlling baselines, sometimes trading a modest amount of efficiency for stronger tail-risk control. We will make our implementation publicly available upon acceptance.

View full details

Poster

Generative Modeling of Weights: Generalization or Memorization?

Boya Zeng ⋅ Yida Yin ⋅ Zhiqiu Xu ⋅ Zhuang Liu

Jun 7, 3:30 PM - 5:30 PM ExHall A 539

Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative, well-known methods in this emerging area on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Contrary to claims in prior work, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. Our further results suggest that the memorization potentially resulted from limited data, overparameterized models, and the underuse of structural priors specific to weight data. Our findings highlight the need for more careful design and evaluation of generative models in new domains.

View full details

Poster

MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer

Weiyu Li ⋅ Antoine Toisoul ⋅ Tom Monnier ⋅ Roman Shapovalov ⋅ Rakesh Ranjan ⋅ Ping Tan ⋅ Andrea Vedaldi

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 542

We present MeshFlow, a new method for compressing and generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh connectivity, which, however, scales poorly due to the inference cost being quadratic in mesh size. AR methods also require discretizing the vertex coordinates, which introduces quantization errors and can cause vertex collapse. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space.This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified-Flow transformer, which generates all mesh vertices and edges in parallel. This model samples meshes $26\times$ faster than the fastest AR generator while also achieving state-of-the-art accuracy across standard mesh-generation metrics.

View full details

Poster

PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

Weifu Fu ⋅ Jinyang Li ⋅ Bin-Bin Gao ⋅ Jialin Li ⋅ Yuhuan Lin ⋅ Hanqiu Deng ⋅ Wenbing Tao ⋅ Yong Liu ⋅ Chengjie Wang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 545

Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text paired samples for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, extending the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal object detector supporting both text and visual prompts. Our visual prompt generation scheme builds on an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, parallel alignment with diverse real-world usage scenarios, and improved classification. Extensive experiments demonstrate that our visual prompt generation scheme, based on text-prompt-based detection pretraining, achieves a higher performance ceiling compared to using visual prompts alone.Our method achieves significant zero-shot detection performance on COCO, LVIS, and ODinW, and excels across various prompt-based detection protocols. In-domain evaluations also demonstrate robust localization performance.

View full details

Poster

Temporal Representation Enhancement (TRE): Learning to Forget Dominant Patterns for Enhanced Temporal Spiking Features

Wei Liu ⋅ Li Yang ⋅ Yufei Wang ⋅ Han Xiao ⋅ Boyu Cai ⋅ Weiming Hu

Jun 7, 11:45 AM - 1:45 PM ExHall F 546

Spiking Neural Networks (SNNs) naturally process visual inputs across multiple timesteps, offering rich temporal dynamics and energy-efficient computation. However, the temporally invariant supervision commonly used in training tends to reinforce the same dominant response patterns across timesteps, leading to redundant representations and limiting temporal discriminability.To overcome this limitation, we introduce \emph{Temporal Representation Enhancement} (TRE), a novel learning-to-forget paradigm that encourages more diverse and complementary temporal representations. TRE identifies high-contribution semantic patterns through class-specific contribution estimation and temporal accumulation, and selectively suppresses them using a dynamic modulation strategy. By redirecting the model’s attention toward alternative yet informative semantic cues, TRE promotes the learning of complementary features across timesteps.This approach not only strengthens the temporal discriminative capacity of SNNs but also enables more effective multi-timestep learning by leveraging richer semantic information. Extensive experiments on both static image datasets and dynamic neuromorphic datasets demonstrate that TRE consistently improves classification accuracy and feature diversity across different SNN backbones.

View full details

Poster

Reading Your Actions: Learning Generalizable Action Representations via Pre-training AEMG

Zhenghao Huang ⋅ Kaikai Wang ⋅ HUILIN YAO ⋅ Lin Shu

Jun 6, 11:45 AM - 1:45 PM ExHall F 547

Electromyography (EMG) is crucial for decoding human motor intentions and achieving natural human-computer interaction, but its generalization ability across subjects, devices, and tasks has long been limited by data heterogeneity, scarce annotations, and the lack of a unified representation paradigm. In this work, we introduce a novel perspective on EMG signals, treating muscle contractions as words and activation sequences as sentences. Based on this perspective, we design a Neuromuscular Contraction Tokenizer (NCT) that generates semantically consistent EMG sentences from raw signals. Building on this, we propose the first large-scale pre-training framework for EMG—Any Electromyography (AEMG), a general EMG representation learning framework based on self-supervised pre-training. Furthermore, we construct the largest cross-device EMG vocabulary to date, which supports seamless transfer across arbitrary channel topologies and sampling rates. Extensive experiments demonstrate that AEMG outperforms state-of-the-art baselines by 5.79–9.25% in zero-shot leave-one-subject-out accuracy, and achieves over 90% few-shot adaptation performance with only 5% of the target user’s data. Our work has proposed the concept of electromyography signals as a cross-device physiological language, learned their grammar from massive amounts of data, and laid the groundwork for a single-training, universally applicable EMG foundation model.

View full details

Poster

NG-GS: NeRF-guided 3D Gaussian Splatting Segmentation

Yi He ⋅ Tao Wang ⋅ Yi Jin ⋅ Congyan Lang ⋅ Yidong Li ⋅ Haibin Ling

Jun 7, 3:30 PM - 5:30 PM ExHall A 547

Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS and LERF-OVS benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU.

View full details

Poster

Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements

Genki Kinoshita ⋅ Shu Nakamura ⋅ Ryo Kawahara ⋅ Shohei Nobuhara ⋅ Yasutomo Kawanishi ⋅ Ko Nishino

Jun 6, 11:45 AM - 1:45 PM ExHall F 550

Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms which capture the atomic joint movements and Action Motifs which are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully self-supervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with a unified pretext task of masked token prediction in their respective latent spaces. We also introduce Action Motif Dataset (AMD), a large-scale dataset of multiview human behavior videos with full SMPL annotations. We introduce a novel use of cameras by mounting them on the feet to achieve their frame-wise annotations despite frequent and heavy body occlusions. Experimental results demonstrate the effectiveness of A4Mer for extracting meaningful Action Atoms and Action Motifs that significantly benefit human behavior modeling tasks including action recognition, motion prediction and synthesis.

View full details

Poster

When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks

Minhyeok Lee

Jun 7, 11:45 AM - 1:45 PM ExHall F 552

Neural networks are often treated as monolithic black boxes that process all inputs uniformly through all layers. However, researchers intuitively wonder: do simple images require all 50 layers of ResNet-50, or is the prediction effectively decided much earlier? We investigate when pretrained models make up their minds during a forward pass by training linear probes at each layer of ResNet variants on ImageNet, without modifying the base model. Our findings reveal substantial computational heterogeneity across architectures: ResNet-50 and ResNet-101 exhibit mean decision depths of 5.5--5.6 layers (k=2 stability), while ResNet-18 requires deeper relative processing at 7.4 layers. We discover pronounced bimodal patterns with distinct populations of early and late deciders, where 39--43\% of samples in deeper ResNets achieve stability within the first third of the network, while 39--54\% require processing beyond 70\% depth. The decision layer is highly sensitive to stability criteria, with mean depths increasing from 2.6--4.1 (k=1) to 9.0--10.0 (k=4). Linear probe accuracy exhibits sharp jumps in final residual stages, reaching 73--75\% for ResNet-50/101 and 65\% for ResNet-18, indicating that semantic consolidation occurs late. These findings expose computational heterogeneity in standard inference and provide actionable guidance for early exit strategies.

View full details

Poster

GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Jing Wang ⋅ Jiajun Liang ⋅ Jie Liu ⋅ Henglin Liu ⋅ Gongye Liu ⋅ Jun Zheng ⋅ Wanyuan Pang ⋅ Ao Ma ⋅ Zhenyu Xie ⋅ Xintao Wang ⋅ Meng Wang ⋅ Pengfei Wan ⋅ Xiaodan Liang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 555

Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution—its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an **implicit over-optimization stage**—while the proxy reward continues to increase, essential metrics such as image quality and text–prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce **GRPO-Guard**, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality. We provide detailed demonstrations of the over-optimization process and corresponding visualizations in **Supplementary Materials. 5**.

View full details

Poster

Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge

Minhyeok Lee

Jun 7, 11:45 AM - 1:45 PM ExHall F 554

Modern computer vision models from different architecture families--CNNs, Vision Transformers, and MLP-Mixers--achieve remarkably similar aggregate performance on standard benchmarks, masking potential systematic differences in how they process visual information. We introduce a simple yet revealing framework to identify where architectural inductive biases truly matter: by systematically mapping controversial images where pretrained models strongly disagree versus consensus images where all models agree. Analyzing 12 pretrained models spanning three architecture families on ImageNet validation set, we discover that controversial images exhibit approximately 4.5$\times$ higher disagreement than consensus images (Controversy Score: 4.46). Despite mean accuracy around 80\%, models show structured disagreement patterns: within-family agreement exceeds cross-family agreement, with CNNs and ViTs forming distinct clusters while MLPs show lower overall alignment. Crucially, only the top 10\% most controversial images drive the majority of architectural divergence, constituting a small but informationally dense subset that reveals fundamental differences masked by aggregate metrics. Our analysis demonstrates that architectural choice matters most on this concentrated controversy space, providing researchers with actionable guidance for model selection and ensemble construction.

View full details

Poster

Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

Vanessa Emanuela Guarino ⋅ Claudia Winklmayr ⋅ Jannik Franzen ⋅ Josef Rumberger ⋅ Manuel Pfeuffer ⋅ Sonja Greven ⋅ Klaus Maier-Hein ⋅ Dagmar Kainmueller ⋅ Christoph Karg ⋅ Carsten T. Lüth

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 555

Uncertainty quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. UQ generates pixel-wise uncertainty maps that must be aggregated into scalar scores for downstream tasks like OoD- or failure-detection.Despite widespread use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied.Global Average is the default choice, yet it does not account for spatial and structural features of uncertainty estimates. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices.We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure.We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, performance of individual aggregators is highly dependent on dataset characteristics, thus we propose a meta aggregator that integrates multiple aggregators and shows robust performance across datasets.To foster reproducibility, we release an open-source Python package for benchmarking uncertainty aggregation methods.

View full details

Poster

Deep Feature Deformation Weights

Richard Liu ⋅ Itai Lang ⋅ Rana Hanocka

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 555

Handle-based mesh deformation has been a long-standing paradigm in computer graphics, enabling intuitive shape edits from sparse controls. Classic techniques offer precise and rapid deformation control. However, they solve an optimization problem with constraints defined by the choice of control handles, requiring a user to know apriori the ideal distribution of handles on the shape to accomplish the desired edit. The mapping from handle set to deformation behavior is often unintuitive and, importantly, non-semantic. Modern data-driven methods, on the other hand, leverage the data prior to obtain semantic edits, at the cost of fine-grained control and speed. We propose a technique that achieves the best of both worlds by leveraging the semantic prior of data and the precise control and speed of traditional frameworks. Our approach is surprisingly simple yet effective: deep feature proximity makes for smooth and semantic deformation weights, with no need for additional regularization. Importantly, these weights can be computed in real-time for any surface point, whereas all prior methods require optimization of these weights. Moreover, the semantic prior from deep features enables co-deformation of semantic parts. We introduce an improved feature distillation pipeline, barycentric feature distillation, which leverages the full visual signal from shape renders to make the compute cost robust to mesh resolution. This allows deep feature weights to be computed for even high resolution meshes in under a minute, in contrast to potentially hours for both classical and neural methods. We preserve and extend existing functionality of classical methods through feature space constraints and locality weighting.Our field representation allows for automatic detection of semantic symmetries, which we use to produce symmetry-preserving deformations. We show a proof-of-concept application which can produce deformations for meshes up to 1 million faces in real-time on a consumer-grade machine.

View full details

Poster

Gated KalmaNet: A Fading Memory Layer through Test-time Ridge Regression

Liangzu Peng ⋅ Aditya Chattopadhyay ⋅ Luca Zancato ⋅ Elvis Nunez ⋅ Wei Xia ⋅ Stefano Soatto

Jun 6, 11:45 AM - 1:45 PM ExHall F 557

As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented settings. We propose \ourname (\ourshortname), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. \ourshortname achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And 2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, \ourshortname shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, \ourshortname excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$\% relative improvement over other fading memory baselines.

View full details

Poster

Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models

Dailan He ⋅ Guanlin Feng ⋅ Xingtong Ge ⋅ Yazhe Niu ⋅ Yi Zhang ⋅ Bingqi Ma ⋅ Guanglu Song ⋅ Yu Liu ⋅ Hongsheng Li

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 559

Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.

View full details

Poster

See What We Cannot See: A Geo-guided Reasoning Benchmark for Object Counting under Adverse Earth Observation Conditions

Jiayi Wang ⋅ Zhihong Tan ⋅ Hongchen Wei ⋅ Daiqing Yang ⋅ Zhenzhong Chen

Jun 7, 3:30 PM - 5:30 PM ExHall A 559

Object counting in remote sensing imagery becomes challenging when visual cues are obscured by clouds, fog, shadows, or low-light conditions. Yet earth observation inherently provides complementary geo-modalities, including land use and map, which offer stable structural and contextual priors that remain available when appearance cues fail. In this paper, we introduce \textbf{GROC}, the first large-scale dataset \textbf{G}eo-guided \textbf{R}easoning in \textbf{O}bject \textbf{C}ounting under adverse earth observation conditions. GROC contains 1.2 million point annotations over 14K images, each aligned with 3 modalities that preserve original geospatial information. We also provide a data engine to collect a large-scale object counting dataset with multiple geo-modalities, realistic degradations, and reliable annotations. We further present an counting agent that adaptively leverages geo-modalities to produce reliable estimates. Extensive experiments show that existing models struggle to “see” through adverse conditions, whereas geo-modalities improve robustness. GROC establishes the first benchmark that explicitly challenges models to \textbf{see what they cannot see}, charting a new direction for geo-guided amodal reasoning in earth observation.

View full details

Poster

SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images

Zepeng Xin ⋅ Kaiyu Li ⋅ Luodi Chen ⋅ Wanchen Li ⋅ Xiao Yuchen ⋅ Hui Qiao ⋅ Weizhan Zhang ⋅ Deyu Meng ⋅ Xiangyong Cao

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 560

Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model's effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released.

View full details

Poster

Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening

Junfeng Li ⋅ Wenyang Zhou ⋅ Xueheng Li ⋅ Xuanhua He ⋅ Jianhou Gan ⋅ Wenqi Ren

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 563

In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a KV-sharing RWKV architecture for efficient global modeling, coupled with a novel tri-token prompting mechanism derived from semantic clustering to steer the fusion process adhering to the following principles: 1) Multigrain-aware Semantic Prototype Scanning. While the RWKV model offers an efficient linear alternative, its recurrent scanning mechanism often introduces positional bias and lacks semantic guidance. To address this, we introduce a semantic-driven scanning strategy. Local hashing is first employed to generate semantic prototypes via clustering, segmenting the image into coherent regions. Our scanning mechanism is then explicitly aware of multi-grain semantic structures, allowing the model to focus on contextually relevant regions during fusion, thereby enhancing spectral integrity and spatial coherence beyond sequence-agnostic approaches. 2) Tri-token Prompt Learning. The core of our framework is a tri-token prompting mechanism: (i) a globally-sourced token to encapsulate the holistic image context, (ii) cluster-derived prototype tokens to represent distinct semantic regions, and (iii) learnable token register that acts as a dynamic buffer to explicitly identify and eliminate feature noisy artifacts that commonly arise from standard global modeling. The global and prototype tokens are broadcast as semantic prompts to guide RWKV's processing, while the register continuously refines the intermediate features. 3) Invertible Q-Shift. To counteract spatial detail, we tailor two key designs: apply a center difference convolution on value pathway within the RWKV block, actively injecting high-frequency information to preserve fine textures and moving beyond parameter-heavy receptive field expansion via invertible neural network empowered multi-scale Q-shift operation. This module performs efficient, lossless feature transformation and shifting across split channels, significantly enriching feature representation. Experimental results demonstrate superiority of our method.

View full details

Poster

MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

Runxun Zhang ⋅ Yizhou Liu ⋅ Dongrui Li ⋅ Bo XU ⋅ Jingwei Wei

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 563

Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR–CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.

View full details

Poster

Learning to Diversify and Focus: A Reinforcement Framework for Open-Vocabulary HOI Detection

Yongchao Xu ⋅ Jiawei Liu ⋅ Junfeng Wang ⋅ Sen Tao ⋅ Na Jiang ⋅ Zheng-Jun Zha

Jun 7, 11:45 AM - 1:45 PM ExHall F 564

Open-Vocabulary Human–Object Interaction (OV-HOI) detection aims to recognize novel HOI categories beyond the training set.Existing OV-HOI detection approaches typically leverage CLIP to extract global visual representations and perform cross-attention between learnable queries and global features to localize human–object pairs.However, such one-stage paradigms tend to overfit seen interactions, limiting their generalization to unseen categories, while the coarse spatial awareness of CLIP also hinders the localization of fine-grained interaction cues.To address these issues, we propose a novel Semantic-Diversified and Interaction-Focused framework (SD-IF), which integrates reinforcement-guided adaptive optimization to jointly enhance semantic generalization and spatial discrimination.Specifically, we introduce a Semantic Diversification (SD) module that applies reinforcement-driven stochastic semantic perturbations and dual-level semantic exploration, expanding the semantic coverage of queries while maintaining visual coherence and effectively encouraging exploration beyond the seen semantic clusters.Furthermore, we design an Interaction Focusing (IF) module that formulates an actor–critic optimization scheme to adaptively refine attention distributions based on detection features and interaction representations, guided by a hybrid reward combining spatial focusing and semantic consistency.This cooperative learning paradigm enables the model to capture discriminative interaction cues and achieve spatially interpretable reasoning.Extensive experiments on two widely used benchmarks demonstrate that SD-IF achieves state-of-the-art performance, significantly surpassing existing OV-HOI detection methods.

View full details

Poster

Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

Houston H. Zhang ⋅ TAO ZHANG ⋅ Baoze Lin ⋅ Yuanqi Xue ⋅ Yincheng Zhu ⋅ Huan Liu ⋅ Li Gu ⋅ Linfeng Ye ⋅ Ziqiang Wang ⋅ Xinxin Zuo ⋅ Yang Wang ⋅ YUANHAO YU ⋅ Zhixiang Chi

Jun 6, 11:45 AM - 1:45 PM ExHall F 565

User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and UI composition modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.

View full details

Poster

RINO: Rotation-Invariant Non-Rigid Correspondences

Maolin Gao ⋅ Shao Jie Hu-Chen ⋅ Congyue Deng ⋅ Riccardo Marin ⋅ Leonidas Guibas ⋅ Daniel Cremers

Jun 7, 11:45 AM - 1:45 PM ExHall F 565

Dense 3D shape correspondence remains a central challenge in computer vision and graphics as many deep learning approaches still rely on intermediate geometric features or handcrafted descriptors, limiting their effectiveness under non-isometric deformations, partial data, and non-manifold inputs. To overcome these issues, we introduce RINO, an unsupervised, rotation-invariant dense correspondence framework that effectively unifies rigid and non-rigid shape matching. The core of our method is the novel RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry. This allows for a fully end-to-end, data-driven approach that bypasses the need for shape pre-alignment or handcrafted features. Extensive experiments show unprecedented performance of RINO across challenging non-rigid matching tasks, including arbitrary poses, non-isometry, partiality, non-manifoldness, and noise.

View full details

Poster

DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors

Mengyang Li ⋅ Pinlong Zhao

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 568

The efficiency of hyperparameter optimization (HPO) is critical for deep learning, yet state-of-the-art methods share a fundamental flaw: they are difficulty-agnostic, treating all hyperparameter configurations homogeneously. This approach leads to inefficient resource allocation, wasting budget in simple regions while under-exploring complex, rugged landscapes, and thereby critically undermining both search efficiency and final performance. To address this universal challenge, we introduce DABO, a framework that pioneers difficulty-aware tuning within the efficient context of Freeze-Thaw Bayesian Optimization. We first model optimization difficulty hierarchically. Then, departing from hand-crafted priors, we train a conditional diffusion model on 120,000 real learning curves, generating synthetic data with 2.3$\times$ higher fidelity. This data trains our difficulty-aware surrogate model and acquisition function to dynamically adapt the search strategy. Across 75 tasks, DABO reduces regret by 11-18\% compared to the leading difficulty-agnostic method, ifBO. Our work establishes a new paradigm for HPO, shifting the focus from configuration-centric to difficulty-aware resource allocation to enable more robust and efficient optimization.

View full details

Poster

Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision

Wang Ma ⋅ Hanjing Wang ⋅ Yufei Zhang ⋅ Darsha Udayanga ⋅ Qiang Ji

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 569

Bayesian deep learning (BDL) integrates Bayesian inference with deep learning, improving predictive performance while enabling principled uncertainty quantification. However, existing BDLs often rely on non-informative random priors, limiting the benefits of Bayesian inference. In contrast, knowledge-augmented deep learning explicitly injects domain knowledge during training, yet lacks a probabilistic foundation. In this paper, we propose a knowledge-augmented BDL framework that integrates domain knowledge both as an informative prior and as an adaptive likelihood under a unified two-stage hybrid formulation. In the first stage, we learn a knowledge-informed prior $p(\theta \mid \mathcal{K})$ by pre-training a model to satisfy domain-specific constraints. In the second stage, we perform Bayesian inference on task data with an adaptive knowledge likelihood $p(\mathcal{K} \mid \theta, \mathcal{D})$, which dynamically enforces these constraints during optimization. This unified framework enables knowledge to guide both initialization and training, significantly improving prediction accuracy, robustness, adaptation and uncertainty estimation. Experiments on various computer vision tasks, including semi-synthetic and real-knowledge scenarios, demonstrate that our two-stage framework consistently outperforms state-of-the-art Bayesian and knowledge-augmented baselines.

View full details

Poster

PGA: Prior-free Generative Attack for Practical No-box Scenario

hongyu peng ⋅ Xiang Yuan ⋅ Gong Cheng

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 569

The unrealistic reliance on abundant prior information in traditional transferable attacks has spurred the Practical No-box Scenario (PNS), where attackers can access only limited unlabeled images. However, existing methods rely on iterative optimization to produce adversarial examples with inherently limited inference speed and transferability. Conversely, faster generative attacks fundamentally conflict with the PNS due to their critical dependence on abundant prior information that is explicitly absent in this scenario. To bridge this gap, we propose Prior-free Generative Attack (PGA), the first generative attack tailored for the PNS. Specifically, we introduce the Curriculum-Guided Micro-Robust Optimization that progressively incorporates more challenging discriminative tasks to mitigate the degenerate solutions common in self-supervised learning with limited data, yielding robust and transferable surrogates for downstream attacks. Furthermore, the Region-Aware Consistent Perturbation Learning guides the generator to produce fine-grained and spatially coherent perturbations, mitigating the common pitfall of generative attacks falling into local optima under insufficient supervision. Extensive experiments demonstrate that our PGA achieves remarkable transferability across various settings with high inference speed. This work provides a more practical benchmark for future research on transferable attacks, revealing the great potential of generative attacks under the PNS.

View full details

Poster

Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation

Yongchan Chun ⋅ Chanhee Park ⋅ Jeongho Yoon ⋅ Jaehyung Seo ⋅ Heuiseok Lim

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 571

Pretrained models have become standard in both vision and language, yet they typically do not provide reliable measures of confidence. Existing uncertainty estimation methods—such as deep ensembles and MC dropout—are often too computationally expensive to deploy in practice. Evidential Deep Learning (EDL) offers a more efficient alternative, but it requires models to be trained to output evidential quantities from the start, which is rarely true for pretrained networks.To enable EDL-style uncertainty estimation in pretrained models, we propose the Evidential Transformation Network (ETN), a lightweight post-hoc module that converts a pretrained predictor into an evidential model. ETN operates in logit space: it learns a sample-dependent affine transformation of the logits and interprets the transformed outputs as parameters of a Dirichlet distribution for uncertainty estimation.We evaluate ETN on image classification and large language model question-answering benchmarks, under both in-distribution and out-of-distribution settings. ETN consistently improves uncertainty estimation over post-hoc baselines, while preserving accuracy and adding only minimal computational overhead.

View full details

Poster

Batman: Benign Knowledge Alignment Through Malicious Null Space in Federated Backdoor Attack

Wenwen He ⋅ Wenke Huang ⋅ Yiyang Fang ⋅ Wenjie Qu ⋅ Jiaheng Zhang ⋅ Mang Ye

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 571

Federated Learning (FL), a distributed learning paradigm that enables local training on user-held data across decentralized devices, is vulnerable to backdoor attacks due to limited visibility into client updates. Exploiting this opacity, adversaries induce targeted misbehavior on trigger inputs without affecting overall performance, thereby compromising the trust and integrity of collaborative training in federated learning systems. Existing federated backdoor attacks mainly concentrate on benign knowledge alignment on trigger-surface design or representation guidance to evade defense mechanisms. However, trigger-surface attacks suffer from insufficient alignment, leaving malicious knowledge distinguishable from benign updates. In contrast, representation-guided attacks attempt to obscure the boundary between benign and malicious behaviors. Nevertheless, excessive incorporation of benign knowledge within a shared parameter space leads to over-alignment, ultimately degrading attack effectiveness. To overcome shared parameter space dilemma in backdoor attack, we propose Batman, a novel backdoor attack that aligns benign knowledge within the malicious null space, which effectively decouples malicious space from shared parameter space and enables benign alignment in an orthogonal direction of this space that does not interfere with the attack effectiveness. To further enhance stealthiness, we combine both clean and global models to guide the alignment perturbation within this null space to evade detection. Experiments on four benchmark datasets demonstrate that Batman consistently achieves strong backdoor performance while remaining stealthy under various defenses.

View full details

Poster

Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation

Yunbei Zhang ⋅ Chengyi Cai ⋅ Feng Liu ⋅ Jihun Hamm

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 573

Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS's effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8\% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5\% for VLMs, +15.6\% for standard VMs) while reducing API calls by over 99.99\%. AReS thus provides a robust and practical solution for adapting modern closed-box models.

View full details

Poster

D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation

Shengzhe Chen ⋅ Hao Yan

Jun 7, 11:45 AM - 1:45 PM ExHall F 572

Convexity is a fundamental geometric prior that underlies many natural and man-made structures, yet remains challenging to impose effectively in end-to-end trainable segmentation networks. We revisit convexity from a functional perspective and propose a unified, threshold-free convexity prior based on quasi-concavity of the network output mask function $u$. Instead of constraining a single binary segmentation, we require all super-level sets of $u$ to be convex, transforming global shape constraints into local, differentiable inequalities on $u$ and its derivatives. From this principle we derive zero, first, and second-order characterizations, yielding respectively a local midpoint convexification operator, a gradient based condition linked to supporting hyperplanes, and a sufficient second-order inequality expressed by a quadratic form on the tangent plane. The first and second-order formulations produce a compact convolutional loss that can be densely applied across the image without thresholding. Our quasi-concavity losses integrate seamlessly with modern segmentation networks via the proposed convex gradient projection module (CGPM). They consistently enforce convexity and improve shape regularity across multiple datasets, outperforming networks tailored for retinal segmentation and surpassing prior shape-aware methods. Remarkably, our analysis unifies a wide spectrum of previous convex shape models, from discrete 1–0–1 line constraints and graph-cuts convexity formulations to curvature or signed distance Laplacian based level-set priors under one continuous, differentiable framework.

View full details

Poster

Batch Loss Score for Dynamic Data Pruning

Qing Zhou ⋅ Bingxuan Zhao ⋅ Tao Yang ⋅ Hongyuan Zhang ⋅ Junyu Gao ⋅ Qi Wang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 574

Dynamic data pruning accelerates deep learning by selectively omitting less informative samples during training. While per-sample loss is a common importance metric, obtaining it can be challenging or infeasible for complex models or loss functions, often requiring significant implementation effort. This work proposes the Batch Loss Score (BLS), a computationally efficient alternative using an Exponential Moving Average (EMA) of readily available batch losses to assign scores to individual samples. We frame the batch loss, from the perspective of a single sample, as a noisy measurement of its scaled individual loss, with noise originating from stochastic batch composition. It is formally shown that the EMA mechanism functions as a first-order low-pass filter, attenuating high-frequency batch composition noise. This yields a score approximating the smoothed and persistent contribution of the individual sample to the loss, providing a theoretical grounding for BLS as a proxy for sample importance. BLS demonstrates remarkable code integration simplicity (\textbf{three-line injection}) and readily adapts existing per-sample loss-based methods (\textbf{one-line proxy}). Its effectiveness is demonstrated by enhancing two such methods to losslessly prune \textbf{20%-50%} of samples across \textit{14 datasets}, \textit{11 tasks} and \textit{18 models}, highlighting its utility and broad applicability, especially for complex scenarios where per-sample loss is difficult to access.

View full details

Poster

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

Dong Zhao ⋅ Qi Zang ⋅ Nan Pu ⋅ Wenjing Li ⋅ Nicu Sebe ⋅ Zhun Zhong

Jun 6, 11:45 AM - 1:45 PM ExHall F 574

Domain Generalization in Semantic Segmentation (DG-SS) aims to enable segmentation models to perform robustly in unseen environments. However, conventional DG-SS methods are restricted to a fixed set of known categories, limiting their applicability in open-world scenarios. Recent progress in Vision-Language Models (VLMs) has advanced Open-Vocabulary Semantic Segmentation (OV-SS) by enabling models to recognize a broader range of concepts. Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments, a challenge that is particularly severe in urban-driving scenarios. To bridge this gap, we introduce Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting that jointly addresses unseen domains and unseen categories. We introduce the first benchmark for OVDG-SS in autonomous driving, addressing a previously unexplored problem and covering both synthetic-to-real and real-to-real generalization across diverse unseen domains and unseen categories. In OVDG-SS, we observe that domain shifts often distort text–image correlations in pre-trained VLMs, which hinders the performance of OV-SS models. To tackle this challenge, we propose S$^2$-Corr, a state-space-driven text–image correlation refinement mechanism that can mitigate domain-induced distortions and produce a more consistent text–image correlation under distribution changes. Extensive experiments on our constructed benchmark demonstrate that the proposed method achieves superior cross-domain performance and efficiency compared to existing OV-SS approaches.

View full details

Poster

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Tilemachos Aravanis ⋅ Vladan Stojnić ⋅ Vasileios Psomas ⋅ Nikos Komodakis ⋅ Giorgos Tolias

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 578

Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision–language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and adapts seamlessly to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

View full details

Poster

Boosting Vision-Language Models Towards Cross-Domain Incremental Object Detection

Xu Wang ⋅ Zihan Lin ⋅ Yixin Zhang ⋅ Zilei Wang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 580

Incremental Object Detection (IOD) aims to equip detectors with the ability to handle dynamic environments and emerging object categories, and the rise of vision-language models has substantially advanced this goal. However, existing studies often oversimplify real-world scenarios by assuming the incremental tasks come from a single general domain. To better investigate vision-language models under IOD, it is necessary to explore more generalized scenarios that encompass both novel categories and domains. To this end, we propose Cross-Domain Incremental Object Detection (CDIOD), a new benchmark that assesses the ability to continuously adapt to diverse object detection tasks across domains. CDIOD reveals that existing methods struggle to balance between adaptivity and stability under substantial domain shifts. To tackle this challenge, we propose Dynamic Group Subspace (DGS), a novel framework that dynamically groups tasks by distribution to promote knowledge sharing and prevent task collisions; progressively consolidates adapters to build shared subspaces and control parameter growth; and implements a dynamic training pipeline to maintain a proper stability-adaptivity balance. DGS enables vision-language models to effectively handle task streams of various distribution shifts. Extensive experiments across three benchmarks demonstrate that DGS achieves state-of-the-art performance, highlighting its robustness in diverse incremental learning scenarios.

View full details

Poster

GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

Aoran Xiao ⋅ Shihao Cheng ⋅ Yonghao Xu ⋅ Yexian Ren ⋅ Hongruixuan Chen ⋅ Naoto Yokoya

Jun 7, 11:45 AM - 1:45 PM ExHall F 580

Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models (LLMs), uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning—capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.

View full details

Poster

Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels

Junda Xu ⋅ Yanmeng Liu ⋅ Xiangqiang Zeng ⋅ Jinrong Wu ⋅ Ying Qu ⋅ Libao Zhang

Jun 7, 11:45 AM - 1:45 PM ExHall F 581

Google Earth imagery, combined with building footprint databases, offers an efficient way to construct localized building datasets. However, the lack of orthorectification in these images leads to spatial misalignments between annotations and their corresponding roof locations. Adopting such misaligned data directly for model training can severely degrade segmentation performance. To address the challenge, we propose an Object-based Multi-stage Alignment Framework (OMAF) that generates high-quality corrected labels with minimal manual intervention. OMAF first employs a prior-regularized self-alignment method to produce high-confidence, object-level offset pseudo-labels, which are then used to train an instance-level offset regression model for label refinement. Experimental results on the challenging Islahiye and Antakya datasets demonstrate that OMAF effectively corrects misalignments and consistently boosts the mIoU of all baseline models by up to $40.6\%$. The ablation experiments also demonstrated that each module in OMAF effectively improves the final alignment performance. Among them, the self-alignment algorithm contributed $9.22\%$ to the mIoU metric, demonstrating the strong effectiveness of this unsupervised alignment method.This work provides a practical and cost-effective solution for large-scale dataset construction and domain adaptation.

View full details

Poster

BiPreManip: Learning Affordance-Based Bimanual Pre-Manipulation through Anticipatory Collaboration

Yan Shen ⋅ Feng Jiang ⋅ Zichen He ⋅ Xiaoqi Li ⋅ Yuchen Liu ⋅ Zhiyu Li ⋅ Ruihai Wu ⋅ Hao Dong

Jun 7, 3:30 PM - 5:30 PM ExHall A 581

Many everyday objects are difficult to directly grasp (e.g., a flat iPad) or manipulate functionally (e.g., opening the cap of a pen lying on a desk). Such tasks require sequential, asymmetric coordination between two arms, where one arm performs preparatory manipulation that enables the other’s goal-directed action—for instance, pushing the iPad to the table’s edge before picking it up, or lifting the pen body to allow the other hand to remove its cap. In this work, we introduce Collaborative Preparatory Manipulation, a class of bimanual manipulation tasks that demand understanding object semantics and geometry, anticipating spatial relationships, and planning long-horizon coordinated actions between the two arms. To tackle this challenge, we propose a visual affordance-based framework that first envisions the final goal-directed action and then guides one arm to perform a sequence of preparatory manipulations that facilitate the other arm’s subsequent operation. This affordance-centric representation enables anticipatory inter-arm reasoning and coordination, generalizing effectively across various objects spanning diverse categories. Extensive experiments in both simulation and the real world demonstrate that our approach substantially improves task success rates and generalization compared to competitive baselines.

View full details

Poster

ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

Ruixun Liu ⋅ Bowen Fu ⋅ Jiayi Song ⋅ Kaiyu Li ⋅ Wanchen Li ⋅ Lanxuan Xue ⋅ Hui Qiao ⋅ Weizhan Zhang ⋅ Deyu Meng ⋅ Xiangyong Cao

Jun 7, 11:45 AM - 1:45 PM ExHall F 583

Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping–zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility.

View full details

Poster

SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

Chong Xia ⋅ Kai Zhu ⋅ Zizhuo Wang ⋅ Fangfu Liu ⋅ Zhizheng Zhang ⋅ Yueqi Duan

Jun 7, 3:30 PM - 5:30 PM ExHall A 583

Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a ''Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.

View full details

Poster

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Lei Xiao ⋅ Jifeng Li ⋅ Juntao Gao ⋅ Feiyang Ye ⋅ Yan Jin ⋅ Jingjing Qian ⋅ Jing Zhang ⋅ Yong Wu ⋅ Xiaoyuan Yu

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 584

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.

View full details

Poster

STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation

Hao Ren ⋅ Zetong Bi ⋅ Yiming Zeng ⋅ Zhaoliang Wan ⋅ Lu Qi ⋅ Hui Cheng

Jun 7, 3:30 PM - 5:30 PM ExHall A 584

Visual navigation requires the robot to reach a specified goal such as an image, based on a sequence of first-person visual observations. While recent learning-based approaches have made significant progress, they often focus on improving policy heads or decision strategies while relying on simplistic feature encoders and temporal pooling to represent visual input. This leads to the loss of fine-grained spatial and temporal structure, ultimately limiting accurate action prediction and progress estimation. In this paper, we propose a unified spatio-temporal representation framework that enhances visual encoding for robotic navigation. Our approach extracts features from both image sequences and goal observations, and fuses them using the designed spatio-temporal fusion module. This module performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution. Experimental results demonstrate that our approach consistently improves navigation performance and offers a generalizable visual backbone for goal-conditioned control. The code will be released to the public.

View full details

Poster

RaUF: Learning the Spatial Uncertainty Field of Radar

Shengpeng Wang ⋅ Kuangyu Wang ⋅ Wei Wang

Jun 7, 3:30 PM - 5:30 PM ExHall A 585

Millimeter-wave radar offers unique advantages in adverse weather but suffers from low spatial fidelity, severe azimuth ambiguity, and clutter-induced spurious returns. Existing methods mainly focus on improving spatial perception effectiveness via coarse-to-fine cross-modal supervision, yet often overlook the ambiguous feature-to-label mapping, which may lead to ill-posed geometric inference and pose fundamental challenges to downstream perception tasks. In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. To resolve conflicting feature-to-label mapping, we design an anisotropic probabilistic model that learns fine-grained uncertainty. To further enhance reliability, we propose a Bidirectional Domain Attention mechanism that exploits the mutual complementarity between spatial structure and Doppler consistency, effectively suppressing spurious or multipath-induced reflections. Extensive experiments on public benchmarks and real-world datasets demonstrate that RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Moreover, downstream case studies further validate the enhanced reliability and scalability of RaUF under challenging real-world driving scenarios. Our dataset will be released to the community.

View full details

Poster

PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting

Stephen Price ⋅ Danielle L. Cote ⋅ Elke A. Rundensteiner

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 587

High-quality segmentations are critical in vision tasks where boundary accuracy is important (e.g., medical diagnostics, quality control, etc.). Recently, promptable vision models have emerged as effective backbones for segmentation refinement frameworks. However, their performance not only hinges on prompt quality, they also must overcome noisy input masks and semantically ambiguous outputs from promptable models. Existing prompt-based refiners rely on fixed prompt rules, making them brittle to changing failure modes and new tasks or domains. We propose \MOE{}, a model-agnostic MoE-driven prompting refiner effective in segmentation refinement across tasks and domains. \MOE{} features three collaborative modules to refine an initial mask: our MoE-based Image-Informed Prompting framework (IIP) takes an image and coarse mask and produces a set of expert score maps to guide prompt generation, the Dynamic Expert Selector (DES) activates only the most relevant experts and fuses their maps to avoid dense evaluation and signal dilution, and the Prompt-Placement Explorer (PPE) explores the fused guidance map to place high-confidence spatially diverse point prompts. Across five benchmark datasets (BIG, VOC, DAVIS585, ECSSD, MSRA-B), \MOE{} achieves statistically significant gains over SOTA methods CascadePSP, SegRefiner, and SAMRefiner on semantic, instance, and salient tasks, with mean improvements of +6.24 IoU / +8.99 BIoU.

View full details

Poster

Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering

Sebin Lee ⋅ Jumin Lee ⋅ Taeyeon Kim ⋅ Youngju Na ⋅ Woobin Im ⋅ Sung-Eui Yoon

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 587

Rapidly-exploring random trees (RRTs) have been widely adopted for robot motion planning due to their robustness and theoretical guarantees. However, existing RRT-based planners require explicit goal configurations specified as numerical joint angles, while many practical applications provide goal specifications through visual observations such as images or demonstration videos where precise goal configurations are unavailable. In this paper, we propose visual-RRT (vRRT), a motion planner that enables visual-goal planning by unifying gradient-based exploitation from differentiable robot rendering with sampling-based exploration from RRTs. We further introduce (1) a frontier-based exploration-exploitation strategy that adaptively prioritizes visually promising search regions, and (2) inertial gradient tree expansion that inherits optimization states across tree branches for momentum-consistent gradient exploitation. Extensive experiments across various robot manipulators including Franka, UR5e, and Fetch demonstrate that vRRT achieves effective visual-goal planning in both simulated and real-world settings, bridging the gap between sampling-based planning and vision-centric robot applications. Our code will be released publicly.

View full details

Poster

Cross-Hand Latent Representation for Vision-Language-Action Models

Guangqi Jiang ⋅ Yutong Liang ⋅ Jianglong Ye ⋅ Jia-Yang Huang ⋅ Changwei Jing ⋅ Yan Duan ⋅ Pieter Abbeel ⋅ Xiaolong Wang ⋅ Xueyan Zou

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 588

Dexterous manipulation is essential for real-world robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception—vision, sound, and language-guided intent—to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models for dexterous manipulation requires large-scale demonstrations across many robotic hands. In addition, as new dexterous embodiments appear rapidly, collecting data for each becomes costly and impractical, creating a need for scalable cross-embodiment learning. We introduce \ourmethod, a vision-language-action framework integrated with a unified latent action space shared across diverse dexterous hands. This embodiment-invariant latent space is directly pluggable into standard VLA architectures, enabling seamless cross-embodiment training and efficient reuse of both existing and newly collected data. Experimental results demonstrate that \ourmethod consistently outperforms baseline VLA models operating in raw joint spaces, establishing it as an effective solution for scalable cross-embodiment dexterous manipulation.

View full details

Poster

Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems

Jiahuan Long ⋅ Tingsong Jiang ⋅ Hanqing Liu ⋅ Chao Ma ⋅ Weien Zhou ⋅ Yang Yang ⋅ Wen Yao

Jun 7, 11:45 AM - 1:45 PM ExHall F 588

Adversarial patches have emerged as a popular privacy-preserving approach for resisting AI-driven surveillance systems. However, their conspicuous appearance makes them difficult to deploy in real-world scenarios. In this paper, we propose a thermally activated adversarial wearable designed to ensure adaptability and effectiveness in complex real-world environments. The system integrates thermochromic dyes with flexible heating units to induce visually dynamic adversarial patterns on clothing surfaces. In its default state, the clothing appears as an ordinary black T-shirt. Upon heating via an embedded thermal unit, hidden adversarial patterns on the fabric are activated, allowing the wearer to effectively evade detection across both visible and infrared modalities. Physical experiments demonstrate that the adversarial wearable achieves rapid texture activation within 50 seconds and maintains an adversarial success rate above 80\% across diverse real-world surveillance environments. This work demonstrates a new pathway toward physically grounded, user-controllable anti-AI systems, highlighting the growing importance of proactive adversarial techniques for privacy protection in the age of ubiquitous AI surveillance.

View full details

Poster

Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics

Najibul Haque Sarker ⋅ Zaber Ibn Abdul Hakim ⋅ Ali Asgarov ⋅ Chia-Wei Tang ⋅ Alvi Md Ishmam ⋅ Chris Thomas

Jun 7, 11:45 AM - 1:45 PM ExHall F 589

Fine-tuning has become the default way to adapt powerful foundation models, but this also enables low-cost repurposing for harmful objectives. Existing immunization methods try to optimize local geometry or simulate short attacker horizons, and penalize observed loss drops. However, in practice, downstream tuners run thousands of updates and overcome these short-horizon defenses.In this paper, we propose CLAMP (Contractive Long-horizon Attacker Mitigation via Progress-bounding), an immunization method that traps harmful fine-tuning by shaping the attacker's optimization dynamics rather than only the initial landscape. Our key idea is to make harmful training locally contractive, making each update smaller than the last. This yields a closed-form bound on the attacker's training beyond the attacker's simulated training steps. We also introduce a Hessian-free directional curvature penalty, to create adversarial landscapes along harmful descent directions. Our bi-level objective minimizes the attacker's predicted improvement from train step zero to infinity. Experiments show our method withstands long-horizon fine-tuning across classification, generative, and autoregressive settings, substantially reduces harmful task adaptation, while preserving benign utility and fine-tuneability.

View full details

Poster

Tracking by Predicting 3-D Gaussians Over Time

Tanish Baranwal ⋅ Himanshu Singh Singh ⋅ Jathushan Rajasegaran ⋅ Jitendra Malik

Jun 7, 3:30 PM - 5:30 PM ExHall A 590

We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pre-training a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.

View full details

Poster

GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Jingjing Qian ⋅ Boyao Han ⋅ Chen Shi ⋅ Lei Xiao ⋅ Long Yang ⋅ Shaoshuai Shi ⋅ Li Jiang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 591

Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

View full details

Poster

GeoCoT: Towards Reliable Remote Sensing Reasoning with Manifold Perspective

Daixun Li ⋅ Zirui Li ⋅ Sibo He ⋅ Jiayun Tian ⋅ Mingxiang Cao ⋅ Weiying Xie ⋅ Yunke Wang ⋅ Xin Zhang ⋅ Yusi Zhang ⋅ Yunsong Li ⋅ Chang Xu ⋅ Leyuan Fang

Jun 6, 11:45 AM - 1:45 PM ExHall F 591

Multimodal Large Language Models (MLLMs) have shown strong potential in remote sensing (RS) through multi-task reasoning and cross-modal generalization.However, existing RS-MLLMs mainly rely on a single shared expert for all tasks, making it hard to produce reliable results. Meanwhile, the intrinsic redundancy and homogeneity of RS images bring substantial difficulties for both training and inference. These challenges directly conflict with the demands of remote sensing, which values task precision and trustworthy reasoning.To address these limitations, we propose GeoCoT, a manifold-driven mixture-of-experts (MoE) system with Chain-of-Thought (CoT) reasoning. GeoCoT introduces Mani-MoE, a sparse expert architecture grounded in local manifold mapping. It projects high-dimensional tokens onto low-rank subspaces adaptively to eliminate redundancy and uncover intrinsic structure, and then routes them through a sparse expert pathway, where gating decisions are guided by the manifold structure of the input.To optimize this architecture, we adopt a CoT-driven multi-stage training strategy. It leverages a cold-start phase for domain adaptation, followed by our RS Vision Group Relative Policy Optimization (RSV-GRPO) to systematically strengthen structured reasoning from global to objectives. Furthermore, we innovatively build *RS-CoT-20k* dataset for task-specific supervision.Extensive experiments on multi-task datasets demonstrate that GeoCoT outperforms prior approaches, achieving $5.27 \\%$ higher average accuracy than the state-of-the-art method. Our code will be available.

View full details

Poster

STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting

Hao Chen ⋅ Tao Han ⋅ Jie ZHANG ⋅ Song Guo ⋅ Lei Bai

Jun 6, 11:45 AM - 1:45 PM ExHall F 592

To gain finer regional forecasts, many works have explored the regional integration from the global atmosphere, e.g., by solving boundary equations in physics-based methods or cropping regions from global forecasts in data-driven methods. However, the effectiveness of these methods is often constrained by static and imprecise regional boundaries, resulting in poor generalization ability. To address this issue, we propose Spatial-Temporal Weather Forecasting (STCast), a novel AI-driven framework for adaptive regional boundary optimization and dynamic monthly forecast allocation. Specifically, our approach employs a Spatial-Aligned Attention (SAA) mechanism, which aligns global and regional spatial distributions to initialize boundaries and adaptively refines them based on attention-derived alignment patterns. Furthermore, we design a Temporal Mixture-of-Experts (TMoE) module, where atmospheric variables from distinct months are dynamically routed to specialized experts using a discrete Gaussian distribution, enhancing the model’s ability to capture temporal patterns. Beyond global and regional forecasting, STCast is evaluated on extreme event prediction and ensemble forecasting. Experimental results demonstrate consistent superiority over state-of-the-art methods across all four tasks.

View full details

Poster

Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

Görkay Aydemir ⋅ Fatma Güney ⋅ Weidi Xie

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 593

Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due todifferent characteristics and the absence of dense ground-truth annotations.Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher predictions, which vary across frames and scenes.In this paper, we address the problem of real-world fine-tuning and introduce Verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation.Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions to construct refined pseudo-label trajectories.When applied during fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos.Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods.

View full details

Poster

Fractal Camouflage: A Bio-Inspired Approach for Multi-Scale Adversarial Attacks in the Infrared Domain

Chengyin Hu ⋅ Xin wang ⋅ Rui Qiu ⋅ Zhe Jia ⋅ Yingying Zhao ⋅ Kai Wang ⋅ Xu Kang ⋅ Yiwei Wei

Jun 7, 11:45 AM - 1:45 PM ExHall F 593

Infrared pedestrian detection is crucial in safety-critical systems but remains vulnerable to adversarial attacks. Existing physical attacks often rely on fixed, static patterns. However, they often lack robustness across scales, as their hand-crafted or uniformly generated structures are fundamentally limited by a fixed receptive field and fail to adapt to varying distances and scene contexts. In light of this, we propose AdvFractal, a black-box attack that exploits the innate self-similarity and structural richness of fractal geometry to naturally generate multi-scale, physically realizable adversarial perturbations. By modeling perturbations with H-type fractals and optimizing parameters via Particle Swarm Optimization, AdvFractal seamlessly coordinates attacks across scales, progressively disrupting detector features from local textures to global shapes. Experiments show AdvFractal achieves an attack success rate (ASR) of 97.54% in the physical domain and 99.16% cross-dataset, significantly outperforming state-of-the-art methods. The perturbations are highly effective in the infrared spectrum while remaining stealthy in visible light, offering a novel approach for evaluating and understanding the security of infrared detection systems.

View full details

Poster

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

Filip Wolf ⋅ Blaz Rolih ⋅ Luka Cehovin Zajc

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 596

Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student’s pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources.

View full details

Poster

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

Mingyu Liu ⋅ Jiuhe Shu ⋅ Hui Chen ⋅ Zeju Li ⋅ Canyu Zhao ⋅ Jiange Yang ⋅ Shenyuan Gao ⋅ Hao Chen ⋅ Chunhua Shen

Jun 7, 11:45 AM - 1:45 PM ExHall F 596

A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 11.6% on LIBERO and 31% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.

View full details

Poster

Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking

Hongtao Yang ⋅ Bineng Zhong ⋅ Qihua Liang ⋅ Yaozong Zheng ⋅ Xiantao Hu ⋅ Yuanliang Xue ⋅ Shuxiang Song

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 599

Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: a spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and a prediction-level distillation that enhances spatial localization by learning the teacher’s capability of accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher’s target modeling capacity to the student. While the asymmetric architecture improves efficiency, it limits temporal adaptability. To mitigate this, a temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed, with EATrack-DeiT improving average success rate by 1.2\% over the previous SOTA while running at 241.9 FPS on GPU.

View full details

Poster

PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting

Danyal Maqbool ⋅ Changhee Lee ⋅ Zachary Huemann ⋅ Samuel D. Church ⋅ Matthew E. Larson ⋅ Scott B. Perlman ⋅ Tomas A. Romero ⋅ Joshua D. Warner ⋅ Meghan Lubner ⋅ Xin Tie ⋅ Jameson Merkow ⋅ Junjie Hu ⋅ Steve Y. Cho ⋅ Tyler J. Bradshaw

Jun 7, 3:30 PM - 5:30 PM ExHall A 600

Generating automated reports for 3D positron emission tomography (PET) is an important and challenging task in medical imaging. PET plays a vital role in oncology, but automating report generation is difficult due to the complexity of whole-body 3D volumes, the wide range of potential clinical findings, and the limited availability of annotated datasets. To address these challenges, we introduce PETARSeg-11K, the first large-scale, publicly available dataset that provides lesion-level correspondence between 3D PET/CT volumes and free-text radiological findings. It comprises 11,356 lesion descriptions paired with 3D segmentations. Second, we propose PETAR-4B, a 3D vision-language model designed for mask-aware, spatially grounded PET/CT reporting. PETAR-4B jointly encodes PET, CT, and 3D lesion segmentation masks, using a 3D focal prompt to capture fine-grained details of lesions that normally comprise less than 0.1\% of the volume. Evaluations using automated metrics show PETAR-4B substantially outperforming all 2D and 3D baselines. A human study involving five physicians---the first of its kind for automated PET reporting---confirms the model's clinical utility and establishes correlations between automated metrics and expert judgment. This work provides a foundational dataset and a novel architecture, advancing 3D medical vision-language understanding in PET.

View full details

Poster

MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

Albert Dominguez Mantes ⋅ Gioele La Manno ⋅ Martin Weigert

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 602

Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modeling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.

View full details

Poster

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

Minghan Yang ⋅ LAN YANG ⋅ Ke Li ⋅ Honggang Zhang ⋅ Kaiyue Pang ⋅ Yi-Zhe Song

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 603

Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions.To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.

View full details

Poster

Verifying Neural Network Robustness with Dual Perturbations

Hai Duong ⋅ Son Vu ⋅ Thanh Le ⋅ ThanhVu Nguyen

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 605

Safety-critical deep learning systems must be robust against real-world corruptions combining spatially correlated distortions and independent noise.Current deep neural network verification methods handle these perturbations separately, either checking independent pixel-wise perturbations or restricted convolutional transformations using predefined patterns.This gap prevents assessing robustness under realistic conditions where both perturbation types occur simultaneously.To address these limitations, we propose VeriDou, a framework that introduces:(i) universal convolutional perturbations that enable verification across continuous spatial distortion spaces, and(ii) dual perturbations that capture both convolutional distortions and independent pixel-level variations.Our evaluation on a set of diverse benchmarks with 14340 instances shows VeriDou's dual perturbations approach found substantially more adversarial examples on networks that existing methods claimed to be highly robust.This shows that VeriDou is able to explore a broader range of unsafe regions and thus enhances formal assessment of robustness.

View full details

Poster

Hyperbolic Relational Prompts for Intersectional Fairness in Medical VLMs

Jiayu Qian ⋅ Zongxian Yang ⋅ Guanxing Chen ⋅ Pengwei Hu ⋅ KC Tan ⋅ Yan Wang ⋅ Yu-An Huang ⋅ Zhi-An Huang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 608

Ensuring fairness in medical vision-language models (VLMs) is essential for equitable healthcare, yet existing models amplify biases across demographic subgroups such as race and gender. Traditional fairness mitigation approaches relying on broad distribution alignment, fall short in addressing these nuanced intersectional disparities. We propose fairness-aware relational prompting (FRP), a novel framework that reformulates prompt generation as a dynamic, fairness-aware reasoning process. FRP constructs a relational graph to capture fine-grained, sample-level similarities and employs a hyperbolic graph layer to explicitly model the hierarchical structure of intersectional identities. Leveraging hyperbolic geometry enables reasoning over complex attribute combinations, effectively reducing entrenched biases. Evaluations on the FairVLMed and Harvard-GF datasets demonstrate that FRP achieves state-of-the-art diagnostic performance, with an area under the curve of 77.50\% and 85.94\% respectively, while substantially improving the demographic parity difference and equalized odds difference. This work moves beyond post-hoc bias correction toward inherently fair VLM architectures, offering a scalable solution for high-stakes medical applications.

View full details

Poster

PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Wenlong Huang ⋅ Yu-Wei Chao ⋅ Arsalan Mousavian ⋅ Ming-Yu Liu ⋅ Dieter Fox ⋅ Kaichun Mo ⋅ Li Fei-Fei

Jun 6, 11:45 AM - 1:45 PM ExHall F 609

Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots, crucial for contact reasoning, while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild.

View full details

Poster

Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

Haochen Niu ⋅ Kanyu Zhang ⋅ Shuyu Yin ⋅ Qinghai Guo ⋅ Peilin Liu ⋅ Fei Wen

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 609

In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and does not exploit the FAN property, thus lead to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample efficiency, and success rate in both in-distribution and out-of-distribution (OOD) scenarios. By aligning with the intrinsic action tolerance of physical manipulation, FAN-guided regularization provides a principled and practical method for sample-efficient, and generalizable VLA adaptation. Code is provided in supplemental material.

View full details

Poster

CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding

Yi-Lin Wei ⋅ Haoran Liao ⋅ Yuhao Lin ⋅ Pengyue Wang ⋅ Zhizhao Liang ⋅ Guiliang Liu ⋅ Wei-Shi Zheng

Jun 6, 11:45 AM - 1:45 PM ExHall F 610

In this paper, we explore an important yet underexplored task in robot manipulation: cycle-based manipulation, where robots need to perform cyclic or repetitive actions with an expected terminal time. These tasks are crucial in daily life, such as shaking a bottle or knocking a nail. However, few prior works have explored this task, leading to two main challenges: 1) the imitation methods often fail to complete these tasks within the expected terminal time due to the ineffective utilization of history; 2) the absence of a benchmark with sufficient data and automatic evaluation tools hinders development of effective solutions in this area. To address these challenges, we firstly propose the CycleManip framework to achieve cycle-based task manipulation in a end-to-end imitation manner without requiring any extra models, hierarchical structure or significant computational overhead. The core insight is to enhance effective history perception by a cost-aware sampling strategy and to improve historical understanding by multi-task learning. Secondly, we introduce a cycle-based task manipulation benchmark, which provides diverse cycle-based tasks, and an automatic evaluation method. Extensive experiments conducted in both simulation and real-world settings demonstrate that our method achieves high success rates in cycle-based task manipulation. The results further show strong adaptation performance in general manipulation, and the plug-and-play ability on imitation policies such as Vision-Language-Action (VLA) models. Moreover, the results show that our approach can be applied across diverse robotic platforms, including bi-arm grippers, dexterous hands, and humanoid robots.

View full details

Poster

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Tommie Kerssies ⋅ Gabriele Berton ⋅ Ju He ⋅ Qihang Yu ⋅ Wufei Ma ⋅ Daan de Geus ⋅ Gijs Dubbelman ⋅ Liang-Chieh Chen

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 611

Anticipating diverse future states is a central challenge in video world modeling. A key limitation lies in the computational cost of generating multiple plausible futures with existing world models. Recent work demonstrates that predicting the future in the latent space of a vision foundation model (VFM), rather than in raw pixel space, greatly improves efficiency. Despite this progress, efficient VFM-based world models are still predominantly discriminative, producing predictions that implicitly average over many possible futures. To explicitly and efficiently model diverse plausible futures, we introduce DeltaWorld, the first VFM-based world model which shifts from deterministic prediction to the ability to generate multiple plausible futures in a single forward pass. At the core of DeltaWorld is DeltaTok, a tokenizer that encodes feature differences between consecutive frames into a single compact “delta” token, effectively reducing redundancy among temporally adjacent feature maps. By representing futures as delta tokens, DeltaWorld efficiently generates multiple diverse predictions in parallel. Experiments on dense forecasting tasks demonstrate that DeltaWorld is capable of predicting futures that more closely align with real-world outcomes, while being orders of magnitude more efficient than existing generative world models. Code will be made publicly available.

View full details

Poster

Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation

Qinglun Zhang ⋅ Shen Cheng ⋅ Tian Dan ⋅ Haoqiang Fan ⋅ Guanghui Liu ⋅ Shuaicheng Liu

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 612

While existing equivariant methods enhance data efficiency, they suffer from high computational intensity, reliance on single-modality inputs, and instability when combined with fast-sampling methods. In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. E3Flow overcomes these challenges, successfully unifying efficient rectified flow with stable, multi-modal equivariant learning for the first time. Our framework is built upon spherical harmonic representations to ensure rigorous SO(3) equivariance. We introduce a novel invariant Feature Enhancement Module (FEM) that dynamically fuses hybrid visual modalities (point clouds and images), injecting rich visual cues into the spherical harmonic features. We evaluate E3Flow on 8 manipulation tasks from the MimicGen benchmark and further conduct 4 real-world experiments to validate its effectiveness in physical environments. Simulation results show that E3Flow achieves a 3.12\% improvement in average success rate over the state-of-the-art Spherical Diffusion Policy (SDP) while simultaneously delivering a 7$\times$ inference speedup. E3Flow thus demonstrates a new and highly effective trade-off between performance, efficiency, and data efficiency for robotic policy learning. Code and videos will be released.

View full details

Poster

FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing

Hanxi Liu ⋅ Yifang Men ⋅ Zhouhui Lian

Jun 7, 3:30 PM - 5:30 PM ExHall A 612

Single-image 3D human reconstruction holds significant promise due to its convenience and high demand in various applications. Previous methods have garnered tremendous progress by employing 2D multi-view diffusion models to generate auxiliary views as reconstruction priors, but they struggle with 3D inconsistencies and limited generalization capabilities. In this paper, we present FISHuman, which aims to generate fine-grained, high-fidelity, and content-wise diverse 3D humans from a single-view input, providing production-ready 3D assets. We propose an elaborately designed workflow that reconstructs dynamic 3D meshes from multi-view inconsistent guidance. Specifically, we adapt a dual-stream transformer-based video diffusion model to generate cross-modally aligned multi-view RGB and normal sequences. We find that naively employing static 3D reconstruction can lead to geometric distortions and texture blurriness, due to the lack of 3D awareness within the generated frames. To address this, we introduce a novel 4D remeshing module that explicitly disentangles the learning of the globally shared canonical mesh and transient variations by tracking per-vertex deformations under different viewpoints. The topological consistency of the deformed meshes inherently enables the optimization of a unified UV representation that effectively integrates appearance attributes across frames. Both qualitative and quantitative experimental results demonstrate the superiority of our method over prior works in terms of appearance realism, geometric fineness, and generalization diversity. We also showcase the applicability of our reconstructed avatars for downstream applications including animation and 3D editing.

View full details

Poster

Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

Yusen Cai ⋅ Qing Lin ⋅ BHARGAVA SATYA NUNNA ⋅ Mengmi Zhang

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 613

Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged ``visual diets", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)—collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture–shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm.All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants’ visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.

View full details

Poster

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

Lianyu Wang ⋅ Meng Wang ⋅ Huazhu Fu ⋅ Daoqiang Zhang

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 615

The rapid adoption of vision-language models (VLMs) in visual recognition and multimodal reasoning has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on predefined and static authorized domain during training, limiting flexibility in dynamic real-world environments. In addition, they often produce opaque and unsafe responses to unauthorized inputs, lacking explicit alerts for illegal usage.To address these limitations, we propose a novel dynamic authorization with legality-aware intellectual property protection (AoD-IP) for VLMs, a framework that supports authorize-on-demand and legality-aware assessment. AoD-IP introduces a lightweight dynamic authorization module that enables flexible, user-controlled authorization, allowing users to actively specify or switch authorized domains on demand at deployment time. This enables the model to adapt seamlessly as application scenarios evolve and provides substantially greater extensibility than existing static-domain approaches. In addition, AoD-IP incorporates a dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs. Comprehensive experimental results on multiple cross-domain benchmarks demonstrate that AoD-IP maintains strong authorized-domain performance and reliable unauthorized detection, while supporting user-controlled authorization for adaptive deployment in dynamic environments.

View full details

Poster

GraspALL: Adaptive Structural Compensation from Illumination Variation for Robotic Garment Grasping in Any Low-Light Conditions

Haifeng Zhong ⋅ Wenshuo Han ⋅ Zhouyu Wang ⋅ Runyang Feng ⋅ Fan Tang ⋅ Tong-yee Lee ⋅ zipei fan ⋅ Ruihai Wu ⋅ Yuran Wang ⋅ Hao Dong ⋅ Hechang Chen ⋅ Hyung Jin Chang ⋅ Yixing Gao

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 616

Achieving accurate garment grasping under dynamically changing illumination is crucial for all-day operation of service robots. However, the reduced illumination in low-light scenes severely degrades garment structural features, leading to a significant drop in grasping robustness. Existing methods typically enhance RGB features by exploiting the illumination-invariant properties of non-RGB modalities, yet they overlook the varying dependence on non-RGB features under varying lighting conditions, which can introduce misaligned non-RGB cues and thereby weaken the model’s adaptability to illumination changes. To address this problem, we propose GraspALL, an illumination-structure interactive compensation model. The innovation of GraspALL lies in encoding continuous illumination changes into quantitative references to guide adaptive feature compensation between RGB and non-RGB modalities, thereby generating illumination-consistent grasping representations. Experiments on the self-built multimodal garment grasping (MIGG) dataset demonstrate that GraspALL improves grasping accuracy by 32-44% over baseline methods under diverse illumination conditions.

View full details

Poster

Real-Time Neural Video Compression with Unified Intra and Inter Coding

Hui Xiang ⋅ Yifan Bian ⋅ Li Li ⋅ Jingran Wu ⋅ Xianguo Zhang ⋅ Dong Liu

Jun 7, 11:45 AM - 1:45 PM ExHall F 615

Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 12.1\% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.

View full details

Poster

EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models

Mingchen Song ⋅ Xiang Deng ⋅ Jie Wei ⋅ Dongmei Jiang ⋅ Liqiang Nie ⋅ Weili Guan

Jun 6, 11:45 AM - 1:45 PM ExHall F 616

Recent advances in unimanual manipulation policies have achieved remarkable success across diverse robotic tasks through abundant training data and well-established model architectures. However, extending these capabilities to bimanual manipulation remains challenging due to the lack of bimanual demonstration data and the complexity of coordinating dual-arm actions. Existing approaches either rely on extensive bimanual datasets or fail to effectively leverage pre-trained unimanual policies. To address this limitation, we propose EnergyAction, a novel framework that compositionally transfers unimanual manipulation policies to bimanual tasks through the Energy-Based Models (EBMs). Specifically, our method incorporates three key innovations. First, we model individual unimanual policies as EBMs and leverage their compositional properties to compose left and right arm actions, enabling the fusion of unimanual policies into a bimanual policy. Second, we introduce an energy-based temporal-spatial coordination mechanism through energy constraints, ensuring the generated bimanual actions are both temporal coherence and spatial feasibility. Third, we propose two different energy-aware denoising strategies that dynamically adapt denoising steps based on action quality assessment. These strategies ensure the generation of high-quality actions while maintaining superior computational efficiency compared to fixed-step denoising approaches. Experimental results demonstrate that EnergyAction effectively transfers unimanual knowledge to bimanual tasks, achieving superior performance on both simulated and real-world tasks with minimal bimanual data.

View full details

Poster

Adapting Lightweight Image-based Counting Models for Video Crowd Counting

Weibo Shu ⋅ Antoni B. Chan

Jun 7, 11:45 AM - 1:45 PM ExHall F 616

Video crowd counting aims to predict the people count in each frame of a video. It requires effectively leveraging spatio-temporal (ST) information in videos while satisfying real-time constraints. However, most existing methods use ST information from neighboring frames through auxiliary extraction and fusion modules---resulting in large computational cost and the need to buffer multiple frames during inference. Such designs limit their practicality in real-world applications with limited computational resources or stringent real-time requirements. To address these issues, we revisit video crowd counting from the perspective of lightweight image-based counting models that enable real-time deployment under limited resources. We analytically define ST information in a model-independent and statistically interpretable manner, and incorporate it into training via a statistical regularizer that effectively enhances model performance without adding modules or inference overhead. Most framework hyperparameters are further formulated as statistical inference problems, allowing automatic estimation from data and consequently efficient adaptation to new scenarios.Our framework unifies video crowd counting and image-based counting models under a compact, principled formulation that is lightweight, portable, and efficient. We also establish theoretical foundations for adapting image-based counting models to video crowd counting and achieve state-of-the-art accuracy and efficiency across six benchmarks, including challenging DRONECROWD and VSCROWD.

View full details

Poster

MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data

Hunor Laczko ⋅ Libang Jia ⋅ Loc-Phat Truong ⋅ Diego Hernández ⋅ Sergio Escalera ⋅ Jordi Gonzàlez ⋅ Meysam Madadi

Jun 7, 3:30 PM - 5:30 PM ExHall A 616

Existing 4D human datasets fall short for fashion-specific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g., tucked shirts, rolled sleeves). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VTON applications, MV-Fashion provides paired data: multi-view synchronized captures of worn garments alongside their corresponding flat, catalogue images. We leverage this dataset to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis.

View full details

Poster

Representing 3D Faces with Learnable B-Spline Volumes

Prashanth Chandran ⋅ Daoye Wang ⋅ Timo Bolkart

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 618

We present CUBE (Control-based Unified B-Splinie Encoding), a new geometric representation for digital humans that combines B-Spline volumes with learned features, and demonstrate its use as decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-Spline representations that use 3D control points, CUBE is parametrized by a lattice (e.g., $8 \times 8 \times 8$) of high-dimensional control features, increasing the models' expressivity. These control features define a continuous mapping from a 3D parametric domain to 3D Euclidean space through an intermediate feature space, which is evaluated in two stages. First, high-dimensional control features are locally blended using the B-Spline bases, yielding a high-dimensional feature vector, where the first three values are the 3D coordinates of a coarse base mesh. This feature vector is input to a small MLP to predict a residual from the base shape, resulting in refined 3D point coordinates. To reconstruct 3D surfaces in dense semantic correspondence, we query CUBE at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support of traditional B-spline representations, enabling us to locally edit the surface by updating individual control features. We demonstrate the strengths of this representation by training two transformer-based encoders to predict CUBE's control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent geometric and multi-view baselines.

View full details

Poster

Scalable Feature Matching via State Space Modeling and Sparse Correlation

Choo Sin Wai ⋅ Bo Li

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 621

Efficient and robust feature matching is crucial for latency-sensitive and resource-constrained applications. While current semi-dense feature matching approaches commonly suffer from quadratic complexity in spatial resolution due to transformer-based long-range context modeling or redundant full correlation computations. To overcome these limitations, we present a novel scalable feature matching method that delivers reliable correspondences with low memory footprint and latency, especially at high resolutions. Our approach introduces three key innovations: (1) a hybrid Conv-Mamba backbone for efficient cross-scale and cross-view feature extraction with linear complexity, (2) a training-free norm-based feature filtering mechanism, enabling sparse correlation that significantly reduces computation overhead during inference, and (3) a lightweight recurrent coordinate refinement that surpasses expectation-based regression in subpixel accuracy. Experimental results demonstrate our method's superior accuracy and efficiency performance over state-of-the-art (SOTA) approaches on both indoor and outdoor datasets. Notably, in resolution scaling tests, our method achieves 45\% lower memory usage and 2.4$\times$ faster inference than JamMa, while also outperforming Efficient LoFTR with 57\% memory reduction and 1.8$\times$ speedup at high resolution. These results demonstrate the strong scalability and practical efficiency of our method.

View full details

Poster

HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation

Jiawen Li ⋅ Fei Jiang ⋅ Dandan Zhu ⋅ Aimin Zhou

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 621

Unsupervised domain adaptation (UDA) for pose estimation promises transfer from synthetic to real domains but often suffers instability under domain shift. Prior work attributes this deterioration to gradient interference between source supervision and target consistency. This conflict is distinct in pose estimation, where sparse and heterogeneous supervision signals cause gradients to be highly sensitive to small localization errors and lead to unstable updates. To address these challenges, we propose HamiPose, a Hamiltonian optimization framework that transports decoupled and confidence-calibrated gradients within a unified geometry to mitigate instability. HamiPose first refines gradient interaction through keypointwise geometry decomposition, orthogonally projecting target gradients to preserve nonconflicting component. Channelwise gated alignment then calibrates the parallel component with confidence and alignment, producing decoupled, confidence-calibrated gradients. These gradients are advanced by a Hamiltonian optimizer with a symplectic integrator, providing controlled momentum that stabilizes updates. Extensive experiments demonstrate that HamiPose achieves state-of-the-art performance in UDA pose estimation while maintains strong performance under domain generalization settings.

View full details

Poster

Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains

Qingwei Ben ⋅ Botian Xu ⋅ Kailin Li ⋅ Feiyu Jia ⋅ Wentao Zhang ⋅ Jingping Wang ⋅ Jingbo Wang ⋅ Dahua Lin ⋅ Jiangmiao Pang

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 621

Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure.This paper presents $\textbf{Gallant}$, a voxel-grid–based framework for humanoid locomotion and local navigation in 3D constrained terrains.It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency.Experimental results show that Gallant’s broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves near-100\% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization. This project will be fully open-source.

View full details

Poster

Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining

Zhumei Wang ⋅ Zechen Hu ⋅ Ruoxi Guo ⋅ Huaijin Pi ⋅ Ziyong Feng ⋅ Liang Zhang ⋅ Mingtao Pei ⋅ Siyuan Huang

Jun 7, 3:30 PM - 5:30 PM ExHall A 621

Human motion recovery for real-world interaction demands both precise action details and metric-scale trajectories. Recovering absolute human pose from monocular input presents a viable solution, but faces two main challenges: (1) models' reliance on 3D training data from constrained environments limits their out-of-distribution generalization; and (2) the inherent difficulty of estimating metric-scale poses from monocular observations. This paper introduces Mocap-2-to-3, a novel framework that differs from prior HMR methods by recovering absolute poses from monocular input and leveraging abundant 2D data to enhance 3D motion recovery. To effectively utilize the action priors and diversity in large-scale 2D datasets, we reformulate 3D motion as a multi-view synthesis process and divide the training into two stages: a single-view diffusion model is first pre-trained on extensive 2D data, followed by multi-view fine-tuning on public 3D data, thus achieving a combination of strong priors and geometric constraints. Furthermore, to recover absolute poses, we introduce a novel human motion representation that decouples the learning of local pose and global movements, while encoding ground geometric priors to accelerate convergence, thereby yielding more precise positioning in the physical world. Experiments on in-the-wild benchmarks show that our method outperforms state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting strong generalization capability. Our code will be made publicly available.

View full details

Poster

GenTract: Generative Global Tractography

Alec Sargood ⋅ Lemuel Puglisi ⋅ Elinor Thompson ⋅ Mirco Musolesi ⋅ Daniel C. Alexander

Jun 7, 11:45 AM - 1:45 PM ExHall F 622

Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract’s performance against state-of-the-art baselines. Notably, GenTract achieves precision 2.1$\times$ higher than the next-best method, TractOracle. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by an order of magnitude. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.

View full details

Poster

Learning to Track Instance from Single Nature Language Description

Yaozong Zheng ⋅ Bineng Zhong ⋅ Qihua Liang ⋅ Shuimu Zeng ⋅ Haiying Xia ⋅ Shuxiang Song

Jun 6, 11:45 AM - 1:45 PM ExHall F 623

How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\tracker} surpasses SOTA self-supervised methods, achieving an improvement of more than 11.2\%, 5\%, and 3.3\% in AUC score on the OTB99, LaSOT, and TNL2K datasets, respectively.

View full details

Poster

Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

Hongwei Fang ⋅ Jiahang Cai ⋅ Xun Wang ⋅ Wenwu Yang

Jun 7, 3:30 PM - 5:30 PM ExHall A 623

Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Source code will be released for research purposes.

View full details

Poster

Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing

Tian Ding ⋅ Hongtao Yang ⋅ Liangtao Shi ⋅ Jun Li ⋅ Xiantao Hu ⋅ Jian Yang ⋅ Ying Tai

Jun 6, 11:45 AM - 1:45 PM ExHall F 625

fails under night scenes, glare, fog, and partial occlusion. Despite notable accuracy gains, recent architectures emphasize deep fusion and large parameter counts, driving up FLOPs and bandwidth. This computational burden constrains real-time performance and limits scalability beyond high-end GPUs. To balance accuracy and efficiency, we propose Adaptive Early-Exit (AEE): we augment the backbone with anytime heads and pair them with a confidence-calibrated early-exit policy that halts inference at the earliest reliable layer, skipping redundant computation. For cross-modal interaction, we design a Holistic-Token-Guided Interaction (HTGI) module, where each modality is compressed into a compact set of holistic state tokens and injected into the other modality’s modeling stream without layer-wise alignment, enabling targeted information exchange at extremely low cost. On RGB-T benchmarks, the lightweight tracker substantially reduces latency while maintaining competitive accuracy; on LasHeR, it achieves 70.2% precision and 56.3% success, running at 148.3 FPS on GPU, 50.2 FPS on CPU, and 28.7 FPS on an edge device.

View full details

Poster

Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning

Zhuofan Xie ⋅ Zishan Lin ⋅ Jinliang Lin ⋅ Jie Qi ⋅ Shaohua Hong ⋅ Shuo Li

Jun 6, 11:45 AM - 1:45 PM ExHall F 628

Active Learning (AL) reduces annotation costs in medical imaging by selecting only the most informative samples for labeling, but suffers from cold-start when labeled data are scarce. Vision-Language Models (VLMs) address the cold-start problem via zero-shot predictions, yet their temperature-scaled softmax outputs treat text-image similarities as deterministic scores while ignoring inherent uncertainty, leading to overconfidence. This overconfidence misleads sample selection, wasting annotation budgets on uninformative cases. To overcome these limitations, the Similarity-as-Evidence (SaE) framework calibrates text–image similarities by introducing a Similarity Evidence Head (SEH), which reinterprets the similarity vector as evidence and parameterizes a Dirichlet distribution over labels. In contrast to a standard softmax that enforces confident predictions even under weak signals, the Dirichlet formulation explicitly quantifies lack of evidence (vacuity) and conflicting evidence (dissonance), thereby mitigating overconfidence caused by rigid softmax normalization. Building on this, SaE employs a dual-factor acquisition strategy: high-vacuity samples (e.g., rare diseases) are prioritized in early rounds to ensure coverage, while high-dissonance samples (e.g., ambiguous diagnoses) are prioritized later to refine boundaries, providing clinically interpretable selection rationales. Experiments on ten public medical imaging datasets with a 20% label budget show that SaE attains state-of-the-art macro-averaged accuracy of 82.57%. On the representative BTMRI dataset, SaE also achieves superior calibration, with a negative log-likelihood (NLL) of 0.425.

View full details

Poster

ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss

Jiaying Ying ⋅ Heming Du ⋅ Kaihao Zhang ⋅ Sean M. Tweedy ⋅ Xin Yu

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 629

Single-image human mesh recovery provides a compact 3D, person-centric representation that supports analysis, animation, AR and VR, rehabilitation, and human–computer interaction. However, prevailing systems impose an intact-limb prior and degrade on people with limb loss, because fixed-topology models cannot represent residual limbs.In this work, we present ResiHMR, a residual-limb aware framework for single-image 3D human modeling. ResiHMR adopts residual-limb keypoints and introduces two components: (i) a topology-adaptive Residual Anchor-Factor Optimization module that constrains estimation to the observed kinematic subgraph of anatomically valid structures, and (ii) a geometry-based Residual-Limb Reconstruction module that estimates residual-limb boundaries and convex limb-termination geometry. Together, these modules introduce topology-aware optimization and explicit termination geometry as tools for human mesh recovery under non-standard limb anatomy.Unlike joint-removal methods in a fixed topology, ResiHMR explicitly reconstructs residual-limb surfaces and aligns optimization with limb-loss topology, which better matches prosthetic biomechanics and real-world use. To the best of our knowledge, this is the first single-image HMR system that explicitly reconstructs residual-limb surfaces and performs topology-adaptive optimization for individuals with limb loss. On a curated dataset of real-world images with limb loss, compared with SMPLify-X, ResiHMR reduces intact-joint 2D MPJPE from 41.32 to 37.40, increases mIoU from 0.662 to 0.703, and improves anatomical plausibility in expert ratings.

View full details

Poster

OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis

Yuxuan Fan ⋅ JING HAO ⋅ Hong Chen ⋅ Jiahao Bao ⋅ Yihua Shao ⋅ Yuci Liang ⋅ Kuo Feng Hung ⋅ Hao Tang

Jun 7, 11:45 AM - 1:45 PM ExHall F 630

Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision–language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision–language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand–image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis. Code, benchmark, and models will be made publicly available.

View full details

Poster

RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation

Ganlin Feng ⋅ Yuxi Long ⋅ Hafsa Moontari Ali ⋅ Erin Lou ⋅ Fahad Butt ⋅ Qian Liu ⋅ Yang Wang ⋅ Pingzhao Hu

Jun 7, 3:30 PM - 5:30 PM ExHall A 631

Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision–language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.

View full details

Poster

Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers

Ufaq Khan ⋅ Umair Nawaz ⋅ Massimo Caputo ⋅ Muhammad Bilal ⋅ Junaid Qadir ⋅ Muhammad Haris Khan

Jun 6, 11:45 AM - 1:45 PM ExHall F 632

Medical imaging presents significant challenges due to acoustic shadows, motion blur, and indistinct boundaries. Addressing these issues is crucial for improving diagnostic accuracy. Many conventional vision require extensive fine-tuning on task-specific data and often lose generalizability to natural-image domains. We propose DCRM-ViT, a domain-conditioned residual modulation framework for Vision Transformers that preserves general-vision capability while adapting to diverse domains. DCRM-ViT keeps the backbone frozen and augments each block with a lightweight Residual Modulation Block (RMB) whose parameters are synthesized per sample by a Domain Router (DR) and Parameter Synthesizer Network (PSN). The router outputs soft domain weights from input features, whereas the synthesizer maps these weights to low-rank residuals that modulate selected projections and, optionally, add a domain-aware bias to attention. Crucially, we learn routing and modulation via a bi-level optimization scheme: a short inner loop adapts RMB parameters to task supervision, while an outer loop updates DR, PSN, and RMB initializations/step sizes so the synthesized residuals generalize across medical and natural domains. Across fine-grained classification (Food101, SUN397, Stanford Cars) and medical segmentation (ultrasound, CT, MRI), DCRM-ViT improves over strong baselines while using modest trainable compute. The ablation studies confirmed the benefits of our architectural enhancements, showing improved performance and adaptability. The results demonstrate DCRM-ViT's potential to offer high diagnostic performance with reduced computational overhead of using 4.7 GFLOPs and 0.3 training min/epoch. Our code will be publicly available upon acceptance.

View full details

Poster

An Efficient Token Compression Framework for Visual Object Tracking

Weijing Wu ⋅ Qihua Liang ⋅ Bineng Zhong ⋅ Haiying Xia ⋅ Zhiyi Mo ⋅ Shuxiang Song

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 637

Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. This fusion is performed through a cascade of collaborative stages, where each stage executes a structured process of template enrichment via search context, unified feature learning, and search feature refinement to ensure precise target localization. Experiments on seven benchmarks demonstrate that our method significantly outperforms current state-of-the-art trackers.

View full details

Poster

Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis

Yuan Zhang ⋅ Sihao Dou ⋅ Kai Hu ⋅ Shuhua Deng ⋅ Chunhong Cao ⋅ Fen Xiao ⋅ Xieping Gao

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 637

Endoscopic video analysis is crucial for early gastrointestinal screening, but its progress is constrained by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods designed for natural videos tend to prioritize dense spatio-temporal modeling and exhibit motion bias, neglecting the static, structured semantics that are critical for clinical decision-making. To address this challenge, we propose **F**ocus-to-**P**erceive **R**epresentation **L**earning (***FPRL***), a cognition-inspired hierarchical framework that emulates the clinical examination process of endoscopic videos. ***FPRL*** first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, ***FPRL*** employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics through the application of teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that ***FPRL*** achieves state-of-the-art performance across diverse downstream tasks, demonstrating its effectiveness and strong generalization in endoscopic video representation learning.

View full details

Poster

CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

Di Zhang ⋅ Zhangpeng Gong ⋅ Xiaobo Pang ⋅ Jiashuai Liu ⋅ Junbo Lu ⋅ Hao Cui ⋅ Jiusong Ge ⋅ Zhi Zeng ⋅ Kai Yi ⋅ Yinghua Li ⋅ Si Liu ⋅ Tingsong Yu ⋅ Haoran Wang ⋅ Mireia Crispin-Ortuzar ⋅ Weimiao Yu ⋅ Chen Li ⋅ Zeyu Gao

Jun 6, 11:45 AM - 1:45 PM ExHall F 638

Foundation models have recently achieved impressive success in computational pathology, demonstrating strong generalization across diverse histopathology tasks. However, existing models overlook the heterogeneous and non-uniform organization of pathological regions of interest (ROIs) because they rely on natural image backbones not tailored for tissue morphology. Consequently, they often fail to capture the coherent tissue architecture beyond isolated patches, limiting interpretability and clinical relevance. To address these challenges, we present Cross-modal Adaptive Region Encoder (CARE), a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions. Specifically, CARE employs a two-stage pretraining strategy: (1) a self-supervised unimodal pretraining stage that learns morphological representations from 34,277 whole-slide images (WSIs) without segmentation annotations, and (2) a cross-modal alignment stage that leverages RNA and protein profiles to refine the construction and representation of adaptive regions. This molecular guidance enables CARE to identify biologically relevant patterns and generate irregular yet coherent tissue regions, selecting the most representative area as ROI. CARE supports a broad range of pathology-related tasks, using either the ROI feature or the slide-level feature obtained by aggregating adaptive regions. Based on only one-tenth of the pretraining data typically used by mainstream foundation models, CARE achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.

View full details

Poster

FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy

Hyejin Park ⋅ Jiwon Yoon ⋅ Sumin Park ⋅ Suree Kim ⋅ Sinae Jang ⋅ Eunsoo Lee ⋅ Dongmin Kang ⋅ Dongbo Min

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 639

Accurate focus quality assessment (FQA) in fluorescence microscopy remains challenging, as the stain-dependent optical properties of fluorescent dyes cause abrupt and heterogeneous focus shifts. However, existing datasets and models overlook this variability, treating focus quality as a stain-agnostic problem. In this work, we formulate the task of \textbf{stain-aware FQA}, emphasizing that focus behavior in fluorescence microscopy must be modeled as a function of staining characteristics. Through quantitative analysis of existing datasets (FocusPath, BBBC006) and our newly curated FluoMix, we demonstrate that focus–rank relationships vary substantially across stains, underscoring the need for stain-aware modeling in fluorescence microscopy. To support this new formulation, we propose \textbf{FluoMix}, the first dataset for stain-aware FQA that encompasses multiple tissues, fluorescent stains, and focus variations. Building on this dataset, we propose \textbf{FluoCLIP}, a two-stage vision-language framework that leverages CLIP's alignment capability to interpret focus quality in the context of biological staining. In the \textbf{stain-grounding phase}, FluoCLIP learns general stain representations by aligning textual stain tokens with visual features, while in the \textbf{stain-guided ranking phase}, it optimizes stain-specific rank prompts for ordinal focus prediction. Together, our formulation, dataset, and framework establish the first foundation for \textbf{stain-aware FQA}, and \textbf{FluoCLIP} achieves strong generalization across diverse fluorescence microscopy conditions.

View full details

Poster

Editprint: General Digital Image Forensics via Editing Fingerprint with Self-Augmentation Training

Haiwei Wu ⋅ Kemou Li ⋅ Yuanman Li ⋅ Jiantao Zhou

Jun 7, 11:45 AM - 1:45 PM ExHall F 640

Digital image forensics can ensure information credibility in tasks like camera source identification (CSI), synthetic image detection (SID), and social network provenance (SNP). These tasks typically rely on image processing history clues left by in-camera operations, post-capture editing, or synthetic generation. However, most existing forensic methods have obvious limitations: 1) they often only focus on camera-specific traces (e.g., the well-known PRNU), and 2) they demand a substantial amount of annotated training data. To address these constraints, we propose Editprint, a novel general forensic feature that captures highly diverse in- and out-camera processing history clues with minimal unlabeled training data. Ideally, we expect that any images undergoing the same imaging, editing, and transmission processes would yield identical Editprints, and vice versa. To model the in- and out-camera operations, we devise an online editing pool based on self-augmentation strategies. Requiring only minimal (e.g., 10) training data, the editing pool can simulate massive (e.g., 10$^\text{7}$) editing chains and traces arising from the in-camera processing and the subsequent out-camera operations. To ensure that Editprint exhibits high discriminative capabilities across various editing chains, we propose using textual descriptions of these chains as labels and supervising their Editprints through language-guided contrastive learning. Extensive experiments show Editprint outperforms existing self-supervised forensics, particularly in non-camera applications such as SNP and SID. We hope that Editprint would inspire the forensic community and serve as a novel benchmark for self-supervised forensics.

View full details

Poster

Wavelet-Driven 3D Anomaly Detection under Pose-Agnostic and Sparse-View

Mingwen Shao ⋅ Qiao Zhang ⋅ Xinyuan Chen ⋅ Xiang Lv ⋅ Lingzhuang Meng ⋅ Chang Liu ⋅ Qinglin Zhan ⋅ Ling Jian

Jun 7, 3:30 PM - 5:30 PM ExHall A 641

Pose-agnostic anomaly detection (PAD) achieves strong performance in localizing anomalies from arbitrary viewpoints when trained on densely sampled normal data. However, under sparse-view conditions, existing methods face two key challenges: (1) sparse observations lead to overfitting and geometric detail loss in 3D reconstruction; (2) limited visual cues lead to inaccurate pose estimation, compromising the reliability of subsequent anomaly localization. To address these challenges, we propose Wave-Pose3D, a wavelet-driven 3D anomaly detection framework tailored for PAD under sparse-view conditions. First, we design a structure-aware and wavelet-optimized Gaussian modeling strategy that dynamically filters unreliable regions via structural priors to mitigate overfitting and leverages high-frequency supervision to restore fine-grained geometric details. Second, to improve pose estimation under sparse views, we develop a wavelet-based pose estimator that integrates low-frequency structural cues and high-frequency details to enhance both initialization and refinement accuracy. Finally, we introduce a wavelet difference-aware anomaly detector that computes frequency-domain anomaly scores, improving localization robustness against pose and geometric variations. By integrating these strategies, Wave-Pose3D achieves robust and accurate anomaly localization under sparse views. Extensive experiments validate that the proposed approach achieves state-of-the-art performance under 10\% and 20\% sparse-view configurations.

View full details

Poster

Enabling Supervised Learning of Generative Signatures for Generalized Synthetic Image Detection

Jianwei Fei ⋅ Yunshu Dai ⋅ Xiaoyu Zhou ⋅ Zhihua Xia ⋅ Alessandro Piva

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 642

Extracting reliable generative traces in generated images is critical for AI-generated images (AIGIs) detection. However, a fundamental challenge exists: AIGIs inherently contain generative traces with no trace-free counterpart available, making supervised extraction of these artifacts infeasible. In this work, we overcome this through a surrogate supervision framework. We design a dynamic reconstructor that simulates diverse generative traces on real images through stochastically varied architectures and parameters. The reconstruction residuals serve as supervision to train an extractor that learns to isolate traces, \textit{i.e.}, generative signatures (GenSign). A detector then fuses extracted GenSign with RGB features to distinguish real images from AIGIs. Our key insight is that sufficient architectural diversity in simulation enables effective transfer to real-world generators, resolving the absence of ground truth GenSign. Extensive experiments across four benchmarks demonstrate state-of-the-art generalization, confirming that our simulation-based learning paradigm is capable of extracting general and transferable forensic features.

View full details

Poster

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

Zhenyu Li ⋅ Sai Kumar Dwivedi ⋅ Filip Maric ⋅ Carlos Chacón ⋅ Nadine Bertsch ⋅ Filippo Arcadu ⋅ Tomas Hodan ⋅ Michael Ramamonjisoa ⋅ Peter Wonka ⋅ Amy Zhao ⋅ Robin Kips ⋅ Cem Keskin ⋅ Anastasia Tkach ⋅ Chenhongyi Yang

Jun 6, 11:45 AM - 1:45 PM ExHall F 642

Egocentric 3D human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present XR-Poser, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training.The proposed model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget.The proposed auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher–student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. In experiments on the EgoBody3M benchmark, XR-Poser outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%, respectively. Furthermore, our auto-labeling system additionally improves the wrist MPJPE by 13.1%.

View full details

Poster

Fine-VAD: Towards Fine-Grained Video Anomaly Detection via Progressive Cross-Granularity Learning

Menghao Zhang ⋅ Yiyan Zhu ⋅ Pengfei Ren ⋅ Haifeng Sun ⋅ Qi Qi ⋅ Zirui Zhuang ⋅ Huazheng Wang ⋅ Lei Zhang ⋅ Jianxin Liao ⋅ Jingyu Wang

Jun 7, 11:45 AM - 1:45 PM ExHall F 643

In this paper, we explore video anomaly detection (VAD) from a fine-grained perspective, which aims not only to detect anomalous events but also to identify their specific categories. Due to the limited number of examples per category, existing methods either fail to handle intra-class variation across diverse contexts or struggle with inter-class confusion caused by shared visual primitives. To address these challenges, we propose a progressive cross-granularity learning paradigm that leverages coarse- and fine-grained labels in a complementary manner to progressively refine representations from generic anomaly patterns to category-specific semantics.Building on this paradigm, we develop Fine-VAD, a progressive alignment framework that aligns video features with supervision signals at multiple granularities. Extensive experiments on two benchmark datasets demonstrate that Fine-VAD achieves up to a 48\% improvement in fine-grained anomaly classification, while maintaining state-of-the-art performance in coarse-grained anomaly detection. Notably, our paradigm generalizes well across diverse model architectures, offering an adaptable and effective solution for real-world fine-grained VAD.

View full details

Poster

All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark

Junjiang Wu ⋅ Liejun Wang ⋅ Zhiqing Guo

Jun 5, 4:00 PM - 6:00 PM ExHall A & F 644

With the rapid advancement of deepfake technology, malicious face manipulations pose a significant threat to personal privacy and social security. However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. Our core framework adopts an innovative 152-dimensional landmark-identity watermark termed LIDMark, which structurally interweaves facial landmarks with a unique source identifier. To robustly extract the LIDMark, we design a novel Factorized-Head Decoder (FHD). Its architecture factorizes the shared backbone features into two specialized heads (i.e., regression and classification), robustly reconstructing the embedded landmarks and identifier, respectively, even when subjected to severe distortion or tampering. This design realizes an "all-in-one" trifunctional forensic solution: the regression head underlies an "intrinsic-extrinsic" consistency check for detection and localization, while the classification head robustly decodes the source identifier for tracing. Extensive experiments demonstrate that the proposed LIDMark framework provides a unified, robust, and imperceptible solution for the detection, localization, and tracing of deepfake content.

View full details

Poster

PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose

Jebastin Nadar ⋅ Simone Foti ⋅ Tolga Birdal

Jun 6, 11:45 AM - 1:45 PM ExHall F 646

Generative pose priors have recently emerged as a powerful tool for inference under occlusion or noise. Yet today’s strongest generative paradigm, *flow matching*, remains unused for human pose due to two fundamental barriers: the absence of a pre-trained flow prior and the non-Euclidean nature of articulated poses. We overcome both by introducing **PoseD-Flow**, a novel framework to unify Riemannian Flow Matching (RFM) with training-free guidance for 3D human pose recovery. PoseD-Flow is composed of two contributions: (i) **PoseRFM**, the first RFM model of human pose, defined directly on the product manifold of joint rotations, and (ii) **Riemannian D-Flow**, a principled guidance mechanism that, by differentiating through its ODE sampling dynamics, conditions PoseRFM at inference without any task-specific training. Our theoretical analysis shows that the induced dynamics are shaped by data covariance and manifold curvature, yielding a bias toward realistic poses. Across pose completion, denoising, and inverse kinematics, \MethodName~establishes new state of the art, particularly under noise, occlusion, and partial observations.

View full details

Poster

PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery

Elkhan Ismayilzada ⋅ Yufei Zhang ⋅ Zijun Cui

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 646

Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN–Transformer backbone, we formulate Euler–Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.

View full details

Poster

Unlocking Motion from Large Vision Models with a Semantic and Kinematic Duality for Gait Recognition

Zhanbo Huang ⋅ Dingqiang Ye ⋅ Xiaoming Liu ⋅ Yu Kong

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 648

Existing set-based gait recognition methods achieve remarkable performance by capturing global semantic context.However, their order-invariant nature prevents them from modeling the fine-grained kinematic patterns that unfold over time.To unify the global and process-level representations, we propose GaitMax, a framework that captures both semantic context and kinematic motion.GaitMax leverages attention-based spatiotemporal modeling to dynamically represent detailed part-level trajectories.While this detailed representation is more powerful, it also captures more nuisance factors (e.g., clothing, viewpoint), leading to potential shortcuts.To mitigate this, we introduce CDLoss, a Conditional Decorrelation Loss that explicitly disentangles the gait embeddings from nuisance factors using vision-language supervision.This loss requires high-quality nuisance descriptions. We therefore construct GCaption, a new resource that provides natural language annotations for multiple gait datasets, moving beyond simple categorical labels. GCaption not only enables CDLoss but also serves as a foundation for future context-aware gait analysis.The superiority of GaitMax is validated through extensive experiments on multiple large-scale gait benchmarks. Models, code, and resources will be released upon publication.

View full details

Poster

PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement

bo zhao ⋅ Dan Guo ⋅ Junzhe Cao ⋅ Yong Xu ⋅ Bochao Zou ⋅ Tao Tan ⋅ Yue Sun ⋅ Zitong YU

Jun 6, 11:45 AM - 1:45 PM ExHall F 649

Remote photoplethysmography (rPPG) measurement enables non-contact physiological monitoring but suffers from accuracy degradation under head motion and illumination changes. Existing deep learning methods are mostly heuristic and lack theoretical grounding, limiting robustness and interpretability. In this work, we propose a physics-informed rPPG paradigm derived from the Navier–Stokes equations of hemodynamics, showing that the pulse signal follows a second-order dynamical system whose discrete solution naturally leads to a causal convolution, justifying the use of a Temporal Convolutional Network (TCN). Based on this principle, we design the PHASE-Net, a lightweight model with three key components: 1) Zero-FLOPs Axial Swapper module to swap or transpose a few spatial channels to mix distant facial regions, boosting cross-region feature interaction without changing temporal order; 2) Adaptive Spatial Filter to learn a soft spatial mask per frame to highlight signal-rich areas and suppress noise for cleaner feature maps; and 3) Gated TCN, a causal dilated TCN with gating that models long-range temporal dynamics for accurate pulse recovery. Extensive experiments demonstrate that PHASE-Net achieves state-of-the-art performance and strong efficiency, offering a theoretically grounded and deployment-ready rPPG solution.

View full details

Poster

CMR-RD: Long-Tailed Adaptive VLM for Explainable CMR Diagnosis

Yansong Li ⋅ Zhongxi Qiu ⋅ Yun Tian ⋅ Zheng jinyu ⋅ Shuo Li

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 651

Cardiac magnetic resonance (CMR) is the clinical gold standard for assessing cardiovascular diseases, but its interpretation relies on expert experience and remains challenging, particularly for identifying rare diseases. Existing automated methods lack interpretable reasoning processes, limiting clinical adoption. Although vision-language models (VLMs) possess basic visual understanding and text generation capabilities, they still lack verifiable reasoning chains in medical diagnosis and underperform on minority classes in long-tail distributions. To address these challenges, we propose CMR-RD, to our knowledge the first VLM for interpretable diagnosis in CMR, capable of generating explicit diagnostic chains aligned with imaging evidence. We construct a CMR dataset that reflects real-world clinical distributions, comprising five disease categories (including two rare conditions) plus normal controls. Building on this, the general-purpose VLM is aligned to medical and CMR semantics using large-scale medical vision–text data, and cold-start training is used to enhance its understanding of medical concepts and basic reasoning. To enhance reasoning and performance on rare samples, we propose Group Phase Policy Optimization (GPPO), which combines online multi-stage reinforcement learning (RL)with adaptive sampling. GPPO enables the model to proactively explore rare and underperforming classes, thereby effectively mitigating long-tail bias. Experiments demonstrate that CMR-RD achieves state-of-the-art accuracy and reasoning-chain correctness compared with medical and general VLM baselines, shows stronger recognition of rare categories, and exhibits higher data efficiency. These results provide an interpretable pathway for automated CMR diagnosis.

View full details

Poster

Visual Diffusion Models are Geometric Solvers

Nir Goren ⋅ Shai Yehezkel ⋅ Omer Dahary ⋅ Andrey Voynov ⋅ Or Patashnik ⋅ Daniel Cohen-Or

Jun 7, 3:30 PM - 5:30 PM ExHall A 651

In this paper we show that visual diffusion models can serve as effective geometric solvers: they can directly reason about geometric problems by working in pixel space. We first demonstrate this on the Inscribed Square Problem, a long-standing problem in geometry that asks whether every Jordan curve contains four points forming a square. We then extend the approach to two other well-known hard geometric problems: the Steiner Tree Problem and the Maximum Area Polygon Problem. Our method treats each problem instance as an image and trains a standard visual diffusion model that transforms Gaussian noise into an image representing a valid approximate solution that closely matches the exact one. The model learns to transform noisy geometric structures into correct configurations, effectively recasting geometric reasoning as image generation. Unlike prior work that necessitates specialized architectures and domain-specific adaptations when applying diffusion to parametric geometric representations, we employ a standard visual diffusion model that operates on the visual representation of the problem. This simplicity highlights a surprising bridge between generative modeling and geometric problem solving. Beyond the specific problems studied here, our results point toward a broader paradigm: operating in image space provides a general and practical framework for approximating notoriously hard problems, and opens the door to tackling a far wider class of challenging geometric tasks.

View full details

Poster

Human Interaction-Aware 3D Reconstruction from a Single Image

Gwanghyun Kim ⋅ Junghun James Kim ⋅ Suh Yoon Jeon ⋅ Jason Park ⋅ Se Young Chun

Jun 6, 11:45 AM - 1:45 PM ExHall F 654

Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D.Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image.

View full details

Poster

OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

Haochen Chang ⋅ Pengfei Ren ⋅ Buyuan Zhang ⋅ Da Li ⋅ Tianhao Han ⋅ HaoYang ZHANG ⋅ Liang Xie ⋅ Hongbo Chen ⋅ Erwei Yin

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 657

Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6\% in detection rate, establishing a strong baseline for online micro gesture recognition. Our code is available in Suppl. Mat. and dataset will be available later.

View full details

Poster

Recovering Physically Plausible Human-Object Interactions from Monocular Videos

Dingbang Huang ⋅ Etienne Vouga ⋅ Qixing Huang ⋅ Georgios Pavlakos

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 658

In this paper, we present a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physical artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework that begins with a kinematic estimate and then refines it through a reinforcement learning (RL) policy trained to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that automatically identifies the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods.

View full details

Poster

One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

Haoxiang Rao ⋅ Zhao Wang ⋅ Chenyang Si ⋅ Yan LYU ⋅ Yuanyi Duan ⋅ Fang Zhao ⋅ Caifeng Shan

Jun 6, 4:45 PM - 6:45 PM ExHall A & F 658

Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of \method, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

View full details

Poster

TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures

Hyeongjin Nam ⋅ Daniel Jung ⋅ Kyoung Mu Lee

Jun 5, 10:45 AM - 12:45 PM ExHall A-F 660

Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human–object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human–object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases. Second, we incorporate appearance cues of the 3D human and object into the alignment process to capture holistic contextual information, thereby ensuring visually plausible reconstructions. As a result, our framework produces accurate and semantically coherent reconstructions, achieving state-of-the-art performance.

View full details

Poster

Image Diffusion Preview with Consistency Solver

Fu-Yun Wang ⋅ Hao Zhou ⋅ Liangzhe Yuan ⋅ Sanghyun Woo ⋅ Boqing Gong ⋅ Bohyung Han ⋅ Ming-Hsuan Yang ⋅ Han Zhang ⋅ Yukun Zhu ⋅ Ting Liu ⋅ Long Zhao

Jun 7, 3:30 PM - 5:30 PM ExHall A 659

The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. In this paper, we propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency.Experimental results demonstrate that ConsistencySolver significantly improves generation quality in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality.

View full details

Poster

GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception

jianqiang xu ⋅ Gensheng Pei ⋅ 刘华峰 Liu ⋅ Yazhou Yao

Jun 6, 11:45 AM - 1:45 PM ExHall F 669

Reliable 3D perception from multi-view roadside sensors hinges on the robust fusion of camera and LiDAR data, a task complicated by geometric misalignments and sensor calibration errors. This paper presents GSV2X, a fusion framework that tackles these challenges through two core contributions. First, to achieve robustness against spatial uncertainty, we lift 2D image features into a unified Bird's-Eye-View (BEV) space by representing them as 3D Gaussian distributions. By incorporating learnable perturbations guided by camera geometry, our model explicitly accounts for potential calibration inaccuracies. Second, to maximize the synergy between modalities, we propose a new orthogonal fusion module. This module employs constrained attention to enforce orthogonality between camera and LiDAR features, effectively disentangling redundant information and promoting the learning of complementary representations. Extensive experiments on the challenging RCooper dataset demonstrate that GSV2X sets a new state-of-the-art in multi-view roadside perception and exhibits remarkable robustness in complex, real-world scenarios.

View full details

Poster

MacTok: Robust Continuous Tokenization for Image Generation

Hengyu Zeng ⋅ Xin Gao ⋅ Guanghao Li ⋅ Yuxiang Yan ⋅ Jiaoyang Ruan ⋅ Ma Junpeng ⋅ Haoyu Albert Wang ⋅ Jian Pu

Jun 7, 3:30 PM - 5:30 PM ExHall A 672

Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce **MacTok**, a **M**asked **A**ugmenting 1D **C**ontinuous **Tok**enizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256$\times$256 and a state-of-the-art 1.52 at 512$\times$512 with SiT-XL, while reducing token usage by up to 64$\times$. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.

View full details

Poster

PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

Hong-Phuc Lai ⋅ Phong Nguyen ⋅ Anh Tran

Jun 7, 11:45 AM - 1:45 PM ExHall F 684

Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image.In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a $10\times$ to $35\times$ speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.

View full details

Poster

Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models

Ning Han ⋅ Zhenyu Ge ⋅ Feng Han ⋅ Yuhua Sun ⋅ Chengqing Li ⋅ Jingjing Chen

Jun 7, 3:30 PM - 5:30 PM ExHall A 684

Concept erasure aims to remove harmful, inappropriate, or copyrighted content from text-to-image diffusion models while preserving non-target semantics. However, existing methods either rely on costly fine-tuning or apply coarse semantic separation, often degrading unrelated concepts and lacking adaptability to evolving concept sets. To alleviate this issue, we propose Graph-Guided Online Concept Erasure (GrOCE), a training-free framework that performs precise and adaptive concept removal through graph-based semantic reasoning. GrOCE models concepts and their interrelations as a dynamic semantic graph, enabling principled reasoning over dependencies and fine-grained isolation of undesired content. It comprises three components: (1) Dynamic Topological Graph Construction for incremental graph building, (2) Adaptive Cluster Identification for multi-hop traversal with similarity-decay scoring, and (3) Selective Edge Severing for targeted edge removal while preserving global semantics. Extensive experiments demonstrate that GrOCE achieves state-of-the-art performance on Concept Similarity (CS) and Fréchet Inception Distance (FID) metrics, offering efficient, accurate, and stable concept erasure without retraining.

View full details

Poster

Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning

Jaekyun Ko ⋅ Dongjin Kim ⋅ Soomin Lee ⋅ Guanghui Wang ⋅ Tae Hyun Kim

Jun 7, 11:45 AM - 1:45 PM ExHall F 685

Denoising in the sRGB image space is challenging due to noise variability.Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These generative approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability.Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise.By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis.Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.

View full details

Poster

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Zehong Ma ⋅ Longhui Wei ⋅ Shuai Wang ⋅ Shiliang Zhang ⋅ Qi Tian

Jun 7, 3:30 PM - 5:30 PM ExHall A 690

Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-**DeCo**upled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of **1.62** (256×256) and **2.22** (512×512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison.

View full details

Poster

Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Yuqing Wang ⋅ Chuofan Ma ⋅ Zhijie Lin ⋅ Yao Teng ⋅ Lijun Yu ⋅ Shuai Wang ⋅ Jiaming Han ⋅ Jiashi Feng ⋅ Yi Jiang ⋅ Xihui Liu

Jun 7, 11:45 AM - 1:45 PM ExHall F 696

Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional VAE tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges.In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. Instead of treating spatial positions atomically, CubiD performs fine-grained masking throughout the high-dimensional discrete representation—any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions through attention, transforming an intractable $O(hwd)$ sequential generation problem into $O(T)$ parallel iterations where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures.

View full details

Poster

Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Shufan Li ⋅ Jiuxiang Gu ⋅ Kangning Liu ⋅ Zhe Lin ⋅ Zijun Wei ⋅ Aditya Grover ⋅ Jason Kuen

Jun 7, 11:45 AM - 1:45 PM ExHall F 699

Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as sparse representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2$\times$ speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.

View full details

Poster

Efﬁcient and Training-Free Single-Image Diffusion Models

Haojun Qiu ⋅ Kiriakos N. Kutulakos ⋅ David B. Lindell

Jun 7, 11:45 AM - 1:45 PM ExHall F 704

We consider the problem of generating images whose internal structure---defined by the distribution of patches across multiple scales---matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.

View full details