CVPR 2026 Schedule

Filter Events

Filter Rooms:

TUE 2 JUN

2 p.m.

Registration / Badge Pickup

(ends 8:00 PM)

WED 3 JUN

7 a.m.

Registration / Badge Pickup

(ends 5:00 PM)

Break:

Breakfast

(ends 9:00 AM)

8 a.m.

Tutorial:

Tom Builds, Tom Breaks: Hands-On Attacks and Defenses for Vision-Language Systems

(ends 12:00 PM)

Tutorial:

The Principles of Diffusion Models: Real-Time Continuous & Discrete Diffusion

(ends 12:00 PM)

Tutorial:

Edge AI in Action: Mastering On-Device Inference

(ends 12:00 PM)

Tutorial:

Towards Safe Multi-Modal Learning: Evolving Threats and Safety Solutions

(ends 12:00 PM)

Workshop:

Workshop on "Bitter Lessons"

(ends 12:00 PM)

Workshop:

Generative AI for XR and Identity-based Applications

(ends 12:30 PM)

Workshop:

GRAIL-V: Grounded Retrieval & Agentic Intelligence for Vision-Language

(ends 12:00 PM)

Workshop:

The 3rd Workshop on Human Motion Generation - New Perspective on Simulation, Animation, and VR applications

(ends 12:00 PM)

Workshop:

LatinX in Computer Vision Research Workshop

(ends 12:00 PM)

Workshop:

Multimodal Foundation Models for Biomedicine: Challenges and Opportunities

(ends 12:00 PM)

Workshop:

The 2nd Workshop on Multimodal Spatial Intelligence

(ends 12:00 PM)

Workshop:

On Sensor Vision Workshop

(ends 12:45 PM)

Workshop:

22nd Workshop on Perception Beyond the Visible Spectrum

(ends 12:00 PM)

Workshop:

The 2nd International Workshop & Challenge on Subtle Visual Computing @CVPR 2026

(ends 12:00 PM)

Workshop:

1st Workshop on Video World Models: Interaction, Memory, and Efficiency

(ends 12:00 PM)

Workshop:

Women in Computer Vision

(ends 12:00 PM)

Workshop:

Workshop on World Models Meet Active Sensing and Closed-Loop Planning

(ends 12:00 PM)

Workshop:

The 5th Explainable AI for Computer Vision (XAI4CV) Workshop

(ends 12:30 PM)

Workshop:

PHAROS AI Factory for Medical Imaging & Healthcare

(ends 12:30 PM)

Workshop:

Workshop on Agentic AI for Visual Media

(ends 5:00 PM)

Workshop:

Bridging Vision, Language, and Action: What’s Missing in Actionable Visual Perception for Robotics

(ends 5:00 PM)

Workshop:

Autonomous Understanding Through Open-world Perception and Integrated Language models for On-road Tasks

(ends 5:00 PM)

Workshop:

Foundation Models for Autonomous Driving

(ends 6:00 PM)

Workshop:

From Lab Demos to Daily Tasks: Embodied Intelligence in the Wild

(ends 5:00 PM)

Workshop:

13th Workshop on Fine-grained Visual Categorization

(ends 5:00 PM)

Workshop:

4th Workshop on Vision Based Industrial Inspection

(ends 5:00 PM)

Workshop:

The 1st Workshop on Deployment of Foundation Models for Embodied AI

(ends 5:00 PM)

8:15 a.m.

Workshop:

Workshop on Vision-based Assistants in the Real-World

(ends 1:00 PM)

8:20 a.m.

Workshop:

Multimodal Alignment for a Pluralistic Society

(ends 12:30 PM)

8:25 a.m.

Workshop:

AI for Creative Visual Content Generation, Editing and Understanding

(ends 12:35 PM)

Workshop:

IPA: Interactive Physical AI Workshop

(ends 1:00 PM)

8:30 a.m.

Workshop:

AI for Content Creation

(ends 12:30 PM)

Workshop:

The 3rd AI for Visual Arts Workshop and Challenges

(ends 12:30 PM)

Workshop:

The 5th DataCV Workshop and Challenge

(ends 12:30 PM)

Workshop:

The 5th Workshop on Federated Learning for Computer Vision

(ends 11:59 AM)

Workshop:

Generative AI for Sign Language

(ends 12:30 PM)

Workshop:

Sense of Space: Multi-Sensory Modeling for Embodied Intelligence

(ends 5:00 PM)

Workshop:

Visual General Intelligence

(ends 6:00 PM)

Workshop:

AI4RWC: The 2nd International Workshop on Vision Intelligence for Real-world Challenges

(ends 12:30 PM)

8:45 a.m.

Workshop:

Computational Cameras and Displays

(ends 5:00 PM)

Workshop:

Third Joint Egocentric Vision (EgoVis) Workshop

(ends 5:50 PM)

8:50 a.m.

Workshop:

AERO-HPR: Human Perception and Recognition in Aerial Surveillance

(ends 12:30 PM)

Workshop:

2nd Workshop on Photorealistic 3D Head Avatars

(ends 12:30 PM)

Workshop:

Efficient Deep Learning for Computer Vision

(ends 3:30 PM)

9 a.m.

Tutorial:

Accelerated Diffusion Models: From Theory to Interactive World Models

(ends 12:15 PM)

Workshop:

The 3rd Workshop on AI for Content Generation, Quality Enhancement and Streaming

(ends 1:00 PM)

Workshop:

The 22nd Embedded Vision Workshop

(ends 12:30 PM)

Workshop:

The 3rd Workshop on Foundation Models for Medical Vision

(ends 12:30 PM)

Workshop:

12th Workshop on Medical Computer Vision

(ends 4:00 PM)

Workshop:

Urban Scene Modeling: Structured, Semantic, and Synthetic 3D Habitats

(ends 6:00 PM)

9:15 a.m.

Workshop:

Workshop on Autonomous Driving

(ends 6:00 PM)

10 a.m.

Break:

Coffee Break

(ends 11:00 AM)

1 p.m.

Tutorial:

Principled Interpretability in Vision Models: From Mechanistic Understanding to Interpretable Models by Design

(ends 5:00 PM)

Tutorial:

Monte Carlo physical simulation

(ends 5:00 PM)

Tutorial:

From Perception to Simulation: The Emergence of World Models in Multi-modal Reasoning

(ends 5:00 PM)

Tutorial:

Building GenAI based Simulation Environment for End-to-End Autonomous Driving

(ends 5:00 PM)

Workshop:

GigaBrain Challenge 2026: Workshop on World Models Empowering Vision Language Action Model

(ends 6:00 PM)

Workshop:

The Second CVPR Workshop on Foundation and Large Vision Models in Remote Sensing (MORSE)

(ends 5:45 PM)

Workshop:

The 2nd 3D-LLM/VLA Workshop: Bridging Language, Vision and Action in 3D Environments

(ends 6:15 PM)

Workshop:

10th Affective & Behavior Analysis in-the-wild

(ends 6:00 PM)

Workshop:

Authenticity & Provenance in the age of Generative AI

(ends 6:00 PM)

Workshop:

Auto-Annotation with Expert-Crafted Guidelines

(ends 5:00 PM)

Workshop:

Cognitive Foundations for Multimodal Models

(ends 5:00 PM)

Workshop:

Computer Vision for the Built World

(ends 6:00 PM)

Workshop:

Computer Vision with Small Data: Beyond Scale -- Toward Data-Efficient Dynamically-Aware Video Intelligence

(ends 5:00 PM)

Workshop:

Computer Vision for Biomechanics Workshop

(ends 6:00 PM)

Workshop:

Sixth Workshop on Neural Architecture Search

(ends 5:00 PM)

Workshop:

DataMFM: Emerging Directions in Data for Multimodal Foundation Models

(ends 6:05 PM)

Workshop:

End-to-End 3D Learning

(ends 6:00 PM)

Workshop:

3rd Workshop on Efficient and On-Device Generation (EDGE), CVPR 2026

(ends 6:00 PM)

Workshop:

1st Workshop on Multi-Agent Robotic Systems: Scaling with Compositional Intelligence

(ends 5:00 PM)

Workshop:

The 5th Workshop on “What is Next in Multimodal Foundation Models?”

(ends 6:00 PM)

Workshop:

Workshop on Multimodal Human Motion Analysis

(ends 6:00 PM)

Workshop:

The 1st Workshop on Monitoring the World through an Imperfect Lens

(ends 5:00 PM)

Workshop:

2nd Workshop on Multimodal Sign Language Recognition

(ends 5:30 PM)

Workshop:

The 3rd MetaFood Workshop (MTF)

(ends 6:00 PM)

Workshop:

Machine Unlearning for Vision

(ends 6:00 PM)

Workshop:

OpenSUN3D: 6th Workshop on Open-World 3D Scene Understanding with Foundation Models

(ends 5:00 PM)

Workshop:

Synthetic & Adversarial ForEnsics

(ends 6:00 PM)

Workshop:

3rd Workshop on ScanNet++ Novel View Synthesis and 3D Semantic Understanding Challenge

(ends 5:30 PM)

Workshop:

The 7th International Workshop and CVML Challenge on Agriculture-Vision: Challenges & Opportunities for Computer Vision in Agriculture

(ends 6:00 PM)

Workshop:

The 1st Workshop on Vision for Intelligent Task Assistants

(ends 5:00 PM)

1:15 p.m.

Workshop:

Second Workshop on Foundation and Generative Models in Biometrics

(ends 6:00 PM)

1:20 p.m.

Workshop:

Rediscovering Intelligence: Can AI Still Learn from Humans?

(ends 5:30 PM)

1:25 p.m.

Workshop:

The 2nd Workshop on Test-time Scaling for Computer Vision

(ends 5:30 PM)

1:30 p.m.

Tutorial:

3D Human Mesh Modeling and Recovery from RGB and LiDAR

(ends 4:45 PM)

Workshop:

Spatial Intelligence for Cultural Heritage

(ends 5:45 PM)

1:45 p.m.

Workshop:

The 5th Workshop on Transformers for Vision and Multimodal AI

(ends 5:40 PM)

2 p.m.

Workshop:

The 1st Workshop on AI-assisted Long Video Creation

(ends 5:00 PM)

3 p.m.

Break:

Coffee Break

(ends 4:00 PM)

THU 4 JUN

7 a.m.

Registration / Badge Pickup

(ends 5:00 PM)

Break:

Breakfast

(ends 9:00 AM)

7:30 a.m.

Workshop:

3D Geometry Generation for Scientific Computing (2nd Edition)

(ends 12:30 PM)

Workshop:

2nd Workshop on Knowledge-Intensive Multimodal Reasoning

(ends 12:30 PM)

8 a.m.

Tutorial:

The Full Stack of Physical AI: Simulation, Foundation Models, and Edge Deployment for Next-Generation Robotics Applications

(ends 12:00 PM)

Tutorial:

Recent Advances in AI for Medical Imaging: Progress, Challenges, and Future Directions

(ends 12:00 PM)

Tutorial:

Extending Computer Vision to Hidden Objects: A Tutorial on Millimeter-Wave Imaging and Reconstruction of Occluded Scenes

(ends 12:00 PM)

Tutorial:

Computer Vision at Scale: Multi-Camera Tracking, Calibration, and Event Detection for Checkout-Free Retail

(ends 12:00 PM)

Workshop:

Third Workshop for Learning 3D with Multi-View Supervision

(ends 12:30 PM)

Workshop:

6th Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics

(ends 12:00 PM)

Workshop:

Workshop on Any-to-any Multimodal Learning

(ends 12:00 PM)

Workshop:

The 3rd Workshop on New Trends in AI-Generated Media and Security

(ends 12:30 PM)

Workshop:

2nd Workshop on Computer Vision for Children

(ends 12:30 PM)

Workshop:

The 5th Workshop on Computer Vision in the Wild: Towards Unified Multimodal Agents For Reasoning in the Wild

(ends 12:00 PM)

Workshop:

The Second Workshop on the Evaluation of the Generative Foundation Models

(ends 12:00 PM)

Workshop:

Geometry-Free Novel View Synthesis and Controllable Video Models

(ends 12:30 PM)

Workshop:

Humans of Generative AI

(ends 12:00 PM)

Workshop:

The 1st Workshop on Low‑Level Vision Frontiers with Generative AI, Preference Optimization, and Agentic Systems

(ends 12:30 PM)

Workshop:

6th Omnidirectional Computer Vision Workshop

(ends 12:10 PM)

Workshop:

Open-World Vision

(ends 12:00 PM)

Workshop:

From Perception to Persuasion: Challenges and Advances in Misinformation Detection in Society

(ends 12:20 PM)

Workshop:

SPAR-3D: Security, Privacy, and Adversarial Robustness in 3D Generative Vision Models

(ends 12:00 PM)

Workshop:

Trustworthy, Robust, Uncertainty-Aware, and Explainable Visual Intelligence and Beyond

(ends 12:30 PM)

Workshop:

The 8th UG2+ Workshop and Challenge: Bridging the Gap between Computational Photography and Visual Perception

(ends 12:00 PM)

Workshop:

Unified Robotic Vision with Cross-Modal Sensing and Alignment

(ends 12:30 PM)

Workshop:

9th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues

(ends 12:00 PM)

Workshop:

11th Workshop on Computer Vision and Multimodal Microscopy Image Analysis

(ends 5:00 PM)

Workshop:

The Seventh Annual Embodied Artificial Intelligence Workshop

(ends 5:00 PM)

Workshop:

2nd Workshop on Agents in Interaction, from Humans to Robots

(ends 5:00 PM)

Workshop:

Mobile AI workshop and associated challenges, 6th edition

(ends 5:00 PM)

Workshop:

Multi-Agent Embodied Intelligent Systems Meet Agentic-AI era: Opportunities, Challenges and Futures

(ends 5:00 PM)

Workshop:

11th New Trends in Image Restoration and Enhancement Workshop and Challenges

(ends 6:00 PM)

Workshop:

Video Generative Models: Benchmarks and Evaluation

(ends 5:00 PM)

Workshop:

2nd Workshop on Video Large Language Models

(ends 5:00 PM)

Workshop:

Workshop on Visual Concepts

(ends 5:00 PM)

Workshop:

Sight and Sound

(ends 5:00 PM)

8:10 a.m.

Workshop:

4th Workshop on Maritime Computer Vision

(ends 12:30 PM)

8:30 a.m.

Tutorial:

Analytic understanding of diffusion models

(ends 5:00 PM)

Workshop:

6th Workshop on CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling

(ends 12:30 PM)

Workshop:

Exploring the Next Generation of Data

(ends 12:00 PM)

Workshop:

Personalization in Generative AI Workshop

(ends 12:30 PM)

Workshop:

PhysHuman: Physically Grounded Human Perception and Modeling

(ends 12:30 PM)

Workshop:

Safe Artificial Intelligence for All Domains

(ends 12:30 PM)

8:45 a.m.

Workshop:

VizWiz Grand Challenge: Interpreting Images and Videos Taken by Blind People

(ends 12:20 PM)

Workshop:

4th Workshop on Generative Models for Computer Vision

(ends 5:00 PM)

Workshop:

9th Multimodal Learning and Applications Workshop

(ends 6:30 PM)

8:55 a.m.

Workshop:

Multimodal Algorithmic Reasoning Workshop

(ends 12:30 PM)

9 a.m.

Tutorial:

All You Need To Know About Self-Driving

(ends 5:30 PM)

Workshop:

The Eighth Workshop on Precognition: Seeing through the Future

(ends 12:15 PM)

Workshop:

The 6th Workshop of Adversarial Machine Learning on Computer Vision: Safety of Vision-Language Agents

(ends 5:00 PM)

Workshop:

12th IEEE International Workshop on Computer Vision in Sports

(ends 5:30 PM)

Workshop:

EarthVision: Large Scale Computer Vision for Remote Sensing Imagery

(ends 5:00 PM)

Workshop:

Embodied Reasoning in Action: Workshop and Challenge on Embodied Reasoning for Robotic Manipulation

(ends 5:00 PM)

Workshop:

2nd Workshop on Human-Interactive Generation and Editing

(ends 5:30 PM)

Workshop:

How Do Vision Models Work?

(ends 5:00 PM)

10 a.m.

Break:

Coffee Break

(ends 11:00 AM)

1 p.m.

Tutorial:

Foundations and Frontiers of Watermarking: Algorithms, Multimodal Extensions, Benchmarks, and Authenticity Frameworks

(ends 5:00 PM)

Tutorial:

The Road to Convergence: Evolution of Unified Multimodal Models

(ends 5:00 PM)

Tutorial:

From Perception to Action: Building Efficient and Deployable Robot Intelligence Pipelines with Open-Source Edge AI Toolkits

(ends 5:00 PM)

Workshop:

1st Workshop on Generative 3D Reconstruction

(ends 6:00 PM)

Workshop:

Medical Reasoning with Vision Language Foundation Models

(ends 6:00 PM)

Workshop:

4D Digital Twins: Real-to-Sim-to-Real for Physical AI

(ends 6:00 PM)

Workshop:

2nd Workshop on 4D Vision: Modeling the Dynamic World

(ends 6:00 PM)

Workshop:

Artificial Intelligence for Space

(ends 5:00 PM)

Workshop:

2nd Workshop on GenAI for Storytelling

(ends 5:00 PM)

Workshop:

Big Model Adaptation In Computer Vision

(ends 5:00 PM)

Workshop:

CVPR 2026 Biometrics Workshop

(ends 5:00 PM)

Workshop:

Bridging AI and Medical Reality: Computer Vision for Real-world Clinical Translation

(ends 6:00 PM)

Workshop:

Computer Vision × Education: Building a Cross‑Community Agenda for Multimodal Vision in Classrooms

(ends 6:00 PM)

Workshop:

CV4Science: Using Computer Vision for the Sciences

(ends 5:45 PM)

Workshop:

Domain Generalization: Evolution, Breakthroughs, and Future Horizons (2nd Edition)

(ends 6:00 PM)

Workshop:

The 2nd CVPR Workshop on Foundation Models Meet Embodied Agents

(ends 6:00 PM)

Workshop:

The 7th International Workshop on Eye and Gaze in Computer Vision

(ends 5:00 PM)

Workshop:

Eighth Workshop on Image Matching: Local Features and Beyond

(ends 6:00 PM)

Workshop:

1st Workshop on Journey to the Awards: Generative AI for Movie-Grade Video Production (J2A), CVPR 2026

(ends 6:00 PM)

Workshop:

The 2nd Workshop on Multi-Modal Reasoning for Agentic Intelligence

(ends 5:00 PM)

Workshop:

4D World Models: Bridging Generation and Reconstruction

(ends 6:00 PM)

Workshop:

Third Workshop on Simulation for Autonomous Driving

(ends 5:00 PM)

Workshop:

ScaleBot: The First Workshop on Scalable Robot Learning Systems

(ends 5:00 PM)

Workshop:

The 3rd Workshop on Synthetic Data for Computer Vision

(ends 5:30 PM)

1:15 p.m.

Workshop:

Second Workshop on Skilled Activity Understanding, Assessment & Feedback Generation

(ends 6:00 PM)

1:30 p.m.

Workshop:

The Third Workshop on Anomaly Detection with Foundation Models

(ends 5:30 PM)

Workshop:

Appearance Understanding and Generation

(ends 5:30 PM)

Workshop:

Pixel-level Video Understanding in the Wild Challenge

(ends 4:30 PM)

Workshop:

Visual Anomaly and Novelty Detection - 4th Edition

(ends 6:00 PM)

2 p.m.

Workshop:

See the World in a Different Light: Physical Appearance Modeling and Relighting in the Age of Generative AI

(ends 5:40 PM)

Workshop:

6th International Workshop on Long-form Video Understanding, Generation and Action

(ends 5:30 PM)

3 p.m.

Break:

Coffee Break

(ends 4:00 PM)

FRI 5 JUN

7 a.m.

Findings Poster Session 1 [7:00-8:30]

Posters 7:00-8:30

BLMT-Stereo: Breaking the Local Minima Trap of Iterative Stereo Matching

SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction

ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

GauSDF: Signed Distance Embedded Gaussian Surfels for 3D Reconstruction

4D E-SloMo: 4D Reconstruction for High Speed Scene using a Hybrid RGB-Event Multi-View System

MADrive: Memory-Augmented Driving Scene Modeling

OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization with Multi-Video 4D Gaussian Splatting

AR4D: Autoregressive 4D Generation from Monocular Videos

CLLAP: Contrastive Learning-based LiDAR-Augmented Pretraining for Enhanced Radar-Camera Fusion

Point2Gaussian: Point-Cloud-to-Gaussian Conversion for Efficient 3D Scene Rendering

Speed3R: Sparse Feed-forward 3D Reconstruction Models

FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views

Generalizable Human Gaussian Splatting via Multi-view Semantic Consistency

HEDA: Hyperbolic-Euclidean Dual Adaptation for Robust Real-World Point Cloud Completion

WildAni4D: Towards 4D Animal Mesh Reconstruction

Instant Colorization of Gaussian Splats

Dynamic Scene Decomposition Beyond Moving Objects for High-Fidelity 3D Reconstruction in Autonomous Driving

LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images

FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting

Affine Bases for Affine Spaces

Improving Densification in 3D Gaussian Splatting for High-Fidelity Rendering

OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design

SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration

GEAR: GEometry-Motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting

GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes

2D Triangle Splatting for Direct Differentiable Mesh Training

ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph

3D Gaussian Splatting for Annular Dark Field Scanning Transmission Electron Microscopy Tomography Reconstruction

IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes

Stream3D: Streaming Zero-Shot 3D Instance Segmentation with Multi-View Noise Mask Filtering and Manifold Refining

Active Exploration for Sparse Visual Localization

GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis

3DFA: Aligning the Features Between Point Cloud and Query Image for Scene-Specific Visual Localization

Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment

From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks

VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction

AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization

Object Pose Transformer: Unifying Unseen Object Pose Estimation

SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery

PDF-GS: Progressive Distractor Filtering for Robust 3D Gaussian Splatting

Three-Step Conditional Diffusion 3D Reconstruction for Light-Field Microscopy

LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates

Learning a Particle Dynamics Model with Real-World Videos

Softmax-GS: Generalized Gaussians Learning When to Blend or Bound

G2I: Transitioning a Generalized Monocular Depth Estimation Model to In-Domain Metric Depth Prediction

3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework

HiDiGen: Hierarchical Diffusion for B-Rep Generation with Explicit Topological Constraints

RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video

RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

WGS: Watertight Geometry Standardization for Scalable 3D Generation

Self-Evolving 3D Scene Generation from a Single Image

Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator

Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

UniVerse3D: Emerging Properties of Unified Multimodal Models in 3D Understanding and Generation

HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation

Beyond Voxel 3D Editing : Learning from 3D Masks and Self-Constructed Data

Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

Defending CLIP via Noise-Induced Feature Dynamics for Training-Free, Zero-shot Adversarial Robustness

Jailbreaking Frontier Foundation Models Through Intention Deception

NSGuard: Null-Space Guided Robust Watermarking for Data Copyright Protection in Customized Generation

A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing

Phantasia: Context-Adaptive Backdoors in Vision Language Models

BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models

CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion

BadVLM: Towards Efficient and Resilient Backdoor Attacks on Large Vision-Language Models

Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

When Data is Scarce, Learn to Adapt: Robust Federated Learning via Adversarial Meta-Optimization

Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

DRA: Structure-Preserving Backdoor Erasure via Diagnosing, Recalibrating, and Adapting

APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition

Cognitive Attack Detection in Augmented Reality (CADAR): A Neuro-Symbolic Approach with Particle Filtering on Perception Graphs

On Evaluating Stateful Defence Models against Query-Based Black-Box Attacks

Optimizing Certified Radius of Zero-shot Composed Image Retrieval via Text Guidance

When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Phantom: A Unified Face-Swap Deepfake Protection Framework with Latent and Spatial Constraints

Tap, Scan, Exploit: The Hidden Vulnerabilities of Everyday QR Codes

DeepFakeShield: A Proactive Defense Against Malicious Face Swapping

MDG: Masked Denoising Generation for Multi-Agent Behavior Modeling in Traffic Environments

LiDAR-to-4D Radar Synthesis for Building Large-Scale Tensor Datasets

SurfaceGS: Dynamic Surface Gaussian Splatting for Urban Driving Scenes

JACoP: Joint Alignment for Compliant Multi-Agent Prediction

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

Physics-Informed Reward Framework for Vision-Language Driven Safe Autonomous Driving

HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes

VESPA: Open-World Auto-Labeling for 3D Object Detection in Autonomous Driving

IRL-VLA: Vision-Language-Action Training via Reward World Model

KnowMTP: A Knowledge-Guided Framework for Multi-Agent Trajectory Prediction in Autonomous Driving

MapGPT: A Vision-Language Model for Large-Scale High-Definition Map Generation

RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies

PAVE: An End-to-End Dataset for Production Autonomous Vehicle Evaluation

RoadTones: Tone Controllable Text Generation from Road Event Videos

GRADE: Guiding Realistic Autonomous Driving with Adaptive Trajectory Evolution

SurfelOcc: Self-supervised Occupancy Prediction via 2D Surfel Splatting

dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

Learning Vision-Language-Action World Models for Autonomous Driving

AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

Pseudo-Expert Regularized Offline RL for End-to-End Autonomous Driving in Photorealistic Closed-Loop Environments

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

CoRT-Predictor: Chain of Risk Thought Autoregressive Trajectory Predictor for Autonomous Driving

C^2T: Captioning-Structure and LLM-Aligned Common-Sense Reward Learning for Traffic–Vehicle Coordination

PEARL: A Lightweight Prompt-based Feature Interpreter Framework for Real-Time, Anonymous, and Heterogeneous Collaborative Perception

Variable-View Diffusion with Geometric Uncertainty Unlocks LiDAR Upsampling

RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection

On the Feasibility and Opportunity of Autoregressive 3D Object Detection

See Tomorrow, Act Today: Foresight-Driven Autonomous Driving

Spatial Transcriptomics as Images for Large-Scale Pretraining

DiffGradCAM: A Class Activation Map Using the Full Model Decision to Solve Unaddressed Adversarial Attacks

Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI

Fingerprint Fragment Expansion using Image Outpainting Approach based on Spectral Normalization PatchGAN

Improving Autoregressive Image Generation Through Coarse-to-Fine Token Prediction

Intelligent Photo Retouching with Language Model-Based Artist Agents

Guided Lensless Polarization Imaging

Blockwise Divide-and-Aggregate for Image Restoration using Diffusion Priors

Towards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras

Adaptive Continuous Kernel Networks for Image Reconstruction from Non-Uniform Sampling

FreqAdapt: Frequency-Adaptive Processing for RAW Object Detection

Stability and Non-Local Modeling in Hybrid Convolution–Transformer Networks for Snapshot Hyperspectral Reconstruction

Breaking Degradation Coupling: A Structural Entropy–Guided Decoupled Framework and Benchmark for Infrared Enhancement

Fast Generative DeOcclusion for Visual Geometry and Robotics

FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

Unlocking Single-View Constraints for Efficient Camera Relocalization with Keypoint-Level Multi-View Geometric Consistency in Training

Evolve Vision-Language-Action Model into an Agent with On-the-fly Tool-use

Retrieval-VLA: Training-Free In-Context Adaptation for Vision-Language-Action Models

Revisiting Articulated Parts Perception in Robot Manipulation

Re^2MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement

Teleoperation, Simulation, or Human Video? Data Utilization Law for Robot Manipulation

ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-Based 3D Scene Understanding

RoboTransfer: Controllable Geometry-Consistent Video Diffusion for Manipulation Policy Transfer

Switch-JustDance: Benchmarking Whole-Body Motion Tracking Controllers Using a Commercial Console Game

OminiMAG-SLAM : Unified Online Dual Graph Optimization for Multi-Agent Gaussian SLAM

ReaAct: Bridging Robotic Reasoning and Action Generation Toward Real-World Spatial Generalization

Learning Multi-Task Robot Trajectory Segmentation from Visual and Kinematic Streams

LP3: LLM-based Potential Prediction Policy for Object Navigation using a Scene-Object Semantic Map

RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

CoTFly: Making UAVs Think Where to Fly Next Through Visual Chain-of-Thought Reasoning

RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction

A1: Adaptive Truncated Vision-Language-Action Model from Affordance to Action

Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

RACE-6D: Real-time Accurate Coarse-to-finE Object 6D Pose Transformer

MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training

Riemannian Score-Based Diffusion for Language-Conditioned Grasp and Affordance Detection

DINO-VO: Learning Where to Focus for Enhanced State Estimation

Temporally-Smooth Global Bundle Adjustment for Real-Time Dense Visual SLAM

Masked Next-Scale Prediction For Self-Supervised Scene Text Recognition

iTCTSL: Interpretable Tropical Cyclone Track and Intensity Forecasting via Task Sensitive Learning

Machine Vision-Oriented Appearance Design: Generate Natural And Robust Textures For 3D Meshes

Bridge Your Fields: MeteoNet for Efficient Non-Uniform Meteorological Field Reconstruction

Catalyst: Out-of-Distribution Detection via Elastic Scaling

LUMINA: Learning and Understanding of Multimodal Information for Narrative and Affect-based Virality Prediction

LOOPE: Learnable Optimal Patch Order for Positional Encoders in Vision Transformers

Watermarking Matters for Deepfake Detection: A Proactive Method for Detecting Forgeries under Conventional Attacks

CTFS : Collaborative Teacher Framework for Forward-Looking Sonar Image Semantic Segmentation with Extremely Limited Labels

TPTransformer: Tensor–Tensor Product Transformer for Hyperspectral Image Super-Resolution

Co-Adaptive Graph Learning Through Coupled Spectral Refinement for 3D Anomaly Detection

The Mechanics of CNN Filtering with Rectification

AndroidLong: LLM-based Android Agents Struggle with Long Looping Tasks

Multimodal Large Language Models as Image Classifiers

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark

Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings

RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes

Vision Language Models are Confused Tourists

LenghuSky-8: An 8-Year All-Sky Cloud Dataset with Star-Aware Masks and Alt-Az Calibration for Segmentation and Nowcasting

Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs

AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Multimodal Models

Name That Part: 3D Part Segmentation and Naming

Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior

Memorization in 3D Shape Generation: An Empirical Study

Shape and Texture Recognition in Large Vision-Language Models

U-SEG: Uncertainty in SEGmentation - A systematic multi-variable exploration

UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

GOVTrack: Towards Generative Open-Vocabulary Multi-Object Tracking

Towards Text-Guided Attribute-Disentangled Multimodal Representation Learning

The DeepSpeak Dataset

AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering

MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition

FinChart-Multimodal: A Dataset for Context-Injected Financial Chart Understanding with Aligned OHLCV Time Series

THEval. Evaluation Framework for Talking Head Video Generation

PolyReal: A Benchmark for Real-World Polymer Science Workflows

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

PureSpace: A Benchmark for Abstract Spatial Reasoning in Vision-Language Models

Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

Beyond 3D Geometry: M3FD, a Large-Scale Dataset and Benchmark for Multimodal 3D Perceptual Understanding

Paper2SysArch: Structure‑Constrained System Architecture Generation from Scientific Papers

WildRelight: A Real-World Dataset and Benchmark for Single-Image Relighting

EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

VibraVerse: A Large-Scale Geometry–Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning

When Harmful Content Goes Invisible: Unveiling Perception Failure of LVLMs with CAMOUHARMTI

Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering

Real-IAD MVN: A Multi-View Normal Vector Dataset and Benchmark for High-Fidelity Industrial Anomaly Detection

The Unwritten Benchmark: A New Challenge for Multimodal Machine Learning in Abstract Perceptual Reasoning

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

MathAll: A Real-World Benchmark for Mathematical Reasoning and Cross-Modal Understanding Evaluation in Omni-MLLMs

Safe-LLaVA: A Privacy-Preserving Vision Language Dataset and Benchmark for Biometric Safety

DR-DPO: Dual-Regularized DPO for Efficient Dataset Condensation

DrawingVQA: A Real-World Benchmark for Multi-Depth Visual–Textual Reasoning on Construction Drawings

SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

SuperGlasses: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

VEBench: Benchmarking Large Multimodal Models for Real-world Video Editing

CrowdVerse: A Bidirectional Reality-Calibrated Benchmark for Crowd Understanding and Simulation

Can Language Models Understand mmWave Data? Benchmarking Large Language Models for mmWave Radar-Based Human Understanding

From Static Snapshots to Dynamic Trajectories: Evaluating and Enhancing the Learning Pathways of Multimodal Large Language Models

Evaluating Dataset Watermarking for Fine-Tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

BMD-45: A Large-Scale CCTV Vehicle Detection Dataset for Urban Traffic in Developing Cities

SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

Splatwizard: A Benchmark Toolkit for 3D Gaussian Splatting Compression

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Unlocking ImageNet’s Multi-Object Nature: Automated Large-Scale Multilabel Annotation

CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios

Seeing the Abstract: A Benchmark for Visual-Only Metaphor Understanding in Multimodal Large Language Models

Cross-Dimensional Forgery Pattern Extraction for Generalizable Forgery Localization Framework

Reliable Test-time Adaptation Via Evidential Uncertainty Modeling in Vision–Language Models

SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers

Do LLMs and VLMs Share Reasoning Neurons? Evidence and Mechanisms of Cross-Modal Transfer

Debiased One-Shot NAS Via Density-Aware Sampling

PSLIF: A Primary-Supplementary LIF Neuron for Spiking Neural Networks

SHIELD: Secure Hypernetworks for Incremental Expansion Learning Defense

Eigen-Value: Efficient Domain-Robust Data Valuation Via Eigenvalue-Based Approach

In2CLR: Joint Intra-Inter Curriculum Learning with Review for Degraded Fake Image Detection

HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

Latent Domain Modeling Improves Robustness to Geographic Shifts

Any-Class Presence Likelihood for Robust Multi-Label Classification with Abundant Negative Data

VideoMatGen: PBR Materials through Joint Generative Modeling

TransKV: A Data-Driven Pruning Method for Large Foundation Models

Rich Feature Learning via Diversification

Image Classification Using CNN-QNN Hybrid Model with Optimized Correlated Features

Dual Strategies for Test-Time Adaptation

CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities

FLToM: Robust Federated Learning with Theory-of-Mind Structure

FedCVC: Federated Primal-Dual Learning with Client-Driven Virtual Compensation for Mitigating Dual Drift

Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration

PHATE-Net: Differentiable Pseudotime Learning for Trustworthy Disease Trajectories in PET

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

Qinling-GFFE: A Novel Station-based Benchmark and Graph-Frequency Fusion Enhancer for Precipitation Forecasting

Deep Feedback ConvNets by Embedding the Working Memory Module for Image Classification

Channel Correlation Loss for Binary Neural Networks

MegAD: An Expert in Meta-Learning Guided Few-Shot Anomaly Detection

SGST-Transformer: A Spherical Geometry-Aware Spatio-Temporal Transformer for 360° Video Saliency Prediction

From Drops to Grid: Noise-Aware Spatio-Temporal Neural Process for Rainfall Estimation

AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens

Texture-Guided Multiscale Cross-Modal Fusion for AI-Generated Image Quality Assessment

Res2SPDNet: Multi-Granularity SPD Matrix Residual Learning for Signal Classification

From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity

MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing

Spectral-Aware Adaptive Convolution for Fine-Grained Cross-View Visual Localization

Context-Aware Semantic Segmentation via Stage-Wise Attention

MFI-ResNet: Efficient ResNet Architecture Optimization via MeanFlow Compression and Selective Incubation

AlphaMerging: Orthogonal Subspace Projection of Task Vectors to Reduce Task Interference for Multi-Task Model Merging

Rethinking Compact (<1M) Vision Models: Balancing Accuracy and Speed through Multi-Path Atrous Convolutions

Hi3Doc: Hierarchical Tri-Level Representations for Multimodal Long-Document Understanding

LongDocSpan: Extending LVLMs for Long Document Understanding

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

InstructTable: Improving Table Structure Recognition Through Instruction

SciPostLayoutTree: A Dataset for Structural Analysis of Scientific Posters

Efficient Document Parsing via Parallel Token Prediction

ChartAgent: A Chart Understanding Framework with Tool Integrated Reasoning

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

FREE-Switch: Frequency-Based Dynamic LoRA Switch for Style Transfer

FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning

What and Where to Adapt: Structure–Semantics Co-Tuning for Machine Vision Compression via Synergistic Adapters

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

GM-Skip: Metric-Guided Transformer Block Skipping for Efficient Vision-Language Models

Dyna-ViT: Parameter-Free Pre-Encoder Token Pruning for Efficient Vision Transformers

Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention

MaMe: Matrix-Based Token Merging

Tiny Inference-Time Scaling with Latent Verifiers

DaMN: Deleting and Migrating Normalization Layers from Transformers

Enriching Knowledge Distillation with Cross-Modal Teacher Fusion

UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models

ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

MipKV: A Sparsify-then-Recover Paradigm for Accelerating Large Vision-Language Model Pre-Filling

Mix-to-Max: Optimizing Data Mixtures for Peak Vision-Language Efficiency

INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

SLAD : Shared LoRA Adapters for Task Specific Distillation

D4C: Data-Free Quantization for Contrastive Language-Image Pre-Training Models

ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

MPM: Mutual Pair Merging for Efficient Vision Transformers

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

AlignFL: Adaptive Learning and Intelligent Generation of Networks for Federated Learning

Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment

Positive Divide and Negative Discrepancy: A New Perspective on Multi-Label Logit Distillation

Beyond Accuracy: An Empirical Study of Perception Stability in Multimodal Large Language Models

Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams

M^3A Policy: Mutable Material Manipulation Augmentation Policy through Photometric Re-rendering

AOMGen: Photoreal, Physics-Consistent Demonstration Generation for Articulated Object Manipulation

Environmental Understanding Vision-language Model for Embodied Agent

DEGround: An Effective Baseline for Ego-centric 3D Visual Grounding With a Homogeneous Framework

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

Plug-and-Think: Structured Reasoning for Vision–Language–Action Models

World Model Robustness via Surprise Recognition

PlanGS: Active 3D Gaussian Reconstruction with Real-Time Planning

A Simple Framework for Visual Navigation

Event-Based Optical Flow Leveraging Precise Event Timing

Generative Event Pretraining with Foundation Model Alignment

HelixTrack: Event‑Based Tracking and RPM Estimation of Propeller-like Objects

PEPR: Privileged Event-based Predictive Regularization for Domain Generalization

Unleashing the Potential of Event-Based Stereo Via Coarse-to-Fine Bio-Inspired Regression

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

An Interpretable Alzheimer's Disease Diagnosis Model via Gray Matter Attention Guided Counterfactual Reasoning

Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models

(ends 8:30 AM)

Registration / Badge Pickup

(ends 5:00 PM)

Break:

Breakfast

(ends 9:00 AM)

8:30 a.m.

Remarks:

Welcome & Awards

(ends 9:00 AM)

8:45 a.m.

Poster Setup:

Poster Setup

(ends 9:15 AM)

9 a.m.

Break:

Courtesy Break

(ends 9:15 AM)

9:15 a.m.

Oral Session 1A: Multimodal Vision [9:15-10:30]

Orals 9:15-10:30

[9:15] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

[9:30] ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

[9:45] ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

[10:00] TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

[10:15] ViT^3: Unlocking Test-Time Training in Vision

(ends 10:30 AM)

Oral Session 1B: Visual Security [9:15-10:30]

Orals 9:15-10:30

[9:15] Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

[9:27] Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

[9:40] RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

[9:52] LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing

[10:05] NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization

[10:17] Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization

(ends 10:30 AM)

Oral Session 1C: Efficient Reasoning [9:15-10:30]

Orals 9:15-10:30

[9:15] Advancing Image Classification with Discrete Diffusion Classification Modeling

[9:27] Does YOLO Really Need to See Every Training Image in Every Epoch?

[9:40] Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks

[9:52] NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices

[10:05] Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species

[10:17] Rethinking Dataset Distillation: Hard Truths about Soft Labels

(ends 10:30 AM)

Oral Session 1D: Computational Imaging [9:15-10:42]

Orals 9:15-10:42

[9:15] Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion

[9:27] Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions

[9:40] MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging

[10:05] Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework

[10:17] UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision

[10:30] Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations

(ends 10:42 AM)

10:15 a.m.

Poster Setup:

Poster Setup

(ends 10:45 AM)

10:45 a.m.

Demonstration:

Demos Session 1

(ends 12:45 PM)

Art Exhibition [10:45-6:00]

(ends 6:00 PM)

Poster Session 1 & Exhibit Hall [10:45-12:45]

Posters 10:45-12:45

A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization

ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

ViT^3: Unlocking Test-Time Training in Vision

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing

NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization

Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization

Advancing Image Classification with Discrete Diffusion Classification Modeling

Does YOLO Really Need to See Every Training Image in Every Epoch?

Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks

NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices

Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species

Rethinking Dataset Distillation: Hard Truths about Soft Labels

Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion

Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions

MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging

Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations

Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework

UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision

AVGGT: Rethinking Global Attention for Accelerating VGGT

ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction

LongStream: Long-Sequence Streaming Autoregressive Visual Geometry

RPGFusion: 4D Radar Prior-Guided Multi-Modal Fusion for 3D Detection

MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

JRM: Joint Reconstruction Model for Multiple Objects without Alignment

Inferring Compositional 4D Scenes without Ever Seeing One

FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation

Complet4R: Geometric Complete 4D Reconstruction

Unblur-SLAM: Dense Neural SLAM for Blurry Inputs

Learning Compact 3D Representations from Feed-Forward Novel View Synthesis

Fast Spatial Tracking with Visual Geometry Transformer

How Much 3D Do Video Foundation Models Encode?

MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes

RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations

Long-Tail Internet Photo Reconstruction

Emergent Outlier View Rejection in Visual Geometry Grounded Transformers

Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning

MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation

ConsistCompose: Unified Multimodal Layout Control for Image Composition

A Training-Free Style-Personalization via SVD-Based Feature Decomposition

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

SplitFlux: Learning to Decouple Content and Style from a Single Image

FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation

EmoStyle: Emotion-Driven Image Stylization

Text-Image Conditioned 3D Generation

IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework

AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

Reasoning Diffusion for Unpaired Test Time Out-of-distribution Text-Image to Video Generation

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative

MTA: Multimodal Task Alignment for BEV Perception and Captioning

β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers

FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models

EvoGraph-R1: Self-Evolving Multimodal Knowledge Hypergraphs for Agentic Retrieval

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks

Black-Box Domain Adaptation for Object Detection with Retention-Driven Knowledge Compression

Decoupled and Reusable Adaptation for Efficient Cross-Modal Transfer

Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy

Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation

Label-Free Cross-Task LoRA Merging with Null-Space Compression

Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation

GeCo: Geometry-Consistent Regularization for Domain Generalized Semantic Segmentation

Event-based Motion Deblurring with Unpaired Data

Stable Spike: Dual Consistency Optimization via Bitwise AND Operations for Spiking Neural Networks

Event-based Visual Deformation Measurement

Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

SpikeTrack: High-performance and Energy-efficient Event-Based Object Tracking with Spiking Neural Network

Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus

Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios

Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control

CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy

DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning

GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching

Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux

Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning

Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

ORV: 4D Occupancy-centric Robot Video Generation

DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning

Language-Free Generative Editing from One Visual Example

Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes

CompBench: Benchmarking Complex Instruction-guided Image Editing

Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories

Learning Personalized Photographic Style from Pairwise User Preferences

CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing

Efficient Weighted Sampling via Score-based Generative Models

MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments

REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting

Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer

FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain

IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors

Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals

Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists

DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum

WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

Retrieve-to-Restore: Efficient All-in-One Image Restoration with a Retrieval-Based Degradation Bank

MRI Contrast Enhancement Kinetics World Model

ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation

Rethinking Knowledge Transfer in Image Quality Assessment: A Perceptual Preference Structure Alignment Perspective

ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models

White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation

Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy

LF-BVN: Blind-View Network for Self-Supervised Light Field Denoising

rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training

Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices

Towards Generalized Representations for Low-Light Understanding: When Signal Constancy Meets Semantic Enrichment

Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos

MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation

OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement

VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering

SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation

Diffusion-Based Native Adversarial Synthesis for Enhanced Medical Segmentation Generalization

CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric

Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

Towards Human-Imperceptible Backdoor Attacks on Text-to-Image Diffusion Models

TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models

DualMirage: Hunting Stealthy Multimodal LLM Agents via CAPTCHAs with Contour and Adversarial Illusions

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

Source Models Leak What They Shouldn’t: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

A Unified Perspective on Adversarial Membership Manipulation in Vision Models

Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

Beyond What's Shared: Recovering Lost Unique Information from Intermediate Layers to Boost Multimodal Geo-Foundation Models

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning

Learning Anchor in Dual Orthogonal Space for Fast Multi-view Clustering

Bootstrapping Multi-view Learning for Test-time Noisy Correspondence

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning

Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception

DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Models

Vision-Speech Models: Teaching Speech Models to Converse about Images

EMMA: Extracting Multiple physical parameters from Multimodal Data

MMGait: Towards Multi-Modal Gait Recognition

OSMO: Open-vocabulary Self-eMOtion Tracking

MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data

Scalable Trajectory Generation for Whole-Body Mobile Manipulation

Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping

Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human–Computer Interaction

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos

ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner

MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On

Scaling Multi-Identity Consistency for Image Customization via Multi-to-Multi Matching Paradigm

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

Functional Mean Flow in Hilbert Space

Benchmarking Single-Factor Physical Video-to-Audio Generation

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Refaçade: Editing Object with Given Reference Texture

Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

Not All Birds Look The Same: Identity-Preserving Generation For Birds

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

Clothe and Pose

FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement

The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

Bidirectional Normalizing Flow: From Data to Noise and Back

ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions

Are Image-to-Video Models Good Zero-Shot Image Editors?

FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters

Unified Latent Space for Understanding and Generation via Semantic Auto-encoder

AHS: Adaptive Head Synthesis via Synthetic Data Augmentations

CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness

Thermal Diffusion Matters: Infrared Spatial-Temporal Video Super-Resolution through Heat Conduction Priors

TextOVSR: Text-Guided Real-World Opera Video Super-Resolution

VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution

GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution

Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance

SignPR: A Progressive Vector-Quantized Diffusion Framework for Sign Language Production

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision

Geometric Neural Distance Fields for Learning Human Motion Priors

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Decoupled Generative Modeling for Human-Object Interaction Synthesis

LiveGesture: Streamable Co-Speech Gesture Generation Model

HandX: Scaling Bimanual Motion and Interaction Generation

MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control

Correspondence-Attention Alignment for Multi-View Diffusion Models

GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models

MatMart: Material Reconstruction of 3D Objects via Diffusion

Region-Adaptive Sampling for Diffusion Transformers

Diffusion Guided Chain-of-Vision for Large Autoregressive Vision Models

Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation

ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization

Heterogeneous Decentralized Diffusion Models

Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding

Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model

RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs

Fast SceneScript: Fast and Accurate Language‑Based 3D Scene Understanding via Multi‑Token Prediction

PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models

ARC Is a Vision Problem!

Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions

S^2-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance

Learning Multi-View Spatial Reasoning from Cross-View Relations

Exploring Spatial Intelligence from a Generative Perspective

Physical Object Understanding with a Physically Controllable World Model

QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding

AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers

SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval

Language-driven Fine-grained Retrieval

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

RetFormer: Multimodal Retrieval for Enhancing Image Recognition

DREAM: Document Recognition with Explicit Adaptive Memory

RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval

POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval

Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

RiskProp: Collision-Anchored Self-Supervised Risk Propagation For Early Accident Anticipation

MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning

TRCoRSurg: Temporal-Relational Co-Reasoning for Surgical Video Triplet Recognition

OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

One-Shot Flow, Any-Time Frame: A Bidirectional Warping Framework for Event-Based Video Frame Interpolation

TF-CADE: Foreground-Concentrated Text-Video Alignment for Zero-Shot Temporal Action Detection

PRISM: Prototype-based Reasoning with Inter-modal Semantic Mining for Interpretable Image Recognition

Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

PhaseWin Search Framework Enable Efficient Object-Level Interpretation

Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability

From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition

Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification

Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

C-LaV: Conditional Latent Velocity Field Denoising for Weather-Robust LiDAR Place Recognition

Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds

Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction

AnyPcc: Compressing Any Point Cloud with a Single Universal Model

CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion

Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis

C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion

PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding

FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting

Neural Distribution Prior for LiDAR Out-of-Distribution Detection

DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs

Concept-Aware Batch Sampling Improves Language-Image Pretraining

HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

Vocabulary Scaling Law: Tuning Open-vocabulary Predictors for Their Openness

Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation

ViTPrompt: Training-Free Prompt Refinement with Visual Tokens for Open-Vocabulary Detection

Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Dynamic Logits Adjustment and Exploration for Test-Time Adaptation in Vision Language Models

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

GenMatter: Perceiving Physical Objects with Generative Matter Models

Bidirectional Query-Driven Generation of Parametric CAD Sketch

The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning

Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

Repurposing 3D Generative Model for Autoregressive Layout Generation

CAD-Refiner: A Unified Framework for CAD Generation and Iterative Editing

A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images

Global Information Thresholding for Sufficient and Necessary Circuits

PrivateEyes: Gaze-Preserving Anonymization for Data Sharing

From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal

Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models

pH-Strips for Selective Forgetting: A Blunt but Fast Diagnostic Baseline for Machine Unlearning

Decoupling Defense Strategies for Robust Image Watermarking

Unsafe2Safe: Controllable Image Anonymization for Downstream Utility

Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing

Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression

DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport

FedSDR: Federated Graph Learning with Structural Noise Detection and Reconstruction

FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift

FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation

VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models

Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Token Warping Helps MLLMs Look from Nearby Viewpoints

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients

Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

DocPrune: Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models

The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery

LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis

Coordinate Denoising for Non‑Equilibrium Molecular Representation Learning

Plug-and-Play Incomplete Multi-View Clustering via Janus-Faced Affinity Learning with Topology Harmonization

Meta-Learning In-Context Enables Training-Free Cross Subject Brain Decoding

Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation

Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling

Harnessing the Power of Foundation Models for Accurate Material Classification

Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features

ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving

TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation

Think Before You Drive: World Model-Inspired Multimodal Grounding

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios

GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving

ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving

KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System

FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier–State Space Integration

NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks

Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

Consistent Instance Field for Dynamic Scene Understanding

CLP: A Real-World Dataset of Contaminated Lens Protectors for Robust Semantic Segmentation

ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions

SAM2Text: Towards Prompt-Free and Multi-Resolution Video Scene Text Segmentation

Reinforcing Video Reasoning Segmentation to Think Before It Segments

VideoMaMa: Mask-Guided Video Matting via Generative Prior

Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models

The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation

SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning

Is Parameter Isolation Better for Prompt-Based Continual Learning?

Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

Affordance-First Decomposition for Continual Learning in Video–Language Understanding

Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning

Elastic Weight Consolidation Done Right for Continual Learning

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation

Talking Together: Synthesizing Co-Located 3D Conversations from Audio

InfinityHuman: Towards Long-Term Audio-Driven Human Animation

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

AudioAvatar: Personalized Audio-driven Whole-body Talking Avatars

One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning

SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

Tavatar: Topology-Aware Gaussian Attribute Derivation for Animatable Human Avatars

PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

PhysHead: Simulation-Ready Gaussian Head Avatars

ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction

FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures

Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction

Reallocating Attention Across Layers to Reduce Multimodal Hallucination

VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision–Language Models

Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

Unstitching the Chimera: Frame-Level Risk and Train-Free Mitigation for Video Hallucination

CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models

Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control

Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation

Spatia: Video Generation with Updatable Spatial Memory

Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context

EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses

CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

Physical Simulator In-the-Loop Video Generation

Refracting Reality: Generating Images with Realistic Transparent Objects

Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos

EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation

Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification

Mind the Gap: Transferring Labels to Align Object Detection Datasets

SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification

Tri-Modal Fusion Transformers for UAV-based Object Detection

View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification

RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection

X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection

FVBench: Benchmarking Deepfake Video Detection Capability of Large Multimodal Models

AKCMamba-YOLO: Selective State Space Models For Real-Time Object Detection

When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

Your One-Stop Solution for AI-Generated Video Detection

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model

MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark

Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Reflection Separation from a Single Image via Joint Latent Diffusion

MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

MatLat: Material Latent Space for PBR Texture Generation

VMonarch: Efficient Video Diffusion Transformers with Structured Attention

DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers

Transition Matching Distillation for Fast Video Generation

Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features

UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair

Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift

DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label

V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception

SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More

A Causal Marriage between VLM and IRM from Understanding to Reasoning

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning

Learning to Select Visual Tools from Experience

Agile Deliberation: Concept Deliberation for Subjective Visual Classification

Tea-Adapter: Teacher Adapter for Efficient Conditional Generation

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule

FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

Twin-T & TwintVQA: A Reliable Structure–Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks

SDGS: Spatial Difference Guided Gaussian Splatting for Simultaneous Localization and 3D Reconstruction

RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting

Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors

Distilling Unsigned Distance Function for Surface Reconstruction from 3D Gaussian Splatting

Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction

DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures

E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction

Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction

DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization

VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes

GauMVC: Generative Decoupled Gaussian Representation for Human-centric Multi-view Video Compression

A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction

RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection

Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment

ActivePolicy: Active Gaussian Reconstruction and Optimization Strategy Based on Global-Local Information Gain

Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field

SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting

NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather

SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Unlocking Token Rewards via Training-Free Reward Attribution

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

When to Think and When to Look: Uncertainty-Guided Lookback

StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering

Understanding Counting Mechanisms in Large Language and Vision-Language Models

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs

VisionLeaf: Entropy-Guided Leaf-First Reasoning for Efficient and Accurate Think-with-Image

GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

Beyond Depth: Evaluating the Width-centric Reasoning Capability of MLLMs

GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension

CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

LoPrune: Efficient Data Pruning for LoRA-Based Fine-Tuning of Vision Transformer

Multi-Scale Local Speculative Decoding for Image Generation

Globscope: Toward a Global View of the Loss Landscape

RADAR: VQ-VAE Decoder of VAR is a Good Student for Restoring Against Degradation by Acceleration

Beyond Single Solution: Multi-Hypothesis Deep Unfolding Network for Image Compressive Sensing

FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers

MambaSIC: Mamba-based Stereo Image Compression with Bi-directional Multi-reference Entropy Model

Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments

Discovering Adaptive Task Dependencies for Efficient Multi-Task Representation Compression

OmniZip: Learning a Unified and Lightweight Lossless Compressor for Multi-Modal Data

Perceptual Neural Video Compression with Color Separation and Rank Chain

Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation

GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction

PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization

PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence

UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization

VGA: Empowering Aerial-Ground Localization by Visual Geometry Alignment

Watch and Learn: Learning to Use Computers from Online Videos

OneThinker: All-in-one Reasoning Model for Image and Video

Incentivizing Versatile Video Reasoning in MLLMs via Data-Efficient Reinforcement Learning

Act2See: Emergent Active Visual Perception for Video Reasoning

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

ViLoMem: Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

ReMoT: Reinforcement Learning with Motion Contrast Triplets

Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues

Semantic-Guided Global-Local Collaborative Prompt Learning for Few-Shot Class Incremental Learning

Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition

One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

Data-Centric Meta-Learning for Robust Few-Shot Generalization

Bridging the Modality Gap in Compositional Zero-Shot Learning via Sparse Alignment and Unimodal Memory Bank

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

WaDi: Weight Direction-aware Distillation for One-step Image Synthesis

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Beyond Soft Label: Dataset Distillation via Orthogonal Gradient Matching

BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting

RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion

Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement

Lens Component Deletion based on Differentiable Ray Tracing

X-band Radar Non-Line-of-Sight Imaging

3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion

UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes

Polarization State Tracing for Reflection Removal and Color-Consistent Reconstruction

GFRRN: Explore the Gaps in Single Image Reflection Removal

Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation

Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference

Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images

HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction

SO(3)-Equivariant ViT-Adapter for Data-Efficient Zero-Shot Sim-to-Real Indoor Panoramic Depth Estimation

Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion

XPaintNet: An eXtreme Lightweight Framework for Stereoscopic Conversion without Inpainting Network

MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation

LiteSense: Lifting Lightweight ToF with RGB for High-Resolution Metric Depth Estimation

3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding

The Midas Touch for Metric Depth

Lifting Unlabeled Internet-level Data for 3D Scene Understanding

ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer

WonderZoom: Multi-Scale 3D World Generation

SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

Extend3D: Town-Scale 3D Generation

Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image

MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation

CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation

CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion

LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning

MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation

Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning

PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback

GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

VISTA: A Test-Time Self-Improving Video Generation Agent

Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models

SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition

Hierarchical Action Learning for Weakly-Supervised Action Segmentation

Gamba: Mamba-based graph convolutional network with dynamic graph topology learning for action recognition

Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors

PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation

TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies

Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera

The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations

DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors

Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision

NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training

Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation

Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images

Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation

Batch Loss Score for Dynamic Data Pruning

Teacher-Guided Routing for Sparse Vision Mixture-of-Experts

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

MangoBench: A Benchmark for Multi-Agent Goal-Conditioned Offline Reinforcement Learning

iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception

MMBench-GUI: A Unified Hierarchical Evaluation Framework for Multi-Platform GUI Agents

Boosting Vision-Language Models Towards Cross-Domain Incremental Object Detection

UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting

Unlearning without Forgetting: Securely Removing Targeted Concepts from Large-Scale Vision-Language Open-Vocabulary Detectors

UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling

S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation

NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection

The Missing Point in Vision Transformers for Universal Image Segmentation

PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting

The Power of Prior: Training-Free Open-Vocabulary Semantic Segmentation with LLaVA

Beyond Text: Visual Description Assembly by Probabilistic Model for CLIP-based Weakly Supervised Semantic Segmentation

High-Precision Dichotomous Image Segmentation via Depth Integrity-Prior and Fine-Grained Patch Strategy

GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation

Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes

Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding

Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation

Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization

MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision

Beyond Tie Points: Satellite Image Block Adjustment based on Dense Feature Consistency

Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

Global Underwater Geolocation from Time-Lapse Polarization Imagery

Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments

PRUE: A Practical Recipe for Field Boundary Segmentation at Scale

SARMAE: Masked Autoencoder for SAR Representation Learning

LNEM: Lunar Neural Elevation Model

A Polarized Reflection and Material Dataset of Real World Objects

LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents

RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning

All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference

A Combination of Noise and Bilateral Filters Achieve Supralinear and Scalable Adversarial Robustness in CNNs

DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion

Write Where It Matters: Policy-Guided Watermarks for 3D Gaussian Splatting

Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model

RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation

CamPI: Physical Adversarial Examples through Camera Power Signal Injection

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

GraspALL: Adaptive Structural Compensation from Illumination Variation for Robotic Garment Grasping in Any Low-Light Conditions

Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

RoboWheel: A Data Engine from Real-World Human Demonstrations for Cross-Embodiment Robotic Learning

Chain of World: World Model Thinking in Latent Motion

Scalable Feature Matching via State Space Modeling and Sparse Correlation

Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

GeoDexGrasp: Geometry-aware Generation for Data-efficient and Physics-plausible Dexterous Grasping

Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

AGiLe: Learning Robust Long-Horizon Manipulation via Affordance-Grounded Bidirectional Latent Planning

Language-Grounded Decoupled Action Representation for Robotic Manipulation

Learning to Act Robustly with View-Invariant Latent Actions

ORBIT: Benchmarking SfM in the Wild with 360° Video

SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking

Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution

Envisioning the Future, One Step at a Time

FlowFM: Advancing Dark Optical Flow Estimation with Flow Matching

Drift-Resilient Temporal Priors for Visual Tracking

An Efficient Token Compression Framework for Visual Object Tracking

No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors

From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

Momentum Memory for Knowledge Distillation in Computational Pathology

Modeling the Brain’s Grammar: ROI-Guided fMRI Pretraining for Transferable and Interpretable Vision Decoding

Joint Spectral Image Reconstruction and Semantic Segmentation with Cooperative Unfolding

X-WIN: Building Chest Radiograph World Model via Predictive Sensing

fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model

TIM: Temporal Decoupling with Iterative Mutual-Refinement Model for Longitudinal Radiology Report Generation

Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding

Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning

BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery

Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic

CMR-RD: Long-Tailed Adaptive VLM for Explainable CMR Diagnosis

Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis

FBTA: Enabling Single-GPU End-to-End Gigapixel WSI Classification with Feature Bridging and Translation Alignment

Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances

Egocentric Visibility-Aware Human Pose Estimation

Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation

OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

Recovering Physically Plausible Human-Object Interactions from Monocular Videos

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures

SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild

CrossHOI: Learning Cross-View Representations for Monocular 3D Human-Object Interaction Reconstruction

Gaussian-Mixture Latent Flow for Stochastic 3D Human Motion Prediction

SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals

Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding

Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection

(ends 12:45 PM)

Break:

Coffee

(ends 11:30 AM)

11 a.m.

Art Gallery Tour with Curator and Artists [11:00-11:30]

(ends 11:30 AM)

11:30 a.m.

Speed Mentorship:

Speed Mentorship Session

(ends 1:00 PM)

1 p.m.

Oral Session 2A: 3D Reconstruction [1:00-2:15]

Orals 1:00-2:15

[1:00] MAMMA: Markerless Accurate Multi-person Motion Acquisition

[1:12] Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos

[1:25] PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

[1:37] SAM 3D Body: Robust Full-Body Human Mesh Recovery

[1:50] SAM 3D: 3Dfy Anything in Images

[2:02] SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

(ends 2:15 PM)

Oral Session 2B: Materials & Lighting [1:00-2:15]

Orals 1:00-2:15

[1:00] 3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects

[1:12] GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport

[1:25] Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy

[1:37] PhyGaP: Physically-Grounded Gaussians with Polarization Cues

[1:50] PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction

[2:02] SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

(ends 2:15 PM)

Oral Session 2C: Gaussian Splatting & Reconstruction [1:00-2:15]

Orals 1:00-2:15

[1:00] Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow

[1:12] MeshSplatting: Differentiable Rendering with Opaque Meshes

[1:25] Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting

[1:37] RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting

[1:50] Selfi: Self-improving Reconstruction Engine via 3D Geometric Feature Alignment

[2:02] Z-Order Transformer for Feed-Forward Gaussian Splatting

(ends 2:15 PM)

Oral Session 2D: Spatio-Temporal Reconstruction [1:00-2:15]

Orals 1:00-2:15

[1:00] 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction

[1:12] Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

[1:25] FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement

[1:37] Residual Primitive Fitting of 3D Shapes with SuperFrusta

[1:50] SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models

[2:02] SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

(ends 2:15 PM)

1:30 p.m.

Art Panel [1:30-2:30]

(ends 2:30 PM)

2:15 p.m.

Break:

Courtesy Break

(ends 2:30 PM)

2:45 p.m.

Keynote:

Programmable Biology: Generative AI for Molecular Design

Simon Kohl

(ends 3:45 PM)

3:30 p.m.

Poster Setup:

Poster Setup

(ends 4:00 PM)

4 p.m.

Poster Session 2 & Exhibit Hall w/ Coffee Break [4:00-6:00]

Posters 4:00-6:00

MAMMA: Markerless Accurate Multi-person Motion Acquisition

Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos

PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

SAM 3D Body: Robust Full-Body Human Mesh Recovery

SAM 3D: 3Dfy Anything in Images

SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects

GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport

Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy

PhyGaP: Physically-Grounded Gaussians with Polarization Cues

PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction

SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow

MeshSplatting: Differentiable Rendering with Opaque Meshes

Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting

RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting

Selfi: Self-improving Reconstruction Engine via 3D Geometric Feature Alignment

Z-Order Transformer for Feed-Forward Gaussian Splatting

4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement

Residual Primitive Fitting of 3D Shapes with SuperFrusta

SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models

SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Affostruction: 3D Affordance Grounding with Generative Reconstruction

MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction

Unified Primitive Proxies for Structured Shape Completion

ART: Articulated Reconstruction Transformer

SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings

S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs

Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

SRGCD: Stability-Driven Region Growth Framework for 3D Change Detection

D-Prism: Differentiable Primitives for Structured Dynamic Modeling

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

LaS-Comp: Zero-shot 3D Completion with Latent–Spatial Consistency

Pano360: Perspective to Panoramic Vision with Geometric Consistency

EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion

OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation

MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

StyleTextGen: Style-Conditioned Multilingual Scene Text Generation

CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion

OneHOI: Unifying Human-Object Interaction Generation and Editing

GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

TV2TV: A Unified Framework for Interleaved Language and Video Generation

Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models

PureCC: Pure Learning for Text-to-Image Concept Customization

Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation

Yume1.5: A Text-Controlled Interactive World Generation Model

PosterReward: Unlocking Accurate Evaluation for High-Quality Graphic Design Generation

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and VLM-Guided Optimization

FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits

LVLM-Aided Alignment of Task-Specific Vision Models

DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment

PG-VTON: Single-Pass Training-Free Virtual Try-On via Patch-Guided Reference Alignment

Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment

Scaling Spatial Intelligence with Multimodal Foundation Models

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning

FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation

Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement

Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation

Bridging Domain Expertise and Generalization for Performance Estimation

Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition

Bridging Domains through Subspace-Aware Model Merging

DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

Scaling Dense Event-Stream Pretraining from Visual Foundation Models

Event Stream Filtering via Probability Flux Estimation

AIMDepth: Asymmetric Image-Event Mamba for Monocular Depth Estimation

Time-Specialized Event-Image Alignment for Blur-to-Video Decomposition

eRetinexGS: Retinex Modeling for Low-Light Scene Enhancement via Event Streams and 3D Gaussian Splatting

Unsupervised 3d Motion Estimation Using Event Camera

Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

ModularAgent: A Task-Aware Modular Framework for Joint Optimization of Multimodal Large Language Models and World Models

AstraNav-Memory: Contexts Compression for Long Memory

Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models

ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands

ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy

ShreddingNet: Coarse-to-Fine Restoration for Multi-Source Shredded Manuscripts

Image Guides Images: Consistent Video Amodal Completion with Rectified In-Context Exemplar Guidance

Radiance Meshes for Volumetric Reconstruction

Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field

CoRoGS: Contextual Gaussian Splatting for Robust Large-Deviation View Synthesis

ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes

Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness

Splatent: Splatting Diffusion Latents for Novel View Synthesis

ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation

Dynamic-Static Decomposition for Novel View Synthesis of Dynamic Scenes with Spiking Neurons

DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification

Gyro-based Deep Video Deblurring

Residual Diffusion Bridge Model for Image Restoration

MMDIR: Multimodal Instruction-Driven Framework for Mixed-Degradation Document Image Restoration

Rectifying Latent Space for Generative Single-Image Reflection Removal

Towards Generalized Multimodal Homography Estimation

Edit-aware RAW reconstruction

Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration

HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation

NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration

MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior

FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model

SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation

BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation

Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation

CROWn: A Unified Framework for Anti‑Aliased Downsampling and Phase‑Calibrated Fusion in 3D Medical Segmentation

Rethinking Box Supervision: Bias-Free Weakly Supervised Medical Segmentation

Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging

TANGO: Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization

MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation

Breaking Multimodal LLM Safety via Video-Driven Prompting

When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters

RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces

A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models

UniDef: Universal Defense Against Unauthorized Image Manipulation

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals

Rethinking Cross-Modal Anchor Alignment for Mitigating Error Accumulation

SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts

Omni-MMSI: Toward Identity-attributed Social Interaction Understanding

Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions

Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning

Illuminating Visual Identity in Universal Multimodal Embeddings

Anti-Degradation Lifelong Multi-View Clustering

The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts

Efficient and High-Fidelity Omni Modality Retrieval

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis

HAVE-Bench: Hierarchical Audio-Visual Evaluation from Perception to Interaction

Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT

CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance

UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation

ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator

TouchDream: 3D Object Completion through Imagined Touch

ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation

TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction

Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects

MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis

LogCD: Local-to-global Consistency Distillation for Few-step Image Generation

EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

LightMover: Generative Light Movement with Color and Intensity Controls

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

EchoVDiff: Cardiac-Cycle Echocardiography Video Generation from Arbitrary Frame

Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets

Frequency-Aware Flow Matching for High-Quality Image Generation

STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows

MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture

Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision

Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models

OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing

PositionIC: Unified Position and Identity Consistency for Image Customization

P-Flow: Prompting Visual Effects Generation

Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization

SURF: Signature-Retained Fast Video Generation

The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection

Lynx: Towards High-Fidelity Personalized Video Generation

VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis

ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering

Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching

OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation

First Frame Is the Place to Go for Video Content Customization

Scaling Zero-Shot Reference-to-Video Generation

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

RunawayEvil: Jailbreaking the Image-to-Video Generative Models

MultiAnimate: Pose-Guided Image Animation Made Extensible

Translating Signals to Languages for sEMG-Based Activity Recognition

Open the Motion Door: Atomic Motion Decomposition and Recomposition for Open-Vocabulary Motion Generation

Multi-level Causal LLM-based Text-to-Motion Generation with Human Alignment

MotionHiFlow: Text-to-Motion via Hierarchical Flow Matching

LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

GVIS: Generative Vector Image Steganography

MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding

GeoRK2: Geometry-Guided Runge–Kutta Integration for Diffusion Transformer Acceleration

Test-time Sparsity for Extreme Fast Action Diffusion

Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation

When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models

ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

A³: Towards Advertising Aesthetic Assessment

GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention

VL-RouterBench: A Benchmark for Vision–Language Model Routing

CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions

LangField4D: Learning Identity-Adaptive and Spatio-Temporal Continuous 4D Language Fields for Dynamic Scenes

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning

Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models

Geometry-Guided 3D Visual Token Pruning for Video-Language Models

Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Direction-aware 3D Large Multimodal Models

CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining

Beyond Global Similarity: Multi-Conditional Retrieval for Fine-Grained Cross-Modal Understanding

Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval

What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely F1

Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance

Memory Matters: Boosting Training-Free Zero-Shot Temporal Action Localization with a Learnable Lookup Table

TVHighlights: LLM-Guided Human-Free Collaborative Training for Video Highlight Detection in Movies and TV Dramas

Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing

Reinforcing Structured Chain-of-Thought for Video Understanding

FlexiVideo: Variation-Aware Temporal Dynamics Modeling for Efficient Video Understanding

MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos

Learning Effective Sign Features without Text for Gloss-free Sign Language Translation

META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding

GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

Local Motion Matters: A Deconstruct–Recompose Paradigm for Reinforcement Learning Pre-training from Videos

Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers

Rounded or Streamlined Head? Bridging Concept Bottleneck Models and Attribute-Described Object Parts

CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers

Rethinking Concept Bottleneck Models: From Pitfalls to Solutions

Make it SING: Analyzing Semantic Invariants in Classifiers

Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization

UniCorrn: Unified Correspondence Transformer Across 2D and 3D

Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion

TACO: Task-Aware Contrastive Learning for Joint LiDAR Localization and 3D Object Detection

Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning

Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis

R3-PCQA: Ray-Reprojection-Reinforcement for No-Reference 3D Point Cloud Quality Assessment

Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation

PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning

U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences

TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR

Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models

Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

AVION: Aerial Vision–Language Instruction from Offline Teacher to Prompt-Tuned Network

CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Role-SynthCLIP: A Role-Play Driven Diverse Synthetic Data Approach

BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation

PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow

CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models

MapRoute:Precise-Concept Erasing Mappers via Semantic Routing

PhotoFramer: Multi-modal Image Composition Instruction

Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post‑hoc Debiasing in Vision-Language Models

Frequency-domain Manipulation for Face Obfuscation

Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models

POUR: A Provably Optimal Method for Unlearning Representation via Neural Collapse

Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks

Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure

SPDMark: Selective Parameter Displacement for Robust Video Watermarking

Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning

FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning

FedSST: Rethinking Fair Federated Graph Learning under Structural Shift

GDFA: Geometry-Driven Federated Unlearning with Directional Task Vector Alignment

FedARA: Resource-adaptive Low-rank Personalized Federated Learning via Anchor-driven Representation Alignment on Heterogeneous Edge Devices

InterRVOS: Interaction-Aware Referring Video Object Segmentation

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection

Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding

GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

MeToM: Metadata-Guided Token Merging for Efficient Video LLMs

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models

Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

CoIn: Coverage and Informativeness-Guided Token Reduction for Efficient Large Multimodal Models

TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis

Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities

CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

LRDUN: A Low-Rank Deep Unfolding Network for Efficient Spectral Compressive Imaging

Neural Collapse in Test-Time Adaptation

CLEX: Complementary Label Exchange Learning for Noisy Facial Expression Recognition

TruckDrive: Long-Range Autonomous Highway Driving Dataset

Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

Den-TP: A Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction

Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation

Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Beyond Rule-Based Agents: Active Markov Games for Realistic Multi-Agent Interaction in Autonomous Driving

Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation

ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

CrackSSM: Reviving SSMs for Crack Segmentation via Dynamic Scanning

BiPA: Bilevel Prompt Adaptation for Underwater Instance Segmentation

RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation

Scene-Centric Unsupervised Video Panoptic Segmentation

Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation

GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment

Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity

Dual-Estimator: Decoupling Global and Local Semantic Shift for Drift Compensation in Class-Incremental Learning

Continual Distillation of Teachers from Different Domains

Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance

Learning from Itself: Mining Internal Knowledge from Vision Language Models for Continual Learning

AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning

An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning

U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation

StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars

FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs

WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering

EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models

VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation

CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird’s-Eye-View Semantic Segmentation

STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction

CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image

OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar

Globally Optimal Pose from Orthographic Silhouettes

AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination

EgoX: Egocentric Video Generation from a Single Exocentric Video

SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion

SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation

ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos

PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling

Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer

SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network

HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification

ID-Sim: An Identity-Focused Similarity Metric

Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces

Towards Cross-Modal Preservation, Consistency and Alignment for Privacy-Preserving Visible-Infrared Person Re-Identification

Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification

Assignment-Driven Hash Learning in a Hyper-Semantic Space for On-the-Fly Category Discovery

DyFCLT: Dynamic Frequency-Decoupled Cross-Modal Learning Transformer for Multimodal Tiny Object Detection

EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer

Building a Precise Video Language with Human–AI Oversight

CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection

Towards Sparse Video Understanding and Reasoning

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generation

What Are You Doing? A Closer Look at Controllable Human Video Generation

Score2Instruct: Scaling Up Video Quality-Centric Instructions via Automated Dimension Scoring

CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Towards High-resolution and Disentangled Reference-based Sketch Colorization

MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping

COT-FM: Cluster-wise Optimal Transport Flow Matching

Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

Guiding a Diffusion Transformer with the Internal Dynamics of Itself

CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions

RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection

COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation

Learnability-Driven Submodular Optimization for Active Roadside 3D Detection

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception

Dynamics-Aware Preference Optimization for Vision-Language Models

Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations

Learning What Helps: Task-Aligned Context Selection for Vision Tasks

Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

NeuroRule: Bridging Vision and Logic with Differentiable Rule Induction

Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter

Ego: Embedding-Guided Personalization of Vision-Language Models

JoPPO: Hierarchical Photography Assessment via Contrastive Joint Conditional Probabilistic Reinforcement Learning

AeroAgent: A Vision–Physics–Decision Framework for Aerodynamic Vehicle Design

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives

MSCD-GS: Motion-Separated Cooperative Deblurring Dynamic Reconstruction via Gaussian Splatting

P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction

iSplat: Iterative Learning for Fine-Grained Gaussian Splatting

Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting

MAPo: Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction

FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario

HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views

SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation

Physically Inspired Gaussian Splatting for HDR Novel View Synthesis

PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in 3D Gaussian Splatting

4C4D: 4 Camera 4D Gaussian Splatting

SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting

GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance

MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting

PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage

TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion

Disco-GS: Gaussian Splatting in Dynamic Color Lighting

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?

GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking

Visual Grounding for Object Questions

CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Stable and Efficient Single-Rollout RL for Multimodal Reasoning

Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

Monet: Reasoning in Latent Visual Space Beyond Image and Language

STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

From Where Things Are to What They Are For: Benchmarking Spatial–Functional Intelligence in Multimodal LLMs

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

S2D: Selective Spectral Decay for Quantization-Friendly Conditioning of Neural Activations

OneSparse: A Unified Framework for Sparse Activation Layers in Vision Models

What Matters in Practical Learned Image Compression

BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder

LazyVAR: Accelerating Visual Autoregressive Models via Scale-wise Token Pruning and Parallel Group Decoding

Spk2VidNet: A Hierarchical Recurrent Architecture for High-Fidelity Video Reconstruction from Long Spike-Camera Streams

Adaptive Learned Image Compression with Graph Neural Networks

SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation

VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition

LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment

CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization

Affine Perspective-Three-Point Problem

Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning

WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with Realistic Tasks

Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

APPO: Attention-guided Perception Policy Optimization for Video Reasoning

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global–Local Feature Fusion

Bridging Human Evaluation to Infrared and Visible Image Fusion

Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion

Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion

Bayesian Decomposition and Semantic Completion for Few-shot Semantic Segmentation

From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification

STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation

FlowComposer: Composable Flows for Compositional Zero-Shot Learning

ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models

UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization

Leveraging Multispectral Sensors for Color Correction in Mobile Cameras

Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance

Optical Diffraction-based Convolution for Semiconductor Lithography

GSNR: Graph Smooth Null-Space Representation for Inverse Problems

MatE: Material Extraction from Single-Image via Geometric Prior

αMatte4K & µMatting: Dataset and Model for Ultra-Micro Precision Alpha Video Matting

Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints

Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields

Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark

Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis

SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment

Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective

PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts

Any Resolution Any Geometry: From Multi-View To Multi-Patch

Paparazzo: Active Mapping of Moving 3D Objects

DepthFocus: Controllable Depth Estimation for See-Through Scenes

OVI-MAP: Open-Vocabulary Instance-Semantic Mapping

PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass

Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling

Variational Graph-based Normal Integration

Vinedresser3D: Towards Agentic Text-guided 3D Editing

MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts

Learning Hierarchical Hyperbolic Mixture Model for Part-aware 3D Generation

MeshRipple: Structured Autoregressive Generation of Artist-Meshes

FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow

CUPID: Generative 3D Reconstruction via Joint Object and Pose Modeling

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

DRM: Diffusion-based Reward Model With Step-wise Guidance

Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models

AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References

Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling

LAMP: Language-Assisted Motion Planning for Controllable Video Generation

Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization

Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation

DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning

Learning a Unified Latent Action Space from Videos with Action-centric Cycle Consistency

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning

Dynamic Momentum Recalibration in Online Gradient Learning

Spherical Leech Quantization for Visual Tokenization and Generation

MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention

GR-Gauge: Cost-efficient Training Configuration By Gauging the Gradient Redundancy

E^2-SCI: Elastic Edge–Cloud Speculative Decoding via Credit Inertia

HyperNAS: Enhancing Architecture Representation for NAS Predictor via Hypernetwork

NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity

Spectral Conformal Risk Control: Distribution-Free Tail Guarantees via Bayesian Quadrature

Edge-RecViT: Efficient Vision Transformer via Semantic-Refined Dynamic Recursion

ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

GUI-SAGE: Enhancing GUI Automation with Self-Explanatory Learning

GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks

HiconAgent: History Context-aware Policy Optimization for GUI Agents

PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection

Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

Common Inpainted Objects In-N-Out of Context

Prompt-Free Universal Region Proposal Network

Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation

PaNDaS: Learnable Shape Interpolation Modeling with Localized Control

Hilbert Curve-Based Attention Enabling Topology-Preserving Image Tensor Representation for Semantic Segmentation Network

Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels

SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains

Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

Universal 3D Shape Matching via Coarse-to-Fine Language Guidance

Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation

CDICS: Delving Into Fine-Grained Attribute for In-Context Segmentation via Compositional Prompts and Phased Decoupling

Discriminative Perception via Anchored Description for Reasoning Segmentation

SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images

Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark

CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening

ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery

Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

Rotation Invariant and Symmetry Aware Pixel Difference Network for Remote Sensing Object Detection

F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation

RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction

PGA: Prior-free Generative Attack for Practical No-box Scenario

Lipschitz Optimization for Formal Verification of Homographies

Batman: Benign Knowledge Alignment Through Malicious Null Space in Federated Backdoor Attack

Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation

Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models

Mitigating Error Amplification in Fast Adversarial Training

Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern

What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs

Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures

Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation

Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment

FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denoising

Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning

Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation

Contact-Aware Neural Dynamics

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering

Cross-Hand Latent Representation for Vision-Language-Action Models

Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention

Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts

GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

Rethinking Occlusion Modeling for UAV Tracking

Adaptive Capacity Autoregressive Visual Tracking

Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking

Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions

TrackMAE: Video Representation Learning via Track Mask and Predict

Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking

Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers

MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

Multimodal Causality-Driven Representation Learning for Generalizable Medical Image Segmentation

Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization

TopoSlide: Topologically-Informed Histopathology Whole Slide Image Representation Learning

Beyond the Static-World: Lifelong Learning for All-in-One Medical Image Restoration

Hyperbolic Relational Prompts for Intersectional Fairness in Medical VLMs

RNED: Rotary Number Encoding and Decoding for Quantitative Medical VLM Analysis

MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding

Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision

BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation

Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation

Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration

SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation

Composite-Attribute Person Re-Identification via Pose-Guided Disentanglement

Representing 3D Faces with Learnable B-Spline Volumes

RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos

HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling

HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation

KASALv2: Fully Automatic 3D Rotational Symmetry Classification and Axis Localization

AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion

Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning

ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals

Similarity-Consistent Likelihood Diffusion enables Hidden Person Detection from Wall Reflections

VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery

Occluded Human Body Capture with Frequency Domain Denoising Prior

ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss

OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer

Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition

DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning

MGDHand: Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D Hand Pose Estimation from Sparse IMUs

CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction

E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation

Bézier Degradation Modeling for LiDAR-based Human Motion Capture

UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass

Illumination-Consistent Human-Scene Reconstruction from Monocular Video

Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution

Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection

Enabling Supervised Learning of Generative Signatures for Generalized Synthetic Image Detection

DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization

All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark

Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification

The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection

Omni-AD: A Large-scale and Versatile Benchmark for Industrial Anomaly Detection

Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection

Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection

LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception

Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction

OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness

An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving

OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera

ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction

(ends 6:00 PM)

Demonstration:

Demos Session 2

(ends 6:00 PM)

5 p.m.

Art Gallery Tour with Curator and Artists [5:00-5:30]

(ends 5:30 PM)

SAT 6 JUN

7:30 a.m.

Findings Poster Session 2 [7:30-9:00]

Posters 7:30-9:00

Beyond Top-1: Forensic Analysis of Full Prediction Distributions Reveals Hidden Model Reasoning

Zero-Shot Textual Explanations via Translating Decision-Critical Features

DMin: Scalable Training Data Influence Estimation for Diffusion Models

A Framework for Evaluating Zero-Shot Image Generation in Concept-Based Explainability

Self-Guided Integrated Gradient Method for Attribution

Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

Discovering Attention Head Interactions in Vision Transformers

Value bounds and Convergence Analysis for Averages of LRP attributions

MReactor: Offline Multiple Appropriate Facial Reaction Generation with Hierarchical Cognitive Disentanglement

B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter” Approach to Micro-Action Recognition

Learning by Neighbor-Aware Semantics, Deciding by Open-Form Flows: Towards Robust Zero-Shot Skeleton Action Recognition

Actionable Human Motion Generation via Latent Imitation and Fine-Grained Text Completion

GHOST: Fast Category-Agnostic Hand-Object Interaction Reconstruction from RGB Videos Using Gaussian Splatting

Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D

CoherentHand: Temporally Consistent 3D Hand Trajectory Synthesis with Semantic Motion Priors

Weakly Supervised Micro-Expression Spotting based on Boundary Refinement Mechanism and Cross-subject Learning Representation

FUSION: Full-body Unified Motion Prior for Body and Hands Via Diffusion

BridgeDiffusion: Latent Space Optimization for Independent Body-Part Generation with Motion Consistency Bridges in Interactive Dance

MARIO: Motion-Augmented Real-Time Multi-Sensor Inertial Odometry

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

How2Sign-Synth3D: Markerless Holistic Sign Language Performance Capture and Synthetic Data for Dense Landmark Tracking

SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance

EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space

VoxFace: Streaming Audio-Visual Synthesis via Relay-Style Multi-Token Prediction for Interactive Conversation

OmniHead: A Unified Model for Dynamic Nonverbal Facial Behaviors

Detecting Precise Hand Touch Moments in Egocentric Video

Less is More: Multimodal Human Pose Estimation with Selective Fusion

PHYLOMAN: Generative Behavior Control via Fusing LLM Planning and Physics-based Control

Contact Matrix: Enhancing Dance Motion Synthesis with Precise Interaction Modeling

Learning Predictive Visuomotor Coordination

FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-Calibration for Cattle Mounting Pose Estimation

Bootstrapping Sign Language Annotations with Sign Language Models

OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

THOM: Generating Physically Plausible Hand-Object Meshes From Text

ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

All-Age Human Mesh Recovery

GeneFlow: Modeling Heredity and Variation via Flow Matching Transformers for Kinship Verification

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning

Fast-HOI: Fast Human-Object Interaction Synthesis via Distilled Interaction Prior and Physical Constrains

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

GeoHOI: Geometry-Enhanced Human-Object Interaction Video Generation via Hierarchical Multi-Modal Injection

TAUE: Training-free Noise Transplant and Cultivation Diffusion Model

GR-Diffusion: Graph-Guided Relational-Aware Diffusion via Attention Alignment

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

FREESTYLE: An Anchor-Free Mechanism for Training-Free Style-Aligned Image Generation

Is Your Text-to-Image Model Robust to Caption Noise?

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

RectifiedHR: Enable Efficient High-Resolution Synthesis via Energy Rectification

ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

Objects in Generated Videos Are Slower Than They Appear: Models Suffer Sub-Earth Gravity and Don’t Know Galileo’s Principle...for now

Group Relative Attention Guidance for Image Editing

ControlPose: High-Fidelity Pose-Controlled Image Generation with Multi-Faceted Pose Disentanglement

FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation

Latent-Compressed Variational Autoencoder for Video Diffusion Models

Deep Parameter Interpolation for Scalar Conditioning

Mining Real-World Image Relations for Large-Scale Controllable Generation and Editing

Disentangle Once, Control All: A Unified and Efficient Framework for Disentangling Multi-Condition Control in Human Video Generation

HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models

Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication

Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations

Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching

Stochastic Perturbations Improve Distribution-to-Distribution Generative Models

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

FA-MoE: Improving Medical Image Generation Through Frequency-Aware Mixture of Experts

Generated Reality: Human-Centric World Simulation Using Interactive Video Generation with Hand and Camera Control

VHOI: Controllable Video Generation of Human–Object Interactions from Sparse Trajectories via Motion Densification

LoViC: Efficient Long Video Generation with Context Compression

FedErase: Personalized Federated Unlearning for Text-to-Image Diffusion Models

Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Models

Earthquake-Bench: Video Generation Benchmark for Earthquake Simulation

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

Block Cascading: Training Free Acceleration of Block-Causal Video Models

Activation-Norm Maximization to Accelerate Training in Flow-Matching Transformers

FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers

No Cache Left Idle: Accelerating diffusion model via Extreme-Slimming Caching

Inference-Time Alignment of Diffusion Models with Evolutionary Algorithms

TokenErase: Robust Concept Erasure via Visual-Injected Token Optimization

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Animated-ART: Multi-Layer Transparent Video Generation

Rethinking Conditioning in Diffusion Models: Dynamic Token Scheduling for Efficient and Aligned Text-to-Image Generation

Attention-Guided Energy Optimization for Label-Aligned Anomaly Generation

USV: Uniﬁed Sparsiﬁcation for Accelerating Video Diffusion Models

OminPSD: Layered PSD Generation with Diffusion Transformer

Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA

Depth Adaptive Efficient Visual Autoregressive Modeling

Cross-Resolution Diffusion Models Via Network Pruning

FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

Understanding Reward Hacking in Text-to-Image Reinforcement Learning

OminiControl2: Efficient Conditioning for Diffusion Transformers

Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting

AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction

MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer

InstaDA: Augmenting Instance Segmentation Data with Dual-Agent System

One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion

Adversarial Concept Distillation for One-Step Diffusion Personalization

DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation

Anomaly Agent: Unified Anomaly Retrieval and Synthesis Before Manufacturing

S^2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation

ColorMam: Color-Aware State Space Model for Image Color Style Transfer

NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing

Towards Source-Aware Object Swapping with Initial Noise Perturbation

SyntheticManga: Training-Free Manga Generation with Phased Diffusion

Fast Autoregressive Video Generation with Diagonal Decoding

E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

Bind-Your-Avatar: Multi-Character-Talking Video Generation with Dynamic 3D-mask-based Embedding Router

SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

PEDRA: Evaluating the Realism of Pedestrian Dynamics in Video Generation

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Jano: Adaptive Diffusion Generation with Early-Stage Convergence Awareness

Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

Decoupled Scale-wise Autoregressive Modeling for Visual Generation

TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Future Optical Flow Prediction Improves Robot Control and Video Generation

Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

Drive-Cascade: Autoregressive Occupancy to LiDAR and Video Synthesis

ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

Concept Erasure via Attention Redirection

Loom: Diffusion-Transformer for Interleaved Generation

Rethinking Training Dynamics in Scale-Wise Autoregressive Generation

HiStream: Efficient High-Resolution Video Generation via Redundancy Eliminated Streaming

Eevee: Towards Close-up High-resolution Video-based Virtual Try-on

Consistent Video Editing as Flow-Driven Image-to-Video Generation

IM-Animation: An Implicit Motion Representation for Identity-Decoupled Character Animation

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Generative Visual Chain-of-Thought for Image Editing

UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation

Blend-Aware Latent Diffusion: Mitigating Stitched Seams in Image Inpainting

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation

SwiftPie: Lightning-fast Subject-driven Image Personalization via One step Diffusion

Video Generation Models are Good Latent Reward Models

Harnessing Layered Graphic Designs with Real Intentions for Text-to-Design Generation

VeCoR — Velocity Contrastive Regularization for Flow Matching

CETCam: Camera-Controllable Video Generation via Consistent and Extensible Tokenization

SafetyBPO: Bidirectional Preference Optimization for Safe Text-to-Image Generation

Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation

DebFilter: Eradicating Biases Stashed in Value

PEdit: Pareto-Guided Image Editing via Dynamic Latent Trajectory Control

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline

Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement

Beyond Pixel Loss: Video-INRs Prefer Perceptual Optimization

MVSSM: Motion-aware Visual State Space Model for Efficient Video Deblurring

PrismNet: Semantic-Aware Image Enhancement via Vision Transformer and Zero-Cost Gating

FLAIR: Frequency- and Locality-Aware Implicit Neural Representations

CtrlISP: Rescuing Low-Light RAW Images via Controllable Neural ISP

Deepfake-Agent: Aggregating Semantic Forgery Clues for Generalizable Detection

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

PCSTracker: Long-term Scene Flow Estimation for Point Cloud Sequences

POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP

Semantic-Aware Spectral Reconstruction: A Spectral Library-Aided Unsupervised Method Based on the Diffusion Model

Linear Recurrent Unit with Semantic Modulation for Image Super-Resolution

RodNet: Visual Pathway-Inspired Adaptive Sparse Network for Efficient Low-Light Image Enhancement

LWTformer: A Detail-Aware, Learnable Wavelet-Transformer for Ancient Chinese Character Image Restoration

SAT: Selective Aggregation Transformer for Image Super-Resolution

PhyFusion: Physics-Aware Infrared and Visible Image Fusion via Modality-Specific Physical Priors

UnfoldIR: Rethinking Deep Unfolding Network in Illumination Degradation Image Restoration

Evaluating Low-Light Image Enhancement Across Multiple Intensity Levels

FALCON: Fast Adaptive Lightweight Computation of Intensities and Events for Depth Estimation

Learning to Translate Noise for Robust Image Denoising

QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution

AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution

Optical Tolerance-Compensated Diffusion Model for Image Restoration

TinySR: Shallow Diffusion Transformers for Real-World Image Super-Resolution

Inf-Dehaze: Beyond GPU Memory Constraints for Ultra-High-Resolution Image Dehazing

DenoiseGS: Gaussian Reconstruction Model for Burst Denoising

FlowSteer: Conditioning Flow Field for Consistent Image Restoration

P^2CS: Parallel Point Cloud Pre-Training with Semantic Consistency

Towards Calibrated Gradient-based Multi-Task Learning

Brain-Inspired Multimodal Spike Neural Network for Image-Text Retrieval

Conformal Cross-Modal Active Learning

Deep-to-Shallow Knowledge Transfer:Multi-Scale Self-Distillation with Bidirectional Aware for 3D Brain Segmentation

MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation

Rethinking Whole-Body CT Image Interpretation: An Abnormality-Centric Approach

Generative Vision-Language Multiple Instance Learning for Weakly Supervised Neonatal Fundus Screening and Reporting

Mitigating Batch Effects in Histopathology via Language-Mediated Robust Embedding Generation

PTF-CT: Polar-Aware Temporal-Frequential Iterative Reconstruction for Sparse-View CT

Learning from Noisy Prompts: Saliency-Guided Prompt Distillation for Robust Segmentation with SAM

Towards Noise-Robust Medical Segmentation via Chebyshev-Attention-Based Asymmetric UNet

Two-Stage 3D Pulmonary Vessel Reconstruction via Trunk--Expansion Coupled Point Cloud Generation

A Simple yet Effective Data Scaling Strategy for Semi-Supervised Medical Image Segmentation

DepthScopy: Decoupling Frequency for Endoscopic Depth Estimation in Sparsely-Textured Regions

ReCliFF: Adaptive Orthogonal Decoupling for Federated Fine-tuning of Medical MLLMs

Volumetrically Consistent Implicit Atlas Learning via Neural Diffeomorphic Flow for Placenta MRI

Vision-Language Models for Automated 3D PET/CT Report Generation

PaM-MIL: Proliferation and Metastasis Enhanced Localization for Multiple Instance Learning on Pathology Images

Surgical Procedural Planning as 3D World Modelling: Towards Automated Pulmonary Resection

From Adaptation to Generalization: Adaptive Visual Prompting for Medical Image Segmentation

AceMIL: Ordinal-Aware Multiple Instance Learning for Pathological Progression Analysis

PhySe-RPO: Physics and Semantics Guided Relative Policy Optimization for Diffusion-Based Surgical Smoke Removal

Anatomy-CoT: Teaching MLLMs to Reason in Radiology

DELRER: Disease Evolution-Informed Longitudinal Radiology Report Generation

M^4Fuse: Lightweight State-Space MoE with a Cross-Scale Gating Bridge for Brain Tumor Segmentation

DynaMind: Reconstructing Dynamic Visual Scenes from EEG by Aligning Temporal Dynamics and Multimodal Semantics to Guided Diffusion

MAE-XNT: A Foundation Model for Segmenting Neuronal Tissue Volumes Generated with X-Ray Nanotomography

NAKUL-Med: Spectral-Graph State Space Models with Dynamics Kernels for Medical Signals

Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation

M^3D-BFS: a Multi-Stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis

Multimodal Decoupled Dynamic Graph Learning for Brain Disease Diagnosis

TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

TP-Seg: Task-Prototype Framework for Unified Medical Lesion Segmentation

C3-Diff: Super-resolving Spatial Transcriptomics via Cross-modal Cross-content Contrastive Diffusion Modelling

MeMix: Multi-Encoder Mixture Framework for Medical Report Generation

Learning Spatial-Preserving Hierarchical Representations for Digital Pathology

Open-Set Spatial Gene Expression Prediction from Histological Images via Retrieval-Augmented Generation

Personalized Functional Brain Network Modeling with Adaptive Auto-Weighted Learning for Automatic Brain Disorder Diagnosis

Do Vision Models Perceive Illusory Motion in Static Images Like Humans?

Meta-CDMTransNet: Cross-Domain Multi-Scale Transformer Meta-Learning Framework for Few-Shot Breast Histopathological Image Classification

PLCReg: Correlation-Aware Polar-Linear Attention for Guiding Medical Image Registration

A Denoising-Enhanced Multimodal Learning Framework for Robust Nasal Endoscopy Report Generation

When Models Learn to Ask Why: Adaptive Causal Reasoning for Trustworthy Medical Vision–Language Models

PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation

Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation

Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning

PGDM: Physics-Guided Noise-Free Diffusion Model Based on Point Spread Function for Light-Scattering Removal in Unpaired Biomedical Images

Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

Anatomy-Aware Adaptive Feature Perturbation Framework for Semi-Supervised MRI Segmentation

EI: Early Intervention for Multimodal Imaging Based Disease Recognition

Rethinking Medical High-Modality Learning Under Missingness — A Long-Tailed Distribution Perspective

HazeMatching: Dehazing Light Microscopy Images with Guided Conditional Flow Matching

Learning Priors via Hybrid Visual Autoregressive Modeling for Medical Image to Image Translation

BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis

RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference

UGLMM: Towards Unified Vision Grounding with Large Multimodal Model

FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval

Training-Free Cross-Modal Alignment via Anchor Profiles with Statistical Significance Testing

CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension

LLM Guided Multi Style Typography and Layout Generation via Dynamic Direct Preference Optimization

FusionBridge: An Efficient Fusion Via Feature Disentanglement for Multi-Modal Object Re-Identification

LlamaRG: A Multi-View Large Language Model for Radiology Report Generation

Mitigating Information Forgetting via Entropy-Driven Progressive Retrospection for Multimodal Long Reasoning

InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression

R²MoE: Representation and Expert Selection Dual-Regularized Mixture-of-Experts for Multimodal Clinical Data

DUALVISION: RGB–Infrared Multimodal Large Language Models for Robust Visual Reasoning

Parallel In-context Learning for Large Vision Language Models

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Prototype and Sample Level Semantic Alignment for Incomplete Multi-View Clustering

Rethinking VLMs for Image Forgery Detection and Localization

DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

OTPrune: Distribution-Aligned Visual Token Pruning Via Optimal Transport

Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering

Materialistic RIR: Material Conditioned Realistic RIR Generation

From Coarse to Precise: Rethinking and Bridging Localization in Multimodal Large Language Models

Do Audio-Visual Large Language Models Really See and Hear?

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

Anticipatory Planning for Multimodal AI Agents

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs

Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometric Problem Solving

HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding

A Diagnostic Study of Region-Based Representations in Multimodal LLMs

HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model

UMI-HOI: Unifying Multimodal Information with Semantic Multi-Head Attention for Human–Object Interaction Detection

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

Visual2Echo Compositional Contrastive Learning (V2E-CCL): Binaural Knowledge Distilled Network for Depth Prediction

TextBind: Your Vision-Language Models are Naturally Unified Multimodal Models

Learning to Walk the Right Paths: Task-Responsive Graph Reasoning for Multimodal Inference

CLASH: A Benchmark for Cross-Modal Contradiction Detection

DA-CLIP: Mitigating Granularity Mismatch in Zero-Shot Anomaly Detection via Decoupled Text-Visual Alignment

HAIT: Hybrid Adversarial Iterative Training for Mitigating Object Hallucination in Large Vision–Language Models

Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

CP-IMoE: Collaborative Prompt-Guided Interactive Mixture-of-Experts for Incomplete Multimodal Learning

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions

LiteEmbed: Adapting CLIP to Rare Classes

CADReasoner: Iterative Program Editing for CAD Reverse Engineering

COSTA: Collaborative Open-Set Test-Time Adaptation Through Robust Prototype Learning

Perturb and Recover: Fine-Tuning for Effective Backdoor Removal from CLIP

PrismPrune: Decoupling Saliency and Diversity in Attention for Efficient Visual Token Pruning in VLMs

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

HAFM: A Post-Fusion Gating Module for Haze-Aware RGB–Thermal Object Detection

CaptAin: Caption-driven Alignment for Bridging Modality Gaps in Partially Relevant Video Retrieval

Dual Anchors, Do It Better: Hierarchical Group Merging for Zero-Shot Anomaly Detection

HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal–Image Modeling and Understanding

Unbiased Dynamic Multimodal Fusion

Video Reasoning Without Training

Efficient Discrete Diffusion Model for Scalable Multi-Objective Traveling Salesman Problem

EpiMask: Leveraging Epipolar Distance Based Masks in Cross-Attention for Satellite Image Matching

S³O: Selective Spatial-Spectral Operator for Cross-Scale Fusion

Fast Kernel-Space Diffusion for Remote Sensing Pansharpening

Unified Urban Tuning: Co-Enhancing Satellite and Street View Reasoning with a Progressive Tuning Framework

GReD-RSITR: A Generative Re-Examined Discriminative Framework for Remote Sensing Image-Text Retrieval

ZODS-RS — Zero-Training Oriented Detection & Segmentation for Remote Sensing

Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization

Optimal-Transport-based Feature Alignment for Multimodal Change Detection

HarmoniDiff-RS: Training-Free Diffusion Harmonization for Satellite Image Composition

CrossWeaver: Towards Efficient Cross-Modal Interweaving and Decoupling for Weakly-Aligned Multispectral Object Detection

ProSM: Progressive Soft Masking for Fine-Grained Remote Image Segmentation

UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Shared–Private Multimodal Decomposition

OffNadirLoc: Benchmark and Framework for Challenging UAV-to-Satellite Geo-Localization under Large Off-Nadir Views

M-PhyGs: Multi-Material Object Dynamics from Video

Diffusion^2: Turning 3D Environments into Radio Frequency Heatmaps

Controllable Radar Simulation with Waveform Parameter Embedding

mmDiff: A Noise-Robust Differentiable Ray-Tracing Framework for mmWave Scene Calibration and Channel Prediction

GLOW: Global Illumination-Aware Inverse Rendering of Indoor Scenes Captured with Dynamic Co-Located Light & Camera

Scene-Level Heterogeneous Physics Simulation with 3D Gaussian Splats

How to Achieve Prototypical Birth and Death for OOD Detection?

Uncertainty-Aware Cross-Modal Opinion Interaction: A General Frameworkfor Visible-Infrared Vehicle and Person Re-Identification

EIRES:Training-free AI-Generated Image Detection via Edit-Induced Reconstruction Error Shift

Vote-in-Context: VLMs as Explainable Zero-Shot Rank Fusers

PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images

HypHOI: Exploring Hierarchical Hyperbolic Embeddings for Human-Object Interaction Detection

A Low-Rank Learning Framework Integrating Detection, Masking, and Recovery for Occluded Facial Expression Recognition

DSAA: Dual-Stage Attribute Activation for Fine-Grained Open Vocabulary Detection

ConSel: Concept-Aware Self-supervised Learning for Regression Beyond Ordinal Tasks

Rolling and Denoising: Rethinking Dynamic Modal Fusion for Multi-Modal Object Re-Identification

Adapting with an Open Mind: Leveraging Open-Vocabulary Detectors for Closed Set Source-Free Domain Adaptive Object Detection

SFS-DETR: Spatial-Frequency Selection for UAV Object Detection

ForenDeX: Unlocking Forensic Insights for Explainable AI-Generated Image Detection

Long-Tailed Out-of-Distribution Detection with Refined Separate Class Learning

Bridging Day and Night: Unsupervised Cross-Domain Re-Identification with Synergistic Prompt and Prototype Learning

(ends 9:00 AM)

Registration / Badge Pickup

(ends 5:00 PM)

Break:

Breakfast

(ends 9:00 AM)

9 a.m.

Oral Session 3A: Generative Diffusion Modeling [9:00-10:15]

Orals 9:00-10:15

[9:00] Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation

[9:12] Guiding a Diffusion Model by Swapping Its Tokens

[9:25] PixelDiT: Pixel Diffusion Transformers for Image Generation

[9:37] SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

[9:50] SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

[10:02] Streaming Diffusion Model for Fast Infrared and Visible Video Fusion

(ends 10:15 AM)

Oral Session 3B: Spatial Understanding [9:00-10:15]

Orals 9:00-10:15

[9:00] ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

[9:12] CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling

[9:25] GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

[9:37] RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video

[9:50] S^2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds

[10:02] Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance

(ends 10:15 AM)

Oral Session 3C: Generative Editing [9:00-10:15]

Orals 9:00-10:15

[9:00] 3D-LATTE: Latent Space 3D Editing from Textual Instructions

[9:12] AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows

[9:25] ChordEdit: One-Step Low-Energy Transport for Image Editing

[9:37] Faithful Contouring: Near-Lossless 3D Voxel Representation Free from Iso-surface

[9:50] Native and Compact Structured Latents for 3D Generation

[10:02] SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control

(ends 10:15 AM)

Oral Session 3D: Multimodal Modeling [9:00-10:27]

Orals 9:00-10:15

[9:00] Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression

[9:12] FINER: MLLMs Hallucinate under Fine-grained Negative Queries

[9:25] MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction

[9:37] PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs

[9:50] PAVAS: Physics-Aware Video-to-Audio Synthesis

[10:02] ProPhy: Progressive Physical Alignment for Dynamic World Simulation

(ends 10:27 AM)

10:15 a.m.

Break:

Courtesy Break

(ends 10:30 AM)

10:30 a.m.

Keynote:

Transforming Computing with Quantum-Centric Supercomputing

Jerry Chow

(ends 11:30 AM)

11:15 a.m.

Poster Setup:

Poster Setup

(ends 11:45 AM)

11:45 a.m.

Doctoral Consortium:

Doctoral Consortium (By invitation only)

(ends 1:45 PM)

Demonstration:

Demos Session 3

(ends 1:45 PM)

Art Exhibition [11:45-6:00]

(ends 6:00 PM)

Poster Session 3 & Exhibit Hall [11:45-1:45]

Posters 11:45-1:45

Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation

Guiding a Diffusion Model by Swapping Its Tokens

PixelDiT: Pixel Diffusion Transformers for Image Generation

SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

Streaming Diffusion Model for Fast Infrared and Visible Video Fusion

ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling

GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video

S^2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds

Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance

3D-LATTE: Latent Space 3D Editing from Textual Instructions

AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows

ChordEdit: One-Step Low-Energy Transport for Image Editing

Faithful Contouring: Near-Lossless 3D Voxel Representation Free from Iso-surface

Native and Compact Structured Latents for 3D Generation

SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control

Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction

PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs

PAVAS: Physics-Aware Video-to-Audio Synthesis

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

V-DPM: 4D Video Reconstruction with Dynamic Point Maps

Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence

Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video

SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors

Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers

SMVRT: Implicit Human 3D Modeling Using Sparse Multi-View Volumetric Reconstruction with Transformer Fusion

LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving

Any4D: Unified Feed-Forward Metric 4D Reconstruction

Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers

Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting

AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend

AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

Parallelised Differentiable Straightest Geodesics for 3D Meshes

Geometry-Aligned and Anomaly-Aware Reconstruction for 3D Anomaly Detection

DVGT: Driving Visual Geometry Transformer

FMPose3D: monocular 3D pose estimation via flow matching

MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts

Foundation Encoders Are All You Need for Preference-Aware Personalization

Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

ThinkGen: Generalized Thinking for Visual Generation

CoLoGen: Progressive Learning of Concept–Localization Duality for Unified Image Generation

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

Multimodal Semantic Bias Mitigation for Diverse Text-To-3D Generation

Visual Personalization Turing Test

Composing Concepts from Images and Videos via Concept-prompt Binding

Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Hierarchical Enhancement of Semantic Priors for Disentangled Text-Driven Motion Generation

Simpleposter: A Simple Baseline For Product Poster Generation

Prompt Yourself: Awakening Textual Semantics in 1D Visual Tokenizers

SkyReels-Text: Fine-Grained Font-Controllable Text Editing for Poster Design

Image Generation from Contextually-Contradictory Prompts

PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward

Aligning Text, Images and 3D Structure Token-by-Token

RefTon: Reference person shot assist virtual Try-on

GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting

Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints

Gravitation-Driven Semantic Alignment for Text Video Retrieval

MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models

M^3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing

PersonaVLM: Long-Term Personalized Multimodal LLMs

MR-RAG: Multimodal Relevance-Aware Retrieval-Augmented Generation for Medical Visual Question Answering

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning

Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective

CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection

CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference

Towards Multimodal Domain Generalization with Few Labels

Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning

Event6D: Event-based Novel Object 6D Pose Tracking

EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network

AE2VID: Event-based Video Reconstruction via Aperture Modulation

From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation

Spike-driven Discrete Aggregation for Event-based Object Detection

x^2-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space

FloVerse: Floor Plan-Guided Multi-Modal Navigation

TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation

History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation

DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration

Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes

CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

Rethinking Visual Rearrangement from A Diffusion Perspective

APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs

When Robots Should Say ''I Don’t Know'': Benchmarking Abstention in Embodied Question Answering

RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

Towards Training-free Scene Text Editing

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding

Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All

Region-Wise Correspondence Prediction between Manga Line Art Images

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models

TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering

RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes

Voxify3D: Pixel Art Meets Volumetric Rendering

Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs

FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction

GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator

LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis

Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields

BiProLoRA: Bilevel Prompt LoRA for Real Scene Recovery

Degradation-Consistent Test-Time Adaptation for All-in-One Image Restoration

CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation

2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition

Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration

It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal

Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration

Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos

Dynamic Exposure Burst Image Restoration

FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization

VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba

Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models

EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy

Underground Plant Exploration: Non-Destructive 3D Root Assessment with GPR Based on Point Graph Neural Network

Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities

MicroFM: Physics-guided Flow Matching for Isotropic Microscopy Reconstruction

Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration

PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation

Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness

Hidden Dangers of Compositional Generation: Diagnosing Semantic Safety Failures in Text-to-Image Models

VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation

Few-shot Acoustic Synthesis with Multimodal Flow Matching

CLIP-like Model as a Foundational Density Ratio Estimator

Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

EgoAVU: Egocentric Audio-Visual Understanding

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Adaptive Confidence Regularization for Multimodal Failure Detection

Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis

PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction

Conflict-Aware Adaptive Cross-Reconstruction for Multimodal Sentiment Analysis

EduDiag: A Benchmark for Educational Diagnostic Reasoning with Error Tracing and Correction on Large Multimodal Models

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

Cross-Modal Guided Visual Synthesis for Data-Efficient Multimodal Depression Recognition

AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis

PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Affordance Correspondence

HandWorld: Hand-Centric Unified Video Action Generation

HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis

ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions

LAM: Language Articulated Object Modelers

Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes

Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation

PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

UniSER: A Foundation Model for Unified Soft Effects Removal

EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation

Inference-time Physics Alignment of Video Generative Models with Latent World Models

SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation

Plenoptic Video Generation

PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference

AdapTok: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Flowception: Temporally Expansive Flow Matching for Video Generation

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Linear Image Generation by Synthesizing Exposure Brackets

Low-Resolution Editing is All You Need for High-Resolution Editing

UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

VENI: Variational Encoder for Natural Illumination

SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

MoCha: End-to-End Video Character Replacement without Structural Guidance

Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

VOSR: A Vision-Only Generative Model for Image Super-Resolution

Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution

DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution

VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light Environment

Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution

Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset

Next-Scale Autoregressive Models for Text-to-Motion Generation

Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds

Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control

RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation

FrankenMotion: Part-level Human Motion Generation and Composition

HSI-GPT2: A Dual-Granularity Large Motion Reasoning Model with Diffusion Refinement for Human–Scene Interaction

SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

Progressive Guessing to Fixed Point: Rethinking Human Motion Prediction with Deep Equilibrium Models

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data

Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots

PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion

Prototype-Guided Concept Erasure in Diffusion Models

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

CARD: Correlation Aware Restoration with Diffusion

DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis

DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease

Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?

CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models

InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior

MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation

M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

Towards Policy-Adaptive Image Guardrail: Benchmark and Method

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

TextFM: Robust Semi-dense Feature Matching with Language Guidance

What’s Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution

Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

Point Cloud as a Foreign Language for Multi-modal Large Language Model

Grounded 3D-Aware Spatial Vision-Language Modeling

SpatialTree: How Spatial Intelligence Branches Out in MLLMs

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach

Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Modeling the Visual Ambiguity of Human Sketches

SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

V^2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

WeaveTime: Streaming from Earlier Frames into Emergent Memory in VideoLLMs

Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning

Efficient Frame Selection for Long Video Understanding via Reinforcement Learning

HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

InternVideo-Next: Towards World-Understanding Video Models

Condensed Test-Time Adaptation of VLMs for Action Recognition

Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency

A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett–Luce Ranking

SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

Explaining Object Detectors via Collective Contribution of Pixels

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

Evaluating Generative Models via One-Dimensional Code Distributions

TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds

LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space

Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds

Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors

QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment

L3DR: 3D-aware LiDAR Diffusion and Rectification

Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal

Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain

MS^2Gait: A Multi-Scale Spatio-Temporal Fusion Network for LiDAR-based Gait Recognition

Foundry: Distilling 3D Foundation Models for the Edge

Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation

Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering

FedMPT: Federated Multi-Label Prompt Tuning of Vision-Language Models

Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

NTK-Guided Implicit Neural Teaching

SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis

Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

Same or Not? Enhancing Visual Perception in Vision-Language Models

Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects

Animator-Centric Skeleton Generation on Objects with Fine-Grained Details

Synthesizing Visual Concepts as Vision-Language Programs

Self-Consistency for LLM-Based Motion Trajectory Generation and Verification

Semantic Scale Space: A Framework for Controllable Image Abstraction

Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection

DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime

SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models

Designing to Forget: Deep Semi-parametric Models for Unlearning

Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking

PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data

Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models

EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment

Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models

A Polynomial Chaos Framework for Causal Discovery in Nonlinear Uncertain Systems

Domain-Skewed Federated Learning with Feature Decoupling and Calibration

From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity

Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning

Few-for-Many Personalized Federated Learning

ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning

Domain Sensitive Federated Learning with Fisher-Informed Pruning

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Bridging Facial Understanding and Animation via Language Models

AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

CVA: Context-aware Video-text Alignment for Video Temporal Grounding

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference

Towards Unified Human Perception and Machine Understanding: Token Flow Guided Compression Framework

A More Word-like Image Tokenization for MLLMs

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention

One Layer’s Trash is Another Layer’s Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs

OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

Tunable Soft Equivariance with Guarantees

Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score

Cluster-aware Anchor Learning for Multi-View Clustering

Revisiting Sparsity Constraint Under High-Rank Property in Partial Multi-Label Learning

Weight Space Representation Learning via Neural Field Adaptation

Recurrent Video Masked Autoencoders

Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning

Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery

From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

Spatial Retrieval Augmented Autonomous Driving

Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems

ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

WPT: World-to-Policy Transfer via Online World Model Distillation

ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction

URScenes: A Multi-scenario Dataset for Unstructured Road Environments

MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Driving

SAMosaic3D: Modular Scene Assembly for Real-Time 3D Segment Anything

Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation

MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation

PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation

RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation

SAMIX: Reinforcing SAM2 with Semantic Adapter and Reference Selecting Policy for Mix-Supervised Segmentation

MARSS: Radar Semantic Segmentation via Modular Attention and State Space Models

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure

Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge

PACT: Phase-Like Transition Constraints in Adapter-Based Continual Learning of Vision-Language Models

Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models

Re-evaluating Continual VQA: Toward Fair and Robust Evaluation for Multimodal Continual Learning

Distilling Balanced Knowledge from a Biased Teacher

Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting

Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning

EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

D^3FER: Dual Channel and Dual Branch Network for Robust Facial Expression Recognition under Dual Challenges

HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

ExpPortrait: Expressive Portrait Generation via Personalized Representation

PersonaLive! Expressive Portrait Image Animation for Live Streaming

ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation

OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives

CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving

TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations

AURA: Multi-modal Shared Autonomy for Urban Navigation

Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image

FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

UIKA: Fast Universal Head Avatar from Pose-Free Images

FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation

First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models

Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

Envision, Attend, Then Respond: Counterfactual Hallucination Mitigation in Large Vision-Language Models

PAS: Prelim Attention Score for Detecting Object Hallucinations in Large Vision-Language Models

MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

Fine-Grained Multi Image Object Hallucination Benchmark

Generative Video Motion Editing with 3D Point Tracks

BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Learning to Generate Highly Dynamic Videos using Synthetic Motion Data

Stereo World Model: Camera-Guided Stereo Video Generation

CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation

MAD: Motion Appearance Decoupling for efficient Driving World Models

VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency

Endless World: Real-Time 3D-Aware Long Video Generation

SpatialDiff: 3D-Aware Object Movement via Implicit Spatial Modeling

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection

CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild

YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection

WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing

MFEN: Multi-Frequency Expert Network for Visible-Infrared Person Re-ID

Object-Generalized Re-Identification: A Step Towards Universal Instance Perception

When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection

Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification

HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

Beyond Caption-Based Queries in Video Moment Retrieval

Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference

VideoRealBench: A Chain-of-Thought Realism Evaluation Benchmark for Generated Human-Centric Videos

VAST: Video Ability‑Stratified Taxonomy for Data‑Efficient Video Reasoning

An Empirical Study on How Video-LLMs Answer Video Questions

FPSBench: A Benchmark for Video Understanding at High Frame Rates

UniComp: Rethinking Video Compression Through Informational Uniqueness

NaTex: Seamless Texture Generation as Latent Color Diffusion

Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models

Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping

Delta Rectified Flow Sampling for Text-to-Image Editing

Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers

SpotEdit: Selective Region Editing in Diffusion Transformers

All-in-One Slider for Attribute Manipulation in Diffusion Models

DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment

From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution

CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception

Scene Reconstruction as Mapping Priors for 3D Detection

CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection

Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments

Learning from Synthetic Data via Provenance-Based Input Gradient Guidance

Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

Draft and Refine with Visual Experts

R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

μVLM: A Vision Language Model for μNPUs

Gaussian Mapping for Evolving Scenes

Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting

AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors

SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM

Faster-GS: Analyzing and Improving Gaussian Splatting Optimization

Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes

GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance

PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation

3D Gaussian Splatting at Arbitrary Resolutions with Compact Proxy Anchors

Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting

AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction

GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction

More Natural, More Real: Object-aware Gaussian Splatting for 3D Visual Decoding from Human Brain

Eulerian Gaussian Splatting using Hashed Probability Pyramids

Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting

ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting

Robust3DGSW: Toward Robust Watermarking for Quantization-Aware 3D Gaussian Splatting

ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking

L^2DGS: Low-Light Dynamic Gaussian Splatting

Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images

HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation

CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

Mario: Multimodal Graph Reasoning with Large Language Models

Boosting Reasoning in Large Multimodal Models via Activation Replay

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding

ROSE: Rotate Your Large Language Model to See

OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe

SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs

Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection

SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

Suppressing Non-Semantic Noise in Masked Image Modeling Representations

Block-based Learned Image Compression without Blocking Artifacts

DeDelayed: Deleting Remote Inference Delay via On-Device Correction

AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception

Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery

Precise Object and Effect Removal with Adaptive Target-Aware Attention

Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression

FreqSIC: Frequency-aware Stereo Image Compression with Bi-directional Checkerboard Context Model

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

Fusion of Depth and Semantics for Probabilistic Floorplan Localization

A2GC: Asymmetric Aggregation with Geometric Constraints for Locally Aggregated Descriptors

Geo2: Geometry-Guided Cross-view Geo-Localization and Image Synthesis

Coverage Optimization for Camera View Selection

Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning

LensWalk: Agentic Video Understanding by Planning How You See in Videos

DPGF-Net: Dual-Prior Guided Fusion Network for Joint Assessment of Perceptual Quality and Semantic Consistency in AI-Generated Images

RegionFuse: Region-Adaptive Pixel Distribution Learning for Infrared and Visible Image Fusion

Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion

TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification

Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process

Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification

Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

Learning to Learn Weight Generation via Local Consistency Diffusion

Balanced Dataset Distillation via Modeling Multiple Visual Pattern Distribution

Grid Distillation: Compositional Image Distillation via Structured Generative Grids

Dataset Distillation by Influence Matching

StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning

Seeing Through Blur: Tackling Defocus in Spike-Based Imaging

Distilling Quasi-Conformal Mapping: A Generalizable and Efficient Solution for Wide-Angle Correction

Lighting in Motion: Spatiotemporal HDR Lighting Estimation

LightRR: A Lightweight Network for Single Image Reflection Removal

HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter

Coded-E2LF: Coded Aperture Light Field Imaging from Events

TokenLight: Precise Lighting Control in Images using Attribute Tokens

Kaleidoscopic Scintillation Event Imaging

gQIR: Generative Quanta Image Reconstruction

Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation

Predicting Spatial Transcriptomics from Histology Images via High-Order Multi-Cell Interaction Modeling

From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images

Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds

Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views

Zero-Shot Depth Completion with Vision-Language Model

FE2E: From Editor to Dense Geometry Estimator

Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision

Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

NI-Tex: Non-isometric Image-based Garment Texture Generation

Velox: Learning Representations of 4D Geometry and Appearance

UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching

UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes

Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion

LoST: Level of Semantics Tokenization for 3D Shapes

Lafite: A Generative Latent Field for 3D Native Texturing

Image-Guided Geometric Stylization of 3D Meshes

LATTICE: Democratize High-Fidelity 3D Generation at Scale

Dehallu3D: Hallucination-Mitigated 3D Generation from a Single Image via Cyclic View Consistency Refinement

MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly

TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation

DynamicsBoost: Dynamic Plausible Video Generation via Annotation-Free Continuation Preference Optimization

Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition

Fine-Grained GRPO for Precise Preference Alignment in Flow Models

Lighting-grounded Video Generation with Renderer-based Agent Reasoning

RewardFlow: Generate Images by Optimizing What You Reward

Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

Self-Corrected Image Generation with Explainable Latent Rewards

Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning

Reading Your Actions: Learning Generalizable Action Representations via Pre-training AEMG

MA-Bench: Towards Fine-grained Micro-Action Understanding

OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments

Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements

DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions

Protect to Adapt: Subspace-Constrained Adaptation with Ranked Negative Prompt Feedback for Few-Shot Action Recognition

SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition

InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search

S^2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain

Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model

Gated KalmaNet: A Fading Memory Layer through Test-time Ridge Regression

Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data

AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks

TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts

QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification

Learning to Solve PDEs on Neural Shape Representations

Frequency Switching Mechanism for Parameter-Efficient Multi-Task Learning

Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses

Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Streamlined Open-Vocabulary Human-Object Interaction Detection

Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting

Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection

WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

Open-Vocabulary Domain Generalization in Urban-Scene Segmentation

OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery

Annotation-Efficient Coreset Selection for Context-dependent Segmentation

ALLNet: Multi-task Dense Prediction for Degraded Images

Geometry-Aware Cross-Modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting

Volumetric Functional Maps

GenMask: Adapting DiT for Segmentation via Direct Mask Generation

Frequency-Aware Affinity for Weakly Supervised Semantic Segmentation

Learning and Aligning Click-Aware Shape Prior for Interactive Amodal Instance Segmentation

Beyond Reassembly: Fractured Object Recovery with Missing Parts

Best Segmentation Buddies for Image-Shape Correspondence

RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments

Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts

Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data

SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images

SkySense-VITA: Towards Universal In-context Segmentation of Multi-modal Remote Sensing Imagery

ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology

GeoCoT: Towards Reliable Remote Sensing Reasoning with Manifold Perspective

STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting

NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting

Improving Adversarial Transferability with Local Perturbation Augmentation

Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs

Stealing Split Learning Bottom Models by Recovering Embedding Geometry

PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems

No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation

Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks

Where, What, Why: Toward Explainable 3D-GS Watermarking

Robust Spiking Neural Networks by Temporal Mutual Information

TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

Obstruction Reasoning for Robotic Grasping

PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

Localizing, Structuring, and Rendering: Bridging 3D and 2D Vision-Language-Action Models for Robotic Manipulation

NIL: No-data Imitation Learning

Humanoid Generative Pre-Training for Zero-Shot Motion Tracking

EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models

CUBic: Coordinated Unified Bimanual Perception and Control Framework

RehearseVLA: Simulated Post-Training for VLAs with Physically-Consistent World Model

GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping

UETrack: A Unified and Efficient Framework for Single Object Tracking

ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy

Efficient Video Object Segmentation and Tracking with Recurrent Dynamic Submodel

Learning to Track Instance from Single Nature Language Description

MV-TAP: Tracking Any Point in Multi-View Videos

Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing

Content-Adaptive Hierarchical Hyperprior for Neural Video Coding

UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking

Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning

From Infusion to Assimilation Distillation for Medical Image Segmentation

IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework

Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers

Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code

MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding

Turning Pre-Trained Vision Transformers into End-to-End Histopathology Whole Slide Image Models for Survival Prediction

A Supervised Multi-task Framework for Joint cryo-ET Restoration Enabled by Generative Physical Simulation

KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation

CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification

OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging

Learning complete and explainable visual representations from itemized text supervision

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images

Differentially Private 2D Human Pose Estimation

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose

SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking

HUMAPS-4D: A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations

PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets

Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation

OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance

Human Interaction-Aware 3D Reconstruction from a Single Image

Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning

SAGA: Source Attribution of Generative AI Videos

VMD-FACT: A New Video Dataset and MLLM-based method for Detecting Realistic AI-Generated Video Misinformation

ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation

A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World

PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection

Learning from Noisy Supervision: A Denoising–Debiasing Framework for Weakly Supervised Video Anomaly Detection

Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

CHAL: Causal-guided Hierarchical Anomaly-aware Learning for Moving Infrared Small Target Detection

RAID: Retrieval-Augmented Anomaly Detection

ADSeeker: A Knowledge-Grounded Reasoning Framework for Industry Anomaly Detection and Reasoning

InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception

Grounded Latents for Entity-Centric 4D Scene Generation

(ends 1:45 PM)

Art Gallery Tour with Curator and Artists [11:45-12:15]

(ends 12:15 PM)

1:45 p.m.

Art Panel [1:45-2:45]

(ends 2:45 PM)

2 p.m.

Oral Session 4A: Geometric Understanding [2:00-3:15]

Orals 2:00-3:15

[2:00] Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

[2:12] Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

[2:25] From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

[2:37] Linear Fundamental Matrix Estimation from 7 or 5 Points

[2:50] OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective

[3:02] VGGT-Ω

(ends 3:15 PM)

Oral Session 4B: Embodied & Agentic Intelligence [2:00-3:15]

Orals 2:00-3:15

[2:00] CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

[2:12] NitroGen: An Open Foundation Model for Generalist Gaming Agents

[2:25] PAI-Bench: A Comprehensive Benchmark For Physical AI

[2:37] RefAV: Towards Planning-Centric Scenario Mining

[2:50] SoccerMaster: A Vision Foundation Model for Soccer Understanding

[3:02] VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

(ends 3:15 PM)

Oral Session 4C: Spatial Reasoning [2:00-3:15]

Orals 2:00-3:15

[2:00] Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras

[2:12] GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials

[2:25] InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity

[2:37] MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping

[2:50] Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation

[3:02] Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

(ends 3:15 PM)

Oral Session 4D: Visual Segmentation [2:00-3:15]

Orals 2:00-3:15

[2:00] INSID3: Training-Free In-Context Segmentation with DINOv3

[2:12] MARCO: Navigating the Unseen Space of Semantic Correspondence

[2:25] PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation

[2:37] R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

[2:50] The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

[3:02] VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

(ends 3:15 PM)

3:15 p.m.

Break:

Courtesy Break

(ends 3:30 PM)

3:30 p.m.

Meeting:

PAMI TC

(ends 4:30 PM)

4:15 p.m.

Poster Setup:

Poster Setup

(ends 4:45 PM)

4:45 p.m.

Demonstration:

Demos Session 4

(ends 6:45 PM)

Poster Session 4 & Exhibit Hall w/ Coffee Break [4:45-6:45]

Posters 4:45-6:45

Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

Linear Fundamental Matrix Estimation from 7 or 5 Points

OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective

VGGT-Ω

CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

NitroGen: An Open Foundation Model for Generalist Gaming Agents

PAI-Bench: A Comprehensive Benchmark For Physical AI

RefAV: Towards Planning-Centric Scenario Mining

SoccerMaster: A Vision Foundation Model for Soccer Understanding

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras

GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials

InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity

MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping

Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

INSID3: Training-Free In-Context Segmentation with DINOv3

MARCO: Navigating the Unseen Space of Semantic Correspondence

PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation

R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion

Lite Any Stereo: Efficient Zero-Shot Stereo Matching

MuM: Multi-View Masked Image Modeling for 3D Vision

ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors

TopoMA: Topology-Guided Multi-Agent Dense RGB 3D Reconstruction via Distributed Inference

Sparse–View Localization via Online Neural 3D Regression

Dynamic Visual SLAM using a General 3D Prior

Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency

UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations

TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction

Global Structure-from-Motion Meets Feedforward Reconstruction

POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation

DuoGen: Towards Autonomous Interleaved Multimodal Generation

Vibe Spaces for Creatively Connecting and Expressing Visual Concepts

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

CREward: A Type-Specific Creativity Reward Model

LumiX: Structured and Coherent Text-to-Intrinsic Generation

Synthetic Curriculum Reinforces Compositional Text-to-Image Generation

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Selectively Extracting and Injecting Visual Attributes into Text-to-Image Models

LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models

UniVerse: Empower Unified Generation with Reasoning and Knowledge

UniVerse: A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization

Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

TGT: Text-Grounded Trajectories for Locally Controlled Video Generation

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

FlowFixer: Towards Detail-Preserving Subject-Driven Generation

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

FEAT: Fashion Editing and Try-On from Any Design

Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

PowerCLIP: Powerset Alignment for Contrastive Pre-Training

MoBind: Motion Binding for Fine-Grained IMU–Video Pose Alignment

The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models

Tackling Model Bias via Game-theoretic Multi-agent Collaboration Framework for Hateful Meme Classification

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction

Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

Hierarchical Process Reward Models are Symbolic Vision Learners

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

SG-LoRA: Semantic-guided LoRA Parameters Generation

AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation

Reframing Long-Tailed Learning via Loss Landscape Geometry

Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning

DC-Merge: Improving Model Merging with Directional Consistency

TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery

Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset

NEC-Diff: Noise-Robust Event-RAW Complementary Diffusion for Seeing Motion in Extreme Darkness

Towards Persistence: Learning Topological Constraints for Event-based Small Object Detection

Geometric-Photometric Event-based 3D Gaussian Ray Tracing

EventDrive: Event Cameras for Vision-Language Driving Intelligence

EventGait: Towards Robust Gait Recognition with Event Streams

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Semantic Audio-Visual Navigation in Continuous Environments

Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

FLARE: A Failure-Aware Framework for Autonomous Correction and Recovery in Visual-Language Robotic Manipulation

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

General Process Reward Modeling for Robotic Reinforcement Learning

DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation

Action-Sketcher: From Reasoning to Action via Visual Sketches for Robotic Manipulation

Thinking in 360°: Humanoid Visual Search in the Wild

Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation

MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues

Cycle-Consistent Tuning for Layered Image Decomposition

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification

NEAF: Natural Image Editing with Attention Fusion for Generalizable Test-time Optimization in Text-Guided Image Editing

OntoAug: Rethinking Generative Data Augmentation via Ontology Guidance

Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere

4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction

Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering

Intrinsic Image Fusion for Multi-View 3D Material Reconstruction

PackUV: Packed Gaussian UV Maps for 4D Volumetric Video

Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects

PhysGaia: A Physics-aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis

MatSpray: Fusing 2D Material World Knowledge on 3D Geometry

OMoBlur: An Object Motion Blur Dataset and Benchmark for Real-World Local Motion Deblurring

Hybrid Agents for Image Restoration

Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation

Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration

Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising

PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors

UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement

FastGaMer: Efficient GainMap Learning for Practical Inverse Tone Mapping

MDS-VQA: Model-Informed Data Selection for Video Quality Assessment

Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events

Disentanglement-wise Image Dehazing through Cross-Domain Manifold Consensus

Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model

EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction

Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning

LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding

TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning

Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation

Masked-Diffusion Autoencoders for 3D Medical Vision Representation Learning

PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation

Test-Time Attention Purification for Backdoored Large Vision Language Models

AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

Towards Robust Multimodal Large Language Models Against Jailbreak Attacks

R^2TUA: Reconstruction-residual Based Targeted and Untargeted Attack Against Text-Image Person Re-Identification

When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

Enhancing Part-Level Point Grounding for Any Open-Source MLLMs

MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction

YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction

How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?

Modeling Cross-vision Synergy for Unified Large Vision Model

Beyond Missing Modalities: Hypergraph Conditioned Diffusion for Uncertainty-Aware Multimodal Emotion Recognition

Rosetta Stone For Unified MLLMs: A Unified Tokenizer to Decipher Understanding and Generation

MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

AMusE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis

CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification

EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer’s Disease

Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

Cross-View Distillation and Adaptive Masking for Incomplete Multi-View Multi-Label Classification

Bootstrap Your Own AV-Proxies: Adaptive Contrastive and Prototype Learning for Audio-Visual Segmentation

Multimodal Distribution Matching for Vision-Language Dataset Distillation

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

Text-Driven 3D Hand Motion Generation from Sign Language Data

Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

GenHOI: Towards Object-Consistent Hand–Object Interaction with Temporally Balanced and Spatially Selective Object Injection

Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling

Training-free Motion Factorization for Compositional Video Generation

Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

PoseAnything: General Pose-guided Video Generation with Part-aware Temporal Coherence

FastHybrid: Accelerating Hybrid Autoregressive Image Generation with Lookahead and Guided Decoding

DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation

AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

LeapAlign: Post-training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

Flow Matching for Multimodal Distributions

From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

ReasonEdit: Towards Reasoning-Enhanced Image Editing Models

Cross-Subject EEG-to-Video Reconstruction and Beyond

Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation

DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution

VABench: A Comprehensive Benchmark for Audio-Video Generation

Relightful Video Portrait Harmonization

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

DVAR: Dynamic Visual Autoregressive Modeling for Image Super-Resolution

Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers

LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation

UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution

EMR-Diff: Edge-aware Multimodal Residual Diffusion Model for Hyperspectral Image Super-resolution

RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution

One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution

HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution

Unifying Precise Keyframes and Semantic Control via Multi-level Diffusion

CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation

Pressure2Motion: Hierarchical Human Motion Reconstruction from Ground Pressure with Text Guidance

From 3D Pose to Prose: Biomechanics-Grounded Vision–Language Coaching

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation

W2W: Language-Model-Based Trajectory Prediction with Reinforcement Learning

ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis

Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models

Unified Number-Free Text-to-Motion Generation Via Flow Matching

Generative Diffusion Priors for 3D Mapping of the Dark Universe

FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation

DiffuView: Multi-View Diffusion Pretraining for 3D Aware Robotic Manipulation

Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

Dual Ascent Diffusion for Inverse Problems

Forecast the Principal, Stabilize the Residual: Subspace-Aware Feature Caching for Diffusion Transformers

Spatial-Spectral Residuals Informed Diffusion Neural Operator for Pan-sharpening

PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion

Pixel Motion Diffusion is What We Need for Robot Control

ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models

M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding

UNICBench: UNIfied Counting Benchmark for MLLM

CaptionQA: Is Your Caption as Useful as the Image Itself?

EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy

VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding

Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

Socratic-Geo: Synthetic Data Generation and Cross-Modal Geometric Reasoning via Multi-Agent Interaction

HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

ReLaGS: Relational Language Gaussian Splatting

3D-IDE: 3D Implicit Depth Emergent

FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning

Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory

4DP-QA: Scalable QA for 4D Perception in Vision Language Models

LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval

EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval

PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation

CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale

Camouflage-aware Image-Text Retrieval via Expert Collaboration

TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval

TIGER: A Unified Framework for Time, Images and Geo-location Retrieval

Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos

VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions

Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation

AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting

Unique Lives, Shared World: Learning from Single-Life Videos

Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding

SVAgent: Storyline-guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding

Structural Graph Probing of Vision–Language Models

Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition

MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

Relational Visual Similarity

PointCNN++: Performant Convolution on Native Points

Fast Markov Random Field Optimisation for Topologically Noisy 3D Shape Matching

LitePT: Lighter Yet Stronger Point Transformer

SuP: Sub-cloud Driven Point Cloud Registration

PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration

Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination

MHopReg: Efficient Hierarchical Multi-Hop Graph Search for Point Cloud Registration

GEM: Generating LiDAR World Model via Deformable Mamba

Hybrid Robust Collaborative Perception with LiDAR-4D Radar Fusion under Adverse Weather Conditions

Task-Driven Implicit Representations for Automated Design of LiDAR Systems

Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models

GazeShift: Unsupervised Gaze Estimation and Dataset for VR

Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design

Soft Modality-Guided Expert Specialization in MoE-VLMs

CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models

Retrieving Counterfactuals Improves Visual In-Context Learning

AutoRegressive Generation with B-rep Holistic Token Sequence Representation

VecGlypher: Unified Vector Glyph Generation with Language Models

NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code

Diagram2Structure: Unlocking LLMs' Diagram Comprehension through DiagramDiff, an Offline Diagram Structuring Framework

ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement

GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents

ShadowDraw: From Any Object to Shadow-Drawing Compositional Art

End-to-End Hyper-Relational Information Extraction for Engineering Diagrams via Dynamically Tokenized Relation Transformer

When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards

Bias at the End of the Score

PECCVAI: Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks

Dynamic Token Reweighting for Robust Vision-Language Models

COPYLENS: Towards Copyrighted Characters Infringement Detection via Copyright-Aware Prompt Learning

Closed-Form Concept Erasure via Double Projections

Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning

Stake the Points: Structure-Faithful Instance Unlearning

Federated Active Learning Under Extreme Non-IID and Global Class Imbalance

FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients

FedCART: Tackling Long-Tailed Distributions in Federated Adversarial Training via Classifier Refinement

Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations

Fully Decentralized Certified Unlearning

Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift

Towards Streaming Referring Video Segmentation via Large Language Model

Multi-speaker Attention Alignment for Multimodal Social Interaction

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

DeRVOS: Decoupling Consistent Trajectory Generation and Multimodal Understanding for Referring Video Object Segmentation

UniCompress: Token Compression for Unified Vision–Language Understanding and Generation

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

SCoRe: Salience-Coverage Reduction for Vision Token Pruning in Vision-Language Models

VLM-PTQ: Efficient Post-Training Quantization for Large Vision-Language Models

Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization

Rethinking Token Reduction for Large Vision-Language Models

Prototype-based Causal Intervention for Multi-Label Image Classiﬁcation

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization

Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty

Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods

Revisiting Learning with Noisy Labels: Active Forgetting and Noise Suppression

PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning

Global-Graph Guided and Local-Graph Weighted Contrastive Learning for Unified Clustering on Incomplete and Noise Multi-View Data

Enhancing Out-of-Distribution Detection with Extended Logit Normalization

Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures

Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images

SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World

RAG-TP: A General Framework for Vehicle Trajectory Prediction via Retrieval-Augmented Generation

Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving

Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual–Inertial Odometry

HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles

AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction

WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving

PlannerRFT: Reinforcing Diffusion Planners through Closed-Loop and Sample-Efficient Fine-Tuning

MARIS: Marine Open-Vocabulary Instance Segmentation

XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening

Training-Free Open-Vocabulary Camouflaged Object Segmentation via Fine-Grained Object Binding and Adaptive Hybrid Prompt

M⁴-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

ReAttnCLIP: Training-Free Open-Vocabulary Remote Sensing Image Segmentation via Re-defined Attention in CLIP

Mixture of Prototypes for Test-time Adaptive Segmentation

Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning

ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark

Decouple Your Discovery and Memory in Continual Generalized Category Discovery

Beyond the Static World: Continual Category Discovery under Visual Drift

Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

SAME: Sparse and Anchored Model Editing for Heterogeneous Incremental Learning under Limited Data

CHEEM: Continual Learning by Reuse, New, Adapt and Skip - A Hierarchical Exploration-Exploitation Approach

Exemplar-Free Continual Learning for State Space Models

A Faster Path to Continual Learning

Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay

BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling

SyncDreamer: Controllable and Expressive Avatar Generation Beyond the Talking Head

PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving

Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning

Unifying Language-Action Understanding and Generation for Autonomous Driving

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

CGHair: Compact Gaussian Hair Reconstruction with Card Clustering

HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars

Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction

RelightAnyone: A Generalized Relightable 3D Gaussian Head Model

Feed-forward Gaussian Registration for Head Avatar Creation and Editing

Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models

Same Attention, Different Truths: Put Logit-Lens over Visual Attention to Detect and Mitigate LVLM Object Hallucination

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

Lyapunov Probes for Hallucination Detection in Large Foundation Models

Captain Safari: A World Engine with Pose-Aligned 3D Memory

Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

PerpetualWonder: Long-horizon Action-conditioned 4D Scene Generation

CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes

RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space

RigMo: Unifying Rig and Motion Learning for Generative Animation

LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models

WHU-MARS: A Multispectral Aerial-Ground Benchmark Towards Any-Scenario Person Re-Identification

Detect Anything via Next Point Prediction

Text-guided Feature Disentanglement for Cross-modal Gait Recognition

Distribution-Aligned Multimodal Fusion for Robust Object Detection

PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection

Portable Active Learning for Object Detection

Efficiency Follows Global-Local Decoupling

VRCLIP: Multimodal Canonical Correlation Alignment for CLIP-Driven Vision-Radio Person Re-Identification

EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection

Expert-Teacher-Student Collaborative Learning for Domain Adaptive Object Detection

CI-VID: A Coherent Interleaved Text-Video Dataset

Generalizable Video Quality Assessment via Weak-to-Strong Learning

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding

Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

Compositional Transformation Reasoning for Composed Video Retrieval

UniVBench: Towards Unified Evaluation for Video Foundation Models

NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers

InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting

TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space

ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers

MeanFlow Transformers with Representation Autoencoders

DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

FARMER: Flow AutoRegressive Transformer over Pixels

Probabilistic Precipitation Nowcasting with Rectified Flow Transformers

FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing

High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning

3D-Object Perception Transformer (3PT)

SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection

Spe-BEVHead: Rethinking the Detection Head Design for Bird’s-Eye-View Object Detection

Unsupervised Multi-agent and Single-agent Perception from Cooperative Views

Zoo3D: Zero-Shot 3D Object Detection at Scene Level

Beyond Appearance: Camouflaged Object Detection via Geometric Structure

SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors

AceTone: Bridging Words and Colors for Conditional Image Grading

Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

UI-Lens: Assessing General MLLMs’ Potential to Automate UI Display Quality Assurance

Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation

Linking Perception, Confidence and Accuracy in MLLMs

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

From Pixel to Precision: Enhancing Handwritten Mathematical Expression Recognition with Image-Level Reward

Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty

Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction

Seele: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices

GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing

PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes

EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images

SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering

GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction

FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinking for Large-Scale LoD 3D Gaussian Splatting

TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting

VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM

SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping

FastGS: Training 3D Gaussian Splatting in 100 Seconds

BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting

ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM

BA-GS: Bayesian Adaptive Gaussian Splatting for SFM-Free 3D Reconstruction

FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes

ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

UVU: Improving Multimodal Understanding via Vision-Language Unified Autoregressive Paradigm

PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model

OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning

HoneyBee: Data Recipes for Vision-Language Reasoners

VisPlay: Self-Evolving Vision-Language Models

Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

ApET: Approximation-Error Guided Token Compression for Efficient VLMs

Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

Vision Transformers Need More Than Registers

Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation

PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion

AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models

ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning

HTTM: Head-wise Temporal Token Merging for Faster VGGT

Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery

Self-Attention Driven Tensor Representation for High-Order Data Recovery

PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-Based Structure Matching

MOGeo: Beyond One-to-One Cross-View Object Geo-localization

Homaloidal parametrization for detecting critical two-view configurations

AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization

MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

Asking like Socrates: Socrates helps VLMs understand remote sensing images

GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

Let VLMs Grade Their Own Thoughts: A Self-Quantification Approach to Reasoning-Aware Reward Modeling

SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

SenseSearch: Empowering Vision-Language Models with High-Resolution Agentic Search-Reasoning via Reinforcement Learning

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

VideoSSR: Video Self-Supervised Reinforcement Learning

Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion

MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement

Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

Human-Centric Multi-Exposure Fusion: Benchmark and Bi-level Cognition Distillation Framework

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

Universal-to-Specific: Dynamic Knowledge-Guided Multiple Instance Learning for Few-Shot Whole Slide Image Classification

SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models

Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation

Streamlined Knowledge Distillation

Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation

Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis

240FPS Stereo Vision from Monocular Mixed Spikes

D^2-FOSA: Dual-Diffusion Guided EEG-to-Image Reconstruction with Frequency-Oriented Semantic Alignment

Self-Diffusion Driven Blind Imaging

Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation

Solvability of the Viewing Graph Under the Affine Camera Model

DiffBMP: Differentiable Rendering with Bitmap Primitives

Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling

Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels

Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis

Multi-View Hierarchical Alignment Learning for Spatial Transcriptomics

FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics

TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis

OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation

Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Seeing Depth Through Frequency and Motion: A Progressive Training Paradigm for Monocular Depth Estimation

GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation

B^3-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates

PE3R: Perception-Efficient 3D Reconstruction

GS-ASM: 2DGS-Supervised Active Stereo Matching

Real2Sim2Real: RetinalDepth-64K for Depth Estimation in Posterior Segment Ophthalmic Surgery

Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

AirSim360: A Panoramic Simulation Platform within Drone View

Radar-Guided Polynomial Fitting for Metric Depth Estimation

UniDAC: Universal Metric Depth Estimation for Any Camera

SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation

I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training

EI-Part: Explode for Completion and Implode for Refinement

MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing

Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration

ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning

FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation

X-Part: High Fidelity And Structure Coherent Shape Decomposition And Completion

Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification

Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion

TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond

Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D

WorldGen: From Text to Traversable and Interactive 3D Worlds

ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation

SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

HiFi-BRep: High-Fidelity Latent Representation for Robust B-Rep Generation

PhysGen: Physically Grounded 3D Shape Generation for Industrial Design

Perceptual 3D Simulation With Physical World Modeling

EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Active Intelligence in Video Avatars via Closed-loop World Modeling

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts

Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization

Learning Latent Proxies for Controllable Single-Image Relighting

MoVie: Broaden Your Views with Human Motion for Action Detection

MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics

LAOF: Robust Latent Action Learning with Optical Flow Constraints

DarkAct: A RGB-Thermal Dataset and Fusion Framework for Multimodal Low-Light Action Recognition

Random Wins All: Rethinking Grouping Strategies for Vision Tokens

Steering Where to Diffuse: Generative Modeling of Phenotypic Response Simulation with Steered Diffusion Bridge

Deep Feature Deformation Weights

Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment

RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models

Coupling Liquid Time‑Constant Encoders with Modern Hopfield Memory

Stronger Normalization-Free Transformers

HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

Convolutional Neural Networks Driven by Content Similarity

MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

MVP: Multiple View Prediction Improves GUI Grounding

Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence On Mobile Devices

OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

Beyond Weak Supervision: MLLMs-Guided Graded Knowledge Distillation for Unsupervised Camouflaged Object Detection

Detecting Unknown Objects via Energy-based Separation for Open World Object Detection

Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection

SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models

Parameterized Prompt for Incremental Object Detection

SRA-Det: Learning Omni-Grained Open-Vocabulary Detection Beyond Category Names

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation

Partial Weakly-Supervised Oriented Object Detection

Seeing Both Sides: Towards Bidirectional Semantic Alignment for Open-Vocabulary Camouflaged Object Segmentation

Towards Robust Multi-Modal Semantic Segmentation with Teacher-Student Framework and Hybrid Prototype Distillation

REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion

Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation

From Softmax to Dirichlet: Evidential Learning for Semi-supervised Semantic Segmentation

Particulate: Feed-Forward 3D Object Articulation

HOPS: Hierarchical Open-vocabulary Part Segmentation with Attention-Aware Filtering and Affinity-Guided Enhancement

Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild

MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction

MUFASA: A Multi-Layer Framework for Slot Attention

ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing

MOMO: Mars Orbital MOdel Foundation Model for Mars Orbital Applications

Seeing Through the Noise: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective

GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

GeoSANE: Learning Geospatial Representations from Models, Not Data

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image

RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

ORSATR-X: A Foundation Model based on Differential-and-Excitation Networks for Optical Remote Sensing Object Recognition

SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples

AdapAction: Adaptive Target Action Backdoor Attack against GUI Agents

Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors

Verifying Neural Network Robustness with Dual Perturbations

Defending Unauthorized Model Merging via Dual-Stage Weight Protection

AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal

On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks

Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

Exploring Conditions for Diffusion Models in Robotic Control

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation

TSTM: Temporal Segmentation for Task-relevant Mask in Visual Reinforcement Learning Generalization

Scaling Spatial and Temporal Context for Robotic Imitation Learning Policies With Scene Graphs

AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking

GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion

MoEActok: A MoE-based Action Tokenizer for Vision-Language-Action Models

A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

PromptDepth: Efficient and Promptable Geometric 3D Vision Model for Embodied Intelligence

Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

IGen: Scalable Data Generation for Robot Learning from Open-World Images

Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

TGTrack: Temporal Generative Learning for Unified Single Object Tracking

GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry

Generalizable Structure-Aware Keypoint Correspondence for Category-Unified 3D Single Object Tracking

Generative Point Tracking and Forecasting

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition

GMT: Effective Global Framework for Multi-Target Multi-Camera Tracking

Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies

DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification

Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning

Phrase-grounded APO for Improving Chest X-ray Report Generation

Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis

OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy

CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction

Building Robust Vision Encoders for Cross-Dataset Evaluation in Immunofluorescent Microscopy

H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction

Dual-Level Hypergraph Generation for Addressing Feature Scarcity in Whole-Slide Image Classification

Temporal Inversion for Learning Interval Change in Chest X-Rays

JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction

PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery

Anatomical Domain Shifts: Test-time Heterogeneous Adaptation for 3D Human Pose Prediction

Unlocking Motion from Large Vision Models with a Semantic and Kinematic Duality for Gait Recognition

Learning 3D Shape Fidelity Metric from Real-world Distortions

BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition

FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors

FUN REC Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network

Bringing Your Portrait to 3D Presence

FLOW: Feature-Level Optimal Warping for Generalized Remote Physiological Measurement

One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection

Multi-Prototype Compactness and Boundary-Aware Synthesis for Unsupervised Anomaly Detection

PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection

Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning

SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

Learning Spatial-Temporal Consistency for 3D Semantic Scene Completion

Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction

Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation

OccAny: Generalized Unconstrained Urban 3D Occupancy

Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving

ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

(ends 6:45 PM)

5 p.m.

Art Gallery Tour with Curator and Artists [5:00-5:30]

(ends 5:30 PM)

7 p.m.

Reception:

Reception

(ends 9:00 PM)

SUN 7 JUN

7:30 a.m.

Findings Poster Session 3 [7:30-9:00]

Posters 7:30-9:00

Advancing Open-Set Detection and Segmentation via Disentangled Representations

Disrupting Positional Encoding for Effective Open Set Recognition

ODOV: Benchmark the Open-Domain Open-Vocabulary Object Detection

Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

Region-Aware Hierarchical Sub-Feature Alignment for Robust EEG-Based Visual Decoding

Super Sparse DETR：YOLO-Competitive Convergence and Acceleration

Bi-Level Optimization for Single Domain Generalization

SA-Matching DETR: A Lightweight Transformer Detector with Enhanced Scale Adaptive Matching

Asymmetric Collaborative Distillation for Asymmetric Image Retrieval

OKGraph: Online Knowledge Graph Probing for Open-vocabulary Recognition

Large Multimodal Models as General In-Context Classifiers

Indexing Multimodal Language Models for Large-scale Image Retrieval

EvoPrompt-ReID: A Bilevel Optimization Framework for Prompt-Encoder Co-evolution in Image Re-Identification

Leveraging Arbitrary Data Sources for AI-Generated Image Detection Without Sacrificing Generalization

OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism

PTAD: Pose and Texture Agnostic Anomaly Detection

Mitigating the ID–OOD Tradeoff in Open-Set Test-Time Adaptation

Towards Universal Open-Set Visual Font Recognition Via Augmented Synthetic Similarity

VR-CLIP: Visual Refinement of CLIP for Zero-Shot Semantic Segmentation

Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection

Once for All: An End-to-End Paradigm for VLM-Based Domain-Generalized Object Detection

SoREL: Soft-Label Refurbishment with Ensemble Learning for Noisy Long-Tailed Classification

Unsupervised Graph Partitioning Framework for Background Suppression in Multi-Query Vehicle Re-Identification

Revisiting Real-Time Detection Transformer with Efficient Encoder Design

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking

DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer

SpHOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Neural Networks

Complexity of Linear Regions in Self-supervised Deep ReLU Networks

Decoupled Sub-Feature Uncertainty Modeling for Robust Multimodal Representation Learning

Pre-trained Models Can Count (Almost): Exploring Quantitative Structure in Visual Representations

A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning

HyperFM: A Efficient Hyperspectral Foundation Model with Spectral Grouping

Seeing Through Fog: Towards Fog-Invariant Action Recognition

Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models

FedAR: Attribute-Guided Representation Learning for Heterogeneous Federated Learning

ZeroDiff++: Balancing Semantic Diffusion Dynamics for Robust Zero-Shot Learning

Equivariant Unsupervised Object Detection with Learnable Riesz Transform and Composite Spatial Transformers

MART: Mechanism-disentanglement Anchor-Routed Training for Learning with Open-World Noisy Data

Online Interpretable Matrix Decomposition for Large-Scale Streaming Data

Object-Centric Vision Token Pruning for Vision Language Models

BrainStack: Neuro-MoE with Functionally Guided Expert Routing for EEG-Based Language Decoding

BiomedHELIX : HiErarchical-Local Interaction eXploration for Biomedical Vision-Language Models

From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness

Seeing Helps Reasoning in Language Models

Layer Embedding Deep Fusion Graph Neural Network

From Horizontal to Rotated: Cross-View Object Geo-Localization with Orientation Awareness

LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Learning to Reason: Targeted Knowledge Discovery and Fuzzy Logic Update for Robust Image Recognition

GaussFiller: Unleashing VLM-Expert Guidance for 3D Scene Completion with 3D Gaussian Splatting

GEODE: Geometry-Guided Discrete Diffusion for Open-Vocabulary 3D Scene Graph Generation

Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps

SCP: Spatial Causal Prediction in Video

SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery

Entropy-Constrained Information Optimal Transport for Multi-View Geo-Localization

Revisiting Image Manipulation Localization under Realistic Manipulation Scenarios

Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning

CADRNet: Cognitively-Inspired Active Vision for 3D Reasoning Segmentation via Differentiable Rendering

Direct Language Embedding Enables Gaussian Splatting for Large Scenes

CogNet: Multi-Agent Collaborative Reasoning and Verification for Salient Object Ranking

MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation

Towards Generalization of Scene Text Tampering Localization via Causal Invariance

Background-Compensated Audio-Visual Semantic Modulation Framework for Audio-Visual Event Localization

POMA-3D: The Point Map Way to 3D Scene Understanding

Gazemo: Mimicking Human Saccades via Foveal-Peripheral Feature Modeling for Lightweight Semantic Segmentation

MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation

Prompt-driven Small Object Instance Segmentation in Earth Observation

OV-Stitcher: A Global Context-Aware Framework for Training-Free Open Vocabulary Semantic Segmentation

SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

Towards Complete Activation: Foreground-Background Multi-Perspective Guided Cross-Support for Few-Shot Segmentation

MHMamba: Multi-Head Mamba for 3D Brain Tumor Segmentation

ROSE: Retrieval-Oriented Segmentation Enhancement

ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation

Autoregressive Universal Video Segmentation Model

FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning

Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Weakly-Supervised Referring Video Object Segmentation Through Text Supervision

TALENT: Target-Aware Efficient Tuning for Referring Image Segmentation

DeepDP-TGMM: Amortized Non-Parametric Clustering for Hyperspherical Self-Supervised Representations

Proto-SaGa: Prototype-based 3D Scene Segmentation with Semantic-aware Gaussian Grouping

RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation

Instruction-Focus-Prompt：Semantics-Driven Structural Prompts for Universal SAM Segmentation

Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning

VirPro: Visual-Referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection

A Single Pixel is All You Need: Weakly Supervised Medical Image Segmentation using Discrete Denoising Diffusion Models

AdaMeta: Adaptive Meta-Learning with Dynamic Task Relational Inference for Few-shot learning

NRFP: A Noise-Robust Feature Plugin for Source-Free Domain Adaptation

Label-Agnostic Category Discovery

Learning from Label Proportion with Dual-Proportion Constraints

Test-Time Distillation for Continual Model Adaptation

Another BRIXEL in the Wall: Towards Cheaper Dense Features

Task-Specific Knowledge Improves Generalization: A Logits-Based Framework for Continual Learning of Vision-Language Models

DARN: Dynamic Adaptive Regularization Networks for Efficient and Robust Foundation Model Adaptation

Training-Free Uncertainty-guided Logit Adjustment for Few-Shot Class-Incremental Learning

Model Merging on Loss Landscapes: A Geometric Perspective

DGD: Density Gradient-guided Diffusion for Long-Tailed Clustering

DGP: Dynamic Gradient Projection for Task-Adaptive Continual Learning

Bootstrap Your Own Classifier: Your Pretrained Vision Models are Secretly Strong Continual Learners

Memory-efficient Continual Learning with Prototypical Exemplar Condensation

Continual Adaptation of Vision Foundational Models for Semantic Segmentation in Adverse Weather

ReMem: A Dynamic Memory Evolution Detector for Zero-Shot Anomaly Detection

CurrMix: Curriculum-Enhanced MixUp for Long-Tailed Visual Recognition

Class-Aware Drift Compensation for Non-Uniform Semantic Shift in Continual Learning

Onboarding Without Forgetting: Hypernetwork Personalization with Data-Free Replay for Personalized Federated Learning

FedNPC: Stochastic Noise-driven Post-hoc Classifier Calibration Method for Federated Long-tailed Learning

Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection

MuSCM: Mutual Spatial Correlation Mapping for Class Incremental Detection Transformer

AFCL: Achieving Spatio-Temporal Invariance to Data Heterogeneity in Federated Continual Learning

SAGA: Semantic Anchor-Guided Alignment for Multi-Source Domain Adaptive Object Detection

DEED: Dual-Channel Enhanced Ensemble Distillation for Uncertainty-Aware Recognition

Wake the Sleeping Weights: Sparsely-Activated Continual Test-Time Adaptation for Medical Image Segmentation

Dynamic Pseudo-Label Assignment and Consistent Prototypical Learning for Few-Shot Class-Incremental Learning

Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters

Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery

Frequency-Guided Iterative Bi-directional Exchange Network for Cross-Domain Few-Shot Segmentation

Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss

SCOPE: Spatially Ordered Continual Learning for 3D Segmentation

Learning to Propose Pose for Category-Agnostic Objects via Joint Refinement with Co-Matching Supervision

Is Prompt Selection Necessary for Task-Free Online Continual Learning?

ReConText3D: Replay-based Continual Text-to-3D Generation

Now You See It, Now You Don't: Instant Concept Erasure for Safe Text-to-Image and Video Generation

ECOC-IL: Robust and Efficient Label LDP for Imbalanced Learning

Safe Codebook: Token-Level Moderation for Safer Visual Autoregressive Generation

Towards Universal and Lightweight Coverless Image Steganography with Multimodal Large Language Models Assistance

A Visual Semantic Adaptive Watermark Grounded by Prefix-Tuning for Large Vision-Language Model

TriGuard-FL: A User-Centric Trust Triad in Federated Learning via Auditable Data, Verifiable Contributions, and Antidote-Driven Mitigation

Assessing the Reliability of Image Quality Metrics and Mitigating Quality Bias in Generative Models

Efficient Unlearning through Maximizing Relearning Convergence Delay

Robust Continual Unlearning against Knowledge Erosion and Forgetting Reversal

Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models

FedOrtho: Efficient Federated Unlearning Via Orthogonal Convolution and Adaptive Soft Pruning

Improving Synthesized Image Detection by Disentangling Generator-Shared and Generator-Specific Image Artifacts

PLR-Gate: Real-Time Gradient Privacy Assessment and Gated Transmission for Secure Federated Learning

A Unified Privacy-Utility Framework for Collaborative Inference via Randomized Smoothing

Verify Claimed Text-to-Image Models Via Boundary-Aware Prompt Optimization

Towards Robust Content Watermarking Against Removal and Forgery Attacks

Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment

CBDC: Clean Bias Direction Construction for Unsupervised Debiasing in Vision-Language Models

Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Leveraging Unlabeled Data from Unknown Sources via Dual-Path Guidance for Deepfake Face Detection

SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

When Agents Steer Human Perception: How AI-Selected Images Can Convertly Alter Disagreements

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

On the Group Disparities Arising from Machine Unlearning

Count What Repeats: Period-Adaptive Multi-Scale Consistency for Self-Supervised Repetitive Action Counting

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

ConfDiff: Confidence-Guided Representation Diffusion for Video Moment Retrieval

Evolutionary Multi-Agent Collaboration for Real-World Video Face Restoration

STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding

HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression

D^2-STX: Decoupling Spatial-Temporal Cross-Attention for Dual-branch Repetitive Action Counting

Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning

Mamba-VMR: Multimodal Query Augmentation Via Generated Videos for Precise Temporal Grounding

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

TP^2-DETR: Unlocking Deformable DETR for Zero-Shot Temporal Action Proposal Generation with Temporal Feature Pyramids

QENN: A Quantum Entanglement-Inspired Neural Network for Interaction and Relationship Prediction in Story Videos

FineGrade: A Rule-Consistent Scoring Framework for Fine‑Grained Action Quality Assessment

One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition

REBA: Residual Mixture-of-Experts and Bidirectional Video–Text Alignment for Better Fine-grained Weakly Supervised Video Anomaly Detection

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

VIDEOP2R: Video Understanding from Perception to Reasoning

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models Via Spatial-Temporal Forest Modeling

HARP: Hierarchical Adaptive Ranking with Probabilistic Modeling for Skill Determination

STORM: End-to-End Referring Multi-Object Tracking in Videos

Extending Segment Anything Model 2 to Multi-Object Tracking by Optimizing Hierarchical Trajectory Memory

NCSTR: Node-Centric Decoupled Spatio-Temporal Reasoning for Video-based Human Pose Estimation

MOSSTrack : Modality-Specific Spatio-Temporal Context Learning for RGB-T Tracking

Temporally Consistent Long-Term Memory for 3D Single Object Tracking

DM^3T: Harmonizing Modalities via Diffusion for Multi-Object Tracking

IRDINO: Adapting DINOv3 with Second-Order Motion Awareness for Moving Infrared Small Target Detection

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

TAPNext++: What’s Next for Tracking Any Point (TAP)?

ARGS: Auto-Regressive Gaussian Splatting via Parallel Progressive Next-Scale Prediction

100Editor: 100+ Views per Batch and Minute-Scale View-Consistent 3D Editing

DIAMOND-SSS: Diffusion-Augmented Multi-View Optimization for Data-efficient SubSurface Scattering

Reason-SVG: Enhancing Structured Reasoning for Vector Graphics Generation with Reinforcement Learning

Harmonized Multi-Layer Text-to-Image Generation with Generative Priors

StabiGS: Video Stabilization through Rendering-Aware Trajectory Optimization in 3DGS-Reconstructed Scenes

More Traces Better: Unified Artifact Modeling for Generalizable and Robust AI-generated Image Detection

Predicting Gene Expression in Spatially Resolved Transcriptomics Across Samples Through Probabilistic Fusion of Hierarchical Histology and Spatial Information

Don't Let the Information Slip Away

FraQAT: Quantization Aware Training with Fractional Bits

MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness

Video Inspector: An Agentic-RL Framework and Benchmark for Human-Aligned Generative Video Evaluation

From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage

CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection

PSIM: Perceptual Similarity Index Measure

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

GreenPlanner: Practical Floorplan Layout Generation via an Energy-Aware and Function-Feasible Generative Framework

Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Spatial and Temporal Representation

WideEye: Achieving Wide Field-of-view Traffic Video Analytics With Dynamic Orientation Adaptation

Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback

Pose-dIVE: Pose-Diversified Augmentation for Person Re-Identification

Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation

BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation

IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment

QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery

CLIPtone-GO: Geometry‐Aware, Gradient-Orthogonalized Text-Guided Color Tone Adjustment

Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems

Exploiting the Source-Asymmetry Confidence Gap for Generalizable AI-Generated Image Detection

CineMatte: Background Matting for Virtual Production and Beyond

GATE: Gaussian-Attentive Transformer for Uncertainty-Aware Age Estimation

GRAFT: Graph-Based Affordance Transfer via Part Correspondence

Face Time Traveller : Travel Through Ages Without Losing Identity

KGGAT: Knowledge-Guided Graph Attention Network for Multi-Label Image Classification

IntentEdit: Multi-Agent Reasoning for Intent-Driven Complex Image Editing

Gen-n-Val: Agentic Image Data Generation and Validation

SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding Via Functional Structure Units

DARTS: Distance-Aware Robust Training for Selective Classification

Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection

PestVL-Net: Enabling Multimodal Pest Learning Via Fine-grained Vision-Language Interaction

Plug-and-Play Dynamic In-context Learning with Stochastic Regularization for Screen Content Image Super-Resolution

EscherNet++: A Scalable Multi-View Framework for Amodal Completion, Novel View Synthesis and Feed-Forward 3D Reconstruction

Human-Intervention Segmentation via Federated Intent Embedding and Multi-Mask Recommendation

Di3PO - Diptych Diffusion DPO for Targeted Improvements in Image Generation

Robust Image Self-Recovery against Tampering using Watermark Generation with Pixel Shuffling

Learning to Select, Learning to Judge: Active Preference Alignment for Mars Terrain Segmentation

Attention Never Lie: Visual Attention Defocus Reveals and Rectifies Hallucinations in MLLMs

Organizing Unstructured Image Collections using Natural Language

Thinking with Blueprints: Assisting Vision–Language Models in Spatial Reasoning via Structured Object Representation

Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

Efficient3D : A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

Visual Reasoning Through Tool-Supervised Reinforcement Learning

VSI: Visual–Subtitle Integration for Keyframe Selection to Enhance Long Video Understanding

ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

Myopia Rectification: KV Cache Pruning for MLLMs Via Dynamic Attention Subsidy and Token Reclamation

NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

Logical Consistency Optimization for Few-Shot Weakly Supervised Video Anomaly Detection

VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering

COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

CoVCR: Bridging Visual Narrative Gaps via Context Generation for Robust Commonsense Reasoning

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

VoQA: Visual-only Question Answering

Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

Language-Augmented Semantic Priors for B-Spline Surface Fitting

Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models

Trajectory-Diversity-Driven Robust Vision-and-Language Navigation

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

Distilling Counterfactual Reasoning from Language to Vision: Causal Graph-Guided Post-Training for Video Understanding

Exploring Physics-aware Video Generation through Reinforcement Learning with Autoregressive Tokens

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models

GDP: Graph-Based Dynamic Personalization for Multimodal Large Language Models

AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Experts

Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

RVLF: A Reinforcing Vision–Language Framework for Gloss-Free Sign Language Translation

Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning

AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models

Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

Fine-Grained Visual Prompt and Region Self-Distillation for Retrieval-Augmented VQA

RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

Modality-Aware Bit Allocation for Mixed-Precision Quantization of Vision-Language Models

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Analyzing and Enhancing Visual Learning in LLM-based Radiology Report Generation

DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration

Semantic Guided Feature Disentanglement and Reconstruction for Domain Adaptive Object Detection

Dual-Modality Anchor-Guided Filtering for Test-Time Prompt Tuning

Towards Efficient Multimodal Unified Reasoning Model via Model Merging

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

Can Textual Reasoning Improve the Performance of MLLMs on Fine-Grained Visual Classification?

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

MASS: Motion-Aware Spatial–temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Beyond Syntax: Action Semantics Learning for App Agents

Learning to Select Visual In-Context Demonstrations

CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

Mull-Tokens: Modality-Agnostic Latent Thinking

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

It’s Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models

Uncertainty-Guided Graph Formulation via MWIS for Token Pruning in LVLMs

From Alignment to Reason: Multi-Agent Debate for Tactical Badminton Video Retrieval

Distilling Out-of-Distribution Knowledge from Large Language Models for CLIP Generalization

Multimodal Reasoning with Explicit Reasoning Patterns and Rewards

VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Model

Recursive Think-Answer Process for LLMs and VLMs

GenSRL: Generative Spatiotemporal Representation Learning for Ophthalmic Prognosis Prediction

Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

Mitigating Vision-Text Order Bias in Vision-Language Model

MARS-RL: Enhancing Multi-Agent RAG Systems for Multi-Modal Documents via Strategic Reasoning with Reinforcement Learning

Beyond Single Object: Learning 3D Relations with Large Language Models

CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

Attention-Space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs

UnrealSpace: Analyzing Spatial Understanding and Reasoning in Controllable Simulation

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning

Hierarchical Textual Knowledge for Enhanced Image Clustering

Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

Entropy-Based Visual Re-perception Inference for Multimodal Models

VACoT: Rethinking Visual Data Augmentation with VLMs

Open World Image Aesthetic Assessment

coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation

PosterGen: Aesthetic-Aware Multi-Modal Paper-to-Poster Generation Via Multi-Agent LLMs

Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Why MLLMs Struggle to Determine Object Orientations

VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal Reinforcement Learning

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Alleviating Hallucinations in Large Vision-Language Models via Decoding-Time Perturbation Adaptation

RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

(ends 9:00 AM)

Registration / Badge Pickup

(ends 1:00 PM)

Break:

Breakfast

(ends 9:00 AM)

9 a.m.

Oral Session 5A: Dynamic Perception [9:00-10:15]

Orals 9:00-10:15

[9:00] Evidential Neural Radiance Fields

[9:12] Global-Aware Edge Prioritization for Pose Graph Initialization

[9:25] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

[9:37] Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics

[9:50] SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker

[10:02] U^2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation

(ends 10:15 AM)

Oral Session 5B: Generalization and Adaptation [9:00-10:15]

Orals 9:00-10:15

[9:00] AToken: A Unified Tokenizer for Vision

[9:12] Confusion-Aware Spectral Regularizer for Long-Tailed Recognition

[9:25] Learning Latent Concepts for Detecting Out-of-Distribution Objects

[9:37] Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery

[9:50] Understanding and Enforcing Weight Disentanglement in Task Arithmetic

[10:02] Understanding Task Transfer in Vision-Language Models

(ends 10:15 AM)

Oral Session 5C: Geometry and Robotics [9:00-10:15]

Orals 9:00-10:15

[9:00] AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

[9:12] Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization

[9:25] QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

[9:37] SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

[9:50] Structural Action Transformer for 3D Dexterous Manipulation

[10:02] TESO: Online Tracking of Essential Matrix by Stochastic Optimization

(ends 10:15 AM)

Oral Session 5D: Human-Centric Modeling & Lighting [9:00-10:15]

Orals 9:00-10:15

[9:00] BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer

[9:12] ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications

[9:25] OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control

[9:37] OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data

[9:50] POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling

[10:02] Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views

(ends 10:15 AM)

10:15 a.m.

Break:

Courtesy Break

(ends 10:30 AM)

10:30 a.m.

Keynote:

Scaling Laws vs. Neural Laws: Toward More Natural Artificial Vision

Thomas Serre

(ends 11:30 AM)

11:15 a.m.

Poster Setup:

Poster Setup

(ends 11:45 AM)

11:45 a.m.

Art Exhibition [11:45-3:00]

(ends 3:00 PM)

Poster Session 5 & Exhibit Hall [11:45-1:45]

Posters 11:45-1:45

Evidential Neural Radiance Fields

Global-Aware Edge Prioritization for Pose Graph Initialization

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics

SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker

U^2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation

AToken: A Unified Tokenizer for Vision

Confusion-Aware Spectral Regularizer for Long-Tailed Recognition

Learning Latent Concepts for Detecting Out-of-Distribution Objects

Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery

Understanding and Enforcing Weight Disentanglement in Task Arithmetic

Understanding Task Transfer in Vision-Language Models

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization

QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

Structural Action Transformer for 3D Dexterous Manipulation

TESO: Online Tracking of Essential Matrix by Stochastic Optimization

BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer

ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications

OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control

OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data

POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling

Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views

Scaling View Synthesis Transformers

WildPose: A Unified Framework for Robust Pose Estimation in the Wild

MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer

Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling

Minimal Constraint Relaxation for Multiview Autocalibration

Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

GGPT: Geometry-Grounded Point Transformer

MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

KV-Tracker: Real-Time Pose Tracking with Transformers

InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization

From Rays to Projections: Better Inputs for Feed-Forward View Synthesis

SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

Parallel Rigidity Matters for Bundle Adjustment

Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization

VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement

Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models

DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives

StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References

DynFusion: Rethinking Condition Fusion for Adaptive Multi-Conditional Text-to-Image Generation

Agentic Retoucher for Text-To-Image Generation

StyleDoctor: Towards Specialist Reward Model for Style-centric Generation Tasks

SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation

Paper2Figure: A Multi-Agent Collaborative System for Figure Generation Towards Academic Research Paper

Adapting In-context Generation for Enhanced Composed Image Retrieval

Transition Models: Rethinking the Generative Learning Objective

Rethinking Glyph Spatial Information in Font Generation

StreamDiT: Real-Time Streaming Text-to-Video Generation

ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control

Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

3D Space as a Scratchpad for Editable Text-to-Image Generation

Aligning Multi-Character Narrative Image Generation with Multi-Aspect Human Preferences

FoleyDirector: Directing Temporal Controllable Video-to-Audio Generation via Fine-Grained Temporal Scripts

DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation

DreamOmni2: Multimodal Instruction-based Generation and Editing

AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models

PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation

IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

Boosting Visual Reprogramming for CLIP with Dual Granularity Alignment

Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in RL

PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning

Unified Personalized Understanding, Generating and Editing

MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport

Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection

Decision Boundary-aware Generation for Long-tailed Learning

Towards Stable Federated Continual Test-Time Adaptation in Wild World

HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning

ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection

Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning

Depth Hypothesis Guided Iterative Refinement for Event–Image Monocular Depth Estimation

High-Quality and Efficient Turbulence Mitigation with Events

Tracking through Severe Occlusion via Event-Derived Transient Cues

FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera

Event-Based Motion Deblurring Using Task-Oriented 3D Gaussian Event Representations

From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras

Extending Embodied Question Answering from Perception to Decision

Dejavu: Towards Experience Feedback Learning for Embodied Intelligence

Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis

Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering

VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents

Rethinking Intermediate Representation for VLM-based Robot Manipulation

Dexterous World Models

FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation

UniLight: A Unified Representation for Lighting

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Hist2Style: Histogram-Guided Stylization with Bilateral Grids

Harmonic Canvas: Inversion-Free Editing for Visually-Guided Music Style Transfer

How to Take a Memorable Picture? Empowering Users with Actionable Feedback

UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models

GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers

HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction

Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation

ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction

Semantic Foam: Unifying Spatial and Semantic Scene Decomposition

NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting

NeAR: Coupled Neural Asset–Renderer Stack

Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis

PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis

Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling

TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising

Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex

Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration

ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction

Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters

Convexity-Aware Noise Calibration: A Self-Supervised Framework for Noise-Level-Unknown Image Denoising

UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration

Beyond the Ground Truth: Enhanced Supervision for Image Restoration

ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration

Bilevel Layer-Positioning LoRA for Real Image Dehazing

SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation

GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation

SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation

Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models

Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification

Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting

Simple-ViLMedSAM: Simple Text Prompts Meet Vision-Language Models for Medical Image Segmentation

NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models

TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

Jailbreaking Vision-Language Models via Dissonance-Guided Suffix Optimization and Image–Phrase Injection

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models

Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

Transform to Transfer: Boosting Adversarial Attack Transferability on Vision-Language Pre-training Models

Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast

Reliable Clustering Number Estimation for Contrastive Multi-View Clustering

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis

SonoWorld: From One Image to a 3D Audio-Visual Scene

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

EXOTIC: External Vision-driven Incomplete Multi-view Classification

Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration

Information-Theoretic Decomposition for Multimodal Interaction Learning

Is the Modality Gap a Bug or a Feature? A Robustness Perspective

Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification

SeD-UD: An Influence-Driven and Hierarchically-Decoupled Information Bottleneck for Multimodal Intent Recognition

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

LacTokGen: Latent Consistency Tokenizer for 1024-pixel Image Generation by 256 Tokens

FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories

Visual Autoregressive Modeling via Next Focus Prediction

Semantic Context Matters: Improving Conditioning for Autoregressive Models

TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction

FlashIn: Fast and Accurate Image Inversion for Real-time Image Editing

EasyV2V: A High-quality Instruction-based Video Editing Framework

One Algorithm to Align Them All

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

Improved Mean Flows: On the Challenges of Fastforward Generative Models

SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

Match-and-Fuse: Consistent Generation from Unstructured Image Sets

Mixture of Style Experts for Diverse Image Stylization

Mirai: Autoregressive Visual Generation Needs Foresight

Align Images Before You Generate

Bridging the Perception Gap in Image Super-Resolution Evaluation

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution

Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance

IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution

Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction

Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution

Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning

FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

MotionMaster: Generalizable Text-Driven Motion Generation and Editing

OpenT2M: No-frill Motion Generation with Open-source, Large-scale, High-quality Data

Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models

PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects

Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

ViHOI: Human-Object Interaction Synthesis with Visual Priors

CLEP: Contrastive Language-Pose Pretraining

OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene

Beyond Mimicry: Learning Whole-Body Human-Humanoid Interaction from Human-Human Demonstrations

PHAC: Promptable Human Amodal Completion

CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

IntrinsicWeather: Controllable Weather Editing in Intrinsic Space

Outlier-Robust Diffusion Solvers for Inverse Problems

Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models

ReasonX: MLLM-Guided Intrinsic Image Decomposition

Diff-SemiER: Transparency-Aware Adaptive Fusion Diffusion Model with Generative Prior for Semi-Transparent Eyeglasses Removal

KLIP: Localized Distribution Shift Detection via KL-Divergence with Diffusion Priors in Inverse Problems

Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models

Taming Generative Diffusion Model for Task-Oriented Infrared Imaging

Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models

RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs

SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

Time Blindness: Why Video-Language Models Can’t See What Humans Can?

Spot The Ball: A Benchmark for Visual Social Inference

MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning

E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought

GeoWorld: Geometric World Models

ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding

Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

MonoVLM: Monocular 3D Visual Grounding with Vision Language Models

Curvature-Aware Captioning: Leveraging Geodesic Attention for 3D Scene Understanding

SPREAD: Spatial-Physical REasoning via geometry Aware Diffusion

ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models

FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection

Model Merging in the Essential Subspace

Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures

Progressive Cross-Modal Causal Intervention for Long-Term Action Recognition

EthoCLIP: Ontology-Enhanced Video-Language Pretraining for Animal Behavior Understanding

TrajTok: Learning Trajectory Tokens Enhances Video Understanding

Streaming Video Instruction Tuning

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning

Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Video Panels for Long Video Understanding

Gaze Target Estimation Anywhere with Concepts

Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Explaining CLIP Zero-shot Predictions Through Concepts

See Through the Noise: Improving Domain Generalization in Gaze Estimation

Mechanisms of Object Localization in Vision–Language Models

mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds

From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching

Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence

BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images

SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching

LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration

GM-R^2: Generative Matching Learning for Unsupervised Geometric Representation and Registration

4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis

PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction

MORE-STEM: Long-Short MemOry REcall and Spatio-TEmporal Consistency Model for Query-Driven 3D/4D Point Cloud Segmentation

Low-Rank Test-Time Training for Pre-Trained Point Cloud Models

STAR: Test-Time Adaptation Can Enhance Universal Prompt Learning for Vision-Language Models

Exploring Visual Pretraining for Learning Language Intelligence

VL-Eraser: Vacuum Distillation for Machine Unlearning in Vision-Language Models

DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles

SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision–Language Models

Mirror Illusion Art

HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

Towards Human-Like Robot Handwriting via Contour-Aware Generation

MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering

Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models

A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks

Rank-Guided Pseudo-Bias Learning for Robust Black-Box Adaptation

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

DSO: Direct Steering Optimization for Bias Mitigation

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

SineProject: Machine Unlearning for Stable Vision-Language Alignment

HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning

OS-Fed: One Snapshot Is All You Need

FedAlign: Differentially Private Distribution Alignment for Non-IID Federated Learning

Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning

Personalized Federated Training of Diffusion Models with Privacy Guarantees

FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning

Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability

Small Object, Great Challenge: A Benchmark for Small Object Visual Grounding

UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Hybrid Token Compression for Vision-Language Models

Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

BiGain: Unified Token Compression for Joint Generation and Classification

Hi-Lo Prune: Look at What You'll Lose before Pruning with Hierarchical Token Selection

VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization

In Pursuit of Pixel Supervision for Visual Pre-training

GaussianMatch: Semi-Supervised Regression with Pseudo-Label Filtering via Multi-View Gaussian Consistency

TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery

Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

The Universal Normal Embedding

Bypassing the Transport Plan: Dynamic Reweighting for Out-of-Distribution Detection with Optimal Transport

Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET

Debiased Sample Selection for Learning with Noisy Labels

Driving on Registers

Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles

EE-RL: Vision Language Guided Reinforcement Learning with Explorer and Expert model for End-to-End Autonomous Driving

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting

DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving

CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration

Efficient Equivariant Transformer for Self-Driving Agent Modeling

Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation

Saliency-Driven Token Merging for Vision Transformers

RISE: Single Static Radar-based Indoor Scene Understanding

Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation

TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection

Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation

SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation

Changes in Real Time: Online Scene Change Detection with Multi-View Fusion

Subspace Alignment for CLIP-based Continual Learning via Canonical Correlation Analysis

DGS: Dual Gradient and Semantic-Shift Guided Low-Rank Adaptation for Class Incremental Learning

Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification

Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models

Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning

Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models

Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging

Few-Shot Hybrid Incremental Learning: Continually Learning under Data Scarcity and Task Uncertainty

High-Fidelity Mobile Avatars with Pruned Local Blendshapes

PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning

Bridging Privacy and Provenance: Traceable Virtual Identity Generation

PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

Dynamic Label Noise Suppression with Optimal Teacher Pool for Facial Expression Recognition

MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation

DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation

HybridDriveVLA: Vision-Language-Action Model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving

NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction

LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation

MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation

FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes

GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction

4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video

PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video

ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars

ZINA: Multimodal Fine-grained Hallucination Detection and Editing

Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration

KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Video Understanding

Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention

Circular-DPO: Aligning Multi-Stage 3D Generative Models via Preference Feedback Loop

Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation

Choreographing a World of Dynamic Objects

SounDiT: Geo-Contextual Soundscape-to-Landscape Generation

Vista4D: Video Reshooting with 4D Point Clouds

CamDirector: Towards Long-Term Coherent Video Trajectory Editing

Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding

Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection

Target-Aware Invertible Encoder with Reconstruction Guidance for Infrared Small Target Detection

BDNet:Bio-Inspired Dual-Backbone Small Object Detection Network

ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer

RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework

FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition

Free-Grained Hierarchical Visual Recognition

URICA: A Uniformity Region Affine Identifier Capture Algorithm for Arbitrary Region Retrieval in Pathology Images

Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision

DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video

Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues

Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding

CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning

SVBench: Evaluation of Video Generation Models on Social Reasoning

Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

CADC: Content Adaptive Diffusion-Based Generative Image Compression

FG-Portrait: 3D Flow Guided Editable Portrait Animation

ResCa: Residual Caching for Diffusion Transformers Acceleration

IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation

SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training

InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation

Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer

DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR

A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection

ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection

PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving

STUR3D: Spatio-Temporal Unified Representation Learning for 3D Object Detection

Exploring 6D Object Pose Estimation with Deformation

SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving

Improving Vision-language Models with Perception-centric Process Reward Models

X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction

PhysInOne: Visual Physics Learning and Reasoning in One Suite

AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety–Critical Cloud Forecasts

TTRV: Test-Time Reinforcement Learning for Vision Language Models

Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR

QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction

Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks

SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting

GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting

OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images

Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos

MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing

Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting

RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing

Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction

PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting

Scene Grounding in the Wild

Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM

Revisiting 3D Reconstruction Kernels as Low-Pass Filters

SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting

GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

IPR-1: Interactive Physical Reasoner

VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension

Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models

Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Latent Implicit Visual Reasoning

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning

Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

TableMix: Enhancing Multimodal Table Reasoning in MLLMs from a Data-Centric Perspective

Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing

Grounded Chain-of-Thought for Multimodal Large Language Models

LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers

SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference

Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights

Compressed-Domain-Aware Online Video Super-Resolution

CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model

Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective

High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling

Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates

Task-Aware Image Signal Processor for Advanced Visual Perception

Enhancing Video Vision Language Model with Hippocampal Sensing

VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery

SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization

EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement

Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent

ReLaX: Reasoning with Latent Exploration for Large Reasoning Models

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Think, Then Verify: A Hypothesis–Verification Multi-Agent Framework for Long Video Understanding

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Multi-Modal Image Fusion via Intervention-Stable Feature Learning

ReCoFuse: Ultra-Robust Image Fusion via Restorative Multi-Modal Diffusion Reciprocal Coupling

Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios

DF^2-VB: Dual-level Fuzzy Fusion with View-specific Boosting for Multi-view Multi-label Classification

UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation

Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification

Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs

Graph Attention Prototypical Network for Robust Few-Shot Classification

Mitigating The Distribution Shift of Diffusion-based Dataset Distillation

EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation

Flow Map Distillation Without Data

F^2HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling

Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal

Inter-Photon-Limited Videography

A Bit is All You Need! Efficient Video Capture via Single Bit Imaging

From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing

Electromagnetic Inverse Scattering from a Single Transmitter

Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging

Physics-Guided Multistep Deformation Reversal for Ancient Bamboo Slip Restoration

cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold

SGDE: Self-supervised Geometry Degradation Estimation Framework for Coded Aperture Compressive Spectral Imaging

Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors

UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition

GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis

Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy

RoSAMDepth: Robust Self-supervised Depth Estimation Leveraging Segment Anything Model

AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments

Dark3R: Learning Structure from Motion in the Dark

What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast

Iris: Integrating Language into Diffusion-based Monocular Depth Estimation

Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation

UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents

Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement

Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

Order Matters: 3D Shape Generation from Sequential VR Sketches

Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation

ArtLLM: Generating Articulated Assets via 3D LLM

PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation

2D-LFM: Lifting Foundation Model without 3D Supervision

ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

FabricGen: Microstructure-Aware Woven Fabric Generation

Leveraging Verifier-Based Reinforcement Learning in Image Editing

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

C^2FG: Control Classifier-Free Guidance via Score Discrepancy Analysis

Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation

Unified Customized Generation by Disentangled Reward Modeling

Region-Aware Instance Consistency Learning for Micro-Expression Recognition

MPL: Match-guided Prototype Learning for Few-shot Action Recognition

LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation

LA-Pose: Latent Action Pretraining Meets Pose Estimation

RAAS: LLM Agentic System Architecture Search with GRPO

Temporal Representation Enhancement (TRE): Learning to Forget Dominant Patterns for Enhanced Temporal Spiking Features

Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

Unlocking Pre-trained Weights: Parameter Inheritance for Zero-Shot Initialization

Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis

Progressive Neural Architecture Generation

A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling

When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks

Temporal Interaction in Spiking Transformers with Multi-Delay Mixer

Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge

Sparsely Timing the Change: A Spiking Temporal Framework for Remote Sensing Interpretation

ProSoftArena: Benchmarking Hierarchical Capabilities of Multi-modal Agents in Professional Software Environments

BAMI: Training-Free Bias Mitigation in GUI Grounding

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning

Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection

Geometry-driven OOD Detectors Are Class-Incremental Learners

Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

Prompt-Free Unknown Label Generation for Open World Detection in Remote Sensing

Learning to Diversify and Focus: A Reinforcement Framework for Open-Vocabulary HOI Detection

RINO: Rotation-Invariant Non-Rigid Correspondences

Hyperbolic Prototype Learning with Uncertainty-Aware Consistency for Continual Test-Time Segmentation

DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval

Leveraging Class Distributions in CLIP for Weakly Supervised Semantic Segmentation

CompetitorFormer: Mitigating Query Conflicts for 3D Instance Segmentation via Competitive Strategy

D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping

Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation

D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation

Fast Reasoning Segmentation for Images and Videos

Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation

CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration

FireScope: Wildfire Risk Raster Prediction With a Chain-of-Thought Oracle

OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis

Regulating Rather than Constraining: Adaptive Guidance for Complex Spectral Reconstruction in Pansharpening

GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model

Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation

Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats

Logit-Margin Repulsion for Backdoor Defense

Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems

Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics

Towards Stealthy and Effective Backdoor Attacks on Lane Detection: A Naturalistic Data Poisoning Approach

Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases

Latent Diffusion Inversion Requires Understanding the Latent Space

Fractal Camouflage: A Bio-Inspired Approach for Multi-Scale Adversarial Attacks in the Infrared Domain

EgoRoC: Towards Egocentric Robotic Control via Task-Agnostic Visual Alignment

Describe Anything Anywhere At Any Moment

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild

Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation

INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic ManipulaTion

MM-ACT: Learn from Multimodal Parallel Generation to Act

HQC-NBV: A Hybrid Quantum-Classical View Planning Approach

Motus: A Unified Latent Action World Model

SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph

MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking

Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

Matching Every Pair to Track Every Point: PairFormer for All-Pairs Tracking and Video Trajectory Fields

Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning

Progressive Multi-cue Alignment for Unaligned RGBT Tracking

Real-Time Neural Video Compression with Unified Intra and Inter Coding

Adapting Lightweight Image-based Counting Models for Video Crowd Counting

Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis

MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis

MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration

Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data

CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks

GenTract: Generative Global Tractography

LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol

Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance

Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?

IEBGL:An Interpretability-Enhanced Brain Graph Learning Framework with LLM-Instructed Topology and Literature-Augmented Semantics

F^2-Assist: Multi-Phase Fetal Growth Forecast and Report Generation from Ultrasound Examination

Sparse Spectral LoRA: Routed Experts for Medical VLMs

SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization

OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis

Structural–Semantic Perception for Diffusion-Guided Temporal Forgery Localization

Forensic-Friendly Image Manipulation via Controllable Latent Diffusion

IncreFA: Breaking the Static Wall of Generative Model Attribution

AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

Detecting Compressed AI-Generated Images via Phase Spectrum Robustness

Detect Any AI-Counterfeited Text Image

DeepfakeImpact: A Two-Stage Benchmark with Real-World Impact in Deepfake Detection

Enhancing the Security of Visual Speaker Authentication Based on Dynamic Lip-Print Analysis

SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images

Editprint: General Digital Image Forensics via Editing Fingerprint with Self-Augmentation Training

Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification

Goldilocks Test Sets for Face Verification

Fine-VAD: Towards Fine-Grained Video Anomaly Detection via Progressive Cross-Granularity Learning

DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting

MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection

Alert-CLIP: Abnormality-aware Latent-Enhanced Representation Tuning of CLIP for Video Anomaly Detection

AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection

Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection

Defect Cue-Preserved Structural Feature Refinement for Few-Shot Anomaly Detection

Anomaly-Related Residual Fields for Cross-domain Anomaly Detection

From Attraction to Equilibrium: Physics-Inspired Semantic Gravitons for Zero-Shot Anomaly Detection

Joint Learning of General and Diverse Patterns with Mixture of Memory Experts for Weakly-Supervised Video Anomaly Detection

No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement

DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving

GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

Test-Time 3D Occupancy Prediction

Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration

Diffusion Mental Averages

dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

RegionRoute: Regional Style Transfer with Diffusion Model

Low-Rank Residual Diffusion Models

RDF-MIG: A Robust Diffusion Framework for Masked Image Generation to Augment Semantic Segmentation and Change Detection

TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration

Bi-directional Autoregressive Diffusion for Large Complex Motion Interpolation

Guiding Token-Sparse Diffusion Models

Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

High-Fidelity Virtual Try-On beyond Paired Data Scarcity via Diffusion-based Cycle-Consistent Learning

Sampling-Aware Quantization for Diffusion Models

CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

Scale Space Diffusion

Making Training-Free Diffusion Segmentors Scale with the Generative Power

Roots Beneath the Cut: Uncovering the Risk of Concept Recovery in Pruning-Based Unlearning for Diffusion Models

Few-Step Diffusion Sampling Through Instance-Aware Discretizations

SpeeDiff: Scalable Pixel-Anchored End-to-End Latent Diffusion Model

Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation

Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production

Diffusion Probe: Generated Image Result Prediction Using CNN Probes

Content-Aware Dynamic Patchification for Efficient Video Diffusion

PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

GROW: Watermark Generation with Progressive Guidance for Diffusion Models

MotionV2V: Editing Motion in a Video

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models

DreamStyle: A Unified Framework for Video Stylization

Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering

Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage

Reward Sharpness-Aware Fine-Tuning for Diffusion Models

DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation

Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration

Cross-modal Representation Learning for Diffusion-generated Image Detection

Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Back to Basics: Let Denoising Generative Models Denoise

CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis

DiP: Taming Diffusion Models in Pixel Space

RAPID: Reusing Attention Sparsity with Inter-step Adaptation for Efficient Video Diffusion

Efﬁcient and Training-Free Single-Image Diffusion Models

(ends 1:45 PM)

Demonstration:

Demos Session 5

(ends 1:45 PM)

Art Gallery Tour with Curator and Artists [11:45-12:15]

(ends 12:15 PM)

2 p.m.

Oral Session 6A: Geometric Learning [2:00-3:15]

Orals 2:00-3:15

[2:00] Differentiable Laplacian Matrix Guided Superpixel Segmentation

[2:15] FILTR: Extracting Topological Features from Pretrained 3D Models

[2:30] Learning Convex Decomposition via Feature Fields

[2:45] Learning Eigenstructures of Unstructured Data Manifolds

[3:00] Mapping Networks

(ends 3:15 PM)

Oral Session 6B: Multimodal Reasoning [2:00-3:15]

Orals 2:00-3:15

[2:00] CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation

[2:15] Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack

[2:30] SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks

[2:45] Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

[3:00] Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

(ends 3:15 PM)

Oral Session 6C: Medical Vision [2:00-3:15]

Orals 2:00-3:15

[2:00] CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

[2:12] DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging

[2:25] Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation

[2:37] LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs

[2:50] Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence

[3:02] SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation

(ends 3:15 PM)

Oral Session 6D: Large-Scale Neural Modeling [2:00-3:15]

Orals 2:00-3:15

[2:00] Efficient Unrolled Networks for Large-Scale 3D Inverse Problems

[2:15] FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization

[2:30] SimScale: Learning to Drive via Real-World Simulation at Scale

[2:45] Texvent: Asynchronous Event Data Simulation via Text Prompt

[3:00] WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

(ends 3:15 PM)

3 p.m.

Poster Setup:

Poster Setup

(ends 3:30 PM)

3:15 p.m.

Break:

Courtesy Break

(ends 3:30 PM)

3:30 p.m.

Poster Session 6 [3:30-5:30]

Posters 3:30-5:30

Differentiable Laplacian Matrix Guided Superpixel Segmentation

FILTR: Extracting Topological Features from Pretrained 3D Models

Learning Convex Decomposition via Feature Fields

Learning Eigenstructures of Unstructured Data Manifolds

Mapping Networks

CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation

Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack

SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging

Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation

LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs

Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence

SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation

Efficient Unrolled Networks for Large-Scale 3D Inverse Problems

FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization

SimScale: Learning to Drive via Real-World Simulation at Scale

Texvent: Asynchronous Event Data Simulation via Text Prompt

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

BuildingGPT: Auto-Regressive Building Wireframe Reconstruction Model with Reinforcement Learning

Emergent Extreme-View Geometry in 3D Foundation Models

LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals

VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale

SEA-Flow3D: Simplified, Efficient, and Accurate Scene Flow via Spatial Vector Sampling and Multi-scale Refinement

OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

DROID-SLAM in the Wild

HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT

Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors

Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences

Learning 3D Reconstruction with Priors in Test Time

ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild

PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

Hint2Gen: Bridging Understanding and Generation via Code-structured Hints

Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization

Learning by Analogy: A Causal Framework for Compositional Generalization

ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation

GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

TempoControl: Temporal Attention Guidance for Text-to-Video Models

Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment

Resolving the Identity Crisis in Text-to-Image Generation

DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation

Gloria: Consistent Character Video Generation via Content Anchors

DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior

M4V: Multimodal Mamba for Efficient Text-to-Video Generation

Property-Informed Diffusion-Based Text-to-Microstructure Generation

DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models

Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

CoV-Align: Efficient Fine-grained Cross-Modal Alignment with Cohesive Visual Semantics Priority

TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization

PhyCritic: Multimodal Critic Models for Physical AI

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

Anchoring the Mind of Multimodal Reasoners: Cognitive Bias as a Vector for Jailbreak Attacks

InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection

Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

MemFlow: A Lightweight Forward Memorizing Framework for Quick Domain Adaptive Feature Mapping

Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

Vision-Language Model Guided Source-Free Domain Adaptation via Optimal Transport

Masked Representation Modeling for Domain-Adaptive Segmentation

TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer

ARES: Unifying Asymmetric RGB-Event Stereo for Probabilistic Scene Flow Estimation

MER-Tracker: Towards High-Speed 3D Point Tracking via Multi-View Event-RGB Hybrid Cameras

Moving Border Ownership for Event-based Motion Segmentation

TTAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors

Seeing Motion Through Polarity for Event-based Action Recognition

Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning

Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size

AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance

Experience Transfer for Multimodal LLM Agents in Minecraft Game

MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

SaPaVe: Towards Active Perception and Manipulation in Vision-Language Action Models for Robotics

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

RealAppiance: Let High-fidelity Appliance Assets Controllable and Workable as Aligned Real Manauls

ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation

MERIT: Multi-domain Efficient RAW Image Translation

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment

EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories

Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity

WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing

UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing

PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis

LumiMotion: Improving Gaussian Relighting with Scene Dynamics

Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation

iLRM: An Iterative Large 3D Reconstruction Model

MVInverse: Feed-forward Multiview Inverse Rendering in Seconds

From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis

MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification

Multi-view Pyramid Transformer: Look Coarser to See Broader

CaT-GS: Efficient 3DGS Rendering for Large Scale Scenes via Inter-frame Caching and Tile Scheduling

RL‑ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment

Benchmarking Endoscopic Surgical Image Restoration and Beyond

SDUIE: Semi-Supervised Diffusion for Underwater Image Enhancement with Quant-Text Dual Control

HiDRA: Hierarchical Degradation Representation and Adaptation with Generative Priors for Enhancing Infrared Vision

BluRef: Unsupervised Image Deblurring with Dense-Matching References

Bi-Bridge: Bidirectional Diffusion Bridges for Low-Light Image Enhancement

UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration

MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

SelfHVD: Self-Supervised Handheld Video Deblurring

Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor

Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency

Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation

Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities

VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

Photo-Guided Tooth Segmentation on 3D Oral Scan Model

Breaking the Continuum: Discrete Distribution Learning for Structural MRI Reconstruction

Uni-Hema: Unified Model for Digital Hematopathology

Post-training Feature Pruning for Fundus Images Classification

Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation

SafeLogo: Turning Your Logos into Jailbreak Shields via Micro-Regional Adversarial Training

Anti-I2V: Safeguarding your Photos from Malicious Image-to-video Generation

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Hierarchically Robust Zero-shot Vision-language Models

Beyond Text Prompts: Precise Concept Erasure through Text–Image Collaboration

AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions

ReMoE: Region-Mixture Experts for Adversarially-Robust Vision Transformers

TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration

SO-Bench: A Structural Output Evaluation of Multimodal LLM

Chain-of-Thought Guided Multi-Modal Object Re-Identification

When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence

CountGD++: Generalized Prompting for Open-World Counting

AudioStory: Generating Long-Form Narrative Audio with Large Language Models

Parameter-Efficient Adaptation for MLLMs via Implicit Modality Decomposition

Hyperbolic Gramian Volumes for Multimodal Alignment

Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation

StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets

CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion

Lenses: Toward Polysemous Vision–Language Understanding

CoRiM: Conflict-driven Risk Minimization for Dynamic Multimodal Fusion

Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

CICA: Coupling Confidence-Aware Pretraining with Confidence-Informed Attention for Robust Multimodal Sentiment Analysis

SAMTok: Representing Any Mask with Two Words

Multi-Metric Representation Learning Strategy Based on Clustering for Fine-Grained Multimodal Sentiment Analysis

Cinematic Audio Source Separation Using Visual Cues

MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection

Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation

Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing

VideoCoF: Unified Video Editing with Temporal Reasoner

Progressive Supernet Training for Efficient Visual Autoregressive Modeling

CoT-Edit: Let CoT Guide Instruction Video Editing

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling

Understanding, Accelerating, and Improving MeanFlow Training

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Dual-Granularity Memory for Efficient Video Generation

Unified Camera Positional Encoding for Controlled Video Generation

EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene

PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories

Object-WIPER: Training-Free Object and Associated Effect Removal in Videos

Mobile-VTON: High-Fidelity On-Device Virtual Try-On

Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation

Towards Robust Sequential Decomposition for Complex Image Editing

Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection

Chain of Event-Centric Causal Thought for Physically Plausible Video Generation

LoL: Longer than Longer, Scaling Video Generation to Hour

FlowMotion: Training-Free Flow Guidance for Video Motion Transfer

Learning Straight Flows: Variational Flow Matching for Efficient Generation

SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens

DNF-SR: Dual-Input and Negative-Aware Feature Fine-Tuning for Real-World Image Super-Resolution

IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution

Edge-Focused Super-Resolution for Omnidirectional Images with Spherical Geometric Augmentation

TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution

PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion

Disentangled Textual Priors for Diffusion-based Image Super-Resolution

Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework

Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features

DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution

STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution

Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions

Human Geometry Distribution for 3D Animation Generation

A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation

Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation

Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation

Causal Motion Diffusion Models for Autoregressive Motion Generation

Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions

MoLingo: Motion–Language Alignment for Text-to-Human Motion Generation

End-to-End Language-Action Model for Humanoid Whole Body Control

Toward Early Quality Assessment of Text-to-Image Diffusion Models

CoD: A Diffusion Foundation Model for Image Compression

Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)

Language-Guided One-Step Diffusion Model for Nighttime Flare Removal

SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras

PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems

Landscape-Awareness for Geometric View Diffusion Model

Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism

REACH: Explicit Recovery Behavior for Diffusion Policies

OralGPT-Omni: A Versatile Dental Multimodal Large Language Model

CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

KαLOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks

Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding

LIBERO-Plus: A Progressive Robustness Benchmark for Visual-Language-Action Models

Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding

LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting

Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation

EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding

Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

Geometrically-Constrained Agent for Spatial Reasoning

PARSE: Part-Aware Relational Spatial Modeling

R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

MCHDoc: A Comprehensive Benchmark for Reading Multi-Carrier Chinese Historical Documents

Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark

CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval

DiT-Distill: Open-Set Fine-Grained Retrieval via Generative Curriculum Knowledge

ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval

Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning

Rethinking BCE Loss for Multi-Label Image Recognition with Fine-Tuning

CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval

PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild

Seeing Conversations: Communication Context Identification in Egocentric Video

Interactive Episodic Memory with User Feedback

Seeing without Pixels: Perception from Camera Trajectories

PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

StreamRAG: Enhancing Real-Time Video Understanding with Retrieval Augmentation

ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting

SkillSight: Efficient First-Person Skill Assessment with Gaze

BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations

Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

Language Models Can Explain Visual Features via Steering

Making the Classification Explanation Faithful to the Confidence Score

Intrinsic Concept Extraction Based on Compositional Interpretability

Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors

Measuring the (Un)Faithfulness of Concept-Based Explanations

Deformation-based In-Context Learning for Point Cloud Understanding

ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders

ESAM++: Efficient Online 3D Perception on the Edge

DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration

Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs

Rethinking 2D-3D Registration: A Novel Network for High-Value Zone Selection and Representation Consistency Alignment

Adaptive 3D Perception for Small Aerial Targets Under Sparse Sampling via Reinforcement Learning

3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation

Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding

Routing on Demand: DSNet for Efficient Progressive Point Cloud Denoising

Hyper-PCN: Hypergraph-Based Point Cloud Completion via High-Order Correlation Modeling

Towards Calibrating Prompt Tuning of Vision- Language Models

DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks

LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

Language-guided Frequency Modulation for Large Vision-Language Models

TANGO: Text-Anchored Guided Optimization for Robust Fine-tuning Vision-Language Models under Label Noise

Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

Reconstructing CLIP for Open-Vocabulary Dense Perception

DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision–Language Transformers to Missing Modalities

BrepVGAE: Variational Graph Autoencoder with Unified Latent Representation for B-rep

NeuROK: Generative 4D Neural Object Kinematics

BrickNet: Graph-Backed Generative Brick Assembly

Unified Vector Floorplan Generation via Markup Representation

CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation

Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

The Invisible Gorilla Effect in Out-of-distribution Detection

Interpretable Debiasing of Vision-Language Models for Social Fairness

Image-based Outlier Synthesis With Training Data

SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning

Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework

When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm

IrisFP: Adversarial-Example-based Model Fingerprinting with Enhanced Uniqueness and Robustness

Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting

Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization

Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning

FedMOP: Achieving Enhanced Privacy and Performance in Federated Learning via Momentum Orthogonal Projection

HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation

Single-Round Scalable Analytic Federated Learning

Controllable Federated Prompt Learning at Test Time

FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning

Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

Spatial Matters: Position-Guided 3D Referring Expression Segmentation

Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach

Collaborative Multi-Mode Pruning for Vision-Language Models

ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs

Imbalanced View Contribution Evaluation and Refinement for Deep Incomplete Multi-View Clustering

Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering

SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Residual Connections Harm Generative Representation Learning

Neural Mixture Density Processes

Large-scale Robust Enhanced Ensemble Clustering via Outlier Decoupling

DriveLaW: Unifying Planning and Video Generation in a Latent Driving World

DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving

Latent Chain-of-Thought World Modeling for End-to-End Driving

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation

Failure Modes for Deep Learning–Based Online Mapping: How to Measure and Address Them

Linking Modality Isolation in Heterogeneous Collaborative Perception

LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance

Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving

DIMOS: Disentangling Instance-level Moving Object Segmentation

EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision

Live Interactive Training for Video Segmentation

Robust Promptable Video Object Segmentation

Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models

Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation

BEV-CAR: Enhancing Monocular Bird’s Eye View Segmentation with Context-Aware Rasterization

Exploring the Underwater World Segmentation without Extra Training

Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting

Cross-Architecture Adaptation: Cloud-Edge Continual Test-Time Adaptation with Dynamic Sampling and Heterogeneous Distillation

Towards Dynamic Modality Alignment in Multimodal Continual Learning

ϕ-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

Incremental Object Detection via Future-Aware Decoupled Cross-Head Distillation

Smart Replay: Adaptive Scheduling of Memory Rehearsal for Computational Resource-Aware Incremental Learning

ReBaPL: Repulsive Bayesian Prompt Learning

Spectral Mixture-of-Experts for Continual Learning

ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations

SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting

MIBURI: Towards Expressive Interactive Gesture Synthesis

Personalized Image Descriptions from Attention Sequences

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence

OctoNav: Towards Generalist Embodied Navigation

WalkGPT: Grounded Vision–Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

SMAP: Semantic Route Planning with Map-Grounded Multimodal Alignment

IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations

Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling

Motion-Aware Animatable Gaussian Avatars Deblurring

ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models

Cross-Modal Attention Calibration for LVLM Hallucination Mitigation

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

Exposing and Evaluating Hallucinations for GUI Grounding

Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models

Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

AniMimic: Imitating 3D Animation from Video Priors

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis

WorldStereo: Bridging Controllable Video Generation and Scene Reconstruction via 3D Geometric Memories

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control

Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance

SANER: Switchable Adapter with Non-parametric Enhanced Routing for Person De-Reidentification

BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification

Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification

Diversity over Uniformity: Rethinking Representation in Generated Image Detection

Mining Instance-Centric Vision–Language Contexts for Human–Object Interaction Detection

FSLoRA: Harmonizing Detection and Re-Identification via Freq-Spatial Low-Rank Adapter for One-Stage Person Search

EEGiT: Teaching Vision Transformers to Understand the EEG signal

FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts

Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification

UAV-CB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection

TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

StreamReady: Learning What to Answer and When in Long Streaming Videos

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Agentic Video Summarization via Self-Reflecting Multimodal Understanding

Self-Critical Distillation Network for Video-based Commonsense Captioning

Ego-Grounding for Personalized Question-Answering in Egocentric Videos

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

EarlyTom: Early Token Compression Completes Fast Video Understanding

VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers

RenderFlow: Single-Step Neural Rendering via Flow Matching

ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers

Masked Region Transformer for Layered Image Generation and Editing at Scale

DDT: Decoupled Diffusion Transformer

Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion

ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

RecTok: Reconstruction Distillation along Rectified Flow

EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions

CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection

H^2A^2: Homogeneity-Aware and Heterogeneity-Aware Feature Perception for Unified Indoor 3D Object Detection

Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation

Towards Intrinsic-Aware Monocular 3D Object Detection

SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection

DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

FailureAtlas: Mapping the Failure Landscape of T2I Models via Active Exploration

HDR-VLM: HDR-Domain Adaptation of VLMs and Preference-Aligned Quality Assessment for HDR Video Color Grading

RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations

BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models

DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

Revisiting Visual Corruptions in LVLMs: A Shape–Texture Perspective on Model Failures

From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing

Trust-calibrated Collaborative Learning for Long-Tailed Visual Recognition

SunFaded: Illumination-Aware Gaussian Splatting for Dark Scenes with Camera-Mounted Active Lighting

TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction

GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space

AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction

Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting

AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM

Learning Differentiable Hierarchies in 3D Gaussian Splatting

WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking

Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting

Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

3D Gaussian Splatting from Unposed Spike Stream

SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method

ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction

Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping

MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes

BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction

EDGS: Eliminating Densification for Efficient Convergence of 3DGS

ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps

Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

DialogueVPR: Towards Conversational Visual Place Recognition

Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

Grounding Everything in Tokens for Multimodal Large Language Models

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

ChartR: Evaluating Reasoning Accuracy and Robustness in Chart Question Answering

Think Visually, Reason Textually: Vision-Language Synergy in Abstract Reasoning

VKG-QA: Visual Knowledge Graph-based Question Answer for Large Multimodal Models

Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning

Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop

VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment

Generative Video Compression with One-Dimensional Latent Representation

Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

Learned Image Compression via Sparse Attention and Adaptive Frequency

UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders

VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

Ultra-Fast Neural Video Compression

Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression

Scaling Parallel Sequence Models to Vision Foundation Models

Revisiting Model Stitching In the Foundation Model Era

GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps

TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings

Towards Visual Query Localization in the 3D World

OVOD-Agent: A Markov–Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection

Pixel2Phys: Distilling Governing Laws from Visual Dynamics

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

Dynamic Important Example Mining for Reinforcement Finetuning

Specificity-aware reinforcement learning for fine-grained open-world classification

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object Detection

Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization

More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization

Beyond Sequential Tools: A Unified VLM Agent System for Photographic Post-Processing via Dynamic Multi-Expert Fusion

Multi-modal Frequency Decomposition Network for Semantic Scene Completion

BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement

FusionRegister: Every Infrared and Visible Image Fusion Deserves Registration

OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning

LRHDR: Learning Representation-enhanced HDR Video Reconstruction

Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

PP-Brep: Few-Shot B-rep Classification with Hybrid Graph Representation

AgentDet: A Shared-Blackboard Multi-Agent Framework for Zero-/Few-Shot Object Detection

SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection

Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment

Learnability-Guided Diffusion for Dataset Distillation

Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

Progressive Mask Distillation for Self-supervised Video Representation

HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation

SpiderCam: Low-Power Snapshot Depth from Differential Defocus

Computational Speckle Pattern Interferometry

DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging

Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors

Nonlinear Color Transfer via Learnable Bezier Flows

VT-Intrinsic: Physics-Based Decomposition of Reflectance and Shading using a Single Visible-Thermal Image Pair

GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT

Computer Vision with a Superpixelation Camera

Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction

Multi-Scale Gradient-Guided Unrolling Architecture with Adaptive Mamba for Compressive Sensing

Deciphering Genotype-Phenotype Mechanisms from High-Content Profiling via Knowledge-Guided Multi-modal Graph Learning

Bulk RNA-seq Guided Multi-modal Detection of Anomalous Regions in Human Cancer via Spatial Transcriptomics

Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics

ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction

PhysVid: Physics Aware Local Conditioning for Generative Video Models

PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment

EvoID: Reinforced Evolution for Identity-Preserving Video Generation

Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning

PhyCo: Learning Controllable Physical Priors for Generative Motion

Unified Multimodal Models as Auto-Encoders

Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models

ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference

Drainage: A Unifying Framework for Addressing Class Uncertainty

Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity

DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging

SASNet: Spatially-Adaptive Sinusoidal Networks for INRs

Generative Modeling of Weights: Generalization or Memorization?

Vision-Oriented Lightweight Neural Architecture Search with Budget-Adaptive Evaluation

Improving Sparse Autoencoder with Dynamic Attention

Stepwise Credit Assignment for GRPO on Flow-Matching Models

FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models

Hyperbolic Busemann Neural Networks

FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching

Image-to-Point Cloud Feature Back-Projection for Multimodal Training of 3D Semantic Segmentation

NG-GS: NeRF-guided 3D Gaussian Splatting Segmentation

Teaching DINOv3 About Partial 3D Geometry: A Self-Supervised Geometry-Aware Approach

SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons

MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision

SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation

Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation

MatchMask: Mask-Centric Generative Data Augmentation for Label-Scarce Semantic Segmentation

Boundary-Responsive Differentiable Gating for Superpixel-Based Segmentation

Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

FUSAR-GPT: A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

UniChange: Unifying Change Detection with Multimodal Large Language Model

Spatiotemporal Pyramid Flow Matching for Climate Emulation

See What We Cannot See: A Geo-guided Reasoning Benchmark for Object Counting under Adverse Earth Observation Conditions

MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

RECS4R: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation

Fourier Angle Alignment for Oriented Object Detection in Remote Sensing

Learning to Infer Parameterized Representations of Plants from 3D Scans

Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling

V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

FeatureFool: Zero-Query Fooling of Video Models via Feature Map

RankOOD - Class Ranking-based Out-of-Distribution Detection

AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples

The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers

Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation

Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception

Hierarchical Attacks for Multi‑Modal Multi‑Agent Reasoning

Omni-Attack: Adversarial Attacks on Open-Ended VQA in Black-Box Multimodal LLMs

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations

Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation

GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer

ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model

BiPreManip: Learning Affordance-Based Bimanual Pre-Manipulation through Anticipatory Collaboration

Learning Surgical Robotic Manipulation with 3D Spatial Priors

SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation

RaUF: Learning the Spatial Uncertainty Field of Radar

SIR: Structured Image Representations for Explainable Robot Learning

Instance-level Visual Active Tracking with Occlusion-Aware Planning

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

AnthroTAP: Learning Point Tracking with Real-World Motion

Tracking by Predicting 3-D Gaussians Over Time

Toward Low-Cost yet Effective Temporal Learning for UAV Tracking

Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again

Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking

CoWTracker: Tracking by Warping instead of Correlation

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Beyond Explicit Language: Plug-and-Play Visual-to-Linguistic Modeling Toward General Object Tracking

FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training

PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting

From Panel to Pixel: Zoom-In Vision–Language Pretraining from Biomedical Scientific Literature

LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings

D2T2 - Multimodal Automated Planning for Brachytherapy

TopoCL: Topological Contrastive Learning for Medical Imaging

Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction

Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation

Decoding 3D Perception via BrainSSD: Synergistic Fusion of EEG Representations from Static and Dynamic Visual Streams

Duala: Dual-Level Alignment of Subjects and Stimuli for Cross-Subject fMRI Decoding

OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks

Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

MedFG-VQA: Low-Frequency Memory and Graph Attention for Lightweight Medical VQA

FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing

DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

RAM: Recover Any 3D Human Motion in-the-Wild

From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction

MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data

Forecasting 3D Scanpaths in Egocentric Video

M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction

ReGenHOI: Unifying Reconstruction and Generation for 3D Human–Object Interaction Understanding

Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation

Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining

SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training

Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion

Learning Forgery-Aware Lip Representations Without Forgery Priors

Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection

GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement

TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery

Unleashing Vision-Language Semantics for Deepfake Video Detection

A Difference-in-Difference Approach to Detecting AI-Generated Images

RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

Zero-shot Detection of AI-Generated Image via RAW-RGB Alignment

Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes

Investigating Self-Supervised Representations for Audio-Visual Deepfake Detection

TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking

FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection

RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation

Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

Wavelet-Driven 3D Anomaly Detection under Pose-Agnostic and Sparse-View

Hunting Normality from Query Sample via Residual Learning for Generalist Anomaly Detection

GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

RebRL: Reinforcing Discrete Visual Diffusion Models with Rebalanced Timestep Credits

Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

Towards Fine-Grained Attribution: Instance-Aware Preference Optimization for Aligning Diffusion Models

SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior-Guided Multimodal LLMs

UniPercept: A Unified Diffusion Model for Generalizable Visual Perception

Visual Diffusion Models are Geometric Solvers

You Only Erase Once: Erasing Anything without Bringing Unexpected Content

Smoothing the Score Function to Enhance Generalization in Diffusion Models

NS-Diff: Fluid Navier–Stokes Guided Video Diffusion via Reinforcement Learning

PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

Generative Neural Video Compression via Video Diffusion Prior

AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation

Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

Image Diffusion Preview with Consistency Solver

The Drift Kernel: Why Diffusion Models Change Even When Told Not To

Interpretable Prompts made Edit-Friendly: Token-to-Token Similarity Reduction in dLLMs for Edit-Friendly Hard Prompt Inversion

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation

EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decompositio

Hierarchical Codec Diffusion for Video-to-Speech Generation

Semantic Alignment for Pose-Invariant Identity Preserving Diffusion

Causality in Video Diffusers is Separable from Denoising

2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching

Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization

MacTok: Robust Continuous Tokenization for Image Generation

Group Editing: Edit Multiple Images in One Go

Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation

Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training

Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

What Is It Like to Be a Noise? An Entropy-based Gaussian Noise Regularization for Diffusion Models

FlashVSR: Towards Real-time Diffusion-Based Streaming Video Super Resolution

DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

GDRO: Group-level Reward Post-training Suitable for Diffusion Models

RFDM: Residual Flow Diffusion Models for Video Editing

FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing

Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models

HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing

CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration

Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers

DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing

D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

Accelerating Diffusion Model Training under Minimal Budgets: A Condensation-Based Perspective

Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models

Guiding Diffusion Models with Semantically Degraded Conditions

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Reviving ConvNeXt for Efficient Convolutional Diffusion Models

Coupled Diffusion Sampling for Training-Free Multi-View Image Editing

Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models

Accelerating Autoregressive Video Diffusion via History-Guided Cache and Residual Correction

MusicInfuser: Making Video Diffusion Listen and Dance

(ends 5:30 PM)