CVPR 2026 Events with Videos

Live Streams and Virtual Content

During the main conference, the following four rooms will be live streamed. The same stream is active for the entire day. You may optionally find specific events listed by type (Keynote, Oral Sessions, etc,) below. The streams are available either from the table or from the event pages.

Video recordings and live streams for all events including workshops and tutorials will appear below as soon as they become available. Live streams are available during the event and recordings from the live stream will be posted to the website within about 48 hours. Workshop and tutorial recordings will be posted in the week following the conference.

Report issues here.

Keynotes

Programmable Biology: Generative AI for Molecular Design
Transforming Computing with Quantum-Centric Supercomputing
Scaling Laws vs. Neural Laws: Toward More Natural Artificial Vision

Meetings

PAMI TC

Oral Sessions

Oral Session 1A: Multimodal Vision
Oral Session 1B: Visual Security
Oral Session 1C: Efficient Reasoning
Oral Session 1D: Computational Imaging
Oral Session 2A: 3D Reconstruction
Oral Session 2B: Materials & Lighting
Oral Session 2C: Gaussian Splatting & Reconstruction
Oral Session 2D: Spatio-Temporal Reconstruction
Oral Session 3A: Generative Diffusion Modeling
Oral Session 3B: Spatial Understanding
Oral Session 3C: Generative Editing
Oral Session 3D: Multimodal Modeling
Oral Session 4A: Geometric Understanding
Oral Session 4B: Embodied & Agentic Intelligence
Oral Session 4C: Spatial Reasoning
Oral Session 4D: Visual Segmentation
Oral Session 5A: Dynamic Perception
Oral Session 5B: Generalization and Adaptation
Oral Session 5C: Geometry and Robotics
Oral Session 5D: Human-Centric Modeling & Lighting
Oral Session 6A: Geometric Learning
Oral Session 6B: Multimodal Reasoning
Oral Session 6C: Medical Vision
Oral Session 6D: Large-Scale Neural Modeling

Posters

VideoMaMa: Mask-Guided Video Matting via Generative Prior
Scalable Feature Matching via State Space Modeling and Sparse Correlation
MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning
NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
VGA: Empowering Aerial-Ground Localization by Visual Geometry Alignment
CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models
Does YOLO Really Need to See Every Training Image in Every Epoch?
Data-Centric Meta-Learning for Robust Few-Shot Generalization
FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement
IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models
Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression
Momentum Memory for Knowledge Distillation in Computational Pathology
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Fast SceneScript: Fast and Accurate Language‑Based 3D Scene Understanding via Multi‑Token Prediction
Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations
SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric
S^2-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
Reinforcing Video Reasoning Segmentation to Think Before It Segments
MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On
A Combination of Noise and Bilateral Filters Achieve Supralinear and Scalable Adversarial Robustness in CNNs
HandX: Scaling Bimanual Motion and Interaction Generation
OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
GenMatter: Perceiving Physical Objects with Generative Matter Models
Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models
Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling
NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
Refaçade: Editing Object with Given Reference Texture
GFRRN: Explore the Gaps in Single Image Reflection Removal
The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision
Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval
Globscope: Toward a Global View of the Loss Landscape
PrivateEyes: Gaze-Preserving Anonymization for Data Sharing
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation
DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images
JRM: Joint Reconstruction Model for Multiple Objects without Alignment
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing
Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models
Recovering Physically Plausible Human-Object Interactions from Monocular Videos
VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
Act2See: Emergent Active Visual Perception for Video Reasoning
DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors
Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios
HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
Long-Tail Internet Photo Reconstruction
Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
Learning Compact 3D Representations from Feed-Forward Novel View Synthesis
Learning Multi-View Spatial Reasoning from Cross-View Relations
DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
Functional Mean Flow in Hilbert Space
SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
Consistent Instance Field for Dynamic Scene Understanding
ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
Source Models Leak What They Shouldn’t: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization
Learning to Act Robustly with View-Invariant Latent Actions
Refracting Reality: Generating Images with Realistic Transparent Objects
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets
Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction
Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery
SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
Heterogeneous Decentralized Diffusion Models
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution
β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation
RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
Vision-Speech Models: Teaching Speech Models to Converse about Images
EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation
Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning
A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images
MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification
CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy
SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
Bidirectional Normalizing Flow: From Data to Noise and Back
Bootstrapping Multi-view Learning for Test-time Noisy Correspondence
The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations
TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures
IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors
iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training
HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling
Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking
SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
Region-Adaptive Sampling for Diffusion Transformers
How Much 3D Do Video Foundation Models Encode?
Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection
Transition Matching Distillation for Fast Video Generation
Correspondence-Attention Alignment for Multi-View Diffusion Models
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
Physical Simulator In-the-Loop Video Generation
FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers
ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
AudioAvatar: Personalized Audio-driven Whole-body Talking Avatars
Exploring Spatial Intelligence from a Generative Perspective
ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
Advancing Image Classification with Discrete Diffusion Classification Modeling
SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
Bidirectional Query-Driven Generation of Parametric CAD Sketch
Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift
Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution
No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors
Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition
Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
EMMA: Extracting Multiple physical parameters from Multimodal Data
HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction
Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
AnyPcc: Compressing Any Point Cloud with a Single Universal Model
OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
Event-based Motion Deblurring with Unpaired Data
Concept-Aware Batch Sampling Improves Language-Image Pretraining
Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus
RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection
VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
A Training-Free Style-Personalization via SVD-Based Feature Decomposition
Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
Extend3D: Town-Scale 3D Generation
WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions
Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence
TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
The Midas Touch for Metric Depth
Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
X-WIN: Building Chest Radiograph World Model via Predictive Sensing
Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos
AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation
Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments
TextOVSR: Text-Guided Real-World Opera Video Super-Resolution
Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
Chain of World: World Model Thinking in Latent Motion
FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs
Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting
MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
Neural Distribution Prior for LiDAR Out-of-Distribution Detection
Affordance-First Decomposition for Continual Learning in Video–Language Understanding
MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization
Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments
Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation
UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Scalable Trajectory Generation for Whole-Body Mobile Manipulation
Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human–Computer Interaction
Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach
PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification
GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models
Not All Birds Look The Same: Identity-Preserving Generation For Birds
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Are Image-to-Video Models Good Zero-Shot Image Editors?
CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model
Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
Harnessing the Power of Foundation Models for Accurate Material Classification
DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval
Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs
MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion
ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction
Language-Free Generative Editing from One Visual Example
Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection
Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models
Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift
Understanding Counting Mechanisms in Large Language and Vision-Language Models
Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision
Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner
Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning
From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition
REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
When to Think and When to Look: Uncertainty-Guided Lookback
Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification
Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering
Learning Personalized Photographic Style from Pairwise User Preferences
MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models
Hierarchical Action Learning for Weakly-Supervised Action Segmentation
Geometric Neural Distance Fields for Learning Human Motion Priors
RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion
Rethinking Dataset Distillation: Hard Truths about Soft Labels
Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances
Efficient Weighted Sampling via Score-based Generative Models
Mind the Gap: Transferring Labels to Align Object Detection Datasets
NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization
StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
EmoStyle: Emotion-Driven Image Stylization
FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment
Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation
AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks
LNEM: Lunar Neural Elevation Model
Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction
EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding
LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis
DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning
Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
X-band Radar Non-Line-of-Sight Imaging
Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule
CompBench: Benchmarking Complex Instruction-guided Image Editing
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data
Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation
DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion
Beyond Depth: Evaluating the Width-centric Reasoning Capability of MLLMs
FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks
Inferring Compositional 4D Scenes without Ever Seeing One
DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning
SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning
PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting
ORBIT: Benchmarking SfM in the Wild with 360° Video
MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
Meta-Learning In-Context Enables Training-Free Cross Subject Brain Decoding
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
Multi-Scale Local Speculative Decoding for Image Generation
Decoupled Generative Modeling for Human-Object Interaction Synthesis
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
VISTA: A Test-Time Self-Improving Video Generation Agent
GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching
Twin-T & TwintVQA: A Reliable Structure–Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks
ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS
Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
ViT^3: Unlocking Test-Time Training in Vision
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Global Underwater Geolocation from Time-Lapse Polarization Imagery
VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision–Language Models
Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation
InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization
PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
Envisioning the Future, One Step at a Time
Global Information Thresholding for Sufficient and Necessary Circuits
QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models
Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
Physical Object Understanding with a Physically Controllable World Model
RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection
A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction
Label-Free Cross-Task LoRA Merging with Null-Space Compression
MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models
VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution
Reflection Separation from a Single Image via Joint Latent Diffusion
AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models
Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
PhysHead: Simulation-Ready Gaussian Head Avatars
Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking
UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
VL-RouterBench: A Benchmark for Vision–Language Model Routing
DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment
STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows
When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models
Gyro-based Deep Video Deblurring
DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
E^2-SCI: Elastic Edge–Cloud Speculative Decoding via Credit Inertia
MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis
Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention
COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment
Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance
RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
Splatent: Splatting Diffusion Latents for Novel View Synthesis
GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
Affostruction: 3D Affordance Grounding with Generative Reconstruction
TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models
PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
Variational Graph-based Normal Integration
ID-Sim: An Identity-Focused Similarity Metric
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining
SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
Affine Perspective-Three-Point Problem
SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction
PGA: Prior-free Generative Attack for Practical No-box Scenario
DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
Breaking Multimodal LLM Safety via Video-Driven Prompting
PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Lipschitz Optimization for Formal Verification of Homographies
Make it SING: Analyzing Semantic Invariants in Classifiers
SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
DRM: Diffusion-based Reward Model With Step-wise Guidance
VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning
Continual Distillation of Teachers from Different Domains
Rethinking Occlusion Modeling for UAV Tracking
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation
Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post‑hoc Debiasing in Vision-Language Models
PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
Visual Grounding for Object Questions
DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation
TopoSlide: Topologically-Informed Histopathology Whole Slide Image Representation Learning
Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness
Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity
All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark
HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation
Building a Precise Video Language with Human–AI Oversight
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generation
Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
Illumination-Consistent Human-Scene Reconstruction from Monocular Video
Residual Primitive Fitting of 3D Shapes with SuperFrusta
ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration
OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
Scaling Dense Event-Stream Pretraining from Visual Foundation Models
GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning
Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers
Learning Effective Sign Features without Text for Gloss-free Sign Language Translation
Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment
PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis
Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation
COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification
Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
LVLM-Aided Alignment of Task-Specific Vision Models
Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification
DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
COT-FM: Cluster-wise Optimal Transport Flow Matching
InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
Rectifying Latent Space for Generative Single-Image Reflection Removal
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging
Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects
MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and VLM-Guided Optimization
Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization
Towards Sparse Video Understanding and Reasoning
Contact-Aware Neural Dynamics
AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning
EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion
VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT
GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
Occluded Human Body Capture with Frequency Domain Denoising Prior
LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
PhyGaP: Physically-Grounded Gaussians with Polarization Cues
Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
AVION: Aerial Vision–Language Instruction from Offline Teacher to Prompt-Tuned Network
Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision
Yume1.5: A Text-Controlled Interactive World Generation Model
ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes
Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement
MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
R3-PCQA: Ray-Reprojection-Reinforcement for No-Reference 3D Point Cloud Quality Assessment
Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities
Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Anti-Degradation Lifelong Multi-View Clustering
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes
Common Inpainted Objects In-N-Out of Context
UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?
Composite-Attribute Person Re-Identification via Pose-Guided Disentanglement
Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation
Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization
Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations
TANGO: Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization
Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
Edit-aware RAW reconstruction
ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
OVI-MAP: Open-Vocabulary Instance-Semantic Mapping
HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding
SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment
Bridging Domains through Subspace-Aware Model Merging
Reinforcing Structured Chain-of-Thought for Video Understanding
OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
Representing 3D Faces with Learnable B-Spline Volumes
HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird’s-Eye-View Semantic Segmentation
EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion
BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
Globally Optimal Pose from Orthographic Silhouettes
Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection
Lynx: Towards High-Fidelity Personalized Video Generation
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention
Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
MeshRipple: Structured Autoregressive Generation of Artist-Meshes
Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
SAM 3D: 3Dfy Anything in Images
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation
Learnability-Driven Submodular Optimization for Active Roadside 3D Detection
Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation
U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
Towards Generalized Multimodal Homography Estimation
MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer
UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion
Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
Rethinking Concept Bottleneck Models: From Pitfalls to Solutions
LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
Z-Order Transformer for Feed-Forward Gaussian Splatting
SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
MAMMA: Markerless Accurate Multi-person Motion Acquisition
SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models
Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks
CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization
TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR
Scene-Centric Unsupervised Video Panoptic Segmentation
CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection
FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement
Translating Signals to Languages for sEMG-Based Activity Recognition
Vinedresser3D: Towards Agentic Text-guided 3D Editing
Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models
What Matters in Practical Learned Image Compression
Dynamic Momentum Recalibration in Online Gradient Learning
The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection
PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models
LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization
Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
UniCorrn: Unified Correspondence Transformer Across 2D and 3D
An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion
TruckDrive: Long-Range Autonomous Highway Driving Dataset
BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
PaNDaS: Learnable Shape Interpolation Modeling with Localized Control
First Frame Is the Place to Go for Video Content Customization
SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera
Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
Optical Diffraction-based Convolution for Semiconductor Lithography
Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints
Residual Diffusion Bridge Model for Image Restoration
What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely F1
NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment
Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling
Prompt-Free Universal Region Proposal Network
OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with Realistic Tasks
Frequency-domain Manipulation for Face Obfuscation
POUR: A Provably Optimal Method for Unlearning Representation via Neural Collapse
Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation
Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence
HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification
SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
D-Prism: Differentiable Primitives for Structured Dynamic Modeling
DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Paparazzo: Active Mapping of Moving 3D Objects
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
Local Motion Matters: A Deconstruct–Recompose Paradigm for Reinforcement Learning Pre-training from Videos
HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering
Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening
Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction
InterRVOS: Interaction-Aware Referring Video Object Segmentation
Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
TrackMAE: Video Representation Learning via Track Mask and Predict
BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation
CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers
RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure
SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching
A³: Towards Advertising Aesthetic Assessment
PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing
CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization
P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction
Ego: Embedding-Guided Personalization of Vision-Language Models
Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization
CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior
D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network
META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
Learning What Helps: Task-Aligned Context Selection for Vision Tasks
APPO: Attention-guided Perception Policy Optimization for Video Reasoning
ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
SPDMark: Selective Parameter Displacement for Robust Video Watermarking
ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss
CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery
4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
What Are You Doing? A Closer Look at Controllable Human Video Generation
ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
GR-Gauge: Cost-efficient Training Configuration By Gauging the Gradient Redundancy
Radiance Meshes for Volumetric Reconstruction
Cross-Hand Latent Representation for Vision-Language-Action Models
Disco-GS: Gaussian Splatting in Dynamic Color Lighting
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification
Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels
Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping
Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention
WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing
Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain
Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision
RAID: Retrieval-Augmented Anomaly Detection
Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting
Predicting Spatial Transcriptomics from Histology Images via High-Order Multi-Cell Interaction Modeling
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
Dynamic Exposure Burst Image Restoration
Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes
Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers
M^3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
Any4D: Unified Feed-Forward Metric 4D Reconstruction
DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
Kaleidoscopic Scintillation Event Imaging
Self-Consistency for LLM-Based Motion Trajectory Generation and Verification
CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding
CARD: Correlation Aware Restoration with Diffusion
Event6D: Event-based Novel Object 6D Pose Tracking
Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
A2GC: Asymmetric Aggregation with Geometric Constraints for Locally Aggregated Descriptors
Learning complete and explainable visual representations from itemized text supervision
Foundry: Distilling 3D Foundation Models for the Edge
UniSER: A Foundation Model for Unified Soft Effects Removal
Condensed Test-Time Adaptation of VLMs for Action Recognition
GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
L^2DGS: Low-Light Dynamic Gaussian Splatting
Grid Distillation: Compositional Image Distillation via Structured Generative Grids
ExpPortrait: Expressive Portrait Generation via Personalized Representation
Delta Rectified Flow Sampling for Text-to-Image Editing
When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection
Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation
Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
Best Segmentation Buddies for Image-Shape Correspondence
FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models
LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting
Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
Beyond Caption-Based Queries in Video Moment Retrieval
An Empirical Study on How Video-LLMs Answer Video Questions
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
Efficient Frame Selection for Long Video Understanding via Reinforcement Learning
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
HUMAPS-4D: A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations
From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
Linear Image Generation by Synthesizing Exposure Brackets
SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection
VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light Environment
Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering
Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training
FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
WPT: World-to-Policy Transfer via Online World Model Distillation
Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes
PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data
From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images
LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure
Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions
Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering
Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses
CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference
Human Interaction-Aware 3D Reconstruction from a Single Image
Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification
Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
Suppressing Non-Semantic Noise in Masked Image Modeling Representations
Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images
StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning
Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
AE2VID: Event-based Video Reconstruction via Aperture Modulation
FloVerse: Floor Plan-Guided Multi-Modal Navigation
Next-Scale Autoregressive Models for Text-to-Motion Generation
PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
Recurrent Video Masked Autoencoders
HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter
PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention
Domain-Skewed Federated Learning with Feature Decoupling and Calibration
Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
Synthesizing Visual Concepts as Vision-Language Programs
CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
Voxify3D: Pixel Art Meets Volumetric Rendering
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Explaining Object Detectors via Collective Contribution of Pixels
Foundation Encoders Are All You Need for Preference-Aware Personalization
CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild
RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation
CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis
A Supervised Multi-task Framework for Joint cryo-ET Restoration Enabled by Generative Physical Simulation
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Same or Not? Enhancing Visual Perception in Vision-Language Models
Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach
Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward
FrankenMotion: Part-level Human Motion Generation and Composition
FE2E: From Editor to Dense Geometry Estimator
Lighting in Motion: Spatiotemporal HDR Lighting Estimation
Aligning Text, Images and 3D Structure Token-by-Token
Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
RewardFlow: Generate Images by Optimizing What You Reward
AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
Coded-E2LF: Coded Aperture Light Field Imaging from Events
Bridging Facial Understanding and Animation via Language Models
Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference
OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness
InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network
RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation
OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
Visual Personalization Turing Test
BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
Parallelised Differentiable Straightest Geodesics for 3D Meshes
PersonaLive! Expressive Portrait Image Animation for Live Streaming
CLIP-like Model as a Foundational Density Ratio Estimator
SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval
Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
Inference-time Physics Alignment of Video Generative Models with Latent World Models
JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
Gated KalmaNet: A Fading Memory Layer through Test-time Ridge Regression
DeDelayed: Deleting Remote Inference Delay via On-Device Correction
SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification
ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing
Towards Multimodal Domain Generalization with Few Labels
Coverage Optimization for Camera View Selection
FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining
PersonaVLM: Long-Term Personalized Multimodal LLMs
gQIR: Generative Quanta Image Reconstruction
SMVRT: Implicit Human 3D Modeling Using Sparse Multi-View Volumetric Reconstruction with Transformer Fusion
NIL: No-data Imitation Learning
QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
Few-for-Many Personalized Federated Learning
A More Word-like Image Tokenization for MLLMs
Image-Guided Geometric Stylization of 3D Meshes
Boosting Reasoning in Large Multimodal Models via Activation Replay
Learning from Noisy Supervision: A Denoising–Debiasing Framework for Weakly Supervised Video Anomaly Detection
HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation
Robust Spiking Neural Networks by Temporal Mutual Information
V^2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers
SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens
Semantic Scale Space: A Framework for Controllable Image Abstraction
ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
Geometry-Aware Cross-Modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting
QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification
Image Generation from Contextually-Contradictory Prompts
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
Spike-driven Discrete Aggregation for Event-based Object Detection
Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization
From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
VQ-VA World: Towards High-Quality Visual Question-Visual Answering
RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
FMPose3D: monocular 3D pose estimation via flow matching
Eulerian Gaussian Splatting using Hashed Probability Pyramids
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models
Learning to Solve PDEs on Neural Shape Representations
Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning
Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code
Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
Few-shot Acoustic Synthesis with Multimodal Flow Matching
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
Designing to Forget: Deep Semi-parametric Models for Unlearning
UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors
AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds
Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling
Differentially Private 2D Human Pose Estimation
Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe
2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition
Fine-Grained Multi Image Object Hallucination Benchmark
Adaptive Confidence Regularization for Multimodal Failure Detection
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation
A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World
AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
Endless World: Real-Time 3D-Aware Long Video Generation
Volumetric Functional Maps
TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
LoST: Level of Semantics Tokenization for 3D Shapes
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields
Gaussian Mapping for Evolving Scenes
Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator
MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction
MV-TAP: Tracking Any Point in Multi-View Videos
InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images
CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
ChordEdit: One-Step Low-Energy Transport for Image Editing
InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models
SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection
Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
Lafite: A Generative Latent Field for 3D Native Texturing
Point Cloud as a Foreign Language for Multi-modal Large Language Model
Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements
Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers
Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation
EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR
GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
Distilling Balanced Knowledge from a Biased Teacher
Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
TextFM: Robust Semi-dense Feature Matching with Language Guidance
Obstruction Reasoning for Robotic Grasping
Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution
Improving Adversarial Transferability with Local Perturbation Augmentation
VOSR: A Vision-Only Generative Model for Image Super-Resolution
MAD: Motion Appearance Decoupling for efficient Driving World Models
MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly
From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images
Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
Generative Video Motion Editing with 3D Point Tracks
CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
3D Gaussian Splatting at Arbitrary Resolutions with Compact Proxy Anchors
PAVAS: Physics-Aware Video-to-Audio Synthesis
A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett–Luce Ranking
Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
FPSBench: A Benchmark for Video Understanding at High Frame Rates
PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation
Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions
HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph
Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data
V-DPM: 4D Video Reconstruction with Dynamic Point Maps
3D-LATTE: Latent Space 3D Editing from Textual Instructions
EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models
Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Composing Concepts from Images and Videos via Concept-prompt Binding
DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
Self-Corrected Image Generation with Explainable Latent Rewards
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
URScenes: A Multi-scenario Dataset for Unstructured Road Environments
When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness
Modeling the Visual Ambiguity of Human Sketches
PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose
Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
Precise Object and Effect Removal with Adaptive Target-Aware Attention
AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks
Block-based Learned Image Compression without Blocking Artifacts
MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
Tunable Soft Equivariance with Guarantees
Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation
SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis
PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
Geo2: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space
S^2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
VENI: Variational Encoder for Natural Illumination
RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments
VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning
Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All
UIKA: Fast Universal Head Avatar from Pose-Free Images
Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs
Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation
Scene Reconstruction as Mapping Priors for 3D Detection
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration
TokenLight: Precise Lighting Control in Images using Attribute Tokens
EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation
MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images
Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning
Weight Space Representation Learning via Neural Field Adaptation
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime
ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy
Low-Resolution Editing is All You Need for High-Resolution Editing
Draft and Refine with Visual Experts
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
Content-Adaptive Hierarchical Hyperprior for Neural Video Coding
Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
Mario: Multimodal Graph Reasoning with Large Language Models
Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning
Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification
CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation
Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective
Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data
Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation
TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification
Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models
Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
PAS: Prelim Attention Score for Detecting Object Hallucinations in Large Vision-Language Models
Grounded Latents for Entity-Centric 4D Scene Generation
InternVideo-Next: Towards World-Understanding Video Models
MicroFM: Physics-guided Flow Matching for Isotropic Microscopy Reconstruction
DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease
Deep Feature Deformation Weights
MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics
Temporal Inversion for Learning Interval Change in Chest X-Rays
Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions
GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
Defending Unauthorized Model Merging via Dual-Stage Weight Protection
Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification
FARMER: Flow AutoRegressive Transformer over Pixels
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
Stake the Points: Structure-Faithful Instance Unlearning
VecGlypher: Unified Vector Glyph Generation with Language Models
GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
Dual Ascent Diffusion for Inverse Problems
PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
Zoo3D: Zero-Shot 3D Object Detection at Scene Level
Generative Diffusion Priors for 3D Mapping of the Dark Universe
FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
Multi-speaker Attention Alignment for Multimodal Social Interaction
LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
MUFASA: A Multi-Layer Framework for Slot Attention
R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection
Structural Graph Probing of Vision–Language Models
VABench: A Comprehensive Benchmark for Audio-Video Generation
Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects
Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World
Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model
SVAgent: Storyline-guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation
PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
VGGT-Ω
FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection
Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
INSID3: Training-Free In-Context Segmentation with DINOv3
IGen: Scalable Data Generation for Robot Learning from Open-World Images
Exemplar-Free Continual Learning for State Space Models
EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
RefAV: Towards Planning-Centric Scenario Mining
FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes
Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
ReLaGS: Relational Language Gaussian Splatting
OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation
WorldGen: From Text to Traversable and Interactive 3D Worlds
InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
GS-ASM: 2DGS-Supervised Active Stereo Matching
Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
Generalizable Video Quality Assessment via Weak-to-Strong Learning
Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning
InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning
FEAT: Fashion Editing and Try-On from Any Design
RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution
Multimodal Distribution Matching for Vision-Language Dataset Distillation
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
DC-Merge: Improving Model Merging with Directional Consistency
Coupling Liquid Time‑Constant Encoders with Modern Hopfield Memory
HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
Feed-forward Gaussian Registration for Head Avatar Creation and Editing
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Rethinking Token Reduction for Large Vision-Language Models
Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory
Selectively Extracting and Injecting Visual Attributes into Text-to-Image Models
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image
HTTM: Head-wise Temporal Token Merging for Faster VGGT
Relightful Video Portrait Harmonization
Self-Diffusion Driven Blind Imaging
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation
AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation
VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis
Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition
MuM: Multi-View Masked Image Modeling for 3D Vision
Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving
ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers
Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning
TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies
PAI-Bench: A Comprehensive Benchmark For Physical AI
DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving
Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift
LeapAlign: Post-training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
LitePT: Lighter Yet Stronger Point Transformer
MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration
Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
Pixel Motion Diffusion is What We Need for Robot Control
Unsupervised Multi-agent and Single-agent Perception from Cooperative Views
Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models
PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion
PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation
NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code
RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty
GenHOI: Towards Object-Consistent Hand–Object Interaction with Temporally Balanced and Spatially Selective Object Injection
Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control
OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
Thinking in 360°: Humanoid Visual Search in the Wild
CaptionQA: Is Your Caption as Useful as the Image Itself?
Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors
SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding
SAME: Sparse and Anchored Model Editing for Heterogeneous Incremental Learning under Limited Data
PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
CREward: A Type-Specific Creativity Reward Model
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
MARCO: Navigating the Unseen Space of Semantic Correspondence
PECCVAI: Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks
NEAF: Natural Image Editing with Attention Fusion for Generalizable Test-time Optimization in Text-Guided Image Editing
Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
FastGaMer: Efficient GainMap Learning for Practical Inverse Tone Mapping
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
Dynamic Token Reweighting for Robust Vision-Language Models
When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards
FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
SG-LoRA: Semantic-guided LoRA Parameters Generation
MoVie: Broaden Your Views with Human Motion for Action Detection
Scaling Spatial and Temporal Context for Robotic Imitation Learning Policies With Scene Graphs
DiffBMP: Differentiable Rendering with Bitmap Primitives
Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior
GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing
Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
OccAny: Generalized Unconstrained Urban 3D Occupancy
GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
AirSim360: A Panoramic Simulation Platform within Drone View
Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning
Seele: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices
Global Structure-from-Motion Meets Feedforward Reconstruction
AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting
Task-Driven Implicit Representations for Automated Design of LiDAR Systems
Text-Driven 3D Hand Motion Generation from Sign Language Data
Captain Safari: A World Engine with Pose-Aligned 3D Memory
Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment
CHEEM: Continual Learning by Reuse, New, Adapt and Skip - A Hierarchical Exploration-Exploitation Approach
LumiX: Structured and Coherent Text-to-Intrinsic Generation
4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction
Beyond the Static World: Continual Category Discovery under Visual Drift
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues
How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction
FUN REC Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm
PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling
MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation
StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
Soft Modality-Guided Expert Specialization in MoE-VLMs
A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
Enhancing Spatial Understanding in Image Generation via Reward Modeling
SoccerMaster: A Vision Foundation Model for Soccer Understanding
3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
Perceptual 3D Simulation With Physical World Modeling
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
Portable Active Learning for Object Detection
Random Wins All: Rethinking Grouping Strategies for Vision Tokens
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction
Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
Learning 3D Shape Fidelity Metric from Real-world Distortions
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
HoneyBee: Data Recipes for Vision-Language Reasoners
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering
Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination
LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models
PE3R: Perception-Efficient 3D Reconstruction
ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining
PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking
Geometric-Photometric Event-based 3D Gaussian Ray Tracing
FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D
Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval
The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models
PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model
Dynamic Visual SLAM using a General 3D Prior
ShadowDraw: From Any Object to Shadow-Drawing Compositional Art
Unified Number-Free Text-to-Motion Generation Via Flow Matching
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
Particulate: Feed-Forward 3D Object Articulation
Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
Diagram2Structure: Unlocking LLMs' Diagram Comprehension through DiagramDiff, an Offline Diagram Structuring Framework
DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
Hierarchical Process Reward Models are Symbolic Vision Learners
Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
DuoGen: Towards Autonomous Interleaved Multimodal Generation
Modeling Cross-vision Synergy for Unified Large Vision Model
NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
B^3-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
Fully Decentralized Certified Unlearning
MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation
Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis
UniDAC: Universal Metric Depth Estimation for Any Camera
Probabilistic Precipitation Nowcasting with Rectified Flow Transformers
Text-guided Feature Disentanglement for Cross-modal Gait Recognition
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
Linking Perception, Confidence and Accuracy in MLLMs
EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
Verifying Neural Network Robustness with Dual Perturbations
M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA
GeoSANE: Learning Geospatial Representations from Models, Not Data
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
Sparse–View Localization via Online Neural 3D Regression
Retrieving Counterfactuals Improves Visual In-Context Learning
RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains
Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction
UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
Solvability of the Viewing Graph Under the Affine Camera Model
Detecting Unknown Objects via Energy-based Separation for Open World Object Detection
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
GazeShift: Unsupervised Gaze Estimation and Dataset for VR
VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale
Streamlined Knowledge Distillation
UniVBench: Towards Unified Evaluation for Video Foundation Models
Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors
ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning
TIGER: A Unified Framework for Time, Images and Geo-location Retrieval
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM
EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction
Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics
AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal
Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
WHU-MARS: A Multispectral Aerial-Ground Benchmark Towards Any-Scenario Person Re-Identification
BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection
SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Linear Fundamental Matrix Estimation from 7 or 5 Points
NitroGen: An Open Foundation Model for Generalist Gaming Agents
Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels
RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation
Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models
ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling
BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion
Mixture of Prototypes for Test-time Adaptive Segmentation
Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization
VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
Phrase-grounded APO for Improving Chest X-ray Report Generation
Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation
Enhancing Part-Level Point Grounding for Any Open-Source MLLMs
PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation
REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion
240FPS Stereo Vision from Monocular Mixed Spikes
When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
Unifying Precise Keyframes and Semantic Control via Multi-level Diffusion
Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation
On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks
RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
3D-IDE: 3D Implicit Depth Emergent
Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation
CI-VID: A Coherent Interleaved Text-Video Dataset
OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
Unique Lives, Shared World: Learning from Single-Life Videos
Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond
SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation
Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
TSTM: Temporal Segmentation for Task-relevant Mask in Visual Reinforcement Learning Generalization
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations
CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
Stronger Normalization-Free Transformers
Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models
Enhancing Out-of-Distribution Detection with Extended Logit Normalization
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning
Cycle-Consistent Tuning for Layered Image Decomposition
Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision–Language Models
OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis
Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering
Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model
Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance
ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference
HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation
ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
OS-Fed: One Snapshot Is All You Need
Latent Diffusion Inversion Requires Understanding the Latent Space
EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
Adapting In-context Generation for Enhanced Composed Image Retrieval
A Bit is All You Need! Efficient Video Capture via Single Bit Imaging
Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
Free-Grained Hierarchical Visual Recognition
VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models
DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs
VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
GGPT: Geometry-Grounded Point Transformer
MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality
UniLight: A Unified Representation for Lighting
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Spot The Ball: A Benchmark for Visual Social Inference
EasyV2V: A High-quality Instruction-based Video Editing Framework
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation
PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Video Panels for Long Video Understanding
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats
ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding
SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation
Hist2Style: Histogram-Guided Stylization with Bilateral Grids
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning
SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models
Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration
PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving
Global-Aware Edge Prioritization for Pose Graph Initialization
Adapting Lightweight Image-based Counting Models for Video Crowd Counting
See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
BAMI: Training-Free Bias Mitigation in GUI Grounding
A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
GM-R^2: Generative Matching Learning for Unsupervised Geometric Representation and Registration
Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge
UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in RL
BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration
4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis
Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET
TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration
Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting
Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
HybridDriveVLA: Vision-Language-Action Model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving
IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis
SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering
OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning
BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images
Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models
BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer
FireScope: Wildfire Risk Raster Prediction With a Chain-of-Thought Oracle
Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models
Efﬁcient and Training-Free Single-Image Diffusion Models
Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
ProSoftArena: Benchmarking Hierarchical Capabilities of Multi-modal Agents in Professional Software Environments
SVBench: Evaluation of Video Generation Models on Social Reasoning
Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
Scale Space Diffusion
MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning
Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging
Anomaly-Related Residual Fields for Cross-domain Anomaly Detection
DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization
High-Quality and Efficient Turbulence Mitigation with Events
FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
Beyond the Ground Truth: Enhanced Supervision for Image Restoration
Roots Beneath the Cut: Uncovering the Risk of Concept Recovery in Pruning-Based Unlearning for Diffusion Models
Compressed-Domain-Aware Online Video Super-Resolution
Agentic Retoucher for Text-To-Image Generation
Content-Aware Dynamic Patchification for Efficient Video Diffusion
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
Understanding Task Transfer in Vision-Language Models
Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning
Region-Aware Instance Consistency Learning for Micro-Expression Recognition
Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction
Dark3R: Learning Structure from Motion in the Dark
Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models
Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval
Explaining CLIP Zero-shot Predictions Through Concepts
TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery
LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol
Progressive Multi-cue Alignment for Unaligned RGBT Tracking
NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy
Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation
QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
Align Images Before You Generate
PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning
BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
Sampling-Aware Quantization for Diffusion Models
Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
Driving on Registers
PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
Goldilocks Test Sets for Face Verification
Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
FG-Portrait: 3D Flow Guided Editable Portrait Animation
Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
Personalized Federated Training of Diffusion Models with Privacy Guarantees
FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization
Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation
Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
F^2HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
Mirror Illusion Art
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers
EXOTIC: External Vision-driven Incomplete Multi-view Classification
SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching
ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production
RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework
VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
Efficient Equivariant Transformer for Self-Driving Agent Modeling
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
Sparse Spectral LoRA: Routed Experts for Medical VLMs
Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent
Match-and-Fuse: Consistent Generation from Unstructured Image Sets
Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement
TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising
Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors
Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
IntrinsicWeather: Controllable Weather Editing in Intrinsic Space
Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents
SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
Rethinking Glyph Spatial Information in Font Generation
Describe Anything Anywhere At Any Moment
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation
TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection
GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR
One Algorithm to Align Them All
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework
Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension
Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
The Universal Normal Embedding
Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
MM-ACT: Learn from Multimodal Parallel Generation to Act
Decision Boundary-aware Generation for Long-tailed Learning
ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
DynFusion: Rethinking Condition Fusion for Adaptive Multi-Conditional Text-to-Image Generation
Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval
SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing
IPR-1: Interactive Physical Reasoner
Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
2D-LFM: Lifting Foundation Model without 3D Supervision
No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
LA-Pose: Latent Action Pretraining Meets Pose Estimation
CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks
Finding Distributed Object-Centric Properties in Self-Supervised Transformers
SineProject: Machine Unlearning for Stable Vision-Language Alignment
From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras
Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
Depth Hypothesis Guided Iterative Refinement for Event–Image Monocular Depth Estimation
Order Matters: 3D Shape Generation from Sequential VR Sketches
FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
GenTract: Generative Global Tractography
DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
Is the Modality Gap a Bug or a Feature? A Robustness Perspective
VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph
MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis
LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation
ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
KLIP: Localized Distribution Shift Detection via KL-Divergence with Diffusion Priors in Inverse Problems
Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic ManipulaTion
Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters
Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks
Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels
cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement
Minimal Constraint Relaxation for Multiview Autocalibration
Vista4D: Video Reshooting with 4D Point Clouds
CADC: Content Adaptive Diffusion-Based Generative Image Compression
ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning
From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks
How to Take a Memorable Picture? Empowering Users with Actionable Feedback
TESO: Online Tracking of Essential Matrix by Stochastic Optimization
LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers
SonoWorld: From One Image to a 3D Audio-Visual Scene
When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks
QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
DSO: Direct Steering Optimization for Bias Mitigation
Parallel Rigidity Matters for Bundle Adjustment
UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents
Reward Sharpness-Aware Fine-Tuning for Diffusion Models
Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
Bridging the Perception Gap in Image Super-Resolution Evaluation
FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera
InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene
Inter-Photon-Limited Videography
ReasonX: MLLM-Guided Intrinsic Image Decomposition
Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence
DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting
C^2FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
High-Fidelity Mobile Avatars with Pruned Local Blendshapes
Latent Implicit Visual Reasoning
MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection
PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video
BiGain: Unified Token Compression for Joint Generation and Classification
ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer
SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
MPL: Match-guided Prototype Learning for Few-shot Action Recognition
PHAC: Promptable Human Amodal Completion
Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport
Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation
Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding
InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation
OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models
InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation
OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data
DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives
High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling
ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration
SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
Visual Autoregressive Modeling via Next Focus Prediction
LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures
CamDirector: Towards Long-Term Coherent Video Trajectory Editing
KV-Tracker: Real-Time Pose Tracking with Transformers
MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation
WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization
FabricGen: Microstructure-Aware Woven Fabric Generation
StreamDiT: Real-Time Streaming Text-to-Video Generation
SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
Mirai: Autoregressive Visual Generation Needs Foresight
Mechanisms of Object Localization in Vision–Language Models
SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
Dexterous World Models
Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval
mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds
Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex
Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
Guiding Token-Sparse Diffusion Models
LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Detecting Compressed AI-Generated Images via Phase Spectrum Robustness
Scene Grounding in the Wild
GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection
Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
Temporal Interaction in Spiking Transformers with Multi-Delay Mixer
Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction
MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
Time Blindness: Why Video-Language Models Can’t See What Humans Can?
When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs
Evidential Neural Radiance Fields
Diffusion Mental Averages
See Through the Noise: Improving Domain Generalization in Gaze Estimation
Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization
Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration
Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding
Uni-Hema: Unified Model for Digital Hematopathology
Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation
VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement
HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps
Human Geometry Distribution for 3D Animation Generation
Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation
Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Interactive Episodic Memory with User Feedback
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization
MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations
Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism
BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models
LumiMotion: Improving Gaussian Relighting with Scene Dynamics
ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
Learning Forgery-Aware Lip Representations Without Forgery Priors
PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems
Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling
Landscape-Awareness for Geometric View Diffusion Model
Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals
InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection
SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting
Hyperbolic Busemann Neural Networks
GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
REACH: Explicit Recovery Behavior for Diffusion Policies
Guiding Diffusion Models with Semantically Degraded Conditions
IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence
D2T2 - Multimodal Automated Planning for Brachytherapy
SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior-Guided Multimodal LLMs
Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor
CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
Teaching DINOv3 About Partial 3D Geometry: A Self-Supervised Geometry-Aware Approach
ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation
Benchmarking Endoscopic Surgical Image Restoration and Beyond
ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation
EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision
SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation
TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction
LoL: Longer than Longer, Scaling Video Generation to Hour
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
MERIT: Multi-domain Efficient RAW Image Translation
ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
UAV-CB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection
The Drift Kernel: Why Diffusion Models Change Even When Told Not To
NS-Diff: Fluid Navier–Stokes Guided Video Diffusion via Reinforcement Learning
Emergent Extreme-View Geometry in 3D Foundation Models
FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts
PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction
iLRM: An Iterative Large 3D Reconstruction Model
Unified Camera Positional Encoding for Controlled Video Generation
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
Learning 3D Reconstruction with Priors in Test Time
ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
Revisiting Model Stitching In the Foundation Model Era
When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
Personalized Image Descriptions from Attention Sequences
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
UniChange: Unifying Change Detection with Multimodal Large Language Model
PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding
VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation
SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation
DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging
AnthroTAP: Learning Point Tracking with Real-World Motion
DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection
VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics
A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation
FlashVSR: Towards Real-time Diffusion-Based Streaming Video Super Resolution
RenderFlow: Single-Step Neural Rendering via Flow Matching
Smoothing the Score Function to Enhance Generalization in Diffusion Models
WalkGPT: Grounded Vision–Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
PARSE: Part-Aware Relational Spatial Modeling
Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering
R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
A Difference-in-Difference Approach to Detecting AI-Generated Images
BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment
Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
Seeing Motion Through Polarity for Event-based Action Recognition
Investigating Self-Supervised Representations for Audio-Visual Deepfake Detection
Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition
Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation
Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
StreamReady: Learning What to Answer and When in Long Streaming Videos
3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
RFDM: Residual Flow Diffusion Models for Video Editing
Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation
FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
Anti-I2V: Safeguarding your Photos from Malicious Image-to-video Generation
TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking
Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework
ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency
VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
Robust Promptable Video Object Segmentation
Generative Modeling of Weights: Generalization or Memorization?
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
Controllable Federated Prompt Learning at Test Time
MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
Progressive Mask Distillation for Self-supervised Video Representation
CoWTracker: Tracking by Warping instead of Correlation
DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks
BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
Progressive Supernet Training for Efficient Visual Autoregressive Modeling
Collaborative Multi-Mode Pruning for Vision-Language Models
AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples
From Panel to Pixel: Zoom-In Vision–Language Pretraining from Biomedical Scientific Literature
DROID-SLAM in the Wild
DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision–Language Transformers to Missing Modalities
PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild
SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras
Hint2Gen: Bridging Understanding and Generation via Code-structured Hints
SkillSight: Efficient First-Person Skill Assessment with Gaze
SIR: Structured Image Representations for Explainable Robot Learning
VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
Generative Neural Video Compression via Video Diffusion Prior
GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT
FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions
Mining Instance-Centric Vision–Language Contexts for Human–Object Interaction Detection
Measuring the (Un)Faithfulness of Concept-Based Explanations
Scaling Parallel Sequence Models to Vision Foundation Models
Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation
Beyond Text Prompts: Precise Concept Erasure through Text–Image Collaboration
From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
Causal Motion Diffusion Models for Autoregressive Motion Generation
MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
Computer Vision with a Superpixelation Camera
Learnability-Guided Diffusion for Dataset Distillation
The Invisible Gorilla Effect in Out-of-distribution Detection
Nonlinear Color Transfer via Learnable Bezier Flows
CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval
Language Models Can Explain Visual Features via Steering
Cinematic Audio Source Separation Using Visual Cues
Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
Failure Modes for Deep Learning–Based Online Mapping: How to Measure and Address Them
LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Towards Intrinsic-Aware Monocular 3D Object Detection
DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations
TopoCL: Topological Contrastive Learning for Medical Imaging
CoT-Edit: Let CoT Guide Instruction Video Editing
Language-guided Frequency Modulation for Large Vision-Language Models
Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
Semantic Alignment for Pose-Invariant Identity Preserving Diffusion
Exposing and Evaluating Hallucinations for GUI Grounding
Image-based Outlier Synthesis With Training Data
Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
Geometrically-Constrained Agent for Spatial Reasoning
Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision
TempoControl: Temporal Attention Guidance for Text-to-Video Models
Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing
RECS4R: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation
TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment
Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery
Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop
HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization
MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
EvoID: Reinforced Evolution for Identity-Preserving Video Generation
Photo-Guided Tooth Segmentation on 3D Oral Scan Model
Interpretable Debiasing of Vision-Language Models for Social Fairness
Post-training Feature Pruning for Fundus Images Classification
Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
MacTok: Robust Continuous Tokenization for Image Generation
Revisiting Visual Corruptions in LVLMs: A Shape–Texture Perspective on Model Failures
Efficient Unrolled Networks for Large-Scale 3D Inverse Problems
Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning
Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation
Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation
Hyperbolic Gramian Volumes for Multimodal Alignment
KαLOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
Multi-view Pyramid Transformer: Look Coarser to See Broader
PhysVid: Physics Aware Local Conditioning for Generative Video Models
Learning by Analogy: A Causal Framework for Compositional Generalization
Visual Diffusion Models are Geometric Solvers
VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework
More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization
Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
RL‑ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment
Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification
MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation
Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration
2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
Improving Sparse Autoencoder with Dynamic Attention
When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence
Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence
Unified Vector Floorplan Generation via Markup Representation
LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
Drainage: A Unifying Framework for Addressing Class Uncertainty
MoLingo: Motion–Language Alignment for Text-to-Human Motion Generation
Decoupling Vision and Language: Codebook Anchored Visual Adaptation
Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark
OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks
PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
VideoCoF: Unified Video Editing with Temporal Reasoner
Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding
Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
DIMOS: Disentangling Instance-level Moving Object Segmentation
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection
Differentiable Laplacian Matrix Guided Superpixel Segmentation
Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity
Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
SpiderCam: Low-Power Snapshot Depth from Differential Defocus
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization
ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization
CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation
PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment
FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection
Deformation-based In-Context Learning for Point Cloud Understanding
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification
Making the Classification Explanation Faithful to the Confidence Score
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging
Learning to Infer Parameterized Representations of Plants from 3D Scans
Boundary-Responsive Differentiable Gating for Superpixel-Based Segmentation
ReBaPL: Repulsive Bayesian Prompt Learning
Specificity-aware reinforcement learning for fine-grained open-world classification
UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders
CountGD++: Generalized Prompting for Open-World Counting
Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again
Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
Hyper-PCN: Hypergraph-Based Point Cloud Completion via High-Order Correlation Modeling
SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting
SimScale: Learning to Drive via Real-World Simulation at Scale
Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning
TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
MusicInfuser: Making Video Diffusion Listen and Dance
ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion
DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
Learning Straight Flows: Variational Flow Matching for Efficient Generation
Mapping Networks
Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation
Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
Forecasting 3D Scanpaths in Egocentric Video
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons
MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
Latent Chain-of-Thought World Modeling for End-to-End Driving
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
BluRef: Unsupervised Image Deblurring with Dense-Matching References
From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
Spatiotemporal Pyramid Flow Matching for Climate Emulation
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion
When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution
Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception
OctoNav: Towards Generalist Embodied Navigation
FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting
GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space
EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions
Seeing Conversations: Communication Context Identification in Egocentric Video
MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
Object-WIPER: Training-Free Object and Associated Effect Removal in Videos
SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning
ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model
TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer
Learning Convex Decomposition via Feature Fields
Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction
TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection
RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling
FILTR: Extracting Topological Features from Pretrained 3D Models
SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning
Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
RankOOD - Class Ranking-based Out-of-Distribution Detection
EDGS: Eliminating Densification for Efficient Convergence of 3DGS
ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval
EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
End-to-End Language-Action Model for Humanoid Whole Body Control
HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
Lenses: Toward Polysemous Vision–Language Understanding
D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration
Towards Calibrating Prompt Tuning of Vision- Language Models
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
Image-to-Point Cloud Feature Back-Projection for Multimodal Training of 3D Semantic Segmentation
Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning
RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
MIBURI: Towards Expressive Interactive Gesture Synthesis
Deciphering Genotype-Phenotype Mechanisms from High-Content Profiling via Knowledge-Guided Multi-modal Graph Learning
IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
Resolving the Identity Crisis in Text-to-Image Generation
TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation
Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations
Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

Remarks

Welcome & Awards

Tutorials

The Principles of Diffusion Models: Real-Time Continuous & Discrete Diffusion
Edge AI in Action: Mastering On-Device Inference
Tom Builds, Tom Breaks: Hands-On Attacks and Defenses for Vision-Language Systems
Accelerated Diffusion Models: From Theory to Interactive World Models
Building GenAI based Simulation Environment for End-to-End Autonomous Driving
From Perception to Simulation: The Emergence of World Models in Multi-modal Reasoning
Monte Carlo physical simulation
3D Human Mesh Modeling and Recovery from RGB and LiDAR
Recent Advances in AI for Medical Imaging: Progress, Challenges, and Future Directions
Computer Vision at Scale: Multi-Camera Tracking, Calibration, and Event Detection for Checkout-Free Retail
Extending Computer Vision to Hidden Objects: A Tutorial on Millimeter-Wave Imaging and Reconstruction of Occluded Scenes
The Full Stack of Physical AI: Simulation, Foundation Models, and Edge Deployment for Next-Generation Robotics Applications
Analytic understanding of diffusion models
All You Need To Know About Self-Driving
The Road to Convergence: Evolution of Unified Multimodal Models
From Perception to Action: Building Efficient and Deployable Robot Intelligence Pipelines with Open-Source Edge AI Toolkits
Foundations and Frontiers of Watermarking: Algorithms, Multimodal Extensions, Benchmarks, and Authenticity Frameworks

Workshops

PHAROS AI Factory for Medical Imaging & Healthcare
Generative AI for XR and Identity-based Applications
Foundation Models for Autonomous Driving
The 3rd Workshop on Human Motion Generation - New Perspective on Simulation, Animation, and VR applications
From Lab Demos to Daily Tasks: Embodied Intelligence in the Wild
The 5th Explainable AI for Computer Vision (XAI4CV) Workshop
Multimodal Foundation Models for Biomedicine: Challenges and Opportunities
Workshop on Agentic AI for Visual Media
Workshop on World Models Meet Active Sensing and Closed-Loop Planning
Autonomous Understanding Through Open-world Perception and Integrated Language models for On-road Tasks
On Sensor Vision Workshop
Workshop on Vision-based Assistants in the Real-World
Multimodal Alignment for a Pluralistic Society
IPA: Interactive Physical AI Workshop
AI for Content Creation
The 5th DataCV Workshop and Challenge
Sense of Space: Multi-Sensory Modeling for Embodied Intelligence
The 3rd AI for Visual Arts Workshop and Challenges
Third Joint Egocentric Vision (EgoVis) Workshop
2nd Workshop on Photorealistic 3D Head Avatars
AERO-HPR: Human Perception and Recognition in Aerial Surveillance
Efficient Deep Learning for Computer Vision
The 22nd Embedded Vision Workshop
Urban Scene Modeling: Structured, Semantic, and Synthetic 3D Habitats
Authenticity & Provenance in the age of Generative AI
The 1st Workshop on Monitoring the World through an Imperfect Lens
The Second CVPR Workshop on Foundation and Large Vision Models in Remote Sensing (MORSE)
Synthetic & Adversarial ForEnsics
The 1st Workshop on Vision for Intelligent Task Assistants
Computer Vision for the Built World
Sixth Workshop on Neural Architecture Search
DataMFM: Emerging Directions in Data for Multimodal Foundation Models
Computer Vision for Biomechanics Workshop
3rd Workshop on Efficient and On-Device Generation (EDGE), CVPR 2026
2nd Workshop on Multimodal Sign Language Recognition
10th Affective & Behavior Analysis in-the-wild
Workshop on Multimodal Human Motion Analysis
Cognitive Foundations for Multimodal Models
OpenSUN3D: 6th Workshop on Open-World 3D Scene Understanding with Foundation Models
The 3rd MetaFood Workshop (MTF)
3rd Workshop on ScanNet++ Novel View Synthesis and 3D Semantic Understanding Challenge
Auto-Annotation with Expert-Crafted Guidelines
The 5th Workshop on “What is Next in Multimodal Foundation Models?”
Machine Unlearning for Vision
The 2nd 3D-LLM/VLA Workshop: Bridging Language, Vision and Action in 3D Environments
The 7th International Workshop and CVML Challenge on Agriculture-Vision: Challenges & Opportunities for Computer Vision in Agriculture
1st Workshop on Multi-Agent Robotic Systems: Scaling with Compositional Intelligence
Second Workshop on Foundation and Generative Models in Biometrics
Rediscovering Intelligence: Can AI Still Learn from Humans?
3D Geometry Generation for Scientific Computing (2nd Edition)
2nd Workshop on Knowledge-Intensive Multimodal Reasoning
The 3rd Workshop on New Trends in AI-Generated Media and Security
2nd Workshop on Computer Vision for Children
The Seventh Annual Embodied Artificial Intelligence Workshop
From Perception to Persuasion: Challenges and Advances in Misinformation Detection in Society
Workshop on Visual Concepts
9th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues
6th Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics
Third Workshop for Learning 3D with Multi-View Supervision
Trustworthy, Robust, Uncertainty-Aware, and Explainable Visual Intelligence and Beyond
Humans of Generative AI
2nd Workshop on Video Large Language Models
Sight and Sound
2nd Workshop on Agents in Interaction, from Humans to Robots
The Second Workshop on the Evaluation of the Generative Foundation Models
SPAR-3D: Security, Privacy, and Adversarial Robustness in 3D Generative Vision Models
Video Generative Models: Benchmarks and Evaluation
Unified Robotic Vision with Cross-Modal Sensing and Alignment
The 8th UG2+ Workshop and Challenge: Bridging the Gap between Computational Photography and Visual Perception
4th Workshop on Maritime Computer Vision
Safe Artificial Intelligence for All Domains
6th Workshop on CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling
PhysHuman: Physically Grounded Human Perception and Modeling
Exploring the Next Generation of Data
Personalization in Generative AI Workshop
9th Multimodal Learning and Applications Workshop
4th Workshop on Generative Models for Computer Vision
2nd Workshop on Human-Interactive Generation and Editing
12th IEEE International Workshop on Computer Vision in Sports
The 6th Workshop of Adversarial Machine Learning on Computer Vision: Safety of Vision-Language Agents
How Do Vision Models Work?
Domain Generalization: Evolution, Breakthroughs, and Future Horizons (2nd Edition)
2nd Workshop on GenAI for Storytelling
CVPR 2026 Biometrics Workshop
Medical Reasoning with Vision Language Foundation Models
Computer Vision × Education: Building a Cross‑Community Agenda for Multimodal Vision in Classrooms
2nd Workshop on 4D Vision: Modeling the Dynamic World
1st Workshop on Generative 3D Reconstruction
The 3rd Workshop on Synthetic Data for Computer Vision
ScaleBot: The First Workshop on Scalable Robot Learning Systems
The 2nd CVPR Workshop on Foundation Models Meet Embodied Agents
CV4Science: Using Computer Vision for the Sciences
Artificial Intelligence for Space
The 7th International Workshop on Eye and Gaze in Computer Vision
Big Model Adaptation In Computer Vision
Bridging AI and Medical Reality: Computer Vision for Real-world Clinical Translation
4D Digital Twins: Real-to-Sim-to-Real for Physical AI
The 2nd Workshop on Multi-Modal Reasoning for Agentic Intelligence
1st Workshop on Journey to the Awards: Generative AI for Movie-Grade Video Production (J2A), CVPR 2026
Second Workshop on Skilled Activity Understanding, Assessment & Feedback Generation
Pixel-level Video Understanding in the Wild Challenge
The Third Workshop on Anomaly Detection with Foundation Models
Appearance Understanding and Generation
See the World in a Different Light: Physical Appearance Modeling and Relighting in the Age of Generative AI

Report issues here.