| Show Detail |
Timezone: America/Denver
|
Filter Rooms:
TUE 2 JUN
2 p.m.
(ends 8:00 PM)
WED 3 JUN
8 a.m.
Tutorial:
(ends 12:00 PM)
Tutorial:
(ends 12:00 PM)
9 a.m.
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 1:00 PM)
10 a.m.
1 p.m.
Tutorial:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
1:30 p.m.
Tutorial:
(ends 6:00 PM)
Tutorial:
(ends 6:00 PM)
3 p.m.
THU 4 JUN
8 a.m.
Tutorial:
(ends 12:00 PM)
Tutorial:
(ends 12:00 PM)
Tutorial:
(ends 12:00 PM)
9 a.m.
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 1:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
10 a.m.
1 p.m.
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
Workshop:
(ends 6:00 PM)
1:30 p.m.
Tutorial:
(ends 6:00 PM)
Tutorial:
(ends 6:00 PM)
3 p.m.
FRI 5 JUN
8:30 a.m.
8:45 a.m.
9 a.m.
9:15 a.m.
Orals 9:15-10:30
[9:15]
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
[9:30]
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
[9:45]
ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
[10:00]
TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
[10:15]
ViT^3: Unlocking Test-Time Training in Vision
(ends 10:30 AM)
Orals 9:15-10:30
[9:15]
Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion
[9:27]
Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
[9:40]
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging
[9:52]
Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack
[10:05]
Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework
[10:17]
UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
(ends 10:30 AM)
Orals 9:15-10:30
[9:15]
Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models
[9:27]
Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets
[9:40]
RAVEN: Erasing Invisible Watermarks via Novel View Synthesis
[9:52]
LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing
[10:05]
NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization
[10:17]
Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
(ends 10:30 AM)
Orals 9:15-10:30
[9:15]
Advancing Image Classification with Discrete Diffusion Classification Modeling
[9:27]
Does YOLO Really Need to See Every Training Image in Every Epoch?
[9:40]
Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks
[9:52]
NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices
[10:05]
Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
[10:17]
Rethinking Dataset Distillation: Hard Truths about Soft Labels
(ends 10:30 AM)
10:15 a.m.
10:45 a.m.
Posters 10:45-12:45
Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
Fast SceneScript: Fast and Accurate Language‑Based 3D Scene Understanding via Multi‑Token Prediction
Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification
(ends 12:45 PM)
(ends 6:00 PM)
11 a.m.
(ends 11:30 AM)
1 p.m.
Orals 1:00-2:15
[1:00]
4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
[1:12]
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
[1:25]
FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement
[1:37]
Residual Primitive Fitting of 3D Shapes with SuperFrusta
[1:50]
SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
[2:02]
SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
(ends 2:15 PM)
Orals 1:00-2:15
[1:00]
3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects
[1:12]
GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
[1:25]
Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
[1:37]
PhyGaP: Physically-Grounded Gaussians with Polarization Cues
[1:50]
PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
[2:02]
SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping
(ends 2:15 PM)
Orals 1:00-2:15
[1:00]
MAMMA: Markerless Accurate Multi-person Motion Acquisition
[1:12]
Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos
[1:25]
PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
[1:37]
SAM 3D Body: Robust Full-Body Human Mesh Recovery
[1:50]
SAM 3D: 3Dfy Anything in Images
[2:02]
SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge
(ends 2:15 PM)
Orals 1:00-2:15
[1:00]
Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
[1:12]
MeshSplatting: Differentiable Rendering with Opaque Meshes
[1:25]
Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting
[1:37]
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
[1:50]
Selfi: Self-improving Reconstruction Engine via 3D Geometric Feature Alignment
[2:02]
Z-Order Transformer for Feed-Forward Gaussian Splatting
(ends 2:15 PM)
1:30 p.m.
(ends 2:30 PM)
2:15 p.m.
2:45 p.m.
3:30 p.m.
4 p.m.
Posters 4:00-6:00
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
Rounded or Streamlined Head? Bridging Concept Bottleneck Models and Attribute-Described Object Parts
(ends 6:00 PM)
5 p.m.
(ends 5:30 PM)
SAT 6 JUN
9 a.m.
Orals 9:00-10:15
[9:00]
Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation
[9:12]
Guiding a Diffusion Model by Swapping Its Tokens
[9:25]
PixelDiT: Pixel Diffusion Transformers for Image Generation
[9:37]
SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
[9:50]
SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
[10:02]
Streaming Diffusion Model for Fast Infrared and Visible Video Fusion
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression
[9:12]
FINER: MLLMs Hallucinate under Fine-grained Negative Queries
[9:25]
MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction
[9:37]
PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs
[9:50]
PAVAS: Physics-Aware Video-to-Audio Synthesis
[10:02]
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation
[9:12]
CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling
[9:25]
GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding
[9:37]
RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
[9:50]
S^2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds
[10:02]
Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
3D-LATTE: Latent Space 3D Editing from Textual Instructions
[9:12]
AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows
[9:25]
ChordEdit: One-Step Low-Energy Transport for Image Editing
[9:37]
Faithful Contouring: Near-Lossless 3D Voxel Representation Free from Iso-surface
[9:50]
Native and Compact Structured Latents for 3D Generation
[10:02]
SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control
(ends 10:15 AM)
10:15 a.m.
10:30 a.m.
11:15 a.m.
11:45 a.m.
Posters 11:45-1:45
ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding
MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation
Progressive Guessing to Fixed Point: Rethinking Human Motion Prediction with Deep Equilibrium Models
From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
Distilling Quasi-Conformal Mapping: A Generalizable and Efficient Solution for Wide-Angle Correction
(ends 1:45 PM)
(ends 6:00 PM)
(ends 12:15 PM)
1:45 p.m.
(ends 2:45 PM)
2 p.m.
Orals 2:00-3:15
[2:00]
INSID3: Training-Free In-Context Segmentation with DINOv3
[2:12]
MARCO: Navigating the Unseen Space of Semantic Correspondence
[2:25]
PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
[2:37]
R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection
[2:50]
The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
[3:02]
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
(ends 3:15 PM)
Orals 2:00-3:15
[2:00]
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
[2:12]
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
[2:25]
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
[2:37]
Linear Fundamental Matrix Estimation from 7 or 5 Points
[2:50]
OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
[3:02]
VGGT-Ω
(ends 3:15 PM)
Orals 2:00-3:15
[2:00]
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
[2:12]
NitroGen: An Open Foundation Model for Generalist Gaming Agents
[2:25]
PAI-Bench: A Comprehensive Benchmark For Physical AI
[2:37]
RefAV: Towards Planning-Centric Scenario Mining
[2:50]
SoccerMaster: A Vision Foundation Model for Soccer Understanding
[3:02]
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
(ends 3:15 PM)
Orals 2:00-3:15
[2:00]
Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
[2:12]
GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
[2:25]
InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
[2:37]
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
[2:50]
Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation
[3:02]
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
(ends 3:15 PM)
3:15 p.m.
4:15 p.m.
4:45 p.m.
Posters 4:45-6:45
Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
Let VLMs Grade Their Own Thoughts: A Self-Quantification Approach to Reasoning-Aware Reward Modeling
Generalizable Structure-Aware Keypoint Correspondence for Category-Unified 3D Single Object Tracking
Dual-Level Hypergraph Generation for Addressing Feature Scarcity in Whole-Slide Image Classification
(ends 6:45 PM)
5 p.m.
(ends 5:30 PM)
7 p.m.
SUN 7 JUN
9 a.m.
Orals 9:00-10:15
[9:00]
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
[9:12]
Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization
[9:25]
QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
[9:37]
SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation
[9:50]
Structural Action Transformer for 3D Dexterous Manipulation
[10:02]
TESO: Online Tracking of Essential Matrix by Stochastic Optimization
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
Evidential Neural Radiance Fields
[9:12]
Global-Aware Edge Prioritization for Pose Graph Initialization
[9:25]
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
[9:37]
Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics
[9:50]
SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker
[10:02]
U^2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
AToken: A Unified Tokenizer for Vision
[9:12]
Confusion-Aware Spectral Regularizer for Long-Tailed Recognition
[9:25]
Learning Latent Concepts for Detecting Out-of-Distribution Objects
[9:37]
Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery
[9:50]
Understanding and Enforcing Weight Disentanglement in Task Arithmetic
[10:02]
Understanding Task Transfer in Vision-Language Models
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer
[9:12]
ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications
[9:25]
OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
[9:37]
OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data
[9:50]
POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling
[10:02]
Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
(ends 10:15 AM)
10:15 a.m.
10:30 a.m.
Keynote:
Thomas Serre
(ends 11:30 AM)
11:15 a.m.
11:45 a.m.
(ends 3:00 PM)
Posters 11:45-1:45
Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval
Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes
CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model
Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation
M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs
Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation
SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation
(ends 1:45 PM)
(ends 12:15 PM)
2 p.m.
Orals 2:00-3:15
[2:00]
Efficient Unrolled Networks for Large-Scale 3D Inverse Problems
[2:15]
FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization
[2:30]
SimScale: Learning to Drive via Real-World Simulation at Scale
[2:45]
Texvent: Asynchronous Event Data Simulation via Text Prompt
[3:00]
WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
(ends 3:15 PM)
Orals 2:00-3:15
[2:00]
CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation
[2:15]
Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations
[2:30]
SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks
[2:45]
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
[3:00]
Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training
(ends 3:15 PM)
Orals 2:00-3:15
[2:00]
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
[2:12]
DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging
[2:25]
Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation
[2:37]
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
[2:50]
Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence
[3:02]
SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation
(ends 3:15 PM)
Orals 2:00-3:15
[2:00]
Differentiable Laplacian Matrix Guided Superpixel Segmentation
[2:15]
FILTR: Extracting Topological Features from Pretrained 3D Models
[2:30]
Learning Convex Decomposition via Feature Fields
[2:45]
Learning Eigenstructures of Unstructured Data Manifolds
[3:00]
Mapping Networks
(ends 3:15 PM)
3 p.m.
3:15 p.m.
3:30 p.m.
Posters 3:30-5:30
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning
VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
Beyond Explicit Language: Plug-and-Play Visual-to-Linguistic Modeling Toward General Object Tracking
LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
(ends 5:30 PM)
Successful Page Load