CVPR 2026 Events with Videos
Live Streams and Virtual Content
During the main conference, the following four rooms will be live streamed. The same stream is active for the entire day. You may optionally find specific events listed by type (Keynote, Oral Sessions, etc,) below. The streams are available either from the table or from the event pages.
Video recordings and live streams for all events including workshops and tutorials will appear below as soon as they become available. Live streams are available during the event and recordings from the live stream will be posted to the website within about 48 hours. Workshop and tutorial recordings will be posted in the week following the conference.
Keynotes
Meetings
Oral Sessions
- Oral Session 1A: Multimodal Vision
- Oral Session 1B: Visual Security
- Oral Session 1C: Efficient Reasoning
- Oral Session 1D: Computational Imaging
- Oral Session 2A: 3D Reconstruction
- Oral Session 2B: Materials & Lighting
- Oral Session 2C: Gaussian Splatting & Reconstruction
- Oral Session 2D: Spatio-Temporal Reconstruction
- Oral Session 3A: Generative Diffusion Modeling
- Oral Session 3B: Spatial Understanding
- Oral Session 3C: Generative Editing
- Oral Session 3D: Multimodal Modeling
- Oral Session 4A: Geometric Understanding
- Oral Session 4B: Embodied & Agentic Intelligence
- Oral Session 4C: Spatial Reasoning
- Oral Session 4D: Visual Segmentation
- Oral Session 5A: Dynamic Perception
- Oral Session 5B: Generalization and Adaptation
- Oral Session 5C: Geometry and Robotics
- Oral Session 5D: Human-Centric Modeling & Lighting
- Oral Session 6A: Geometric Learning
- Oral Session 6B: Multimodal Reasoning
- Oral Session 6C: Medical Vision
- Oral Session 6D: Large-Scale Neural Modeling
Posters
- VideoMaMa: Mask-Guided Video Matting via Generative Prior
- Scalable Feature Matching via State Space Modeling and Sparse Correlation
- MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
- UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
- Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning
- NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
- VGA: Empowering Aerial-Ground Localization by Visual Geometry Alignment
- CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models
- Does YOLO Really Need to See Every Training Image in Every Epoch?
- Data-Centric Meta-Learning for Robust Few-Shot Generalization
- FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement
- IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models
- Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
- LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
- UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
- Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression
- Momentum Memory for Knowledge Distillation in Computational Pathology
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
- Fast SceneScript: Fast and Accurate Language‑Based 3D Scene Understanding via Multi‑Token Prediction
- Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations
- SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
- CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric
- S^2-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
- Reinforcing Video Reasoning Segmentation to Think Before It Segments
- MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On
- A Combination of Noise and Bilateral Filters Achieve Supralinear and Scalable Adversarial Robustness in CNNs
- HandX: Scaling Bimanual Motion and Interaction Generation
- OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
- GenMatter: Perceiving Physical Objects with Generative Matter Models
- Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
- Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models
- Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling
- NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
- VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
- Refaçade: Editing Object with Given Reference Texture
- GFRRN: Explore the Gaps in Single Image Reflection Removal
- The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
- MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision
- Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval
- Globscope: Toward a Global View of the Loss Landscape
- PrivateEyes: Gaze-Preserving Anonymization for Data Sharing
- Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
- Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
- HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation
- DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
- Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images
- JRM: Joint Reconstruction Model for Multiple Objects without Alignment
- Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
- LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing
- Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
- Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
- FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
- MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
- DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
- Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models
- Recovering Physically Plausible Human-Object Interactions from Monocular Videos
- VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
- Act2See: Emergent Active Visual Perception for Video Reasoning
- DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors
- Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios
- HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
- Long-Tail Internet Photo Reconstruction
- Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
- Learning Compact 3D Representations from Feed-Forward Novel View Synthesis
- Learning Multi-View Spatial Reasoning from Cross-View Relations
- DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
- OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
- 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
- Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
- Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
- Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
- Functional Mean Flow in Hilbert Space
- SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
- Consistent Instance Field for Dynamic Scene Understanding
- ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
- Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
- HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
- Source Models Leak What They Shouldn’t: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization
- Learning to Act Robustly with View-Invariant Latent Actions
- Refracting Reality: Generating Images with Realistic Transparent Objects
- Agile Deliberation: Concept Deliberation for Subjective Visual Classification
- Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets
- Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction
- Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
- BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery
- SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
- Heterogeneous Decentralized Diffusion Models
- Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
- GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution
- β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment
- SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
- Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation
- RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
- FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
- Vision-Speech Models: Teaching Speech Models to Converse about Images
- EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
- PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
- S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation
- Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning
- A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images
- MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation
- WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
- World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
- ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
- MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
- Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
- SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification
- CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
- Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy
- SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
- Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
- Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
- Bidirectional Normalizing Flow: From Data to Noise and Back
- Bootstrapping Multi-view Learning for Test-time Noisy Correspondence
- The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations
- TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures
- IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors
- iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
- DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
- Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
- rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training
- HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
- UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling
- Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
- Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
- From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking
- SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
- Region-Adaptive Sampling for Diffusion Transformers
- How Much 3D Do Video Foundation Models Encode?
- Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection
- Transition Matching Distillation for Fast Video Generation
- Correspondence-Attention Alignment for Multi-View Diffusion Models
- FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
- Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
- VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
- A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
- Physical Simulator In-the-Loop Video Generation
- FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers
- ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
- InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
- AudioAvatar: Personalized Audio-driven Whole-body Talking Avatars
- Exploring Spatial Intelligence from a Generative Perspective
- ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
- Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
- Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
- Advancing Image Classification with Discrete Diffusion Classification Modeling
- SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
- Bidirectional Query-Driven Generation of Parametric CAD Sketch
- Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
- FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift
- Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
- Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution
- No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors
- Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
- Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition
- Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
- EMMA: Extracting Multiple physical parameters from Multimodal Data
- HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction
- Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
- EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
- PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
- AnyPcc: Compressing Any Point Cloud with a Single Universal Model
- OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
- Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
- ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control
- Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
- Event-based Motion Deblurring with Unpaired Data
- Concept-Aware Batch Sampling Improves Language-Image Pretraining
- Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus
- RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection
- VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
- A Training-Free Style-Personalization via SVD-Based Feature Decomposition
- Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
- Extend3D: Town-Scale 3D Generation
- WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
- Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions
- Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
- QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence
- TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
- Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
- The Midas Touch for Metric Depth
- Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
- X-WIN: Building Chest Radiograph World Model via Predictive Sensing
- Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos
- AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation
- Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments
- TextOVSR: Text-Guided Real-World Opera Video Super-Resolution
- Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
- Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
- Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
- Chain of World: World Model Thinking in Latent Motion
- FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs
- Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
- BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting
- MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
- Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
- Neural Distribution Prior for LiDAR Out-of-Distribution Detection
- Affordance-First Decomposition for Continual Learning in Video–Language Understanding
- MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
- RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation
- Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
- STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
- ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization
- Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
- X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
- The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments
- Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
- RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
- PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation
- UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
- Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation
- MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
- Scalable Trajectory Generation for Whole-Body Mobile Manipulation
- Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human–Computer Interaction
- Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach
- PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding
- EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
- View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification
- GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models
- Not All Birds Look The Same: Identity-Preserving Generation For Birds
- One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
- Are Image-to-Video Models Good Zero-Shot Image Editors?
- CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
- Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model
- Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
- Harnessing the Power of Foundation Models for Accurate Material Classification
- DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
- ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
- PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
- RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval
- Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs
- MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
- Lifting Unlabeled Internet-level Data for 3D Scene Understanding
- Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation
- LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
- C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion
- ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction
- Language-Free Generative Editing from One Visual Example
- Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection
- Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models
- Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
- Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift
- Understanding Counting Mechanisms in Large Language and Vision-Language Models
- Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
- Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision
- Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner
- Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
- fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
- Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning
- From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition
- REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
- Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices
- Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
- When to Think and When to Look: Uncertainty-Guided Lookback
- Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos
- MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
- HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
- Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification
- Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
- Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
- GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving
- DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
- 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
- VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering
- Learning Personalized Photographic Style from Pairwise User Preferences
- MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
- ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models
- Hierarchical Action Learning for Weakly-Supervised Action Segmentation
- Geometric Neural Distance Fields for Learning Human Motion Priors
- RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion
- Rethinking Dataset Distillation: Hard Truths about Soft Labels
- Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances
- Efficient Weighted Sampling via Score-based Generative Models
- Mind the Gap: Transferring Labels to Align Object Detection Datasets
- NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization
- StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
- MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
- EmoStyle: Emotion-Driven Image Stylization
- FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment
- Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation
- AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
- NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks
- LNEM: Lunar Neural Elevation Model
- Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
- LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction
- EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding
- LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis
- DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
- From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
- DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
- MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
- EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning
- Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
- MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
- E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
- Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
- X-band Radar Non-Line-of-Sight Imaging
- Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule
- CompBench: Benchmarking Complex Instruction-guided Image Editing
- Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
- TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
- Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
- LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
- Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
- Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data
- Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
- TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation
- DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion
- Beyond Depth: Evaluating the Width-centric Reasoning Capability of MLLMs
- FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
- Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks
- Inferring Compositional 4D Scenes without Ever Seeing One
- DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning
- SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning
- PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting
- ORBIT: Benchmarking SfM in the Wild with 360° Video
- MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
- Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
- LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
- Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
- Meta-Learning In-Context Enables Training-Free Cross Subject Brain Decoding
- VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
- PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
- Multi-Scale Local Speculative Decoding for Image Generation
- Decoupled Generative Modeling for Human-Object Interaction Synthesis
- Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
- Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
- Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
- AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
- DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
- VISTA: A Test-Time Self-Improving Video Generation Agent
- GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching
- Twin-T & TwintVQA: A Reliable Structure–Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks
- ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS
- Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation
- The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
- ViT^3: Unlocking Test-Time Training in Vision
- Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
- Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
- Global Underwater Geolocation from Time-Lapse Polarization Imagery
- VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision–Language Models
- Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
- Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation
- InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
- Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization
- PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
- FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
- OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
- GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
- Envisioning the Future, One Step at a Time
- Global Information Thresholding for Sufficient and Necessary Circuits
- QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models
- Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
- Physical Object Understanding with a Physically Controllable World Model
- RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection
- A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction
- Label-Free Cross-Task LoRA Merging with Null-Space Compression
- MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
- UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
- Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
- AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models
- VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution
- Reflection Separation from a Single Image via Joint Latent Diffusion
- AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models
- Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
- PhysHead: Simulation-Ready Gaussian Head Avatars
- Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
- Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
- Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking
- UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs
- MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
- HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
- HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling
- GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
- StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
- VL-RouterBench: A Benchmark for Vision–Language Model Routing
- DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment
- STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows
- When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models
- Gyro-based Deep Video Deblurring
- DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization
- Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
- E^2-SCI: Elastic Edge–Cloud Speculative Decoding via Credit Inertia
- MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis
- Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
- VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
- MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention
- COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
- GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment
- Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance
- RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
- Splatent: Splatting Diffusion Latents for Novel View Synthesis
- GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
- Affostruction: 3D Affordance Grounding with Generative Reconstruction
- TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models
- PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
- Variational Graph-based Normal Integration
- ID-Sim: An Identity-Focused Similarity Metric
- AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
- SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
- Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining
- SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
- Affine Perspective-Three-Point Problem
- SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction
- PGA: Prior-free Generative Attack for Practical No-box Scenario
- DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
- Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
- GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
- Breaking Multimodal LLM Safety via Video-Driven Prompting
- PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
- Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
- Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
- Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
- Lipschitz Optimization for Formal Verification of Homographies
- Make it SING: Analyzing Semantic Invariants in Classifiers
- SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
- DRM: Diffusion-based Reward Model With Step-wise Guidance
- VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
- TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
- DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning
- Continual Distillation of Teachers from Different Domains
- Rethinking Occlusion Modeling for UAV Tracking
- Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
- Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation
- Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
- Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
- CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
- Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
- Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post‑hoc Debiasing in Vision-Language Models
- PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
- Visual Grounding for Object Questions
- DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation
- TopoSlide: Topologically-Informed Histopathology Whole Slide Image Representation Learning
- Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness
- Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity
- All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark
- HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation
- Building a Precise Video Language with Human–AI Oversight
- TiViBench: Benchmarking Think-in-Video Reasoning for Video Generation
- Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
- Illumination-Consistent Human-Scene Reconstruction from Monocular Video
- Residual Primitive Fitting of 3D Shapes with SuperFrusta
- ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization
- Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration
- OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
- Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
- HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation
- MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
- E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
- Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
- Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
- Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
- Scaling Dense Event-Stream Pretraining from Visual Foundation Models
- GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning
- Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers
- Learning Effective Sign Features without Text for Gloss-free Sign Language Translation
- Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
- S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
- ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
- NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
- GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
- Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
- SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
- RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
- A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
- Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
- Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment
- PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
- Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis
- Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
- Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation
- COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification
- Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
- ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
- ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
- BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
- LVLM-Aided Alignment of Task-Specific Vision Models
- Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification
- DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
- COT-FM: Cluster-wise Optimal Transport Flow Matching
- InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
- FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
- Rectifying Latent Space for Generative Single-Image Reflection Removal
- STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
- Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging
- Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects
- MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
- HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
- Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
- VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
- SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
- STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
- Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
- ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
- Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
- OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
- SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
- UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
- PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and VLM-Guided Optimization
- Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
- SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
- Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
- UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization
- Towards Sparse Video Understanding and Reasoning
- Contact-Aware Neural Dynamics
- AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning
- EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
- Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion
- VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
- THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT
- GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
- Occluded Human Body Capture with Frequency Domain Denoising Prior
- LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
- From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
- The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
- RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
- PhyGaP: Physically-Grounded Gaussians with Polarization Cues
- Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
- MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
- Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
- MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
- AVION: Aerial Vision–Language Instruction from Offline Teacher to Prompt-Tuned Network
- Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision
- Yume1.5: A Text-Controlled Interactive World Generation Model
- ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes
- Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement
- MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
- R3-PCQA: Ray-Reprojection-Reinforcement for No-Reference 3D Point Cloud Quality Assessment
- Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities
- Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
- Anti-Degradation Lifelong Multi-View Clustering
- DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
- Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes
- Common Inpainted Objects In-N-Out of Context
- UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
- Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
- Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?
- Composite-Attribute Person Re-Identification via Pose-Guided Disentanglement
- Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation
- Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization
- Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations
- TANGO: Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization
- Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
- EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
- Edit-aware RAW reconstruction
- ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
- OVI-MAP: Open-Vocabulary Instance-Semantic Mapping
- HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
- MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding
- SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment
- Bridging Domains through Subspace-Aware Model Merging
- Reinforcing Structured Chain-of-Thought for Video Understanding
- OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
- CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
- Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
- Representing 3D Faces with Learnable B-Spline Volumes
- HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
- SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
- CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird’s-Eye-View Semantic Segmentation
- EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
- CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion
- BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections
- Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
- GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
- Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
- Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
- Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
- PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
- Globally Optimal Pose from Orthographic Silhouettes
- Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
- LAMP: Language-Assisted Motion Planning for Controllable Video Generation
- Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection
- Lynx: Towards High-Fidelity Personalized Video Generation
- Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
- Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
- RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
- Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention
- Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
- MeshRipple: Structured Autoregressive Generation of Artist-Meshes
- Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
- SAM 3D: 3Dfy Anything in Images
- AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
- E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation
- Learnability-Driven Submodular Optimization for Active Roadside 3D Detection
- Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation
- U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
- MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
- Towards Generalized Multimodal Homography Estimation
- MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer
- UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
- EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion
- Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
- UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
- Rethinking Concept Bottleneck Models: From Pitfalls to Solutions
- LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
- Z-Order Transformer for Feed-Forward Gaussian Splatting
- SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
- CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
- MAMMA: Markerless Accurate Multi-person Motion Acquisition
- SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
- NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
- Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models
- Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks
- CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization
- TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR
- Scene-Centric Unsupervised Video Panoptic Segmentation
- CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection
- FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement
- Translating Signals to Languages for sEMG-Based Activity Recognition
- Vinedresser3D: Towards Agentic Text-guided 3D Editing
- Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
- TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
- Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
- Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models
- What Matters in Practical Learned Image Compression
- Dynamic Momentum Recalibration in Online Gradient Learning
- The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection
- PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models
- LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
- Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
- PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
- Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization
- Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
- Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
- UniCorrn: Unified Correspondence Transformer Across 2D and 3D
- An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
- Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion
- TruckDrive: Long-Range Autonomous Highway Driving Dataset
- BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
- PaNDaS: Learnable Shape Interpolation Modeling with Localized Control
- First Frame Is the Place to Go for Video Content Customization
- SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
- Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
- SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
- OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera
- Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
- Optical Diffraction-based Convolution for Semiconductor Lithography
- Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints
- Residual Diffusion Bridge Model for Image Restoration
- What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely F1
- NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
- R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment
- Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling
- Prompt-Free Universal Region Proposal Network
- OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
- WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with Realistic Tasks
- Frequency-domain Manipulation for Face Obfuscation
- POUR: A Provably Optimal Method for Unlearning Representation via Neural Collapse
- Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation
- Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence
- HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification
- SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
- D-Prism: Differentiable Primitives for Structured Dynamic Modeling
- DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
- Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation
- Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
- Paparazzo: Active Mapping of Moving 3D Objects
- An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
- FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
- Local Motion Matters: A Deconstruct–Recompose Paradigm for Reinforcement Learning Pre-training from Videos
- HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
- WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering
- Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations
- Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
- Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
- GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
- Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening
- Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
- UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
- PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
- TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction
- InterRVOS: Interaction-Aware Referring Video Object Segmentation
- Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
- TrackMAE: Video Representation Learning via Track Mask and Predict
- BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation
- CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers
- RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection
- Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
- Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
- Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
- Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure
- SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
- G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
- VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
- MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
- Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching
- A³: Towards Advertising Aesthetic Assessment
- PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
- OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing
- CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization
- P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction
- Ego: Embedding-Guided Personalization of Vision-Language Models
- Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
- Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
- Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization
- CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
- MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior
- D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network
- META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding
- PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
- Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
- Learning What Helps: Task-Aligned Context Selection for Vision Tasks
- APPO: Attention-guided Perception Policy Optimization for Video Reasoning
- ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
- SPDMark: Selective Parameter Displacement for Robust Video Watermarking
- ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss
- CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
- ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery
- 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
- What Are You Doing? A Closer Look at Controllable Human Video Generation
- ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
- STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
- Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
- MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
- Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
- GR-Gauge: Cost-efficient Training Configuration By Gauging the Gradient Redundancy
- Radiance Meshes for Volumetric Reconstruction
- Cross-Hand Latent Representation for Vision-Language-Action Models
- Disco-GS: Gaussian Splatting in Dynamic Color Lighting
- Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
- ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
- Geometry-Guided 3D Visual Token Pruning for Video-Language Models
- From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification
- Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels
- Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping
- Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention
- WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing
- Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain
- Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
- Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision
- RAID: Retrieval-Augmented Anomaly Detection
- Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting
- Predicting Spatial Transcriptomics from Histology Images via High-Order Multi-Cell Interaction Modeling
- FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
- TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
- Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
- Dynamic Exposure Burst Image Restoration
- Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes
- Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers
- M^3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
- Any4D: Unified Feed-Forward Metric 4D Reconstruction
- DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
- Kaleidoscopic Scintillation Event Imaging
- Self-Consistency for LLM-Based Motion Trajectory Generation and Verification
- CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding
- CARD: Correlation Aware Restoration with Diffusion
- Event6D: Event-based Novel Object 6D Pose Tracking
- Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
- SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
- OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
- YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection
- Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
- HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
- A2GC: Asymmetric Aggregation with Geometric Constraints for Locally Aggregated Descriptors
- Learning complete and explainable visual representations from itemized text supervision
- Foundry: Distilling 3D Foundation Models for the Edge
- UniSER: A Foundation Model for Unified Soft Effects Removal
- Condensed Test-Time Adaptation of VLMs for Action Recognition
- GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
- L^2DGS: Low-Light Dynamic Gaussian Splatting
- Grid Distillation: Compositional Image Distillation via Structured Generative Grids
- ExpPortrait: Expressive Portrait Generation via Personalized Representation
- Delta Rectified Flow Sampling for Text-to-Image Editing
- When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection
- Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
- PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation
- Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
- Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
- Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
- AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
- Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
- Best Segmentation Buddies for Image-Shape Correspondence
- FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
- Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
- Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
- CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models
- LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
- Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting
- Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
- ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
- Beyond Caption-Based Queries in Video Moment Retrieval
- An Empirical Study on How Video-LLMs Answer Video Questions
- MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
- Efficient Frame Selection for Long Video Understanding via Reinforcement Learning
- Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
- Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
- HUMAPS-4D: A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations
- From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
- SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
- Linear Image Generation by Synthesizing Exposure Brackets
- SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
- CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection
- VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light Environment
- Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering
- Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training
- FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
- M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
- VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
- WPT: World-to-Policy Transfer via Online World Model Distillation
- Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes
- PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data
- From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
- SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images
- LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
- WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
- Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure
- Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions
- Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
- ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering
- Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses
- CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference
- Human Interaction-Aware 3D Reconstruction from a Single Image
- Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification
- Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
- Suppressing Non-Semantic Noise in Masked Image Modeling Representations
- Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images
- StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning
- Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
- AE2VID: Event-based Video Reconstruction via Aperture Modulation
- FloVerse: Floor Plan-Guided Multi-Modal Navigation
- Next-Scale Autoregressive Models for Text-to-Motion Generation
- PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
- Recurrent Video Masked Autoencoders
- HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter
- PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention
- Domain-Skewed Federated Learning with Feature Decoupling and Calibration
- Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
- Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
- Synthesizing Visual Concepts as Vision-Language Programs
- CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation
- WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
- Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
- Voxify3D: Pixel Art Meets Volumetric Rendering
- OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
- Explaining Object Detectors via Collective Contribution of Pixels
- Foundation Encoders Are All You Need for Preference-Aware Personalization
- CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild
- RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
- Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
- BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
- DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process
- Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
- ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation
- CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
- SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
- Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis
- A Supervised Multi-task Framework for Joint cryo-ET Restoration Enabled by Generative Physical Simulation
- Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
- Same or Not? Enhancing Visual Perception in Vision-Language Models
- Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach
- Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
- PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward
- FrankenMotion: Part-level Human Motion Generation and Composition
- FE2E: From Editor to Dense Geometry Estimator
- Lighting in Motion: Spatiotemporal HDR Lighting Estimation
- Aligning Text, Images and 3D Structure Token-by-Token
- Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
- Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
- Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution
- FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
- RewardFlow: Generate Images by Optimizing What You Reward
- AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
- CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
- Coded-E2LF: Coded Aperture Light Field Imaging from Events
- Bridging Facial Understanding and Animation via Language Models
- Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference
- OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
- Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
- StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
- Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
- First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
- UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
- Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness
- InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
- Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
- Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
- EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network
- RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation
- OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
- Visual Personalization Turing Test
- BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
- Parallelised Differentiable Straightest Geodesics for 3D Meshes
- PersonaLive! Expressive Portrait Image Animation for Live Streaming
- CLIP-like Model as a Foundational Density Ratio Estimator
- SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval
- Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
- Inference-time Physics Alignment of Video Generative Models with Latent World Models
- JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
- Gated KalmaNet: A Fading Memory Layer through Test-time Ridge Regression
- DeDelayed: Deleting Remote Inference Delay via On-Device Correction
- SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
- ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
- CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
- Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification
- ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing
- Towards Multimodal Domain Generalization with Few Labels
- Coverage Optimization for Camera View Selection
- FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
- NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining
- PersonaVLM: Long-Term Personalized Multimodal LLMs
- gQIR: Generative Quanta Image Reconstruction
- SMVRT: Implicit Human 3D Modeling Using Sparse Multi-View Volumetric Reconstruction with Transformer Fusion
- NIL: No-data Imitation Learning
- QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
- Few-for-Many Personalized Federated Learning
- A More Word-like Image Tokenization for MLLMs
- Image-Guided Geometric Stylization of 3D Meshes
- Boosting Reasoning in Large Multimodal Models via Activation Replay
- Learning from Noisy Supervision: A Denoising–Debiasing Framework for Weakly Supervised Video Anomaly Detection
- HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
- No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation
- Robust Spiking Neural Networks by Temporal Mutual Information
- V^2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
- Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers
- SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens
- Semantic Scale Space: A Framework for Controllable Image Abstraction
- ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
- Geometry-Aware Cross-Modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting
- QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification
- Image Generation from Contextually-Contradictory Prompts
- Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
- Spike-driven Discrete Aggregation for Event-based Object Detection
- Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery
- Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
- Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization
- From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
- VQ-VA World: Towards High-Quality Visual Question-Visual Answering
- RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
- FMPose3D: monocular 3D pose estimation via flow matching
- Eulerian Gaussian Splatting using Hashed Probability Pyramids
- DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
- GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models
- Learning to Solve PDEs on Neural Shape Representations
- Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
- MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
- When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
- Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning
- Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code
- Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
- Few-shot Acoustic Synthesis with Multimodal Flow Matching
- HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
- RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes
- Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
- Designing to Forget: Deep Semi-parametric Models for Unlearning
- UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
- Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
- AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors
- AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
- Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds
- Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling
- Differentially Private 2D Human Pose Estimation
- Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
- OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe
- 2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition
- Fine-Grained Multi Image Object Hallucination Benchmark
- Adaptive Confidence Regularization for Multimodal Failure Detection
- A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
- KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation
- A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World
- AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
- VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
- EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
- Endless World: Real-Time 3D-Aware Long Video Generation
- Volumetric Functional Maps
- TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
- Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
- LoST: Level of Semantics Tokenization for 3D Shapes
- TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
- Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields
- Gaussian Mapping for Evolving Scenes
- Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
- GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator
- MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction
- MV-TAP: Tracking Any Point in Multi-View Videos
- InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
- Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images
- CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
- ChordEdit: One-Step Low-Energy Transport for Image Editing
- InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models
- SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
- GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
- PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection
- Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
- Lafite: A Generative Latent Field for 3D Native Texturing
- Point Cloud as a Foreign Language for Multi-modal Large Language Model
- Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements
- Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers
- Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
- Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation
- EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR
- GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
- LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
- Distilling Balanced Knowledge from a Biased Teacher
- Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
- TextFM: Robust Semi-dense Feature Matching with Language Guidance
- Obstruction Reasoning for Robotic Grasping
- Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution
- Improving Adversarial Transferability with Local Perturbation Augmentation
- VOSR: A Vision-Only Generative Model for Image Super-Resolution
- MAD: Motion Appearance Decoupling for efficient Driving World Models
- MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly
- From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images
- Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering
- TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
- TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
- SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
- Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
- POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
- Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
- Generative Video Motion Editing with 3D Point Tracks
- CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
- RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
- Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
- 3D Gaussian Splatting at Arbitrary Resolutions with Compact Proxy Anchors
- PAVAS: Physics-Aware Video-to-Audio Synthesis
- A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett–Luce Ranking
- Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
- AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
- FPSBench: A Benchmark for Video Understanding at High Frame Rates
- PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation
- Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
- DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions
- HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph
- Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data
- V-DPM: 4D Video Reconstruction with Dynamic Point Maps
- 3D-LATTE: Latent Space 3D Editing from Textual Instructions
- EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models
- Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning
- Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
- Composing Concepts from Images and Videos via Concept-prompt Binding
- DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
- MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation
- Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
- Self-Corrected Image Generation with Explainable Latent Rewards
- Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
- URScenes: A Multi-scenario Dataset for Unstructured Road Environments
- When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness
- Modeling the Visual Ambiguity of Human Sketches
- PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose
- Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
- SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis
- IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
- Precise Object and Effect Removal with Adaptive Target-Aware Attention
- AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
- AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks
- Block-based Learned Image Compression without Blocking Artifacts
- MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
- Tunable Soft Equivariance with Guarantees
- Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets
- REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
- MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation
- SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
- Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis
- PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
- Geo2: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
- HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
- LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space
- S^2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
- UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
- VENI: Variational Encoder for Natural Illumination
- RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments
- VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation
- SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
- Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning
- Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All
- UIKA: Fast Universal Head Avatar from Pose-Free Images
- Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs
- Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation
- Scene Reconstruction as Mapping Priors for 3D Detection
- SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
- Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration
- TokenLight: Precise Lighting Control in Images using Attribute Tokens
- EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
- History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation
- MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images
- Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning
- Weight Space Representation Learning via Neural Field Adaptation
- Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
- DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime
- ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy
- Low-Resolution Editing is All You Need for High-Resolution Editing
- Draft and Refine with Visual Experts
- Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
- Content-Adaptive Hierarchical Hyperprior for Neural Video Coding
- Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
- Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
- Mario: Multimodal Graph Reasoning with Large Language Models
- Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning
- Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification
- CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation
- Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective
- Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
- Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
- ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data
- Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
- Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
- Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
- SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation
- TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification
- Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models
- Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
- PAS: Prelim Attention Score for Detecting Object Hallucinations in Large Vision-Language Models
- Grounded Latents for Entity-Centric 4D Scene Generation
- InternVideo-Next: Towards World-Understanding Video Models
- MicroFM: Physics-guided Flow Matching for Isotropic Microscopy Reconstruction
- DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease
- Deep Feature Deformation Weights
- MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
- MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics
- Temporal Inversion for Learning Interval Change in Chest X-Rays
- Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
- Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions
- GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
- TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
- MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
- Defending Unauthorized Model Merging via Dual-Stage Weight Protection
- Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
- DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification
- FARMER: Flow AutoRegressive Transformer over Pixels
- Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
- AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
- Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
- Stake the Points: Structure-Faithful Instance Unlearning
- VecGlypher: Unified Vector Glyph Generation with Language Models
- GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
- Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
- Dual Ascent Diffusion for Inverse Problems
- PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
- Zoo3D: Zero-Shot 3D Object Detection at Scene Level
- Generative Diffusion Priors for 3D Mapping of the Dark Universe
- FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
- Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
- Multi-speaker Attention Alignment for Multimodal Social Interaction
- LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
- MUFASA: A Multi-Layer Framework for Slot Attention
- R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection
- Structural Graph Probing of Vision–Language Models
- VABench: A Comprehensive Benchmark for Audio-Video Generation
- Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects
- Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
- FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
- SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World
- Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model
- SVAgent: Storyline-guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
- Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation
- PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
- VGGT-Ω
- FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection
- Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
- INSID3: Training-Free In-Context Segmentation with DINOv3
- IGen: Scalable Data Generation for Robot Learning from Open-World Images
- Exemplar-Free Continual Learning for State Space Models
- EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
- RefAV: Towards Planning-Centric Scenario Mining
- FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes
- Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
- FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
- ReLaGS: Relational Language Gaussian Splatting
- OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation
- WorldGen: From Text to Traversable and Interactive 3D Worlds
- InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
- DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
- GS-ASM: 2DGS-Supervised Active Stereo Matching
- Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
- Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
- Generalizable Video Quality Assessment via Weak-to-Strong Learning
- Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning
- InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
- MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
- Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
- OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning
- FEAT: Fashion Editing and Try-On from Any Design
- RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution
- Multimodal Distribution Matching for Vision-Language Dataset Distillation
- Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
- DC-Merge: Improving Model Merging with Directional Consistency
- Coupling Liquid Time‑Constant Encoders with Modern Hopfield Memory
- HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
- Feed-forward Gaussian Registration for Head Avatar Creation and Editing
- BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
- YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction
- PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
- InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
- BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
- ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
- Rethinking Token Reduction for Large Vision-Language Models
- Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory
- Selectively Extracting and Injecting Visual Attributes into Text-to-Image Models
- Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
- Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image
- HTTM: Head-wise Temporal Token Merging for Faster VGGT
- Relightful Video Portrait Harmonization
- Self-Diffusion Driven Blind Imaging
- SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
- Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation
- AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation
- VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
- SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
- TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis
- Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition
- MuM: Multi-View Masked Image Modeling for 3D Vision
- Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving
- ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers
- Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning
- TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
- GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies
- PAI-Bench: A Comprehensive Benchmark For Physical AI
- DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving
- Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
- Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift
- LeapAlign: Post-training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
- Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
- LitePT: Lighter Yet Stronger Point Transformer
- MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration
- Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
- MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
- Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
- Pixel Motion Diffusion is What We Need for Robot Control
- Unsupervised Multi-agent and Single-agent Perception from Cooperative Views
- Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
- TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
- AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models
- PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion
- PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation
- NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code
- RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
- ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
- SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
- Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty
- GenHOI: Towards Object-Consistent Hand–Object Interaction with Temporally Balanced and Spatially Selective Object Injection
- Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
- One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control
- OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
- Thinking in 360°: Humanoid Visual Search in the Wild
- CaptionQA: Is Your Caption as Useful as the Image Itself?
- Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors
- SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding
- SAME: Sparse and Anchored Model Editing for Heterogeneous Incremental Learning under Limited Data
- PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
- A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
- CREward: A Type-Specific Creativity Reward Model
- TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
- Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
- MARCO: Navigating the Unseen Space of Semantic Correspondence
- PECCVAI: Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks
- NEAF: Natural Image Editing with Attention Fusion for Generalizable Test-time Optimization in Text-Guided Image Editing
- Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
- HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
- FastGaMer: Efficient GainMap Learning for Practical Inverse Tone Mapping
- Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
- Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
- Dynamic Token Reweighting for Robust Vision-Language Models
- When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards
- FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
- SG-LoRA: Semantic-guided LoRA Parameters Generation
- MoVie: Broaden Your Views with Human Motion for Action Detection
- Scaling Spatial and Temporal Context for Robotic Imitation Learning Policies With Scene Graphs
- DiffBMP: Differentiable Rendering with Bitmap Primitives
- Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior
- GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing
- Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
- OccAny: Generalized Unconstrained Urban 3D Occupancy
- GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
- Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
- AirSim360: A Panoramic Simulation Platform within Drone View
- Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning
- Seele: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices
- Global Structure-from-Motion Meets Feedforward Reconstruction
- AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting
- Task-Driven Implicit Representations for Automated Design of LiDAR Systems
- Text-Driven 3D Hand Motion Generation from Sign Language Data
- Captain Safari: A World Engine with Pose-Aligned 3D Memory
- Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment
- CHEEM: Continual Learning by Reuse, New, Adapt and Skip - A Hierarchical Exploration-Exploitation Approach
- LumiX: Structured and Coherent Text-to-Intrinsic Generation
- 4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction
- Beyond the Static World: Continual Category Discovery under Visual Drift
- SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
- MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues
- How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
- Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
- CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction
- FUN REC Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
- CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
- TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
- HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm
- PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning
- IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
- Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling
- MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation
- StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
- ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
- SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
- Soft Modality-Guided Expert Specialization in MoE-VLMs
- A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
- Enhancing Spatial Understanding in Image Generation via Reward Modeling
- SoccerMaster: A Vision Foundation Model for Soccer Understanding
- 3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
- Perceptual 3D Simulation With Physical World Modeling
- SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
- Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
- Portable Active Learning for Object Detection
- Random Wins All: Rethinking Grouping Strategies for Vision Tokens
- Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
- CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
- LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
- WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
- Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction
- Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
- DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
- Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation
- Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
- ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
- Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
- Learning 3D Shape Fidelity Metric from Real-world Distortions
- Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
- HoneyBee: Data Recipes for Vision-Language Reasoners
- Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
- V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
- Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering
- Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination
- LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models
- PE3R: Perception-Efficient 3D Reconstruction
- ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
- Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining
- PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
- MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
- AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking
- Geometric-Photometric Event-based 3D Gaussian Ray Tracing
- FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
- Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
- TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
- Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D
- Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
- TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval
- The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models
- PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
- PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model
- Dynamic Visual SLAM using a General 3D Prior
- ShadowDraw: From Any Object to Shadow-Drawing Compositional Art
- Unified Number-Free Text-to-Motion Generation Via Flow Matching
- Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
- Particulate: Feed-Forward 3D Object Articulation
- Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
- FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
- Diagram2Structure: Unlocking LLMs' Diagram Comprehension through DiagramDiff, an Offline Diagram Structuring Framework
- DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
- GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
- Hierarchical Process Reward Models are Symbolic Vision Learners
- Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
- DuoGen: Towards Autonomous Interleaved Multimodal Generation
- Modeling Cross-vision Synergy for Unified Large Vision Model
- NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
- B^3-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
- High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
- Fully Decentralized Certified Unlearning
- MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
- Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation
- Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
- Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis
- UniDAC: Universal Metric Depth Estimation for Any Camera
- Probabilistic Precipitation Nowcasting with Rectified Flow Transformers
- Text-guided Feature Disentanglement for Cross-modal Gait Recognition
- Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
- From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
- Linking Perception, Confidence and Accuracy in MLLMs
- EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
- Verifying Neural Network Robustness with Dual Perturbations
- M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA
- GeoSANE: Learning Geospatial Representations from Models, Not Data
- AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
- Sparse–View Localization via Online Neural 3D Regression
- Retrieving Counterfactuals Improves Visual In-Context Learning
- RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
- HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
- Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains
- Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
- H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction
- UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
- DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
- PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
- Solvability of the Viewing Graph Under the Affine Camera Model
- Detecting Unknown Objects via Energy-based Separation for Open World Object Detection
- ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
- GazeShift: Unsupervised Gaze Estimation and Dataset for VR
- VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale
- Streamlined Knowledge Distillation
- UniVBench: Towards Unified Evaluation for Video Foundation Models
- Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
- SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors
- ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning
- TIGER: A Unified Framework for Time, Images and Geo-location Retrieval
- PowerCLIP: Powerset Alignment for Contrastive Pre-Training
- ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
- DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation
- Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
- RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
- ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM
- EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction
- Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics
- AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal
- Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
- WHU-MARS: A Multispectral Aerial-Ground Benchmark Towards Any-Scenario Person Re-Identification
- BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection
- SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
- EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
- Linear Fundamental Matrix Estimation from 7 or 5 Points
- NitroGen: An Open Foundation Model for Generalist Gaming Agents
- Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels
- RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation
- Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models
- ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
- FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling
- BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation
- Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
- Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion
- Mixture of Prototypes for Test-time Adaptive Segmentation
- Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization
- VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
- Phrase-grounded APO for Improving Chest X-ray Report Generation
- Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
- Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation
- See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
- MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
- GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
- DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation
- Enhancing Part-Level Point Grounding for Any Open-Source MLLMs
- PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation
- REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion
- 240FPS Stereo Vision from Monocular Mixed Spikes
- When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
- Unifying Precise Keyframes and Semantic Control via Multi-level Diffusion
- Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
- REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
- Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation
- On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks
- RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
- Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
- Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
- EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
- EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
- 3D-IDE: 3D Implicit Depth Emergent
- Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation
- CI-VID: A Coherent Interleaved Text-Video Dataset
- OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
- Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
- Unique Lives, Shared World: Learning from Single-Life Videos
- Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
- PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
- VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
- Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
- EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
- TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond
- SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation
- Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
- No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
- Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition
- OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
- TSTM: Temporal Segmentation for Task-relevant Mask in Visual Reinforcement Learning Generalization
- VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
- Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
- Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations
- CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
- Stronger Normalization-Free Transformers
- Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models
- Enhancing Out-of-Distribution Detection with Extended Logit Normalization
- Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
- Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning
- Cycle-Consistent Tuning for Layered Image Decomposition
- Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
- CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision–Language Models
- OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis
- Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering
- Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model
- Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance
- ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
- SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference
- HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
- Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation
- ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
- OS-Fed: One Snapshot Is All You Need
- Latent Diffusion Inversion Requires Understanding the Latent Space
- EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
- Adapting In-context Generation for Enhanced Composed Image Retrieval
- A Bit is All You Need! Efficient Video Capture via Single Bit Imaging
- Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
- Free-Grained Hierarchical Visual Recognition
- VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models
- DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs
- VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
- GGPT: Geometry-Grounded Point Transformer
- MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality
- UniLight: A Unified Representation for Lighting
- LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
- Spot The Ball: A Benchmark for Visual Social Inference
- EasyV2V: A High-quality Instruction-based Video Editing Framework
- Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
- SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation
- PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
- 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
- Video Panels for Long Video Understanding
- SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
- IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
- Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats
- ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding
- SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
- Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation
- Hist2Style: Histogram-Guided Stylization with Bilateral Grids
- MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
- Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning
- SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models
- Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
- Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration
- PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving
- Global-Aware Edge Prioritization for Pose Graph Initialization
- Adapting Lightweight Image-based Counting Models for Video Crowd Counting
- See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
- BAMI: Training-Free Bias Mitigation in GUI Grounding
- A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
- GM-R^2: Generative Matching Learning for Unsupervised Geometric Representation and Registration
- Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge
- UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in RL
- BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
- CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration
- 4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis
- Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET
- TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration
- Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting
- Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
- HybridDriveVLA: Vision-Language-Action Model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving
- IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis
- SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering
- OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
- FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning
- BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images
- Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models
- BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer
- FireScope: Wildfire Risk Raster Prediction With a Chain-of-Thought Oracle
- Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models
- Efficient and Training-Free Single-Image Diffusion Models
- Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
- Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
- Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting
- TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
- VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
- ProSoftArena: Benchmarking Hierarchical Capabilities of Multi-modal Agents in Professional Software Environments
- SVBench: Evaluation of Video Generation Models on Social Reasoning
- Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
- Scale Space Diffusion
- MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning
- Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging
- Anomaly-Related Residual Fields for Cross-domain Anomaly Detection
- DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
- SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
- Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization
- High-Quality and Efficient Turbulence Mitigation with Events
- FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
- Beyond the Ground Truth: Enhanced Supervision for Image Restoration
- Roots Beneath the Cut: Uncovering the Risk of Concept Recovery in Pruning-Based Unlearning for Diffusion Models
- Compressed-Domain-Aware Online Video Super-Resolution
- Agentic Retoucher for Text-To-Image Generation
- Content-Aware Dynamic Patchification for Efficient Video Diffusion
- MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
- Understanding Task Transfer in Vision-Language Models
- Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning
- Region-Aware Instance Consistency Learning for Micro-Expression Recognition
- Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
- Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction
- Dark3R: Learning Structure from Motion in the Dark
- Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models
- Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval
- Explaining CLIP Zero-shot Predictions Through Concepts
- TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery
- LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol
- Progressive Multi-cue Alignment for Unaligned RGBT Tracking
- NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
- Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy
- Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation
- QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
- Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
- PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
- Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
- Align Images Before You Generate
- PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning
- BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
- Sampling-Aware Quantization for Diffusion Models
- Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
- Driving on Registers
- PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects
- SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
- Goldilocks Test Sets for Face Verification
- Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
- FG-Portrait: 3D Flow Guided Editable Portrait Animation
- Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights
- AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
- Personalized Federated Training of Diffusion Models with Privacy Guarantees
- FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
- Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
- Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization
- Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation
- Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
- F^2HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
- Mirror Illusion Art
- Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
- GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers
- EXOTIC: External Vision-driven Incomplete Multi-view Classification
- SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching
- ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
- Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production
- RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework
- VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
- Efficient Equivariant Transformer for Self-Driving Agent Modeling
- SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
- Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
- KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
- Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
- WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
- ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
- VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
- Sparse Spectral LoRA: Routed Experts for Medical VLMs
- Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent
- Match-and-Fuse: Consistent Generation from Unstructured Image Sets
- Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement
- TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising
- Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors
- Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
- IntrinsicWeather: Controllable Weather Editing in Intrinsic Space
- Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents
- SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
- HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
- Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
- Rethinking Glyph Spatial Information in Font Generation
- Describe Anything Anywhere At Any Moment
- X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
- More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
- GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation
- Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
- MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking
- FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation
- TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection
- GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
- CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
- Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
- DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR
- One Algorithm to Align Them All
- LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
- PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
- Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
- Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework
- Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs
- VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension
- Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
- The Universal Normal Embedding
- Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
- MM-ACT: Learn from Multimodal Parallel Generation to Act
- Decision Boundary-aware Generation for Long-tailed Learning
- ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning
- Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
- Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
- Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
- DynFusion: Rethinking Condition Fusion for Adaptive Multi-Conditional Text-to-Image Generation
- Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval
- SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
- MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
- Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
- RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing
- IPR-1: Interactive Physical Reasoner
- Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
- 2D-LFM: Lifting Foundation Model without 3D Supervision
- No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
- LA-Pose: Latent Action Pretraining Meets Pose Estimation
- CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks
- Finding Distributed Object-Centric Properties in Self-Supervised Transformers
- SineProject: Machine Unlearning for Stable Vision-Language Alignment
- From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras
- Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
- Depth Hypothesis Guided Iterative Refinement for Event–Image Monocular Depth Estimation
- Order Matters: 3D Shape Generation from Sequential VR Sketches
- FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
- GenTract: Generative Global Tractography
- DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
- Is the Modality Gap a Bug or a Feature? A Robustness Perspective
- VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
- RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph
- MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
- MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis
- LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation
- ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
- PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
- KLIP: Localized Distribution Shift Detection via KL-Divergence with Diffusion Priors in Inverse Problems
- Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
- Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
- ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
- Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
- INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic ManipulaTion
- Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters
- Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases
- WildPose: A Unified Framework for Robust Pose Estimation in the Wild
- NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
- Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification
- Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
- A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks
- Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
- Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics
- UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
- Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels
- cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
- TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
- TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
- EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement
- Minimal Constraint Relaxation for Multiview Autocalibration
- Vista4D: Video Reshooting with 4D Point Clouds
- CADC: Content Adaptive Diffusion-Based Generative Image Compression
- ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
- Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning
- From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
- Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
- NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
- RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks
- How to Take a Memorable Picture? Empowering Users with Actionable Feedback
- TESO: Online Tracking of Essential Matrix by Stochastic Optimization
- LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers
- SonoWorld: From One Image to a 3D Audio-Visual Scene
- When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks
- QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
- DSO: Direct Steering Optimization for Bias Mitigation
- Parallel Rigidity Matters for Bundle Adjustment
- UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents
- Reward Sharpness-Aware Fine-Tuning for Diffusion Models
- Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
- Bridging the Perception Gap in Image Super-Resolution Evaluation
- FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera
- InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene
- Inter-Photon-Limited Videography
- ReasonX: MLLM-Guided Intrinsic Image Decomposition
- Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence
- DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting
- C^2FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
- High-Fidelity Mobile Avatars with Pruned Local Blendshapes
- Latent Implicit Visual Reasoning
- MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection
- PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video
- BiGain: Unified Token Compression for Joint Generation and Classification
- ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer
- SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
- MPL: Match-guided Prototype Learning for Few-shot Action Recognition
- PHAC: Promptable Human Amodal Completion
- Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport
- Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation
- Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding
- InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation
- OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models
- InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
- Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
- SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
- TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
- Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation
- OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data
- DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives
- High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling
- ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
- Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
- UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration
- SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
- Visual Autoregressive Modeling via Next Focus Prediction
- LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
- MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
- AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
- ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
- Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
- MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures
- CamDirector: Towards Long-Term Coherent Video Trajectory Editing
- KV-Tracker: Real-Time Pose Tracking with Transformers
- MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation
- WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
- Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
- G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
- RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization
- FabricGen: Microstructure-Aware Woven Fabric Generation
- StreamDiT: Real-Time Streaming Text-to-Video Generation
- SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
- Mirai: Autoregressive Visual Generation Needs Foresight
- Mechanisms of Object Localization in Vision–Language Models
- SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization
- OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
- Dexterous World Models
- Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval
- mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds
- Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
- GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
- VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
- Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex
- Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
- Guiding Token-Sparse Diffusion Models
- LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
- CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
- Detecting Compressed AI-Generated Images via Phase Spectrum Robustness
- Scene Grounding in the Wild
- GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis
- HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
- LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
- IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution
- VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
- Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection
- Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
- LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection
- MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
- D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation
- CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
- Temporal Interaction in Spiking Transformers with Multi-Delay Mixer
- Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction
- MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
- Time Blindness: Why Video-Language Models Can’t See What Humans Can?
- When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs
- Evidential Neural Radiance Fields
- Diffusion Mental Averages
- See Through the Noise: Improving Domain Generalization in Gaze Estimation
- Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization
- Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration
- Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
- Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding
- Uni-Hema: Unified Model for Digital Hematopathology
- Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
- Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation
- VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
- TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement
- HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps
- Human Geometry Distribution for 3D Animation Generation
- Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation
- Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors
- SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
- Interactive Episodic Memory with User Feedback
- Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
- SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
- Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization
- MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations
- Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
- Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism
- BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models
- LumiMotion: Improving Gaussian Relighting with Scene Dynamics
- ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
- Learning Forgery-Aware Lip Representations Without Forgery Priors
- PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems
- Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
- Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling
- Landscape-Awareness for Geometric View Diffusion Model
- Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals
- InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection
- SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
- Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting
- Hyperbolic Busemann Neural Networks
- GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer
- Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
- Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
- StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
- DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
- Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
- Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
- REACH: Explicit Recovery Behavior for Diffusion Policies
- Guiding Diffusion Models with Semantically Degraded Conditions
- IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence
- D2T2 - Multimodal Automated Planning for Brachytherapy
- SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior-Guided Multimodal LLMs
- Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
- 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
- Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor
- CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
- Teaching DINOv3 About Partial 3D Geometry: A Self-Supervised Geometry-Aware Approach
- ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
- tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
- DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation
- Benchmarking Endoscopic Surgical Image Restoration and Beyond
- ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation
- EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision
- SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation
- TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction
- LoL: Longer than Longer, Scaling Video Generation to Hour
- DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
- MERIT: Multi-domain Efficient RAW Image Translation
- ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
- UAV-CB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection
- The Drift Kernel: Why Diffusion Models Change Even When Told Not To
- NS-Diff: Fluid Navier–Stokes Guided Video Diffusion via Reinforcement Learning
- Emergent Extreme-View Geometry in 3D Foundation Models
- FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts
- PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
- M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction
- iLRM: An Iterative Large 3D Reconstruction Model
- Unified Camera Positional Encoding for Controlled Video Generation
- Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
- Learning 3D Reconstruction with Priors in Test Time
- ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
- Revisiting Model Stitching In the Foundation Model Era
- When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
- Personalized Image Descriptions from Attention Sequences
- Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
- Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
- UniChange: Unifying Change Detection with Multimodal Large Language Model
- PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding
- VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
- FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
- ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
- CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
- CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation
- SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation
- DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging
- AnthroTAP: Learning Point Tracking with Real-World Motion
- DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
- Ego-Grounding for Personalized Question-Answering in Egocentric Videos
- Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection
- VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
- GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics
- A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation
- FlashVSR: Towards Real-time Diffusion-Based Streaming Video Super Resolution
- RenderFlow: Single-Step Neural Rendering via Flow Matching
- Smoothing the Score Function to Enhance Generalization in Diffusion Models
- WalkGPT: Grounded Vision–Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
- PARSE: Part-Aware Relational Spatial Modeling
- Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering
- R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
- A Difference-in-Difference Approach to Detecting AI-Generated Images
- BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment
- Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
- Seeing Motion Through Polarity for Event-based Action Recognition
- Investigating Self-Supervised Representations for Audio-Visual Deepfake Detection
- Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition
- Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation
- Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
- Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
- StreamReady: Learning What to Answer and When in Long Streaming Videos
- 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
- RFDM: Residual Flow Diffusion Models for Video Editing
- Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation
- FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution
- MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
- Anti-I2V: Safeguarding your Photos from Malicious Image-to-video Generation
- TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking
- Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework
- ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
- The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
- Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency
- VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment
- Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
- AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
- Robust Promptable Video Object Segmentation
- Generative Modeling of Weights: Generalization or Memorization?
- See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
- Controllable Federated Prompt Learning at Test Time
- MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
- MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
- Progressive Mask Distillation for Self-supervised Video Representation
- CoWTracker: Tracking by Warping instead of Correlation
- DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks
- BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
- EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
- Progressive Supernet Training for Efficient Visual Autoregressive Modeling
- Collaborative Multi-Mode Pruning for Vision-Language Models
- AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples
- From Panel to Pixel: Zoom-In Vision–Language Pretraining from Biomedical Scientific Literature
- DROID-SLAM in the Wild
- DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision–Language Transformers to Missing Modalities
- PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild
- SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras
- Hint2Gen: Bridging Understanding and Generation via Code-structured Hints
- SkillSight: Efficient First-Person Skill Assessment with Gaze
- SIR: Structured Image Representations for Explainable Robot Learning
- VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
- FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
- Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
- RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
- Generative Neural Video Compression via Video Diffusion Prior
- GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT
- FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
- HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
- Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions
- Mining Instance-Centric Vision–Language Contexts for Human–Object Interaction Detection
- Measuring the (Un)Faithfulness of Concept-Based Explanations
- Scaling Parallel Sequence Models to Vision Foundation Models
- Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
- StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation
- Beyond Text Prompts: Precise Concept Erasure through Text–Image Collaboration
- From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
- LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting
- It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
- Causal Motion Diffusion Models for Autoregressive Motion Generation
- MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
- Computer Vision with a Superpixelation Camera
- Learnability-Guided Diffusion for Dataset Distillation
- The Invisible Gorilla Effect in Out-of-distribution Detection
- Nonlinear Color Transfer via Learnable Bezier Flows
- CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval
- Language Models Can Explain Visual Features via Steering
- Cinematic Audio Source Separation Using Visual Cues
- Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
- Failure Modes for Deep Learning–Based Online Mapping: How to Measure and Address Them
- LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
- Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
- Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
- PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
- Towards Intrinsic-Aware Monocular 3D Object Detection
- DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations
- TopoCL: Topological Contrastive Learning for Medical Imaging
- CoT-Edit: Let CoT Guide Instruction Video Editing
- Language-guided Frequency Modulation for Large Vision-Language Models
- Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
- Semantic Alignment for Pose-Invariant Identity Preserving Diffusion
- Exposing and Evaluating Hallucinations for GUI Grounding
- Image-based Outlier Synthesis With Training Data
- Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
- DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
- Geometrically-Constrained Agent for Spatial Reasoning
- Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
- MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision
- TempoControl: Temporal Attention Guidance for Text-to-Video Models
- Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing
- RECS4R: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation
- TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
- GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
- MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment
- Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
- TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery
- Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop
- HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
- Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
- Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
- FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization
- MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
- LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
- EvoID: Reinforced Evolution for Identity-Preserving Video Generation
- Photo-Guided Tooth Segmentation on 3D Oral Scan Model
- Interpretable Debiasing of Vision-Language Models for Social Fairness
- Post-training Feature Pruning for Fundus Images Classification
- Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
- LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
- MacTok: Robust Continuous Tokenization for Image Generation
- Revisiting Visual Corruptions in LVLMs: A Shape–Texture Perspective on Model Failures
- Efficient Unrolled Networks for Large-Scale 3D Inverse Problems
- Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning
- Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation
- Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
- Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
- Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation
- Hyperbolic Gramian Volumes for Multimodal Alignment
- KαLOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
- Multi-view Pyramid Transformer: Look Coarser to See Broader
- PhysVid: Physics Aware Local Conditioning for Generative Video Models
- Learning by Analogy: A Causal Framework for Compositional Generalization
- Visual Diffusion Models are Geometric Solvers
- VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
- SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
- Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework
- More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization
- Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
- HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
- RL‑ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment
- Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification
- MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
- HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation
- Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
- DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration
- 2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching
- PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
- Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
- Improving Sparse Autoencoder with Dynamic Attention
- When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence
- Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence
- Unified Vector Floorplan Generation via Markup Representation
- LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
- Drainage: A Unifying Framework for Addressing Class Uncertainty
- MoLingo: Motion–Language Alignment for Text-to-Human Motion Generation
- Decoupling Vision and Language: Codebook Anchored Visual Adaptation
- Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark
- OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks
- PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
- PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
- PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
- Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
- SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
- Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
- CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
- PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
- BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
- Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
- VideoCoF: Unified Video Editing with Temporal Reasoner
- Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding
- Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
- CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
- DIMOS: Disentangling Instance-level Moving Object Segmentation
- Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
- SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection
- Differentiable Laplacian Matrix Guided Superpixel Segmentation
- Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity
- Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
- SpiderCam: Low-Power Snapshot Depth from Differential Defocus
- Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
- Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization
- ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
- AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
- TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization
- CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation
- PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment
- FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection
- Deformation-based In-Context Learning for Point Cloud Understanding
- MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification
- Making the Classification Explanation Faithful to the Confidence Score
- MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
- DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging
- Learning to Infer Parameterized Representations of Plants from 3D Scans
- Boundary-Responsive Differentiable Gating for Superpixel-Based Segmentation
- ReBaPL: Repulsive Bayesian Prompt Learning
- Specificity-aware reinforcement learning for fine-grained open-world classification
- UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders
- CountGD++: Generalized Prompting for Open-World Counting
- Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again
- Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
- Hyper-PCN: Hypergraph-Based Point Cloud Completion via High-Order Correlation Modeling
- SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting
- SimScale: Learning to Drive via Real-World Simulation at Scale
- Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
- FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning
- TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
- DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing
- StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
- MusicInfuser: Making Video Diffusion Listen and Dance
- ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion
- DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
- Learning Straight Flows: Variational Flow Matching for Efficient Generation
- Mapping Networks
- Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation
- Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
- Forecasting 3D Scanpaths in Egocentric Video
- Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
- SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons
- MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
- NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
- Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment
- SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
- TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
- Latent Chain-of-Thought World Modeling for End-to-End Driving
- AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
- GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
- BluRef: Unsupervised Image Deblurring with Dense-Matching References
- From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
- Spatiotemporal Pyramid Flow Matching for Climate Emulation
- EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
- PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion
- When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
- IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution
- Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
- Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception
- OctoNav: Towards Generalist Embodied Navigation
- FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
- Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
- MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
- MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
- RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
- Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting
- GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space
- EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions
- Seeing Conversations: Communication Context Identification in Egocentric Video
- MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
- Object-WIPER: Training-Free Object and Associated Effect Removal in Videos
- SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning
- ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model
- TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer
- Learning Convex Decomposition via Feature Fields
- Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
- AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction
- TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
- GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection
- RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
- OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
- Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling
- FILTR: Extracting Topological Features from Pretrained 3D Models
- SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning
- Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
- RankOOD - Class Ranking-based Out-of-Distribution Detection
- EDGS: Eliminating Densification for Efficient Convergence of 3DGS
- ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval
- EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
- Learning Long-term Motion Embeddings for Efficient Kinematics Generation
- DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
- End-to-End Language-Action Model for Humanoid Whole Body Control
- HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
- Lenses: Toward Polysemous Vision–Language Understanding
- D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration
- Towards Calibrating Prompt Tuning of Vision- Language Models
- Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
- Image-to-Point Cloud Feature Back-Projection for Multimodal Training of 3D Semantic Segmentation
- Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning
- RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
- CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
- MIBURI: Towards Expressive Interactive Gesture Synthesis
- Deciphering Genotype-Phenotype Mechanisms from High-Content Profiling via Knowledge-Guided Multi-modal Graph Learning
- IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion
- OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
- Resolving the Identity Crisis in Text-to-Image Generation
- TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation
- Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
- IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations
- Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images
Remarks
Tutorials
- The Principles of Diffusion Models: Real-Time Continuous & Discrete Diffusion
- Edge AI in Action: Mastering On-Device Inference
- Tom Builds, Tom Breaks: Hands-On Attacks and Defenses for Vision-Language Systems
- Accelerated Diffusion Models: From Theory to Interactive World Models
- Building GenAI based Simulation Environment for End-to-End Autonomous Driving
- From Perception to Simulation: The Emergence of World Models in Multi-modal Reasoning
- Monte Carlo physical simulation
- 3D Human Mesh Modeling and Recovery from RGB and LiDAR
- Recent Advances in AI for Medical Imaging: Progress, Challenges, and Future Directions
- Computer Vision at Scale: Multi-Camera Tracking, Calibration, and Event Detection for Checkout-Free Retail
- Extending Computer Vision to Hidden Objects: A Tutorial on Millimeter-Wave Imaging and Reconstruction of Occluded Scenes
- The Full Stack of Physical AI: Simulation, Foundation Models, and Edge Deployment for Next-Generation Robotics Applications
- Analytic understanding of diffusion models
- All You Need To Know About Self-Driving
- The Road to Convergence: Evolution of Unified Multimodal Models
- From Perception to Action: Building Efficient and Deployable Robot Intelligence Pipelines with Open-Source Edge AI Toolkits
- Foundations and Frontiers of Watermarking: Algorithms, Multimodal Extensions, Benchmarks, and Authenticity Frameworks
Workshops
- Generative AI for XR and Identity-based Applications
- Foundation Models for Autonomous Driving
- The 3rd Workshop on Human Motion Generation - New Perspective on Simulation, Animation, and VR applications
- From Lab Demos to Daily Tasks: Embodied Intelligence in the Wild
- Multimodal Foundation Models for Biomedicine: Challenges and Opportunities
- Workshop on Agentic AI for Visual Media
- Workshop on World Models Meet Active Sensing and Closed-Loop Planning
- On Sensor Vision Workshop
- Workshop on Vision-based Assistants in the Real-World
- Multimodal Alignment for a Pluralistic Society
- IPA: Interactive Physical AI Workshop
- The 5th DataCV Workshop and Challenge
- The 3rd AI for Visual Arts Workshop and Challenges
- Third Joint Egocentric Vision (EgoVis) Workshop
- AERO-HPR: Human Perception and Recognition in Aerial Surveillance
- Efficient Deep Learning for Computer Vision
- The 22nd Embedded Vision Workshop
- Authenticity & Provenance in the age of Generative AI
- The 1st Workshop on Monitoring the World through an Imperfect Lens
- The Second CVPR Workshop on Foundation and Large Vision Models in Remote Sensing (MORSE)
- The 1st Workshop on Vision for Intelligent Task Assistants
- Computer Vision for Biomechanics Workshop
- 3rd Workshop on Efficient and On-Device Generation (EDGE), CVPR 2026
- 10th Affective & Behavior Analysis in-the-wild
- Workshop on Multimodal Human Motion Analysis
- Cognitive Foundations for Multimodal Models
- OpenSUN3D: 6th Workshop on Open-World 3D Scene Understanding with Foundation Models
- The 3rd MetaFood Workshop (MTF)
- 3rd Workshop on ScanNet++ Novel View Synthesis and 3D Semantic Understanding Challenge
- Auto-Annotation with Expert-Crafted Guidelines
- The 5th Workshop on “What is Next in Multimodal Foundation Models?”
- Machine Unlearning for Vision
- The 2nd 3D-LLM/VLA Workshop: Bridging Language, Vision and Action in 3D Environments
- 1st Workshop on Multi-Agent Robotic Systems: Scaling with Compositional Intelligence
- Second Workshop on Foundation and Generative Models in Biometrics
- Rediscovering Intelligence: Can AI Still Learn from Humans?
- 3D Geometry Generation for Scientific Computing (2nd Edition)
- 2nd Workshop on Knowledge-Intensive Multimodal Reasoning
- The 3rd Workshop on New Trends in AI-Generated Media and Security
- 2nd Workshop on Computer Vision for Children
- Workshop on Visual Concepts
- 9th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues
- 6th Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics
- Third Workshop for Learning 3D with Multi-View Supervision
- Trustworthy, Robust, Uncertainty-Aware, and Explainable Visual Intelligence and Beyond
- Humans of Generative AI
- Sight and Sound
- The Second Workshop on the Evaluation of the Generative Foundation Models
- Video Generative Models: Benchmarks and Evaluation
- Safe Artificial Intelligence for All Domains
- 6th Workshop on CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling
- Exploring the Next Generation of Data
- Personalization in Generative AI Workshop
- 9th Multimodal Learning and Applications Workshop
- 4th Workshop on Generative Models for Computer Vision
- 2nd Workshop on Human-Interactive Generation and Editing
- 12th IEEE International Workshop on Computer Vision in Sports
- The 6th Workshop of Adversarial Machine Learning on Computer Vision: Safety of Vision-Language Agents
- How Do Vision Models Work?
- Domain Generalization: Evolution, Breakthroughs, and Future Horizons (2nd Edition)
- 2nd Workshop on GenAI for Storytelling
- CVPR 2026 Biometrics Workshop
- Medical Reasoning with Vision Language Foundation Models
- Computer Vision × Education: Building a Cross‑Community Agenda for Multimodal Vision in Classrooms
- 2nd Workshop on 4D Vision: Modeling the Dynamic World
- 1st Workshop on Generative 3D Reconstruction
- The 3rd Workshop on Synthetic Data for Computer Vision
- ScaleBot: The First Workshop on Scalable Robot Learning Systems
- The 2nd CVPR Workshop on Foundation Models Meet Embodied Agents
- CV4Science: Using Computer Vision for the Sciences
- The 7th International Workshop on Eye and Gaze in Computer Vision
- Big Model Adaptation In Computer Vision
- Bridging AI and Medical Reality: Computer Vision for Real-world Clinical Translation
- 4D Digital Twins: Real-to-Sim-to-Real for Physical AI
- 1st Workshop on Journey to the Awards: Generative AI for Movie-Grade Video Production (J2A), CVPR 2026
- Second Workshop on Skilled Activity Understanding, Assessment & Feedback Generation
- Pixel-level Video Understanding in the Wild Challenge
- The Third Workshop on Anomaly Detection with Foundation Models
- See the World in a Different Light: Physical Appearance Modeling and Relighting in the Age of Generative AI
Report issues here.
Successful Page Load