CVPR 2026 Events with Videos
Posters
- Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
- X-WIN: Building Chest Radiograph World Model via Predictive Sensing
- RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
- Refracting Reality: Generating Images with Realistic Transparent Objects
- Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
- MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
- Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
- Inferring Compositional 4D Scenes without Ever Seeing One
- Scalable Feature Matching via State Space Modeling and Sparse Correlation
- PhysHead: Simulation-Ready Gaussian Head Avatars
- 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
- Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
- Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
- AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models
- Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets
- From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition
- Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions
- ORBIT: Benchmarking SfM in the Wild with 360° Video
- VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering
- Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
- NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
- ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
- GFRRN: Explore the Gaps in Single Image Reflection Removal
- Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
- AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models
- Exploring Spatial Intelligence from a Generative Perspective
- BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery
- Global Underwater Geolocation from Time-Lapse Polarization Imagery
- Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
- A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction
- Physical Object Understanding with a Physically Controllable World Model
- MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
- InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
- ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization
- SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning
- FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers
- DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning
- Envisioning the Future, One Step at a Time
- STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
- PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
- Physical Simulator In-the-Loop Video Generation
- Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation
- VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
- Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
- TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
- LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
- Heterogeneous Decentralized Diffusion Models
- FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
- Learning Personalized Photographic Style from Pairwise User Preferences
- X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
- The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments
- Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
- RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
- RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval
- Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
- Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection
- ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models
- Region-Adaptive Sampling for Diffusion Transformers
- The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
- SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
- Scalable Trajectory Generation for Whole-Body Mobile Manipulation
- From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking
- Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks
- DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion
- Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation
- GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution
- Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction
- Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach
- FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment
- RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion
- EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
- EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
- Twin-T & TwintVQA: A Reliable Structure–Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks
- MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision
- Globscope: Toward a Global View of the Loss Landscape
- PrivateEyes: Gaze-Preserving Anonymization for Data Sharing
- EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning
- Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
- β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment
- Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
- Not All Birds Look The Same: Identity-Preserving Generation For Birds
- One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
- Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
- Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
- CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
- Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model
- Decoupled Generative Modeling for Human-Object Interaction Synthesis
- SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
- DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
- PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
- MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
- PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
- FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
- EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
- EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding
- Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data
- UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
- Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs
- PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
- HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
- Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression
- Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
- Lifting Unlabeled Internet-level Data for 3D Scene Understanding
- Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
- rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training
- Concept-Aware Batch Sampling Improves Language-Image Pretraining
- S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation
- Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning
- AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
- C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion
- FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
- Beyond Depth: Evaluating the Width-centric Reasoning Capability of MLLMs
- A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images
- Language-Free Generative Editing from One Visual Example
- TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation
- Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection
- Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
- DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
- Label-Free Cross-Task LoRA Merging with Null-Space Compression
- Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances
- Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift
- Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
- Understanding Counting Mechanisms in Large Language and Vision-Language Models
- MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation
- LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
- Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
- iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
- Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule
- E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
- MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
- Efficient Weighted Sampling via Score-based Generative Models
- IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors
- Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
- WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
- Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
- OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
- CompBench: Benchmarking Complex Instruction-guided Image Editing
- REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
- Functional Mean Flow in Hilbert Space
- LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction
- Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices
- X-band Radar Non-Line-of-Sight Imaging
- Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization
- TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures
- From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
- Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration
- NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
- DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
- Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
- Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos
- LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis
- GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving
- ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
- Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification
- Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
- Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
- Mind the Gap: Transferring Labels to Align Object Detection Datasets
- Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation
- NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization
- Bootstrapping Multi-view Learning for Test-time Noisy Correspondence
- Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
- Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
- MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
- RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation
- MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
- Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy
- Affordance-First Decomposition for Continual Learning in Video–Language Understanding
- MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
- MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
- Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
- Hierarchical Action Learning for Weakly-Supervised Action Segmentation
- Geometric Neural Distance Fields for Learning Human Motion Priors
- CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
- Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision
- Rethinking Dataset Distillation: Hard Truths about Soft Labels
- Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
- HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation
- Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
- Neural Distribution Prior for LiDAR Out-of-Distribution Detection
- Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
- FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
- Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation
- MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
- DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
- ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction
- VideoMaMa: Mask-Guided Video Matting via Generative Prior
- Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models
- VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
- MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
- DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
- Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
- Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
- GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
- InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
- EmoStyle: Emotion-Driven Image Stylization
- DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors
- ViT^3: Unlocking Test-Time Training in Vision
- LNEM: Lunar Neural Elevation Model
- ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
- Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
- MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
- NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks
- HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
- BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting
- Long-Tail Internet Photo Reconstruction
- Harnessing the Power of Foundation Models for Accurate Material Classification
- IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models
- Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
- World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
- FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs
- Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
- Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
- When to Think and When to Look: Uncertainty-Guided Lookback
- MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
- UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
- Chain of World: World Model Thinking in Latent Motion
- QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence
- SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification
- OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
- FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
- 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
- HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
- SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
- VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
- Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
- Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
- PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting
- Bidirectional Normalizing Flow: From Data to Noise and Back
- Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
- Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models
- QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models
- The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations
- LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
- The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
- Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
- Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
- Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
- VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision–Language Models
- MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
- Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
- TextOVSR: Text-Guided Real-World Opera Video Super-Resolution
- HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
- Reflection Separation from a Single Image via Joint Latent Diffusion
- GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching
- Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments
- Momentum Memory for Knowledge Distillation in Computational Pathology
- AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation
- Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
- VISTA: A Test-Time Self-Improving Video Generation Agent
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
- GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models
- Fast SceneScript: Fast and Accurate Language‑Based 3D Scene Understanding via Multi‑Token Prediction
- MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
- Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
- Does YOLO Really Need to See Every Training Image in Every Epoch?
- UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling
- Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
- PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding
- UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
- Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
- Learning Compact 3D Representations from Feed-Forward Novel View Synthesis
- Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
- Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos
- Data-Centric Meta-Learning for Robust Few-Shot Generalization
- OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
- Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
- TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
- How Much 3D Do Video Foundation Models Encode?
- Act2See: Emergent Active Visual Perception for Video Reasoning
- Transition Matching Distillation for Fast Video Generation
- Correspondence-Attention Alignment for Multi-View Diffusion Models
- Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
- WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
- Recovering Physically Plausible Human-Object Interactions from Monocular Videos
- Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation
- SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
- AnyPcc: Compressing Any Point Cloud with a Single Universal Model
- Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
- A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
- JRM: Joint Reconstruction Model for Multiple Objects without Alignment
- VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
- Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus
- Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images
- LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing
- Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human–Computer Interaction
- ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
- DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
- Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
- Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner
- CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric
- Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
- Event-based Motion Deblurring with Unpaired Data
- Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
- ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control
- Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
- StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
- Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios
- Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval
- Refaçade: Editing Object with Given Reference Texture
- Agile Deliberation: Concept Deliberation for Subjective Visual Classification
- Advancing Image Classification with Discrete Diffusion Classification Modeling
- Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning
- S^2-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
- Learning Multi-View Spatial Reasoning from Cross-View Relations
- SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
- VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
- Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
- VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution
- Reinforcing Video Reasoning Segmentation to Think Before It Segments
- FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift
- PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
- Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
- Consistent Instance Field for Dynamic Scene Understanding
- ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
- SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
- Multi-Scale Local Speculative Decoding for Image Generation
- Global Information Thresholding for Sufficient and Necessary Circuits
- View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification
- OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
- Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition
- GenMatter: Perceiving Physical Objects with Generative Matter Models
- A Combination of Noise and Bilateral Filters Achieve Supralinear and Scalable Adversarial Robustness in CNNs
- MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On
- The Midas Touch for Metric Depth
- SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
- HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction
- Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
- Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models
- UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
- DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
- No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors
- Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
- Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution
- FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement
- DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
- CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models
- Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation
- Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
- Learning to Act Robustly with View-Invariant Latent Actions
- Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling
- VGA: Empowering Aerial-Ground Localization by Visual Geometry Alignment
- Source Models Leak What They Shouldn’t: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization
- Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
- RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection
- RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection
- Meta-Learning In-Context Enables Training-Free Cross Subject Brain Decoding
- A Training-Free Style-Personalization via SVD-Based Feature Decomposition
- Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
- Extend3D: Town-Scale 3D Generation
- Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
- Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning
- Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
- LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
- DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
- Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
- Vision-Speech Models: Teaching Speech Models to Converse about Images
- Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
- Bidirectional Query-Driven Generation of Parametric CAD Sketch
- Are Image-to-Video Models Good Zero-Shot Image Editors?
- PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation
- fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
- TANGO: Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization
- EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion
- Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints
- MAMMA: Markerless Accurate Multi-person Motion Acquisition
- Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation
- SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
- ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
- Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
- BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
- Dynamic Momentum Recalibration in Online Gradient Learning
- FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement
- Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
- Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
- What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely F1
- NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
- POUR: A Provably Optimal Method for Unlearning Representation via Neural Collapse
- Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
- Geometry-Guided 3D Visual Token Pruning for Video-Language Models
- ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
- HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
- Cross-Hand Latent Representation for Vision-Language-Action Models
- Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
- What Matters in Practical Learned Image Compression
- Reinforcing Structured Chain-of-Thought for Video Understanding
- BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation
- CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization
- G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
- Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
- Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure
- CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers
- Ego: Embedding-Guided Personalization of Vision-Language Models
- Representing 3D Faces with Learnable B-Spline Volumes
- MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior
- APPO: Attention-guided Perception Policy Optimization for Video Reasoning
- Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
- CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird’s-Eye-View Semantic Segmentation
- AVION: Aerial Vision–Language Instruction from Offline Teacher to Prompt-Tuned Network
- Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention
- Rectifying Latent Space for Generative Single-Image Reflection Removal
- FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
- BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
- ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
- Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
- What Are You Doing? A Closer Look at Controllable Human Video Generation
- From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification
- Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
- CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion
- GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
- Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
- Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening
- Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels
- Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence
- UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs
- DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization
- Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks
- PGA: Prior-free Generative Attack for Practical No-box Scenario
- UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
- HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
- OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
- MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
- UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization
- PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and VLM-Guided Optimization
- Visual Grounding for Object Questions
- Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations
- Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification
- ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization
- Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
- AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
- MeshRipple: Structured Autoregressive Generation of Artist-Meshes
- Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling
- Paparazzo: Active Mapping of Moving 3D Objects
- PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
- OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing
- LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
- SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
- CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
- NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
- COT-FM: Cluster-wise Optimal Transport Flow Matching
- Scene-Centric Unsupervised Video Panoptic Segmentation
- Vinedresser3D: Towards Agentic Text-guided 3D Editing
- TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
- Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models
- GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
- PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models
- PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
- PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
- Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization
- An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
- Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion
- AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
- First Frame Is the Place to Go for Video Content Customization
- Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
- Learning Effective Sign Features without Text for Gloss-free Sign Language Translation
- OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
- DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
- An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
- Local Motion Matters: A Deconstruct–Recompose Paradigm for Reinforcement Learning Pre-training from Videos
- InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
- MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
- Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
- Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
- STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
- Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
- The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
- Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
- MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
- Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching
- A³: Towards Advertising Aesthetic Assessment
- PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
- RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection
- META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding
- Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
- UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
- STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
- R3-PCQA: Ray-Reprojection-Reinforcement for No-Reference 3D Point Cloud Quality Assessment
- Radiance Meshes for Volumetric Reconstruction
- TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models
- Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping
- MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
- When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models
- Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
- E^2-SCI: Elastic Edge–Cloud Speculative Decoding via Credit Inertia
- MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis
- SPDMark: Selective Parameter Displacement for Robust Video Watermarking
- Variational Graph-based Normal Integration
- Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes
- UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
- Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
- Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization
- SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
- DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning
- Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
- EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
- Edit-aware RAW reconstruction
- HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation
- E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
- Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
- OVI-MAP: Open-Vocabulary Instance-Semantic Mapping
- COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification
- MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding
- Bridging Domains through Subspace-Aware Model Merging
- Globally Optimal Pose from Orthographic Silhouettes
- SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
- OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
- SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
- SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
- Towards Sparse Video Understanding and Reasoning
- Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
- Yume1.5: A Text-Controlled Interactive World Generation Model
- Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection
- MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
- Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
- SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
- Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
- Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
- GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
- Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
- Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
- PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
- Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
- OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera
- Prompt-Free Universal Region Proposal Network
- SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
- MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer
- Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
- Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
- Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
- TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction
- Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
- SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
- SAM 3D: 3Dfy Anything in Images
- Learning What Helps: Task-Aligned Context Selection for Vision Tasks
- E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation
- Translating Signals to Languages for sEMG-Based Activity Recognition
- Learnability-Driven Submodular Optimization for Active Roadside 3D Detection
- Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
- GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment
- ID-Sim: An Identity-Focused Similarity Metric
- SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
- Lipschitz Optimization for Formal Verification of Homographies
- UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
- DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation
- OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
- Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models
- CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization
- Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement
- ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
- SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment
- ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes
- Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
- BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections
- CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection
- Z-Order Transformer for Feed-Forward Gaussian Splatting
- Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
- CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
- ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery
- The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection
- GR-Gauge: Cost-efficient Training Configuration By Gauging the Gradient Redundancy
- Disco-GS: Gaussian Splatting in Dynamic Color Lighting
- ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
- LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
- DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
- Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
- Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
- Residual Primitive Fitting of 3D Shapes with SuperFrusta
- NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
- Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis
- Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
- LVLM-Aided Alignment of Task-Specific Vision Models
- WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with Realistic Tasks
- TruckDrive: Long-Range Autonomous Highway Driving Dataset
- SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
- PaNDaS: Learnable Shape Interpolation Modeling with Localized Control
- CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
- Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations
- D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network
- SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
- StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
- R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment
- MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
- Frequency-domain Manipulation for Face Obfuscation
- Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation
- HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification
- D-Prism: Differentiable Primitives for Structured Dynamic Modeling
- VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
- FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
- HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
- WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering
- PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
- UniCorrn: Unified Correspondence Transformer Across 2D and 3D
- Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
- Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
- GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
- Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers
- Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
- HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
- UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
- STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows
- Gyro-based Deep Video Deblurring
- InterRVOS: Interaction-Aware Referring Video Object Segmentation
- TrackMAE: Video Representation Learning via Track Mask and Predict
- Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
- Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
- VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
- TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
- Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
- Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization
- Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation
- Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
- ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss
- Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity
- All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark
- HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation
- CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
- OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation
- GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
- 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
- ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
- MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
- SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
- A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
- PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
- MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
- HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
- RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
- Affostruction: 3D Affordance Grounding with Generative Reconstruction
- Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
- CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
- Contact-Aware Neural Dynamics
- THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT
- Occluded Human Body Capture with Frequency Domain Denoising Prior
- Affine Perspective-Three-Point Problem
- From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
- RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
- Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
- PhyGaP: Physically-Grounded Gaussians with Polarization Cues
- LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
- VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
- Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion
- Breaking Multimodal LLM Safety via Video-Driven Prompting
- EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
- PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
- ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
- Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
- DRM: Diffusion-based Reward Model With Step-wise Guidance
- Rethinking Occlusion Modeling for UAV Tracking
- Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment
- Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post‑hoc Debiasing in Vision-Language Models
- TiViBench: Benchmarking Think-in-Video Reasoning for Video Generation
- Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging
- STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
- ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
- Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
- GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
- Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
- Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration
- MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
- Scaling Dense Event-Stream Pretraining from Visual Foundation Models
- S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
- Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
- Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
- RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
- Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
- Illumination-Consistent Human-Scene Reconstruction from Monocular Video
- DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
- Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation
- Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects
- Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
- Building a Precise Video Language with Human–AI Oversight
- Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness
- TopoSlide: Topologically-Informed Histopathology Whole Slide Image Representation Learning
- AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning
- Continual Distillation of Teachers from Different Domains
- GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
- VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
- Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
- Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
- Make it SING: Analyzing Semantic Invariants in Classifiers
- Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision
- Anti-Degradation Lifelong Multi-View Clustering
- Common Inpainted Objects In-N-Out of Context
- SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction
- Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining
- SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
- EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
- Splatent: Splatting Diffusion Latents for Novel View Synthesis
- Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance
- COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
- MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention
- Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
- Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities
- LAMP: Language-Assisted Motion Planning for Controllable Video Generation
- Lynx: Towards High-Fidelity Personalized Video Generation
- Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
- DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment
- VL-RouterBench: A Benchmark for Vision–Language Model Routing
- Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking
- DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
- Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
- HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling
- Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention
- Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation
- Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
- Residual Diffusion Bridge Model for Image Restoration
- Rethinking Concept Bottleneck Models: From Pitfalls to Solutions
- Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
- Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
- Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation
- Optical Diffraction-based Convolution for Semiconductor Lithography
- U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
- Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?
- PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
- Towards Generalized Multimodal Homography Estimation
- Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
- RAID: Retrieval-Augmented Anomaly Detection
- A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World
- SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
- FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
- PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection
- SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
- TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification
- No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation
- A Supervised Multi-task Framework for Joint cryo-ET Restoration Enabled by Generative Physical Simulation
- Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
- Geo2: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
- Learning to Solve PDEs on Neural Shape Representations
- Linear Image Generation by Synthesizing Exposure Brackets
- Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
- Same or Not? Enhancing Visual Perception in Vision-Language Models
- NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining
- Any4D: Unified Feed-Forward Metric 4D Reconstruction
- SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
- Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
- Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting
- GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
- Low-Resolution Editing is All You Need for High-Resolution Editing
- From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
- HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
- HUMAPS-4D: A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations
- Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
- TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
- Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach
- LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space
- Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
- Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
- S^2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
- PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward
- FrankenMotion: Part-level Human Motion Generation and Composition
- TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
- Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering
- FE2E: From Editor to Dense Geometry Estimator
- Efficient Frame Selection for Long Video Understanding via Reinforcement Learning
- MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
- L^2DGS: Low-Light Dynamic Gaussian Splatting
- Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
- Lighting in Motion: Spatiotemporal HDR Lighting Estimation
- An Empirical Study on How Video-LLMs Answer Video Questions
- Beyond Caption-Based Queries in Video Moment Retrieval
- Aligning Text, Images and 3D Structure Token-by-Token
- Robust Spiking Neural Networks by Temporal Mutual Information
- Content-Adaptive Hierarchical Hyperprior for Neural Video Coding
- HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
- A2GC: Asymmetric Aggregation with Geometric Constraints for Locally Aggregated Descriptors
- ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
- Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
- Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting
- From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images
- QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification
- AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
- M^3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
- MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly
- Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
- VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
- Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers
- UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
- Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
- Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers
- Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
- Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution
- LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
- Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models
- When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection
- Learning complete and explainable visual representations from itemized text supervision
- GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models
- DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
- Grid Distillation: Compositional Image Distillation via Structured Generative Grids
- MV-TAP: Tracking Any Point in Multi-View Videos
- VENI: Variational Encoder for Natural Illumination
- CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models
- MAD: Motion Appearance Decoupling for efficient Driving World Models
- VOSR: A Vision-Only Generative Model for Image Super-Resolution
- RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments
- Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes
- Improving Adversarial Transferability with Local Perturbation Augmentation
- EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
- PersonaVLM: Long-Term Personalized Multimodal LLMs
- Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
- FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
- Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
- Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution
- RewardFlow: Generate Images by Optimizing What You Reward
- Foundry: Distilling 3D Foundation Models for the Edge
- AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
- gQIR: Generative Quanta Image Reconstruction
- VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation
- SMVRT: Implicit Human 3D Modeling Using Sparse Multi-View Volumetric Reconstruction with Transformer Fusion
- SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
- Endless World: Real-Time 3D-Aware Long Video Generation
- Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning
- Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
- CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
- InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models
- Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective
- Coded-E2LF: Coded Aperture Light Field Imaging from Events
- QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
- Bridging Facial Understanding and Animation via Language Models
- Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All
- Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference
- OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
- UIKA: Fast Universal Head Avatar from Pose-Free Images
- Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
- Obstruction Reasoning for Robotic Grasping
- CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation
- Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
- Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs
- Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation
- Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
- Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification
- StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
- UniSER: A Foundation Model for Unified Soft Effects Removal
- Scene Reconstruction as Mapping Priors for 3D Detection
- SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
- Volumetric Functional Maps
- Spike-driven Discrete Aggregation for Event-based Object Detection
- Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
- Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration
- Condensed Test-Time Adaptation of VLMs for Action Recognition
- First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
- Eulerian Gaussian Splatting using Hashed Probability Pyramids
- TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
- TokenLight: Precise Lighting Control in Images using Attribute Tokens
- UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
- Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness
- InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
- FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
- Best Segmentation Buddies for Image-Shape Correspondence
- EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
- Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
- TextFM: Robust Semi-dense Feature Matching with Language Guidance
- FMPose3D: monocular 3D pose estimation via flow matching
- Few-for-Many Personalized Federated Learning
- Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
- Predicting Spatial Transcriptomics from Histology Images via High-Order Multi-Cell Interaction Modeling
- Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
- Distilling Balanced Knowledge from a Biased Teacher
- Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain
- Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
- InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
- A More Word-like Image Tokenization for MLLMs
- TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
- EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network
- RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
- LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
- History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation
- RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation
- SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens
- Semantic Scale Space: A Framework for Controllable Image Abstraction
- VQ-VA World: Towards High-Quality Visual Question-Visual Answering
- Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning
- ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy
- LoST: Level of Semantics Tokenization for 3D Shapes
- MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images
- OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
- GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
- From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
- GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
- TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
- PAS: Prelim Attention Score for Detecting Object Hallucinations in Large Vision-Language Models
- Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning
- Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
- Visual Personalization Turing Test
- AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
- BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
- Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
- Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields
- Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images
- Weight Space Representation Learning via Neural Field Adaptation
- Parallelised Differentiable Straightest Geodesics for 3D Meshes
- Dynamic Exposure Burst Image Restoration
- Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
- Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
- ExpPortrait: Expressive Portrait Generation via Personalized Representation
- Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
- Image-Guided Geometric Stylization of 3D Meshes
- Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
- PersonaLive! Expressive Portrait Image Animation for Live Streaming
- Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
- CLIP-like Model as a Foundational Density Ratio Estimator
- Boosting Reasoning in Large Multimodal Models via Activation Replay
- SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
- Mario: Multimodal Graph Reasoning with Large Language Models
- Learning from Noisy Supervision: A Denoising–Debiasing Framework for Weakly Supervised Video Anomaly Detection
- UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
- EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR
- Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
- FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
- Geometry-Aware Cross-Modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting
- AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors
- AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
- PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation
- Designing to Forget: Deep Semi-parametric Models for Unlearning
- Event6D: Event-based Novel Object 6D Pose Tracking
- Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification
- Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning
- Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
- SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval
- Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
- WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing
- Gaussian Mapping for Evolving Scenes
- Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
- Composing Concepts from Images and Videos via Concept-prompt Binding
- Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
- Suppressing Non-Semantic Noise in Masked Image Modeling Representations
- Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
- Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds
- Inference-time Physics Alignment of Video Generative Models with Latent World Models
- Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images
- Human Interaction-Aware 3D Reconstruction from a Single Image
- Delta Rectified Flow Sampling for Text-to-Image Editing
- DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
- EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models
- Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation
- StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning
- Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
- CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference
- AE2VID: Event-based Video Reconstruction via Aperture Modulation
- 3D-LATTE: Latent Space 3D Editing from Textual Instructions
- V-DPM: 4D Video Reconstruction with Dynamic Point Maps
- MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation
- FloVerse: Floor Plan-Guided Multi-Modal Navigation
- Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision
- Next-Scale Autoregressive Models for Text-to-Motion Generation
- Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses
- Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
- Self-Corrected Image Generation with Explainable Latent Rewards
- Grounded Latents for Entity-Centric 4D Scene Generation
- CARD: Correlation Aware Restoration with Diffusion
- PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
- JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
- Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
- Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling
- Recurrent Video Masked Autoencoders
- HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter
- Differentially Private 2D Human Pose Estimation
- PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention
- Draft and Refine with Visual Experts
- RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes
- Domain-Skewed Federated Learning with Feature Decoupling and Calibration
- Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
- Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
- Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
- Gated KalmaNet: A Fading Memory Layer through Test-time Ridge Regression
- URScenes: A Multi-scenario Dataset for Unstructured Road Environments
- Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions
- Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
- Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data
- DeDelayed: Deleting Remote Inference Delay via On-Device Correction
- Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
- Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
- HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph
- Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure
- WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
- Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
- Synthesizing Visual Concepts as Vision-Language Programs
- Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
- DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions
- When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness
- CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation
- SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
- SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
- WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
- Modeling the Visual Ambiguity of Human Sketches
- Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
- LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
- Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
- ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data
- InternVideo-Next: Towards World-Understanding Video Models
- Voxify3D: Pixel Art Meets Volumetric Rendering
- PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose
- OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
- PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation
- Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers
- Explaining Object Detectors via Collective Contribution of Pixels
- ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
- SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images
- Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
- Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
- Foundation Encoders Are All You Need for Preference-Aware Personalization
- Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements
- OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe
- Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
- FPSBench: A Benchmark for Video Understanding at High Frame Rates
- GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator
- AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
- 2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition
- SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation
- HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
- Few-shot Acoustic Synthesis with Multimodal Flow Matching
- SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis
- Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
- HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
- IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
- V^2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
- Point Cloud as a Foreign Language for Multi-modal Large Language Model
- OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
- Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
- DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime
- CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild
- A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett–Luce Ranking
- Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code
- Precise Object and Effect Removal with Adaptive Target-Aware Attention
- Fine-Grained Multi Image Object Hallucination Benchmark
- PAVAS: Physics-Aware Video-to-Audio Synthesis
- AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
- From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
- PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data
- Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes
- CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
- CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding
- 3D Gaussian Splatting at Arbitrary Resolutions with Compact Proxy Anchors
- RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
- Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
- YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection
- ChordEdit: One-Step Low-Energy Transport for Image Editing
- Lafite: A Generative Latent Field for 3D Native Texturing
- WPT: World-to-Policy Transfer via Online World Model Distillation
- AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks
- Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization
- Block-based Learned Image Compression without Blocking Artifacts
- Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification
- MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
- Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
- BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
- Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
- VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
- M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
- RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
- FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
- DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process
- MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction
- Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training
- Self-Consistency for LLM-Based Motion Trajectory Generation and Verification
- Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
- Tunable Soft Equivariance with Guarantees
- Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering
- Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning
- CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
- CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
- Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets
- Generative Video Motion Editing with 3D Point Tracks
- ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing
- When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
- VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light Environment
- Kaleidoscopic Scintillation Event Imaging
- Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
- Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
- Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
- ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation
- REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
- Towards Multimodal Domain Generalization with Few Labels
- DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
- Coverage Optimization for Camera View Selection
- CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
- Adaptive Confidence Regularization for Multimodal Failure Detection
- SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
- MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation
- SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
- Image Generation from Contextually-Contradictory Prompts
- CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection
- A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
- KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation
- MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
- POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
- Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
- Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis
- Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery
- ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
- PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
- Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis
- VGGT-Ω
- EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
- 3D-IDE: 3D Implicit Depth Emergent
- FEAT: Fashion Editing and Try-On from Any Design
- LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
- Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
- Enhancing Out-of-Distribution Detection with Extended Logit Normalization
- Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis
- Fully Decentralized Certified Unlearning
- The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models
- FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
- HoneyBee: Data Recipes for Vision-Language Reasoners
- Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
- Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
- Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
- MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
- AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
- Deep Feature Deformation Weights
- Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
- FUN REC Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
- HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
- Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
- GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies
- InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
- FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes
- Exemplar-Free Continual Learning for State Space Models
- INSID3: Training-Free In-Context Segmentation with DINOv3
- FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection
- Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
- GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
- DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification
- Defending Unauthorized Model Merging via Dual-Stage Weight Protection
- TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
- Cycle-Consistent Tuning for Layered Image Decomposition
- Stronger Normalization-Free Transformers
- Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition
- OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
- EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
- On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks
- Phrase-grounded APO for Improving Chest X-ray Report Generation
- Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models
- DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation
- TIGER: A Unified Framework for Time, Images and Geo-location Retrieval
- Solvability of the Viewing Graph Under the Affine Camera Model
- PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
- DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
- H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction
- HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
- EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
- From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
- UniDAC: Universal Metric Depth Estimation for Any Camera
- Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation
- MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
- Modeling Cross-vision Synergy for Unified Large Vision Model
- GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
- Unified Number-Free Text-to-Motion Generation Via Flow Matching
- PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model
- PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
- Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D
- Geometric-Photometric Event-based 3D Gaussian Ray Tracing
- PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
- Learning 3D Shape Fidelity Metric from Real-world Distortions
- Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
- Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction
- SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
- Soft Modality-Guided Expert Specialization in MoE-VLMs
- SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
- StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
- TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
- 4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction
- Text-Driven 3D Hand Motion Generation from Sign Language Data
- Dynamic Token Reweighting for Robust Vision-Language Models
- Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift
- Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
- Relightful Video Portrait Harmonization
- HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
- SVAgent: Storyline-guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
- Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions
- Unique Lives, Shared World: Learning from Single-Life Videos
- Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
- REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
- When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
- GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
- Sparse–View Localization via Online Neural 3D Regression
- Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
- FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
- Dynamic Visual SLAM using a General 3D Prior
- ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
- Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
- Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation
- 3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
- PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning
- FastGaMer: Efficient GainMap Learning for Practical Inverse Tone Mapping
- ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
- PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
- Multi-speaker Attention Alignment for Multimodal Social Interaction
- VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
- 240FPS Stereo Vision from Monocular Mixed Spikes
- Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation
- Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
- ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
- SG-LoRA: Semantic-guided LoRA Parameters Generation
- Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains
- How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
- SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
- CHEEM: Continual Learning by Reuse, New, Adapt and Skip - A Hierarchical Exploration-Exploitation Approach
- Captain Safari: A World Engine with Pose-Aligned 3D Memory
- Global Structure-from-Motion Meets Feedforward Reconstruction
- Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning
- OccAny: Generalized Unconstrained Urban 3D Occupancy
- DiffBMP: Differentiable Rendering with Bitmap Primitives
- When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards
- Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
- MARCO: Navigating the Unseen Space of Semantic Correspondence
- Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
- CaptionQA: Is Your Caption as Useful as the Image Itself?
- OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
- Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty
- PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion
- AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models
- Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
- LeapAlign: Post-training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
- ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers
- Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving
- MuM: Multi-View Masked Image Modeling for 3D Vision
- TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis
- SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
- Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory
- BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
- Feed-forward Gaussian Registration for Head Avatar Creation and Editing
- Multimodal Distribution Matching for Vision-Language Dataset Distillation
- Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
- Coupling Liquid Time‑Constant Encoders with Modern Hopfield Memory
- BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
- Selectively Extracting and Injecting Visual Attributes into Text-to-Image Models
- Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
- Self-Diffusion Driven Blind Imaging
- SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
- AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation
- VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
- MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration
- MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
- Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
- TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
- ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
- SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
- Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
- Thinking in 360°: Humanoid Visual Search in the Wild
- Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors
- SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding
- PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
- A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
- TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
- NEAF: Natural Image Editing with Attention Fusion for Generalizable Test-time Optimization in Text-Guided Image Editing
- Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
- Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
- FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
- Scaling Spatial and Temporal Context for Robotic Imitation Learning Policies With Scene Graphs
- Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior
- Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
- GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
- AirSim360: A Panoramic Simulation Platform within Drone View
- AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting
- Task-Driven Implicit Representations for Automated Design of LiDAR Systems
- Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment
- LumiX: Structured and Coherent Text-to-Intrinsic Generation
- MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues
- CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
- HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm
- TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
- Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
- GS-ASM: 2DGS-Supervised Active Stereo Matching
- MoVie: Broaden Your Views with Human Motion for Action Detection
- AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking
- CI-VID: A Coherent Interleaved Text-Video Dataset
- Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
- Generative Diffusion Priors for 3D Mapping of the Dark Universe
- Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
- EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
- Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
- Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering
- ShadowDraw: From Any Object to Shadow-Drawing Compositional Art
- Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
- DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
- Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
- ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
- MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
- PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
- SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation
- MUFASA: A Multi-Layer Framework for Slot Attention
- R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection
- PAI-Bench: A Comprehensive Benchmark For Physical AI
- Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling
- MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation
- Perceptual 3D Simulation With Physical World Modeling
- Random Wins All: Rethinking Grouping Strategies for Vision Tokens
- ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
- LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models
- Particulate: Feed-Forward 3D Object Articulation
- DuoGen: Towards Autonomous Interleaved Multimodal Generation
- B^3-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
- M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA
- ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
- ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning
- RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
- Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
- Linear Fundamental Matrix Estimation from 7 or 5 Points
- See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
- Unifying Precise Keyframes and Semantic Control via Multi-level Diffusion
- Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
- EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
- Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
- Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models
- MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics
- FARMER: Flow AutoRegressive Transformer over Pixels
- VecGlypher: Unified Vector Glyph Generation with Language Models
- IGen: Scalable Data Generation for Robot Learning from Open-World Images
- RefAV: Towards Planning-Centric Scenario Mining
- RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution
- DC-Merge: Improving Model Merging with Directional Consistency
- YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction
- PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
- Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition
- TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
- Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
- PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation
- GenHOI: Towards Object-Consistent Hand–Object Interaction with Temporally Balanced and Spatially Selective Object Injection
- One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control
- Beyond the Static World: Continual Category Discovery under Visual Drift
- Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
- IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
- ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
- A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
- SoccerMaster: A Vision Foundation Model for Soccer Understanding
- Portable Active Learning for Object Detection
- CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
- LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
- Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
- Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
- V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
- Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination
- PE3R: Perception-Efficient 3D Reconstruction
- Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining
- MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
- Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
- Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
- TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval
- NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
- High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
- Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
- Probabilistic Precipitation Nowcasting with Rectified Flow Transformers
- Text-guided Feature Disentanglement for Cross-modal Gait Recognition
- Linking Perception, Confidence and Accuracy in MLLMs
- Verifying Neural Network Robustness with Dual Perturbations
- Retrieving Counterfactuals Improves Visual In-Context Learning
- RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
- UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
- VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale
- Streamlined Knowledge Distillation
- UniVBench: Towards Unified Evaluation for Video Foundation Models
- Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
- SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors
- PowerCLIP: Powerset Alignment for Contrastive Pre-Training
- EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction
- Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics
- WHU-MARS: A Multispectral Aerial-Ground Benchmark Towards Any-Scenario Person Re-Identification
- BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection
- SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
- NitroGen: An Open Foundation Model for Generalist Gaming Agents
- RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation
- FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling
- Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
- Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion
- Mixture of Prototypes for Test-time Adaptive Segmentation
- DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation
- Enhancing Part-Level Point Grounding for Any Open-Source MLLMs
- PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation
- REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion
- Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
- Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation
- RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
- EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
- Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation
- Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
- No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
- TSTM: Temporal Segmentation for Task-relevant Mask in Visual Reinforcement Learning Generalization
- Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
- Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations
- Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
- Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning
- Temporal Inversion for Learning Interval Change in Chest X-Rays
- GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
- Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
- Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
- Dual Ascent Diffusion for Inverse Problems
- PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
- Zoo3D: Zero-Shot 3D Object Detection at Scene Level
- FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
- VABench: A Comprehensive Benchmark for Audio-Video Generation
- Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects
- FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
- SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World
- Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation
- Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
- Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
- ReLaGS: Relational Language Gaussian Splatting
- OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation
- WorldGen: From Text to Traversable and Interactive 3D Worlds
- DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
- Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
- Generalizable Video Quality Assessment via Weak-to-Strong Learning
- Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning
- InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
- MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
- Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
- OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning
- InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
- Rethinking Token Reduction for Large Vision-Language Models
- Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image
- DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving
- RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
- CREward: A Type-Specific Creativity Reward Model
- PECCVAI: Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks
- GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing
- Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
- CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction
- Enhancing Spatial Understanding in Image Generation via Reward Modeling
- ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM
- WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
- Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
- Diagram2Structure: Unlocking LLMs' Diagram Comprehension through DiagramDiff, an Offline Diagram Structuring Framework
- GazeShift: Unsupervised Gaze Estimation and Dataset for VR
- Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
- TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond
- Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
- AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
- Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning
- Seele: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices
- DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
- GeoSANE: Learning Geospatial Representations from Models, Not Data
- AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal
- VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
- OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
- VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
- MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
- Stake the Points: Structure-Faithful Instance Unlearning
- Structural Graph Probing of Vision–Language Models
- Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
- HTTM: Head-wise Temporal Token Merging for Faster VGGT
- Pixel Motion Diffusion is What We Need for Robot Control
- Unsupervised Multi-agent and Single-agent Perception from Cooperative Views
- Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation
- FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
- Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization
- CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
- Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
- G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
- Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning
- SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation
- MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
- Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
- EasyV2V: A High-quality Instruction-based Video Editing Framework
- F^2HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
- Global-Aware Edge Prioritization for Pose Graph Initialization
- GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis
- Sampling-Aware Quantization for Diffusion Models
- IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis
- Efficient Equivariant Transformer for Self-Driving Agent Modeling
- SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering
- Hist2Style: Histogram-Guided Stylization with Bilateral Grids
- MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
- Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models
- Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
- MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning
- Reward Sharpness-Aware Fine-Tuning for Diffusion Models
- Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
- Anomaly-Related Residual Fields for Cross-domain Anomaly Detection
- Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs
- Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement
- MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis
- A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
- IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
- Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework
- Region-Aware Instance Consistency Learning for Micro-Expression Recognition
- SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
- PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
- Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
- Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
- TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration
- LA-Pose: Latent Action Pretraining Meets Pose Estimation
- VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
- DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR
- PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects
- Inter-Photon-Limited Videography
- SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching
- Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
- Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
- Time Blindness: Why Video-Language Models Can’t See What Humans Can?
- BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer
- High-Quality and Efficient Turbulence Mitigation with Events
- HybridDriveVLA: Vision-Language-Action Model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving
- Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
- More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
- GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
- Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent
- UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents
- VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
- 2D-LFM: Lifting Foundation Model without 3D Supervision
- Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
- Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
- Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors
- QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
- BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
- BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images
- Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
- BiGain: Unified Token Compression for Joint Generation and Classification
- Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization
- From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
- IntrinsicWeather: Controllable Weather Editing in Intrinsic Space
- DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting
- Compressed-Domain-Aware Online Video Super-Resolution
- UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
- FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
- Understanding Task Transfer in Vision-Language Models
- DSO: Direct Steering Optimization for Bias Mitigation
- LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
- SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
- Goldilocks Test Sets for Face Verification
- TESO: Online Tracking of Essential Matrix by Stochastic Optimization
- FG-Portrait: 3D Flow Guided Editable Portrait Animation
- Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
- How to Take a Memorable Picture? Empowering Users with Actionable Feedback
- TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery
- PHAC: Promptable Human Amodal Completion
- CADC: Content Adaptive Diffusion-Based Generative Image Compression
- Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
- Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy
- Visual Autoregressive Modeling via Next Focus Prediction
- VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
- TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
- IPR-1: Interactive Physical Reasoner
- Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
- Diffusion Mental Averages
- Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization
- Agentic Retoucher for Text-To-Image Generation
- Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
- FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning
- Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
- GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
- BAMI: Training-Free Bias Mitigation in GUI Grounding
- When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs
- Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
- Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
- PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video
- AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
- Personalized Federated Training of Diffusion Models with Privacy Guarantees
- Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
- FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
- VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
- TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
- OS-Fed: One Snapshot Is All You Need
- Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
- ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
- Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval
- Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
- HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
- SonoWorld: From One Image to a 3D Audio-Visual Scene
- InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
- Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production
- High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling
- Driving on Registers
- LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
- 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
- Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation
- KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
- EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement
- Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
- WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
- GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation
- Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
- NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
- HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
- Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET
- RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization
- Scale Space Diffusion
- Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models
- From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras
- Dark3R: Learning Structure from Motion in the Dark
- TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
- GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers
- Order Matters: 3D Shape Generation from Sequential VR Sketches
- RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph
- InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene
- Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence
- No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
- cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
- UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration
- SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
- SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
- Detecting Compressed AI-Generated Images via Phase Spectrum Robustness
- SineProject: Machine Unlearning for Stable Vision-Language Alignment
- Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding
- CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration
- RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework
- PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
- ProSoftArena: Benchmarking Hierarchical Capabilities of Multi-modal Agents in Professional Software Environments
- Is the Modality Gap a Bug or a Feature? A Robustness Perspective
- One Algorithm to Align Them All
- Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
- GenTract: Generative Global Tractography
- OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
- VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
- Roots Beneath the Cut: Uncovering the Risk of Concept Recovery in Pruning-Based Unlearning for Diffusion Models
- ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
- SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
- Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
- Rethinking Glyph Spatial Information in Font Generation
- ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
- Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
- High-Fidelity Mobile Avatars with Pruned Local Blendshapes
- Match-and-Fuse: Consistent Generation from Unstructured Image Sets
- ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning
- KLIP: Localized Distribution Shift Detection via KL-Divergence with Diffusion Priors in Inverse Problems
- Video Panels for Long Video Understanding
- MPL: Match-guided Prototype Learning for Few-shot Action Recognition
- Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases
- Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
- UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in RL
- Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
- QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
- Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification
- EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
- DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs
- ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
- Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents
- SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
- SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference
- HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
- Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters
- Free-Grained Hierarchical Visual Recognition
- Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
- SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
- Vista4D: Video Reshooting with 4D Point Clouds
- Dexterous World Models
- WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
- Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats
- MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
- Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging
- Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
- ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
- ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer
- RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks
- MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures
- OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data
- WildPose: A Unified Framework for Robust Pose Estimation in the Wild
- Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
- UniLight: A Unified Representation for Lighting
- Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
- SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
- Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction
- Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
- ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
- Latent Implicit Visual Reasoning
- MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection
- Decision Boundary-aware Generation for Long-tailed Learning
- C^2FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
- CamDirector: Towards Long-Term Coherent Video Trajectory Editing
- Beyond the Ground Truth: Enhanced Supervision for Image Restoration
- Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
- MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
- Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
- FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera
- Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
- DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
- Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
- Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting
- LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
- MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
- Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels
- Sparse Spectral LoRA: Routed Experts for Medical VLMs
- StreamDiT: Real-Time Streaming Text-to-Video Generation
- Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
- LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation
- Mechanisms of Object Localization in Vision–Language Models
- D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation
- ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
- CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
- Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
- Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation
- KV-Tracker: Real-Time Pose Tracking with Transformers
- TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection
- MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking
- Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction
- OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis
- EXOTIC: External Vision-driven Incomplete Multi-view Classification
- RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing
- SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
- Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
- CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
- VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
- 4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis
- GM-R^2: Generative Matching Learning for Unsupervised Geometric Representation and Registration
- TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
- Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization
- PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
- Scene Grounding in the Wild
- MM-ACT: Learn from Multimodal Parallel Generation to Act
- Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model
- Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
- AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
- MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation
- OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models
- ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
- Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport
- Parallel Rigidity Matters for Bundle Adjustment
- Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration
- MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
- Content-Aware Dynamic Patchification for Efficient Video Diffusion
- FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
- FabricGen: Microstructure-Aware Woven Fabric Generation
- Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights
- Progressive Multi-cue Alignment for Unaligned RGBT Tracking
- ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding
- LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
- Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration
- PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving
- FireScope: Wildfire Risk Raster Prediction With a Chain-of-Thought Oracle
- Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting
- Mirai: Autoregressive Visual Generation Needs Foresight
- Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
- SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization
- Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval
- Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
- Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge
- IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution
- SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
- SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models
- VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
- Guiding Token-Sparse Diffusion Models
- Mirror Illusion Art
- LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
- BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
- A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks
- Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning
- Adapting Lightweight Image-based Counting Models for Video Crowd Counting
- Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection
- CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
- Explaining CLIP Zero-shot Predictions Through Concepts
- Evidential Neural Radiance Fields
- Bridging the Perception Gap in Image Super-Resolution Evaluation
- MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality
- Depth Hypothesis Guided Iterative Refinement for Event–Image Monocular Depth Estimation
- CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision–Language Models
- See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
- Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation
- Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
- LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol
- DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
- Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
- SVBench: Evaluation of Video Generation Models on Social Reasoning
- Temporal Interaction in Spiking Transformers with Multi-Delay Mixer
- Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance
- mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds
- ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
- Describe Anything Anywhere At Any Moment
- TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising
- InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation
- FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation
- Minimal Constraint Relaxation for Multiview Autocalibration
- LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers
- VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension
- DynFusion: Rethinking Condition Fusion for Adaptive Multi-Conditional Text-to-Image Generation
- Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval
- See Through the Noise: Improving Domain Generalization in Gaze Estimation
- Finding Distributed Object-Centric Properties in Self-Supervised Transformers
- The Universal Normal Embedding
- When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks
- NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
- Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
- Adapting In-context Generation for Enhanced Composed Image Retrieval
- NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
- Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering
- Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models
- PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning
- Latent Diffusion Inversion Requires Understanding the Latent Space
- VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models
- GGPT: Geometry-Grounded Point Transformer
- PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
- Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning
- Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics
- MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
- Align Images Before You Generate
- X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
- INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic ManipulaTion
- ReasonX: MLLM-Guided Intrinsic Image Decomposition
- Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation
- Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex
- A Bit is All You Need! Efficient Video Capture via Single Bit Imaging
- OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
- DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives
- LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection
- SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
- Efficient and Training-Free Single-Image Diffusion Models
- Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation
- Spot The Ball: A Benchmark for Visual Social Inference
- SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
- Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
- 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
- RFDM: Residual Flow Diffusion Models for Video Editing
- Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
- Landscape-Awareness for Geometric View Diffusion Model
- ReBaPL: Repulsive Bayesian Prompt Learning
- FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution
- MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
- Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling
- Anti-I2V: Safeguarding your Photos from Malicious Image-to-video Generation
- CountGD++: Generalized Prompting for Open-World Counting
- PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems
- TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking
- Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
- Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework
- ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
- ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
- The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
- BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models
- Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency
- Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
- VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment
- Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization
- AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
- SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
- Generative Modeling of Weights: Generalization or Memorization?
- Human Geometry Distribution for 3D Animation Generation
- MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
- HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps
- Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
- DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks
- Uni-Hema: Unified Model for Digital Hematopathology
- EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
- Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images
- Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
- Resolving the Identity Crisis in Text-to-Image Generation
- Personalized Image Descriptions from Attention Sequences
- CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
- Progressive Supernet Training for Efficient Visual Autoregressive Modeling
- Image-to-Point Cloud Feature Back-Projection for Multimodal Training of 3D Semantic Segmentation
- Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
- Towards Calibrating Prompt Tuning of Vision- Language Models
- DROID-SLAM in the Wild
- DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision–Language Transformers to Missing Modalities
- SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras
- DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
- EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
- EDGS: Eliminating Densification for Efficient Convergence of 3DGS
- RankOOD - Class Ranking-based Out-of-Distribution Detection
- AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples
- Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
- SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning
- FILTR: Extracting Topological Features from Pretrained 3D Models
- GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT
- SIR: Structured Image Representations for Explainable Robot Learning
- Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
- Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
- RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
- Generative Neural Video Compression via Video Diffusion Prior
- TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer
- MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
- StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation
- EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions
- Measuring the (Un)Faithfulness of Concept-Based Explanations
- Scaling Parallel Sequence Models to Vision Foundation Models
- GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space
- RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
- MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
- LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting
- Beyond Text Prompts: Precise Concept Erasure through Text–Image Collaboration
- MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
- From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
- Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
- FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
- Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception
- Hyper-PCN: Hypergraph-Based Point Cloud Completion via High-Order Correlation Modeling
- Computer Vision with a Superpixelation Camera
- Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
- PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion
- Learnability-Guided Diffusion for Dataset Distillation
- EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
- SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting
- Spatiotemporal Pyramid Flow Matching for Climate Emulation
- TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
- SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
- Failure Modes for Deep Learning–Based Online Mapping: How to Measure and Address Them
- Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment
- MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
- Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
- NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
- LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
- DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing
- Latent Chain-of-Thought World Modeling for End-to-End Driving
- GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
- BluRef: Unsupervised Image Deblurring with Dense-Matching References
- From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
- PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
- When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
- CoT-Edit: Let CoT Guide Instruction Video Editing
- IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution
- OctoNav: Towards Generalist Embodied Navigation
- Semantic Alignment for Pose-Invariant Identity Preserving Diffusion
- Exposing and Evaluating Hallucinations for GUI Grounding
- Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting
- Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
- Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding
- MusicInfuser: Making Video Diffusion Listen and Dance
- Object-WIPER: Training-Free Object and Associated Effect Removal in Videos
- DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
- ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model
- AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction
- Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
- TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
- DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
- GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection
- RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
- Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations
- Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling
- TempoControl: Temporal Attention Guidance for Text-to-Video Models
- Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing
- Mapping Networks
- Learning Long-term Motion Embeddings for Efficient Kinematics Generation
- HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
- TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery
- Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop
- Lenses: Toward Polysemous Vision–Language Understanding
- D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration
- Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning
- MIBURI: Towards Expressive Interactive Gesture Synthesis
- IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion
- OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
- TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation
- IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations
- SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons
- VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
- TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement
- AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
- Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation
- Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors
- Interactive Episodic Memory with User Feedback
- Post-training Feature Pruning for Fundus Images Classification
- Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
- MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations
- Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism
- Revisiting Visual Corruptions in LVLMs: A Shape–Texture Perspective on Model Failures
- RECS4R: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation
- Learning Forgery-Aware Lip Representations Without Forgery Priors
- Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning
- Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation
- Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
- Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation
- Seeing Conversations: Communication Context Identification in Egocentric Video
- KαLOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
- Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
- Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
- DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
- Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
- REACH: Explicit Recovery Behavior for Diffusion Policies
- SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning
- Multi-view Pyramid Transformer: Look Coarser to See Broader
- D2T2 - Multimodal Automated Planning for Brachytherapy
- SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior-Guided Multimodal LLMs
- Learning Convex Decomposition via Feature Fields
- PhysVid: Physics Aware Local Conditioning for Generative Video Models
- ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval
- Visual Diffusion Models are Geometric Solvers
- VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
- ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
- SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
- Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework
- DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation
- Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
- ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation
- RL‑ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment
- Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification
- EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision
- SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation
- Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation
- Guiding Diffusion Models with Semantically Degraded Conditions
- IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence
- Improving Sparse Autoencoder with Dynamic Attention
- HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation
- Emergent Extreme-View Geometry in 3D Foundation Models
- FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts
- PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
- 2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching
- PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
- Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
- Unified Camera Positional Encoding for Controlled Video Generation
- Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence
- CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
- Unified Vector Floorplan Generation via Markup Representation
- LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
- Drainage: A Unifying Framework for Addressing Class Uncertainty
- MoLingo: Motion–Language Alignment for Text-to-Human Motion Generation
- ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
- CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
- BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
- Decoupling Vision and Language: Codebook Anchored Visual Adaptation
- DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging
- AnthroTAP: Learning Point Tracking with Real-World Motion
- GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics
- PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
- PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
- Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
- RenderFlow: Single-Step Neural Rendering via Flow Matching
- SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
- Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
- Smoothing the Score Function to Enhance Generalization in Diffusion Models
- CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
- PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
- NS-Diff: Fluid Navier–Stokes Guided Video Diffusion via Reinforcement Learning
- Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
- VideoCoF: Unified Video Editing with Temporal Reasoner
- R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
- Investigating Self-Supervised Representations for Audio-Visual Deepfake Detection
- Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
- Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation
- CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
- StreamReady: Learning What to Answer and When in Long Streaming Videos
- Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
- MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
- Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation
- Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
- ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
- Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
- Robust Promptable Video Object Segmentation
- See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
- Controllable Federated Prompt Learning at Test Time
- MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
- Differentiable Laplacian Matrix Guided Superpixel Segmentation
- Progressive Mask Distillation for Self-supervised Video Representation
- Mining Instance-Centric Vision–Language Contexts for Human–Object Interaction Detection
- CoWTracker: Tracking by Warping instead of Correlation
- Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity
- BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
- Collaborative Multi-Mode Pruning for Vision-Language Models
- Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
- From Panel to Pixel: Zoom-In Vision–Language Pretraining from Biomedical Scientific Literature
- PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild
- Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
- AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
- TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization
- CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation
- Hint2Gen: Bridging Understanding and Generation via Code-structured Hints
- SkillSight: Efficient First-Person Skill Assessment with Gaze
- VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
- FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
- FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
- HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
- Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
- Language-guided Frequency Modulation for Large Vision-Language Models
- PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment
- Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions
- MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
- DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging
- Learning to Infer Parameterized Representations of Plants from 3D Scans
- Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
- Specificity-aware reinforcement learning for fine-grained open-world classification
- It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
- Causal Motion Diffusion Models for Autoregressive Motion Generation
- MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
- Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again
- The Invisible Gorilla Effect in Out-of-distribution Detection
- Nonlinear Color Transfer via Learnable Bezier Flows
- CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval
- Language Models Can Explain Visual Features via Steering
- Image-based Outlier Synthesis With Training Data
- Cinematic Audio Source Separation Using Visual Cues
- Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
- Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
- Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
- SimScale: Learning to Drive via Real-World Simulation at Scale
- Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
- Towards Intrinsic-Aware Monocular 3D Object Detection
- DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations
- TopoCL: Topological Contrastive Learning for Medical Imaging
- TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
- Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
- Geometrically-Constrained Agent for Spatial Reasoning
- MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision
- TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
- GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
- ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion
- MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment
- Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
- Learning Straight Flows: Variational Flow Matching for Efficient Generation
- HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
- Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation
- HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
- FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning
- FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization
- Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
- Forecasting 3D Scanpaths in Egocentric Video
- MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
- When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence
- OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks
- LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
- PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
- Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
- EvoID: Reinforced Evolution for Identity-Preserving Video Generation
- Photo-Guided Tooth Segmentation on 3D Oral Scan Model
- Interpretable Debiasing of Vision-Language Models for Social Fairness
- Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
- ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
- Boundary-Responsive Differentiable Gating for Superpixel-Based Segmentation
- LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
- MacTok: Robust Continuous Tokenization for Image Generation
- Efficient Unrolled Networks for Large-Scale 3D Inverse Problems
- Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
- Hyperbolic Gramian Volumes for Multimodal Alignment
- Learning by Analogy: A Causal Framework for Compositional Generalization
- OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
- UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders
- More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization
- End-to-End Language-Action Model for Humanoid Whole Body Control
- Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
- Deciphering Genotype-Phenotype Mechanisms from High-Content Profiling via Knowledge-Guided Multi-modal Graph Learning
- DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration
- Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark
- SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
- StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
- LumiMotion: Improving Gaussian Relighting with Scene Dynamics
- Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals
- InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection
- Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding
- DIMOS: Disentangling Instance-level Moving Object Segmentation
- SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection
- Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor
- SpiderCam: Low-Power Snapshot Depth from Differential Defocus
- MERIT: Multi-domain Efficient RAW Image Translation
- Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization
- FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection
- Deformation-based In-Context Learning for Point Cloud Understanding
- Learning 3D Reconstruction with Priors in Test Time
- Revisiting Model Stitching In the Foundation Model Era
- Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
- iLRM: An Iterative Large 3D Reconstruction Model
- When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
- M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction
- FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
- The Drift Kernel: Why Diffusion Models Change Even When Told Not To
- SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation
- UAV-CB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection
- Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
- MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification
- ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
- UniChange: Unifying Change Detection with Multimodal Large Language Model
- PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding
- VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
- DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
- LoL: Longer than Longer, Scaling Video Generation to Hour
- RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
- TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction
- Benchmarking Endoscopic Surgical Image Restoration and Beyond
- CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation
- WalkGPT: Grounded Vision–Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
- tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
- Teaching DINOv3 About Partial 3D Geometry: A Self-Supervised Geometry-Aware Approach
- DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
- Ego-Grounding for Personalized Question-Answering in Egocentric Videos
- Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection
- VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
- A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation
- FlashVSR: Towards Real-time Diffusion-Based Streaming Video Super Resolution
- 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
- Making the Classification Explanation Faithful to the Confidence Score
- Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
- PARSE: Part-Aware Relational Spatial Modeling
- BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment
- Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering
- Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
- StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
- A Difference-in-Difference Approach to Detecting AI-Generated Images
- Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
- Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
- Seeing Motion Through Polarity for Event-based Action Recognition
- GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer
- Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition
- Hyperbolic Busemann Neural Networks
- Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting
Report issues here.
Successful Page Load