Skip to yearly menu bar
Skip to main content
Main Navigation
CVPR
Code of Conduct
Create Profile
Privacy Policy
Contact CVPR
HELP/FAQ
Reset / Forgot Password
My Stuff
Reset Password
Login
Select Year: (2026)
2026
2025
2024
2023
Home
Schedule
Workshops
Tutorials
Keynotes
Papers
Sponsors
Organizers
Awards
Award Candidates
Best Papers
Highlights
Layout:
mini
compact
topic
detail
×
No topics available
No sessions available
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection
Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
Seeing Conversations: Communication Context Identification in Egocentric Video
LensWalk: Agentic Video Understanding by Planning How You See in Videos
MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes
V-DPM: 4D Video Reconstruction with Dynamic Point Maps
CoWTracker: Tracking by Warping instead of Correlation
Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution
Edit-aware RAW reconstruction
PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
FastGaMer: Efficient GainMap Learning for Practical Inverse Tone Mapping
RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion
ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning
Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
Mobile-VTON: High-Fidelity On-Device Virtual Try-On
X-band Radar Non-Line-of-Sight Imaging
FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinking for Large-Scale LoD 3D Gaussian Splatting
Inter-Photon-Limited Videography
ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM
FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery
Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval
Plug-and-Play Incomplete Multi-View Clustering via Janus-Faced Affinity Learning with Topology Harmonization
Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning
H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction
MambaSIC: Mamba-based Stereo Image Compression with Bi-directional Multi-reference Entropy Model
Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
UniLight: A Unified Representation for Lighting
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
Drainage: A Unifying Framework for Addressing Class Uncertainty
From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution
SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
HybridDriveVLA: Vision-Language-Action Model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
FUN REC Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
3D-LATTE: Latent Space 3D Editing from Textual Instructions
Block-based Learned Image Compression without Blocking Artifacts
Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data
Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
Scaling Parallel Sequence Models to Vision Foundation Models
InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
PhotoFramer: Multi-modal Image Composition Instruction
CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video
HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling
FlashVSR: Towards Real-time Diffusion-Based Streaming Video Super Resolution
SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
Bi-directional Autoregressive Diffusion for Large Complex Motion Interpolation
Hint2Gen: Bridging Understanding and Generation via Code-structured Hints
SplitFlux: Learning to Decouple Content and Style from a Single Image
PosterReward: Unlocking Accurate Evaluation for High-Quality Graphic Design Generation
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
Live Interactive Training for Video Segmentation
Multimodal Semantic Bias Mitigation for Diverse Text-To-3D Generation
eRetinexGS: Retinex Modeling for Low-Light Scene Enhancement via Event Streams and 3D Gaussian Splatting
VQ-VA World: Towards High-Quality Visual Question-Visual Answering
Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers
MoEActok: A MoE-based Action Tokenizer for Vision-Language-Action Models
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation
Bridging Domains through Subspace-Aware Model Merging
Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor
UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement
MMGait: Towards Multi-Modal Gait Recognition
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
HAVE-Bench: Hierarchical Audio-Visual Evaluation from Perception to Interaction
TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising
Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
Affine Perspective-Three-Point Problem
Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence
Scene Reconstruction as Mapping Priors for 3D Detection
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards
Source Models Leak What They Shouldn’t: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization
SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
Suppressing Non-Semantic Noise in Masked Image Modeling Representations
Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection
Globscope: Toward a Global View of the Loss Landscape
Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers
Token Warping Helps MLLMs Look from Nearby Viewpoints
Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
MatLat: Material Latent Space for PBR Texture Generation
DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning
Self-Attention Driven Tensor Representation for High-Order Data Recovery
Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
MuM: Multi-View Masked Image Modeling for 3D Vision
Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
MCHDoc: A Comprehensive Benchmark for Reading Multi-Carrier Chinese Historical Documents
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
Bridging Facial Understanding and Animation via Language Models
DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Fine-VAD: Towards Fine-Grained Video Anomaly Detection via Progressive Cross-Granularity Learning
Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
AudioAvatar: Personalized Audio-driven Whole-body Talking Avatars
Batman: Benign Knowledge Alignment Through Malicious Null Space in Federated Backdoor Attack
PAVAS: Physics-Aware Video-to-Audio Synthesis
Homaloidal parametrization for detecting critical two-view configurations
Hist2Style: Histogram-Guided Stylization with Bilateral Grids
HiFi-BRep: High-Fidelity Latent Representation for Robust B-Rep Generation
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
Hierarchical Process Reward Models are Symbolic Vision Learners
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions
MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models
Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation
Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment
Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion
MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration
ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing
WorldGen: From Text to Traversable and Interactive 3D Worlds
MAMMA: Markerless Accurate Multi-person Motion Acquisition
Breaking the Continuum: Discrete Distribution Learning for Structural MRI Reconstruction
FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution
CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision–Language Models
Grounded 3D-Aware Spatial Vision-Language Modeling
ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning
Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning
SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again
SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images
Hyperbolic Prototype Learning with Uncertainty-Aware Consistency for Continual Test-Time Segmentation
DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction
ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
LIBERO-Plus: A Progressive Robustness Benchmark for Visual-Language-Action Models
MHopReg: Efficient Hierarchical Multi-Hop Graph Search for Point Cloud Registration
From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition
BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery
P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction
Compositional Transformation Reasoning for Composed Video Retrieval
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
Subspace Alignment for CLIP-based Continual Learning via Canonical Correlation Analysis
Illuminating Visual Identity in Universal Multimodal Embeddings
VENI: Variational Encoder for Natural Illumination
SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Event Stream Filtering via Probability Flux Estimation
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
MAD: Motion Appearance Decoupling for efficient Driving World Models
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis
MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation
Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding
ORBIT: Benchmarking SfM in the Wild with 360° Video
Correspondence-Attention Alignment for Multi-View Diffusion Models
Causal Motion Diffusion Models for Autoregressive Motion Generation
CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation
Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Inferring Compositional 4D Scenes without Ever Seeing One
Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
Rethinking Dataset Distillation: Hard Truths about Soft Labels
RADAR: VQ-VAE Decoder of VAR is a Good Student for Restoring Against Degradation by Acceleration
ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control
LitePT: Lighter Yet Stronger Point Transformer
Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization
WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
GenHOI: Towards Object-Consistent Hand–Object Interaction with Temporally Balanced and Spatially Selective Object Injection
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors
HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution
A Mixed Diet Makes DINO An Omnivorous Vision Encoder
Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics
AVION: Aerial Vision–Language Instruction from Offline Teacher to Prompt-Tuned Network
Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation
OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery
PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild
Streamlined Open-Vocabulary Human-Object Interaction Detection
Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models
FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
RDF-MIG: A Robust Diffusion Framework for Masked Image Generation to Augment Semantic Segmentation and Change Detection
Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model
ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding
Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure
DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation
SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
Beyond What's Shared: Recovering Lost Unique Information from Intermediate Layers to Boost Multimodal Geo-Foundation Models
From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
Bias at the End of the Score
Inference-time Physics Alignment of Video Generative Models with Latent World Models
Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision
Designing to Forget: Deep Semi-parametric Models for Unlearning
RefTon: Reference person shot assist virtual Try-on
Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
Vista4D: Video Reshooting with 4D Point Clouds
Towards Cross-Modal Preservation, Consistency and Alignment for Privacy-Preserving Visible-Infrared Person Re-Identification
SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons
DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors
CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
Image Diffusion Preview with Consistency Solver
Voxify3D: Pixel Art Meets Volumetric Rendering
Human Interaction-Aware 3D Reconstruction from a Single Image
The Power of Prior: Training-Free Open-Vocabulary Semantic Segmentation with LLaVA
Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis
Modeling the Brain’s Grammar: ROI-Guided fMRI Pretraining for Transferable and Interpretable Vision Decoding
DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection
Decoupled and Reusable Adaptation for Efficient Cross-Modal Transfer
High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling
mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds
ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing
Diffusion Probe: Generated Image Result Prediction Using CNN Probes
StreamingTOM: Streaming Token Compression for Efficient Video Understanding
LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization
PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures
Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP
SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
What Are You Doing? A Closer Look at Controllable Human Video Generation
Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding
Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
VGA: Empowering Aerial-Ground Localization by Visual Geometry Alignment
VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing
V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception
SAM 3D Body: Robust Full-Body Human Mesh Recovery
Time Blindness: Why Video-Language Models Can’t See What Humans Can?
AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance
MoVie: Broaden Your Views with Human Motion for Action Detection
UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
Scaling Zero-Shot Reference-to-Video Generation
Latent Implicit Visual Reasoning
CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
Dexterous World Models
DuoGen: Towards Autonomous Interleaved Multimodal Generation
Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
Hunting Normality from Query Sample via Residual Learning for Generalist Anomaly Detection
ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild
Emergent Extreme-View Geometry in 3D Foundation Models
Long-Tail Internet Photo Reconstruction
OctoT2I: A Self-Evolving Agentic Text-to-Image Router
PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
OntoAug: Rethinking Generative Data Augmentation via Ontology Guidance
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models
Controllable Federated Prompt Learning at Test Time
Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models
Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark
CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition
Residual Primitive Fitting of 3D Shapes with SuperFrusta
Electromagnetic Inverse Scattering from a Single Transmitter
Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior
SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision
Structural Graph Probing of Vision–Language Models
Delta Rectified Flow Sampling for Text-to-Image Editing
Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
Forensic-Friendly Image Manipulation via Controllable Latent Diffusion
VecGlypher: Unified Vector Glyph Generation with Language Models
REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping
GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement
Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction
SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation
SkySense-VITA: Towards Universal In-context Segmentation of Multi-modal Remote Sensing Imagery
GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space
EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
Boundary-Responsive Differentiable Gating for Superpixel-Based Segmentation
WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
One Algorithm to Align Them All
RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
Towards Multimodal Domain Generalization with Few Labels
Adaptive Confidence Regularization for Multimodal Failure Detection
Personalized Image Descriptions from Attention Sequences
Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denoising
HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm
Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation
Gated KalmaNet: A Fading Memory Layer through Test-time Ridge Regression
Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation
Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets
The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments
StyleDoctor: Towards Specialist Reward Model for Style-centric Generation Tasks
Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis
GazeShift: Unsupervised Gaze Estimation and Dataset for VR
Choreographing a World of Dynamic Objects
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
RewardFlow: Generate Images by Optimizing What You Reward
ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors
Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
ViHOI: Human-Object Interaction Synthesis with Visual Priors
BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Affordance Correspondence
GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
NTK-Guided Implicit Neural Teaching
Trust-calibrated Collaborative Learning for Long-Tailed Visual Recognition
Specificity-aware reinforcement learning for fine-grained open-world classification
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
Exploring Spatial Intelligence from a Generative Perspective
How to Take a Memorable Picture? Empowering Users with Actionable Feedback
COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs
ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation
DRM: Diffusion-based Reward Model With Step-wise Guidance
CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding
GGPT: Geometry-Grounded Point Transformer
Simpleposter: A Simple Baseline For Product Poster Generation
SpotEdit: Selective Region Editing in Diffusion Transformers
Score2Instruct: Scaling Up Video Quality-Centric Instructions via Automated Dimension Scoring
SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
GROW: Watermark Generation with Progressive Guidance for Diffusion Models
Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity
GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model
Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
Archon: A Unified Multimodal Model for Holistic Digital Human Generation
rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training
AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
RecTok: Reconstruction Distillation along Rectified Flow
Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
Fast Spatial Tracking with Visual Geometry Transformer
CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
A Faster Path to Continual Learning
Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
Anchoring the Mind of Multimodal Reasoners: Cognitive Bias as a Vector for Jailbreak Attacks
Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation
Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift
Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus
Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift
RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics
PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation
LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing
LangField4D: Learning Identity-Adaptive and Spatio-Temporal Continuous 4D Language Fields for Dynamic Scenes
Enhancing Out-of-Distribution Detection with Extended Logit Normalization
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning
The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models
VAST: Video Ability‑Stratified Taxonomy for Data‑Efficient Video Reasoning
OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion
MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
Linear Fundamental Matrix Estimation from 7 or 5 Points
INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic ManipulaTion
Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs
Language-driven Fine-grained Retrieval
LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation
InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast
Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution
Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis
HoneyBee: Data Recipes for Vision-Language Reasoners
Stable Spike: Dual Consistency Optimization via Bitwise AND Operations for Spiking Neural Networks
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights
EgoSound: Benchmarking Sound Understanding in Egocentric Videos
PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing
GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking
Solvability of the Viewing Graph Under the Affine Camera Model
ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion
Event6D: Event-based Novel Object 6D Pose Tracking
An Efficient Token Compression Framework for Visual Object Tracking
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
Beyond the Ground Truth: Enhanced Supervision for Image Restoration
Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery
AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
ResCa: Residual Caching for Diffusion Transformers Acceleration
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
Make it SING: Analyzing Semantic Invariants in Classifiers
Predicting Spatial Transcriptomics from Histology Images via High-Order Multi-Cell Interaction Modeling
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
VisionLeaf: Entropy-Guided Leaf-First Reasoning for Efficient and Accurate Think-with-Image
MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
SonoWorld: From One Image to a 3D Audio-Visual Scene
MeshSplatting: Differentiable Rendering with Opaque Meshes
LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
Understanding and Enforcing Weight Disentanglement in Task Arithmetic
Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation
Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
Learning Multi-View Spatial Reasoning from Cross-View Relations
Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting
Semantic Audio-Visual Navigation in Continuous Environments
Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification
EDGS: Eliminating Densification for Efficient Convergence of 3DGS
Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning
Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models
LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs
LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration
SVBench: Evaluation of Video Generation Models on Social Reasoning
ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
VT-Intrinsic: Physics-Based Decomposition of Reflectance and Shading using a Single Visible-Thermal Image Pair
TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction
MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
LA-Pose: Latent Action Pretraining Meets Pose Estimation
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
Reviving ConvNeXt for Efficient Convolutional Diffusion Models
Nonlinear Color Transfer via Learnable Bezier Flows
RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
TANGO: Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization
Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems
GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
Sky2Ground: A Benchmark for Site Modeling under Varying Altitude
Transition Matching Distillation for Fast Video Generation
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human–Computer Interaction
Streaming Video Instruction Tuning
Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction
Eulerian Gaussian Splatting using Hashed Probability Pyramids
Learning a Unified Latent Action Space from Videos with Action-centric Cycle Consistency
Diffusion Mental Averages
SkyReels-Text: Fine-Grained Font-Controllable Text Editing for Poster Design
Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control
Occluded Human Body Capture with Frequency Domain Denoising Prior
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy
Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression
Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos
A Polarized Reflection and Material Dataset of Real World Objects
TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection
MedFG-VQA: Low-Frequency Memory and Graph Attention for Lightweight Medical VQA
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer
Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework
Bayesian Decomposition and Semantic Completion for Few-shot Semantic Segmentation
TF-CADE: Foreground-Concentrated Text-Video Alignment for Zero-Shot Temporal Action Detection
Masked Region Transformer for Layered Image Generation and Editing at Scale
VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
Ultra-Fast Neural Video Compression
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
MPL: Match-guided Prototype Learning for Few-shot Action Recognition
ModularAgent: A Task-Aware Modular Framework for Joint Optimization of Multimodal Large Language Models and World Models
Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux
Weight Space Representation Learning via Neural Field Adaptation
SRGCD: Stability-Driven Region Growth Framework for 3D Change Detection
ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation
Local Motion Matters: A Deconstruct–Recompose Paradigm for Reinforcement Learning Pre-training from Videos
LNEM: Lunar Neural Elevation Model
GUI-SAGE: Enhancing GUI Automation with Self-Explanatory Learning
DiffBMP: Differentiable Rendering with Bitmap Primitives
FedSDR: Federated Graph Learning with Structural Noise Detection and Reconstruction
SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World
B^3-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction
UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
Multi-speaker Attention Alignment for Multimodal Social Interaction
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting
UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
TruckDrive: Long-Range Autonomous Highway Driving Dataset
Gyro-based Deep Video Deblurring
HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks
WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation
OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation
GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos
Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
TrajTok: Learning Trajectory Tokens Enhances Video Understanding
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation
FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
AToken: A Unified Tokenizer for Vision
SO-Bench: A Structural Output Evaluation of Multimodal LLM
SenseSearch: Empowering Vision-Language Models with High-Resolution Agentic Search-Reasoning via Reinforcement Learning
Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data
Image-based Outlier Synthesis With Training Data
Spot The Ball: A Benchmark for Visual Social Inference
Frequency-domain Manipulation for Face Obfuscation
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection
CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild
GR-Gauge: Cost-efficient Training Configuration By Gauging the Gradient Redundancy
DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks
MatE: Material Extraction from Single-Image via Geometric Prior
TempoControl: Temporal Attention Guidance for Text-to-Video Models
DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
Endless World: Real-Time 3D-Aware Long Video Generation
Lenses: Toward Polysemous Vision–Language Understanding
Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
GraspALL: Adaptive Structural Compensation from Illumination Variation for Robotic Garment Grasping in Any Low-Light Conditions
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
Efficient Unrolled Networks for Large-Scale 3D Inverse Problems
Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
ViT^3: Unlocking Test-Time Training in Vision
Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models
Advancing Image Classification with Discrete Diffusion Classification Modeling
Does YOLO Really Need to See Every Training Image in Every Epoch?
Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks
NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices
Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion
Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging
Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack
AVGGT: Rethinking Global Attention for Accelerating VGGT
ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction
LongStream: Long-Sequence Streaming Autoregressive Visual Geometry
MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation
Complet4R: Geometric Complete 4D Reconstruction
Learning Compact 3D Representations from Feed-Forward Novel View Synthesis
MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes
Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models
Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning
MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
ConsistCompose: Unified Multimodal Layout Control for Image Composition
A Training-Free Style-Personalization via SVD-Based Feature Decomposition
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
EmoStyle: Emotion-Driven Image Stylization
Text-Image Conditioned 3D Generation
IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework
Reasoning Diffusion for Unpaired Test Time Out-of-distribution Text-Image to Video Generation
SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
MTA: Multimodal Task Alignment for BEV Perception and Captioning
β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment
SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers
FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment
Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models
EvoGraph-R1: Self-Evolving Multimodal Knowledge Hypergraphs for Agentic Retrieval
Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks
Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy
Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation
Label-Free Cross-Task LoRA Merging with Null-Space Compression
Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation
GeCo: Geometry-Consistent Regularization for Domain Generalized Semantic Segmentation
Event-based Motion Deblurring with Unpaired Data
Event-based Visual Deformation Measurement
Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
SpikeTrack: High-performance and Energy-efficient Event-Based Object Tracking with Spiking Neural Network
Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios
InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching
Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning
Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
ORV: 4D Occupancy-centric Robot Video Generation
Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models
CompBench: Benchmarking Complex Instruction-guided Image Editing
Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories
Learning Personalized Photographic Style from Pairwise User Preferences
CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing
Efficient Weighted Sampling via Score-based Generative Models
Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors
Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
Retrieve-to-Restore: Efficient All-in-One Image Restoration with a Retrieval-Based Degradation Bank
MRI Contrast Enhancement Kinetics World Model
ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
Rethinking Knowledge Transfer in Image Quality Assessment: A Perceptual Preference Structure Alignment Perspective
ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models
White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation
Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy
LF-BVN: Blind-View Network for Self-Supervised Light Field Denoising
Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices
Towards Generalized Representations for Low-Light Understanding: When Signal Constancy Meets Semantic Enrichment
Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering
SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation
Diffusion-Based Native Adversarial Synthesis for Enhanced Medical Segmentation Generalization
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
Towards Human-Imperceptible Backdoor Attacks on Text-to-Image Diffusion Models
TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
DualMirage: Hunting Stealthy Multimodal LLM Agents via CAPTCHAs with Contour and Adversarial Illusions
Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
A Unified Perspective on Adversarial Membership Manipulation in Vision Models
Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning
Learning Anchor in Dual Orthogonal Space for Fast Multi-view Clustering
Bootstrapping Multi-view Learning for Test-time Noisy Correspondence
FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning
DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Models
Vision-Speech Models: Teaching Speech Models to Converse about Images
Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data
Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping
RefAV: Towards Planning-Centric Scenario Mining
Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos
ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO
POUR: A Provably Optimal Method for Unlearning Representation via Neural Collapse
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner
MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On
Scaling Multi-Identity Consistency for Image Customization via Multi-to-Multi Matching Paradigm
NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
Benchmarking Single-Factor Physical Video-to-Audio Generation
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
Refaçade: Editing Object with Given Reference Texture
Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction
Not All Birds Look The Same: Identity-Preserving Generation For Birds
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
Clothe and Pose
FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
Bidirectional Normalizing Flow: From Data to Noise and Back
ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
Are Image-to-Video Models Good Zero-Shot Image Editors?
FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness
Thermal Diffusion Matters: Infrared Spatial-Temporal Video Super-Resolution through Heat Conduction Priors
TextOVSR: Text-Guided Real-World Opera Video Super-Resolution
VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution
GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution
Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance
SignPR: A Progressive Vector-Quantized Diffusion Framework for Sign Language Production
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
Decoupled Generative Modeling for Human-Object Interaction Synthesis
GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models
MatMart: Material Reconstruction of 3D Objects via Diffusion
Region-Adaptive Sampling for Diffusion Transformers
Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning
ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding
Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model
Fast SceneScript: Fast and Accurate Language‑Based 3D Scene Understanding via Multi‑Token Prediction
PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks
Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
ARC Is a Vision Problem!
Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions
S^2-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding
AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
RetFormer: Multimodal Retrieval for Enhancing Image Recognition
DREAM: Document Recognition with Explicit Adaptive Memory
POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
RiskProp: Collision-Anchored Self-Supervised Risk Propagation For Early Accident Anticipation
MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning
TRCoRSurg: Temporal-Relational Co-Reasoning for Surgical Video Triplet Recognition
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
One-Shot Flow, Any-Time Frame: A Bidirectional Warping Framework for Event-Based Video Frame Interpolation
PRISM: Prototype-based Reasoning with Inter-modal Semantic Mining for Interpretable Image Recognition
Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach
PhaseWin Search Framework Enable Efficient Object-Level Interpretation
Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
C-LaV: Conditional Latent Velocity Field Denoising for Weather-Robust LiDAR Place Recognition
Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction
AnyPcc: Compressing Any Point Cloud with a Single Universal Model
CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion
Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis
FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
Neural Distribution Prior for LiDAR Out-of-Distribution Detection
ViTPrompt: Training-Free Prompt Refinement with Visual Tokens for Open-Vocabulary Detection
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
Dynamic Logits Adjustment and Exploration for Test-Time Adaptation in Vision Language Models
CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment
Bidirectional Query-Driven Generation of Parametric CAD Sketch
Repurposing 3D Generative Model for Autoregressive Layout Generation
CAD-Refiner: A Unified Framework for CAD Generation and Iterative Editing
A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images
Global Information Thresholding for Sufficient and Necessary Circuits
PrivateEyes: Gaze-Preserving Anonymization for Data Sharing
From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal
Decoupling Defense Strategies for Robust Image Watermarking
DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport
FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift
FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation
AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding
HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models
The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
Coordinate Denoising for Non‑Equilibrium Molecular Representation Learning
Meta-Learning In-Context Enables Training-Free Cross Subject Brain Decoding
Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation
Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving
Think Before You Drive: World Model-Inspired Multimodal Grounding
DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving
NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks
Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
Consistent Instance Field for Dynamic Scene Understanding
Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions
SAM2Text: Towards Prompt-Free and Multi-Resolution Video Scene Text Segmentation
Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models
SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning
Is Parameter Isolation Better for Prompt-Based Continual Learning?
Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models
Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning
Elastic Weight Consolidation Done Right for Continual Learning
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
InfinityHuman: Towards Long-Term Audio-Driven Human Animation
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation
Tavatar: Topology-Aware Gaussian Attribute Derivation for Animatable Human Avatars
PhysHead: Simulation-Ready Gaussian Head Avatars
ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction
FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures
Reallocating Attention Across Layers to Reduce Multimodal Hallucination
Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression
Unstitching the Chimera: Frame-Level Risk and Train-Free Mitigation for Video Hallucination
Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation
EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization
Physical Simulator In-the-Loop Video Generation
Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification
Mind the Gap: Transferring Labels to Align Object Detection Datasets
SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification
View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification
RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection
X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection
FVBench: Benchmarking Deepfake Video Detection Capability of Large Multimodal Models
AKCMamba-YOLO: Selective State Space Models For Real-Time Object Detection
When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse
Your One-Stop Solution for AI-Generated Video Detection
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
Reflection Separation from a Single Image via Joint Latent Diffusion
MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
VMonarch: Efficient Video Diffusion Transformers with Structured Attention
DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers
Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting
MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More
A Causal Marriage between VLM and IRM from Understanding to Reasoning
Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
Tea-Adapter: Teacher Adapter for Efficient Conditional Generation
Twin-T & TwintVQA: A Reliable Structure–Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks
SDGS: Spatial Difference Guided Gaussian Splatting for Simultaneous Localization and 3D Reconstruction
RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting
Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors
Distilling Unsigned Distance Function for Surface Reconstruction from 3D Gaussian Splatting
Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction
DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
GauMVC: Generative Decoupled Gaussian Representation for Human-centric Multi-view Video Compression
A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction
RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection
ActivePolicy: Active Gaussian Reconstruction and Optimization Strategy Based on Global-Local Information Gain
SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
Beyond Depth: Evaluating the Width-centric Reasoning Capability of MLLMs
GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension
CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering
Perceptual 3D Simulation With Physical World Modeling
Multi-Scale Local Speculative Decoding for Image Generation
Beyond Single Solution: Multi-Hypothesis Deep Unfolding Network for Image Compressive Sensing
Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
Discovering Adaptive Task Dependencies for Efficient Multi-Task Representation Compression
OmniZip: Learning a Unified and Lightweight Lossless Compressor for Multi-Modal Data
Perceptual Neural Video Compression with Color Separation and Rank Chain
Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
Watch and Learn: Learning to Use Computers from Online Videos
OneThinker: All-in-one Reasoning Model for Image and Video
Incentivizing Versatile Video Reasoning in MLLMs via Data-Efficient Reinforcement Learning
Act2See: Emergent Active Visual Perception for Video Reasoning
ViLoMem: Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
ReMoT: Reinforcement Learning with Motion Contrast Triplets
Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
Semantic-Guided Global-Local Collaborative Prompt Learning for Few-Shot Class Incremental Learning
Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition
Bridging the Modality Gap in Compositional Zero-Shot Learning via Sparse Alignment and Unimodal Memory Bank
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models
BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting
Lens Component Deletion based on Differentiable Ray Tracing
3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion
UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes
Polarization State Tracing for Reflection Removal and Color-Consistent Reconstruction
GFRRN: Explore the Gaps in Single Image Reflection Removal
Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation
Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference
Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images
HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction
SO(3)-Equivariant ViT-Adapter for Data-Efficient Zero-Shot Sim-to-Real Indoor Panoramic Depth Estimation
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
XPaintNet: An eXtreme Lightweight Framework for Stereoscopic Conversion without Inpainting Network
MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation
LiteSense: Lifting Lightweight ToF with RGB for High-Resolution Metric Depth Estimation
The Midas Touch for Metric Depth
WonderZoom: Multi-Scale 3D World Generation
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation
Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction
CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning
MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation
PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback
GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
The Drift Kernel: Why Diffusion Models Change Even When Told Not To
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition
Hierarchical Action Learning for Weakly-Supervised Action Segmentation
Gamba: Mamba-based graph convolutional network with dynamic graph topology learning for action recognition
Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors
RoboWheel: A Data Engine from Real-World Human Demonstrations for Cross-Embodiment Robotic Learning
NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
MangoBench: A Benchmark for Multi-Agent Goal-Conditioned Offline Reinforcement Learning
MMBench-GUI: A Unified Hierarchical Evaluation Framework for Multi-Platform GUI Agents
Boosting Vision-Language Models Towards Cross-Domain Incremental Object Detection
Unlearning without Forgetting: Securely Removing Targeted Concepts from Large-Scale Vision-Language Open-Vocabulary Detectors
UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling
S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation
NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection
PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting
Beyond Text: Visual Description Assembly by Probabilistic Model for CLIP-based Weakly Supervised Semantic Segmentation
High-Precision Dichotomous Image Segmentation via Depth Integrity-Prior and Fine-Grained Patch Strategy
GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization
MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision
Beyond Tie Points: Satellite Image Block Adjustment based on Dense Feature Consistency
Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
PRUE: A Practical Recipe for Field Boundary Segmentation at Scale
SARMAE: Masked Autoencoder for SAR Representation Learning
LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning
All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference
A Combination of Noise and Bilateral Filters Achieve Supralinear and Scalable Adversarial Robustness in CNNs
Write Where It Matters: Policy-Guided Watermarks for 3D Gaussian Splatting
RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation
Chain of World: World Model Thinking in Latent Motion
Scalable Feature Matching via State Space Modeling and Sparse Correlation
Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
GeoDexGrasp: Geometry-aware Generation for Data-efficient and Physics-plausible Dexterous Grasping
From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
AGiLe: Learning Robust Long-Horizon Manipulation via Affordance-Grounded Bidirectional Latent Planning
Language-Grounded Decoupled Action Representation for Robotic Manipulation
Learning to Act Robustly with View-Invariant Latent Actions
SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking
FlowFM: Advancing Dark Optical Flow Estimation with Flow Matching
Drift-Resilient Temporal Priors for Visual Tracking
No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors
From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking
Momentum Memory for Knowledge Distillation in Computational Pathology
Joint Spectral Image Reconstruction and Semantic Segmentation with Cooperative Unfolding
X-WIN: Building Chest Radiograph World Model via Predictive Sensing
fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model
TIM: Temporal Decoupling with Iterative Mutual-Refinement Model for Longitudinal Radiology Report Generation
Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic
CMR-RD: Long-Tailed Adaptive VLM for Explainable CMR Diagnosis
Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis
FBTA: Enabling Single-GPU End-to-End Gigapixel WSI Classification with Feature Bridging and Translation Alignment
Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances
Egocentric Visibility-Aware Human Pose Estimation
Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures
SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild
Gaussian-Mixture Latent Flow for Stochastic 3D Human Motion Prediction
Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding
Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection
PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge
Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
PhyGaP: Physically-Grounded Gaussians with Polarization Cues
Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
Selfi: Self-improving Reconstruction Engine via 3D Geometric Feature Alignment
Z-Order Transformer for Feed-Forward Gaussian Splatting
4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement
Affostruction: 3D Affordance Grounding with Generative Reconstruction
Unified Primitive Proxies for Structured Shape Completion
ART: Articulated Reconstruction Transformer
S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer
D-Prism: Differentiable Primitives for Structured Dynamic Modeling
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
LaS-Comp: Zero-shot 3D Completion with Latent–Spatial Consistency
Pano360: Perspective to Panoramic Vision with Geometric Consistency
EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion
MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis
OneHOI: Unifying Human-Object Interaction Generation and Editing
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation
TV2TV: A Unified Framework for Interleaved Language and Video Generation
Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
PureCC: Pure Learning for Text-to-Image Concept Customization
Yume1.5: A Text-Controlled Interactive World Generation Model
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and VLM-Guided Optimization
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits
LVLM-Aided Alignment of Task-Specific Vision Models
PG-VTON: Single-Pass Training-Free Virtual Try-On via Patch-Guided Reference Alignment
Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment
Scaling Spatial Intelligence with Multimodal Foundation Models
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation
Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement
Bridging Domain Expertise and Generalization for Performance Estimation
DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection
Scaling Dense Event-Stream Pretraining from Visual Foundation Models
Time-Specialized Event-Image Alignment for Blur-to-Video Decomposition
Unsupervised 3d Motion Estimation Using Event Camera
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
AstraNav-Memory: Contexts Compression for Long Memory
Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models
ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting
MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models
The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy
Image Guides Images: Consistent Video Amodal Completion with Rectified In-Context Exemplar Guidance
Radiance Meshes for Volumetric Reconstruction
Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
CoRoGS: Contextual Gaussian Splatting for Robust Large-Deviation View Synthesis
ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes
Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness
Splatent: Splatting Diffusion Latents for Novel View Synthesis
Dynamic-Static Decomposition for Novel View Synthesis of Dynamic Scenes with Spiking Neurons
Residual Diffusion Bridge Model for Image Restoration
MMDIR: Multimodal Instruction-Driven Framework for Mixed-Degradation Document Image Restoration
Rectifying Latent Space for Generative Single-Image Reflection Removal
Towards Generalized Multimodal Homography Estimation
HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation
MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior
FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model
SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection
Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation
CROWn: A Unified Framework for Anti‑Aliased Downsampling and Phase‑Calibrated Fusion in 3D Medical Segmentation
Rethinking Box Supervision: Bias-Free Weakly Supervised Medical Segmentation
Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
Breaking Multimodal LLM Safety via Video-Driven Prompting
When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters
RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models
UniDef: Universal Defense Against Unauthorized Image Manipulation
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals
Rethinking Cross-Modal Anchor Alignment for Mitigating Error Accumulation
Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning
Anti-Degradation Lifelong Multi-View Clustering
The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis
Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT
HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation
Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
TouchDream: 3D Object Completion through Imagined Touch
MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis
LogCD: Local-to-global Consistency Distillation for Few-step Image Generation
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
Parallel Jacobi Decoding for Fast Autoregressive Image Generation
CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions
EchoVDiff: Cardiac-Cycle Echocardiography Video Generation from Arbitrary Frame
Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization
SURF: Signature-Retained Fast Video Generation
Lynx: Towards High-Fidelity Personalized Video Generation
VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
First Frame Is the Place to Go for Video Content Customization
Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs
MultiAnimate: Pose-Guided Image Animation Made Extensible
Translating Signals to Languages for sEMG-Based Activity Recognition
Multi-level Causal LLM-based Text-to-Motion Generation with Human Alignment
MotionHiFlow: Text-to-Motion via Hierarchical Flow Matching
LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
GVIS: Generative Vector Image Steganography
MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding
GeoRK2: Geometry-Guided Runge–Kutta Integration for Diffusion Transformer Acceleration
A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation
When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment
A³: Towards Advertising Aesthetic Assessment
Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention
VL-RouterBench: A Benchmark for Vision–Language Model Routing
G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning
Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining
Beyond Global Similarity: Multi-Conditional Retrieval for Fine-Grained Cross-Modal Understanding
Mitigating The Distribution Shift of Diffusion-based Dataset Distillation
What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely F1
Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence
Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance
SPDMark: Selective Parameter Displacement for Robust Video Watermarking
Memory Matters: Boosting Training-Free Zero-Shot Temporal Action Localization with a Learnable Lookup Table
TVHighlights: LLM-Guided Human-Free Collaborative Training for Video Highlight Detection in Movies and TV Dramas
FlexiVideo: Variation-Aware Temporal Dynamics Modeling for Efficient Video Understanding
Learning Effective Sign Features without Text for Gloss-free Sign Language Translation
META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding
Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers
Rounded or Streamlined Head? Bridging Concept Bottleneck Models and Attribute-Described Object Parts
Rethinking Concept Bottleneck Models: From Pitfalls to Solutions
UniCorrn: Unified Correspondence Transformer Across 2D and 3D
Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion
TACO: Task-Aware Contrastive Learning for Joint LiDAR Localization and 3D Object Detection
Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning
Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation
PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences
TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR
Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
Role-SynthCLIP: A Role-Play Driven Diverse Synthetic Data Approach
PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
MapRoute:Precise-Concept Erasing Mappers via Semantic Routing
Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching
Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning
FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning
FedSST: Rethinking Fair Federated Graph Learning under Structural Shift
GDFA: Geometry-Driven Federated Unlearning with Directional Task Vector Alignment
FedARA: Resource-adaptive Low-rank Personalized Federated Learning via Anchor-driven Representation Alignment on Heterogeneous Edge Devices
InterRVOS: Interaction-Aware Referring Video Object Segmentation
RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
MeToM: Metadata-Guided Token Merging for Efficient Video LLMs
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
CoIn: Coverage and Informativeness-Guided Token Reduction for Efficient Large Multimodal Models
TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
Neural Collapse in Test-Time Adaptation
Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
Den-TP: A Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction
Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation
ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
CrackSSM: Reviving SSMs for Crack Segmentation via Dynamic Scanning
BiPA: Bilevel Prompt Adaptation for Underwater Instance Segmentation
Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification
Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment
Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity
Dual-Estimator: Decoupling Global and Local Semantic Shift for Drift Compensation in Class-Incremental Learning
Continual Distillation of Teachers from Different Domains
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Learning from Itself: Mining Internal Knowledge from Vision Language Models for Continual Learning
HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering
TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models
VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird’s-Eye-View Semantic Segmentation
STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image
OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
Globally Optimal Pose from Orthographic Silhouettes
Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding
AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination
SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation
Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos
WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling
Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network
HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification
Assignment-Driven Hash Learning in a Hyper-Semantic Space for On-the-Fly Category Discovery
DyFCLT: Dynamic Frequency-Decoupled Cross-Modal Learning Transformer for Multimodal Tiny Object Detection
EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
Building a Precise Video Language with Human–AI Oversight
CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection
Towards Sparse Video Understanding and Reasoning
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generation
CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance
Towards High-resolution and Disentangled Reference-based Sketch Colorization
Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers
Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping
Guiding a Diffusion Transformer with the Internal Dynamics of Itself
CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions
COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception
Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations
Learning What Helps: Task-Aligned Context Selection for Vision Tasks
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
NeuroRule: Bridging Vision and Logic with Differentiable Rule Induction
Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter
JoPPO: Hierarchical Photography Assessment via Contrastive Joint Conditional Probabilistic Reinforcement Learning
AeroAgent: A Vision–Physics–Decision Framework for Aerodynamic Vehicle Design
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
MSCD-GS: Motion-Separated Cooperative Deblurring Dynamic Reconstruction via Gaussian Splatting
iSplat: Iterative Learning for Fine-Grained Gaussian Splatting
Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
MAPo: Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction
FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario
Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering
HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation
PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in 3D Gaussian Splatting
4C4D: 4 Camera 4D Gaussian Splatting
SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
Disco-GS: Gaussian Splatting in Dynamic Color Lighting
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning
Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
Stable and Efficient Single-Rollout RL for Multimodal Reasoning
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
Monet: Reasoning in Latent Visual Space Beyond Image and Language
OneSparse: A Unified Framework for Sparse Activation Layers in Vision Models
What Matters in Practical Learned Image Compression
BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder
LazyVAR: Accelerating Visual Autoregressive Models via Scale-wise Token Pruning and Parallel Group Decoding
Spk2VidNet: A Hierarchical Recurrent Architecture for High-Fidelity Video Reconstruction from Long Spike-Camera Streams
Adaptive Learned Image Compression with Graph Neural Networks
SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning
WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with Realistic Tasks
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
APPO: Attention-guided Perception Policy Optimization for Video Reasoning
EVA: Efficient Reinforcement Learning for End-to-End Video Agent
Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling
Bridging Human Evaluation to Infrared and Visible Image Fusion
Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion
From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification
STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation
FlowComposer: Composable Flows for Compositional Zero-Shot Learning
ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
CamPI: Physical Adversarial Examples through Camera Power Signal Injection
UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization
Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance
GSNR: Graph Smooth Null-Space Representation for Inverse Problems
αMatte4K & µMatting: Dataset and Model for Ultra-Micro Precision Alpha Video Matting
Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields
Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis
SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment
Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective
GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers
Any Resolution Any Geometry: From Multi-View To Multi-Patch
Paparazzo: Active Mapping of Moving 3D Objects
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
Variational Graph-based Normal Integration
Vinedresser3D: Towards Agentic Text-guided 3D Editing
MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
MeshRipple: Structured Autoregressive Generation of Artist-Meshes
FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow
CUPID: Generative 3D Reconstruction via Joint Object and Pose Modeling
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References
Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling
Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation
DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning
Dynamic Momentum Recalibration in Online Gradient Learning
MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention
E^2-SCI: Elastic Edge–Cloud Speculative Decoding via Credit Inertia
HyperNAS: Enhancing Architecture Representation for NAS Predictor via Hypernetwork
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
Spectral Conformal Risk Control: Distribution-Free Tail Guarantees via Bayesian Quadrature
Edge-RecViT: Efficient Vision Transformer via Semantic-Refined Dynamic Recursion
ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization
HiconAgent: History Context-aware Policy Optimization for GUI Agents
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection
Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
Common Inpainted Objects In-N-Out of Context
Prompt-Free Universal Region Proposal Network
Hilbert Curve-Based Attention Enabling Topology-Preserving Image Tensor Representation for Semantic Segmentation Network
Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels
SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains
Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation
CDICS: Delving Into Fine-Grained Attribute for In-Context Segmentation via Compositional Prompts and Phased Decoupling
Discriminative Perception via Anchored Description for Reasoning Segmentation
SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images
Best Segmentation Buddies for Image-Shape Correspondence
Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation
Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening
ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery
Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction
Rotation Invariant and Symmetry Aware Pixel Difference Network for Remote Sensing Object Detection
F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation
RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction
PGA: Prior-free Generative Attack for Practical No-box Scenario
Lipschitz Optimization for Formal Verification of Homographies
Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models
Mitigating Error Amplification in Fast Adversarial Training
Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs
Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning
Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs
Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
Cross-Hand Latent Representation for Vision-Language-Action Models
Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts
GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
Rethinking Occlusion Modeling for UAV Tracking
Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking
Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions
TrackMAE: Video Representation Learning via Track Mask and Predict
Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking
Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes
Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance
Multimodal Causality-Driven Representation Learning for Generalizable Medical Image Segmentation
TopoSlide: Topologically-Informed Histopathology Whole Slide Image Representation Learning
Beyond the Static-World: Lifelong Learning for All-in-One Medical Image Restoration
Hyperbolic Relational Prompts for Intersectional Fairness in Medical VLMs
RNED: Rotary Number Encoding and Decoding for Quantitative Medical VLM Analysis
MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation
Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration
SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation
KASALv2: Fully Automatic 3D Rotational Symmetry Classification and Axis Localization
Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning
ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals
Similarity-Consistent Likelihood Diffusion enables Hidden Person Detection from Wall Reflections
VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery
EMMA: Extracting Multiple physical parameters from Multimodal Data
ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss
OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer
Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition
DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning
MGDHand: Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D Hand Pose Estimation from Sparse IMUs
E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation
Bézier Degradation Modeling for LiDAR-based Human Motion Capture
Illumination-Consistent Human-Scene Reconstruction from Monocular Video
Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution
Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection
Enabling Supervised Learning of Generative Signatures for Generalized Synthetic Image Detection
DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization
All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark
Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective
Omni-AD: A Large-scale and Versatile Benchmark for Industrial Anomaly Detection
Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection
Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection
LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera
ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction
Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation
Guiding a Diffusion Model by Swapping Its Tokens
SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
Streaming Diffusion Model for Fast Infrared and Visible Video Fusion
CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling
GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding
S^2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds
AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows
ChordEdit: One-Step Low-Energy Transport for Image Editing
Faithful Contouring: Near-Lossless 3D Voxel Representation Free from Iso-surface
Native and Compact Structured Latents for 3D Generation
MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction
PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs
MARCO: Navigating the Unseen Space of Semantic Correspondence
Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
SMVRT: Implicit Human 3D Modeling Using Sparse Multi-View Volumetric Reconstruction with Transformer Fusion
LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving
Any4D: Unified Feed-Forward Metric 4D Reconstruction
Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
Parallelised Differentiable Straightest Geodesics for 3D Meshes
Geometry-Aligned and Anomaly-Aware Reconstruction for 3D Anomaly Detection
DVGT: Driving Visual Geometry Transformer
Hyper-PCN: Hypergraph-Based Point Cloud Completion via High-Order Correlation Modeling
MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
Foundation Encoders Are All You Need for Preference-Aware Personalization
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
ThinkGen: Generalized Thinking for Visual Generation
CoLoGen: Progressive Learning of Concept–Localization Duality for Unified Image Generation
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
Visual Personalization Turing Test
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
Hierarchical Enhancement of Semantic Priors for Disentangled Text-Driven Motion Generation
PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward
Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
Gravitation-Driven Semantic Alignment for Text Video Retrieval
MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
M^3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
PersonaVLM: Long-Term Personalized Multimodal LLMs
MR-RAG: Multimodal Relevance-Aware Retrieval-Augmented Generation for Medical Visual Question Answering
Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective
CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference
Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning
EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network
AE2VID: Event-based Video Reconstruction via Aperture Modulation
From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation
Spike-driven Discrete Aggregation for Event-based Object Detection
x^2-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space
FloVerse: Floor Plan-Guided Multi-Modal Navigation
TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation
Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning
Rethinking Visual Rearrangement from A Diffusion Perspective
APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation
Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation
InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
When Robots Should Say ''I Don’t Know'': Benchmarking Abstention in Embodied Question Answering
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
Towards Training-free Scene Text Editing
VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding
Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models
Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering
RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes
FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing
Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields
BiProLoRA: Bilevel Prompt LoRA for Real Scene Recovery
Degradation-Consistent Test-Time Adaptation for All-in-One Image Restoration
CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation
2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition
Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration
It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal
Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration
Dynamic Exposure Burst Image Restoration
FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization
Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
MicroFM: Physics-guided Flow Matching for Isotropic Microscopy Reconstruction
Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration
PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention
When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness
Hidden Dangers of Compositional Generation: Diagnosing Semantic Safety Failures in Text-to-Image Models
JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models
Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation
Few-shot Acoustic Synthesis with Multimodal Flow Matching
Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
Conflict-Aware Adaptive Cross-Reconstruction for Multimodal Sentiment Analysis
EduDiag: A Benchmark for Educational Diagnostic Reasoning with Error Tracing and Correction on Large Multimodal Models
Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
Cross-Modal Guided Visual Synthesis for Data-Efficient Multimodal Depression Recognition
AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On
UniSER: A Foundation Model for Unified Soft Effects Removal
EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation
Plenoptic Video Generation
PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
AdapTok: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Linear Image Generation by Synthesizing Exposure Brackets
Low-Resolution Editing is All You Need for High-Resolution Editing
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
MoCha: End-to-End Video Character Replacement without Structural Guidance
Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling
Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution
DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light Environment
Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
Next-Scale Autoregressive Models for Text-to-Motion Generation
Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds
RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
Progressive Guessing to Fixed Point: Rethinking Human Motion Prediction with Deep Equilibrium Models
ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data
Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion
Prototype-Guided Concept Erasure in Diffusion Models
CARD: Correlation Aware Restoration with Diffusion
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models
MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation
Towards Policy-Adaptive Image Guardrail: Benchmark and Method
TextFM: Robust Semi-dense Feature Matching with Language Guidance
What’s Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation
SpatialTree: How Spatial Intelligence Branches Out in MLLMs
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
Selectively Extracting and Injecting Visual Attributes into Text-to-Image Models
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
Language-Free Generative Editing from One Visual Example
ReMatch: Boosting Representation through Matching for Multimodal Retrieval
RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval
Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach
Modeling the Visual Ambiguity of Human Sketches
SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval
ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval
V^2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
WeaveTime: Streaming from Earlier Frames into Emergent Memory in VideoLLMs
Efficient Frame Selection for Long Video Understanding via Reinforcement Learning
InternVideo-Next: Towards World-Understanding Video Models
Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency
A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett–Luce Ranking
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
Explaining Object Detectors via Collective Contribution of Pixels
H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
Evaluating Generative Models via One-Dimensional Code Distributions
BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds
Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors
QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment
MS^2Gait: A Multi-Scale Spatio-Temporal Fusion Network for LiDAR-based Gait Recognition
Foundry: Distilling 3D Foundation Models for the Edge
Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation
Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering
FedMPT: Federated Multi-Label Prompt Tuning of Vision-Language Models
SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis
Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs
Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
Animator-Centric Skeleton Generation on Objects with Fine-Grained Details
Synthesizing Visual Concepts as Vision-Language Programs
Semantic Scale Space: A Framework for Controllable Image Abstraction
DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime
Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data
Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models
EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment
Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models
Domain-Skewed Federated Learning with Feature Decoupling and Calibration
Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning
Few-for-Many Personalized Federated Learning
ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning
Domain Sensitive Federated Learning with Fisher-Informed Pruning
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos
CVA: Context-aware Video-text Alignment for Video Temporal Grounding
OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
Towards Unified Human Perception and Machine Understanding: Token Flow Guided Compression Framework
Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
One Layer’s Trash is Another Layer’s Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs
Tunable Soft Equivariance with Guarantees
Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score
Cluster-aware Anchor Learning for Multi-View Clustering
Revisiting Sparsity Constraint Under High-Rank Property in Partial Multi-Label Learning
Recurrent Video Masked Autoencoders
Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning
Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery
From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
Spatial Retrieval Augmented Autonomous Driving
ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving
CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving
Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction
URScenes: A Multi-scenario Dataset for Unstructured Road Environments
MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Driving
SAMosaic3D: Modular Scene Assembly for Real-Time 3D Segment Anything
Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation
MARSS: Radar Semantic Segmentation via Modular Attention and State Space Models
MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure
PACT: Phase-Like Transition Constraints in Adapter-Based Continual Learning of Vision-Language Models
Re-evaluating Continual VQA: Toward Fair and Robust Evaluation for Multimodal Continual Learning
Distilling Balanced Knowledge from a Biased Teacher
Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning
EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
D^3FER: Dual Channel and Dual Branch Network for Robust Facial Expression Recognition under Dual Challenges
ExpPortrait: Expressive Portrait Generation via Personalized Representation
PersonaLive! Expressive Portrait Image Animation for Live Streaming
ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation
CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations
Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
UIKA: Fast Universal Head Avatar from Pose-Free Images
FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation
CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric
Envision, Attend, Then Respond: Counterfactual Hallucination Mitigation in Large Vision-Language Models
Generative Video Motion Editing with 3D Point Tracks
BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
Stereo World Model: Camera-Guided Stereo Video Generation
CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation
VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency
SpatialDiff: 3D-Aware Object Movement via Implicit Spatial Modeling
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
Zoo3D: Zero-Shot 3D Object Detection at Scene Level
YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection
VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection
WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing
Object-Generalized Re-Identification: A Step Towards Universal Instance Perception
When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection
Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification
HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph
Beyond Caption-Based Queries in Video Moment Retrieval
VideoRealBench: A Chain-of-Thought Realism Evaluation Benchmark for Generated Human-Centric Videos
An Empirical Study on How Video-LLMs Answer Video Questions
UniComp: Rethinking Video Compression Through Informational Uniqueness
NaTex: Seamless Texture Generation as Latent Color Diffusion
All-in-One Slider for Attribute Manipulation in Diffusion Models
From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution
Beyond Appearance: Camouflaged Object Detection via Geometric Structure
CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception
Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments
Draft and Refine with Visual Experts
R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII
Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
μVLM: A Vision Language Model for μNPUs
Gaussian Mapping for Evolving Scenes
AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors
SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance
PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation
3D Gaussian Splatting at Arbitrary Resolutions with Compact Proxy Anchors
Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting
AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting
ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting
Robust3DGSW: Toward Robust Watermarking for Quantization-Aware 3D Gaussian Splatting
ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking
L^2DGS: Low-Light Dynamic Gaussian Splatting
Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection
DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images
HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token
Mario: Multimodal Graph Reasoning with Large Language Models
Boosting Reasoning in Large Multimodal Models via Activation Replay
Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding
OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe
SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs
AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
Precise Object and Effect Removal with Adaptive Target-Aware Attention
Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
FreqSIC: Frequency-aware Stereo Image Compression with Bi-directional Checkerboard Context Model
SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization
Fusion of Depth and Semantics for Probabilistic Floorplan Localization
A2GC: Asymmetric Aggregation with Geometric Constraints for Locally Aggregated Descriptors
Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering
MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding
Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning
DPGF-Net: Dual-Prior Guided Fusion Network for Joint Assessment of Perceptual Quality and Semantic Consistency in AI-Generated Images
RegionFuse: Region-Adaptive Pixel Distribution Learning for Infrared and Visible Image Fusion
Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared
VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion
TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification
Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection
DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process
Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification
Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
Balanced Dataset Distillation via Modeling Multiple Visual Pattern Distribution
Grid Distillation: Compositional Image Distillation via Structured Generative Grids
Dataset Distillation by Influence Matching
StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning
Seeing Through Blur: Tackling Defocus in Spike-Based Imaging
Distilling Quasi-Conformal Mapping: A Generalizable and Efficient Solution for Wide-Angle Correction
LightRR: A Lightweight Network for Single Image Reflection Removal
HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter
Coded-E2LF: Coded Aperture Light Field Imaging from Events
Kaleidoscopic Scintillation Event Imaging
gQIR: Generative Quanta Image Reconstruction
Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images
Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images
Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
FE2E: From Editor to Dense Geometry Estimator
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
NI-Tex: Non-isometric Image-based Garment Texture Generation
UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion
LoST: Level of Semantics Tokenization for 3D Shapes
Lafite: A Generative Latent Field for 3D Native Texturing
Image-Guided Geometric Stylization of 3D Meshes
LATTICE: Democratize High-Fidelity 3D Generation at Scale
Dehallu3D: Hallucination-Mitigated 3D Generation from a Single Image via Cyclic View Consistency Refinement
MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly
TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation
Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
Fine-Grained GRPO for Precise Preference Alignment in Flow Models
Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
Self-Corrected Image Generation with Explainable Latent Rewards
Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning
Reading Your Actions: Learning Generalizable Action Representations via Pre-training AEMG
MA-Bench: Towards Fine-grained Micro-Action Understanding
OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements
DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions
Protect to Adapt: Subspace-Constrained Adaptation with Ranked Negative Prompt Feedback for Few-Shot Action Recognition
SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition
InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search
S^2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model
AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks
TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
Learning to Solve PDEs on Neural Shape Representations
Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses
GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis
Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
Open-Vocabulary Domain Generalization in Urban-Scene Segmentation
Annotation-Efficient Coreset Selection for Context-dependent Segmentation
ALLNet: Multi-task Dense Prediction for Degraded Images
Geometry-Aware Cross-Modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting
Volumetric Functional Maps
GenMask: Adapting DiT for Segmentation via Direct Mask Generation
Frequency-Aware Affinity for Weakly Supervised Semantic Segmentation
Learning and Aligning Click-Aware Shape Prior for Interactive Amodal Instance Segmentation
Beyond Reassembly: Fractured Object Recovery with Missing Parts
RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments
Parallel Rigidity Matters for Bundle Adjustment
GeoCoT: Towards Reliable Remote Sensing Reasoning with Manifold Perspective
STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting
GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding
Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images
Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
Improving Adversarial Transferability with Local Perturbation Augmentation
Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs
Stealing Split Learning Bottom Models by Recovering Embedding Geometry
No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation
Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks
Where, What, Why: Toward Explainable 3D-GS Watermarking
Robust Spiking Neural Networks by Temporal Mutual Information
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models
AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
Localizing, Structuring, and Rendering: Bridging 3D and 2D Vision-Language-Action Models for Robotic Manipulation
Humanoid Generative Pre-Training for Zero-Shot Motion Tracking
EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models
CUBic: Coordinated Unified Bimanual Perception and Control Framework
RehearseVLA: Simulated Post-Training for VLAs with Physically-Consistent World Model
GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
UETrack: A Unified and Efficient Framework for Single Object Tracking
Efficient Video Object Segmentation and Tracking with Recurrent Dynamic Submodel
Learning to Track Instance from Single Nature Language Description
Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing
Content-Adaptive Hierarchical Hyperprior for Neural Video Coding
UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking
Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning
From Infusion to Assimilation Distillation for Medical Image Segmentation
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code
MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding
Turning Pre-Trained Vision Transformers into End-to-End Histopathology Whole Slide Image Models for Survival Prediction
A Supervised Multi-task Framework for Joint cryo-ET Restoration Enabled by Generative Physical Simulation
KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation
CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis
Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification
OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging
Learning complete and explainable visual representations from itemized text supervision
EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR
TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose
SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
HUMAPS-4D: A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations
PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement
LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance
Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning
SAGA: Source Attribution of Generative AI Videos
VMD-FACT: A New Video Dataset and MLLM-based method for Detecting Realistic AI-Generated Video Misinformation
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World
PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection
Learning from Noisy Supervision: A Denoising–Debiasing Framework for Weakly Supervised Video Anomaly Detection
VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer
CHAL: Causal-guided Hierarchical Anomaly-aware Learning for Moving Infrared Small Target Detection
RAID: Retrieval-Augmented Anomaly Detection
ADSeeker: A Knowledge-Grounded Reasoning Framework for Industry Anomaly Detection and Reasoning
QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception
Grounded Latents for Entity-Centric 4D Scene Generation
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
SoccerMaster: A Vision Foundation Model for Soccer Understanding
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation
AceTone: Bridging Words and Colors for Conditional Image Grading
R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection
The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
TopoMA: Topology-Guided Multi-Agent Dense RGB 3D Reconstruction via Distributed Inference
Sparse–View Localization via Online Neural 3D Regression
Dynamic Visual SLAM using a General 3D Prior
Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization
FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling
Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations
Global Structure-from-Motion Meets Feedforward Reconstruction
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
CREward: A Type-Specific Creativity Reward Model
LumiX: Structured and Coherent Text-to-Intrinsic Generation
OmniGen2: Towards Instruction-Aligned Multimodal Generation
LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models
FlowFixer: Towards Detail-Preserving Subject-Driven Generation
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
FEAT: Fashion Editing and Try-On from Any Design
PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
MoBind: Motion Binding for Fine-Grained IMU–Video Pose Alignment
Tackling Model Bias via Game-theoretic Multi-agent Collaboration Framework for Hateful Meme Classification
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
SG-LoRA: Semantic-guided LoRA Parameters Generation
AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation
Reframing Long-Tailed Learning via Loss Landscape Geometry
DC-Merge: Improving Model Merging with Directional Consistency
NEC-Diff: Noise-Robust Event-RAW Complementary Diffusion for Seeing Motion in Extreme Darkness
Towards Persistence: Learning Topological Constraints for Event-based Small Object Detection
Geometric-Photometric Event-based 3D Gaussian Ray Tracing
EventDrive: Event Cameras for Vision-Language Driving Intelligence
MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent
Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics
Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation
FLARE: A Failure-Aware Framework for Autonomous Correction and Recovery in Visual-Language Robotic Manipulation
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
General Process Reward Modeling for Robotic Reinforcement Learning
DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation
Action-Sketcher: From Reasoning to Action via Visual Sketches for Robotic Manipulation
Thinking in 360°: Humanoid Visual Search in the Wild
MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues
Cycle-Consistent Tuning for Layered Image Decomposition
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
NEAF: Natural Image Editing with Attention Fusion for Generalizable Test-time Optimization in Text-Guided Image Editing
Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction
Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering
Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects
OMoBlur: An Object Motion Blur Dataset and Benchmark for Real-World Local Motion Deblurring
Hybrid Agents for Image Restoration
Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation
Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration
PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors
Disentanglement-wise Image Dehazing through Cross-Domain Manifold Consensus
Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model
EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction
LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning
Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation
Masked-Diffusion Autoencoders for 3D Medical Vision Representation Learning
PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation
Test-Time Attention Purification for Backdoored Large Vision Language Models
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
R^2TUA: Reconstruction-residual Based Targeted and Untargeted Attack Against Text-Image Person Re-Identification
When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models
Enhancing Part-Level Point Grounding for Any Open-Source MLLMs
MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction
YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction
Modeling Cross-vision Synergy for Unified Large Vision Model
Beyond Missing Modalities: Hypergraph Conditioned Diffusion for Uncertainty-Aware Multimodal Emotion Recognition
Rosetta Stone For Unified MLLMs: A Unified Tokenizer to Decipher Understanding and Generation
MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
AMusE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding
CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification
EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer’s Disease
Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration
Cross-View Distillation and Adaptive Masking for Incomplete Multi-View Multi-Label Classification
Bootstrap Your Own AV-Proxies: Adaptive Contrastive and Prototype Learning for Audio-Visual Segmentation
Multimodal Distribution Matching for Vision-Language Dataset Distillation
M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
PoseAnything: General Pose-guided Video Generation with Part-aware Temporal Coherence
FastHybrid: Accelerating Hybrid Autoregressive Image Generation with Lookahead and Guided Decoding
AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
Cross-Subject EEG-to-Video Reconstruction and Beyond
Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation
VABench: A Comprehensive Benchmark for Audio-Video Generation
Relightful Video Portrait Harmonization
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
DVAR: Dynamic Visual Autoregressive Modeling for Image Super-Resolution
Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution
EMR-Diff: Edge-aware Multimodal Residual Diffusion Model for Hyperspectral Image Super-resolution
One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution
FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution
Unifying Precise Keyframes and Semantic Control via Multi-level Diffusion
Pressure2Motion: Hierarchical Human Motion Reconstruction from Ground Pressure with Text Guidance
From 3D Pose to Prose: Biomechanics-Grounded Vision–Language Coaching
MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation
W2W: Language-Model-Based Trajectory Prediction with Reinforcement Learning
Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models
Unified Number-Free Text-to-Motion Generation Via Flow Matching
Generative Diffusion Priors for 3D Mapping of the Dark Universe
FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation
Spatial-Spectral Residuals Informed Diffusion Neural Operator for Pan-sharpening
PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion
Pixel Motion Diffusion is What We Need for Robot Control
ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
UNICBench: UNIfied Counting Benchmark for MLLM
VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement
Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
Socratic-Geo: Synthetic Data Generation and Cross-Modal Geometric Reasoning via Multi-Agent Interaction
HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
ReLaGS: Relational Language Gaussian Splatting
3D-IDE: 3D Implicit Depth Emergent
Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory
LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval
EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation
CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
Camouflage-aware Image-Text Retrieval via Expert Collaboration
TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval
TIGER: A Unified Framework for Time, Images and Geo-location Retrieval
Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions
Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation
AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting
Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
SVAgent: Storyline-guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
Relational Visual Similarity
PointCNN++: Performant Convolution on Native Points
Fast Markov Random Field Optimisation for Topologically Noisy 3D Shape Matching
PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination
GEM: Generating LiDAR World Model via Deformable Mamba
Hybrid Robust Collaborative Perception with LiDAR-4D Radar Fusion under Adverse Weather Conditions
Task-Driven Implicit Representations for Automated Design of LiDAR Systems
Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
Soft Modality-Guided Expert Specialization in MoE-VLMs
CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models
AutoRegressive Generation with B-rep Holistic Token Sequence Representation
NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code
Diagram2Structure: Unlocking LLMs' Diagram Comprehension through DiagramDiff, an Offline Diagram Structuring Framework
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement
GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
ShadowDraw: From Any Object to Shadow-Drawing Compositional Art
End-to-End Hyper-Relational Information Extraction for Engineering Diagrams via Dynamically Tokenized Relation Transformer
Dynamic Token Reweighting for Robust Vision-Language Models
COPYLENS: Towards Copyrighted Characters Infringement Detection via Copyright-Aware Prompt Learning
Closed-Form Concept Erasure via Double Projections
Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning
Stake the Points: Structure-Faithful Instance Unlearning
Federated Active Learning Under Extreme Non-IID and Global Class Imbalance
FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients
Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations
Fully Decentralized Certified Unlearning
Towards Streaming Referring Video Segmentation via Large Language Model
OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
DeRVOS: Decoupling Consistent Trajectory Generation and Multimodal Understanding for Referring Video Object Segmentation
UniCompress: Token Compression for Unified Vision–Language Understanding and Generation
SCoRe: Salience-Coverage Reduction for Vision Token Pruning in Vision-Language Models
VLM-PTQ: Efficient Post-Training Quantization for Large Vision-Language Models
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
Rethinking Token Reduction for Large Vision-Language Models
Prototype-based Causal Intervention for Multi-Label Image Classification
FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization
Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty
Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
Revisiting Learning with Noisy Labels: Active Forgetting and Noise Suppression
PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning
Global-Graph Guided and Local-Graph Weighted Contrastive Learning for Unified Clustering on Incomplete and Noise Multi-View Data
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving
Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual–Inertial Odometry
HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving
MARIS: Marine Open-Vocabulary Instance Segmentation
XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening
Training-Free Open-Vocabulary Camouflaged Object Segmentation via Fine-Grained Object Binding and Adaptive Hybrid Prompt
M⁴-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection
ReAttnCLIP: Training-Free Open-Vocabulary Remote Sensing Image Segmentation via Re-defined Attention in CLIP
Mixture of Prototypes for Test-time Adaptive Segmentation
Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
Decouple Your Discovery and Memory in Continual Generalized Category Discovery
Beyond the Static World: Continual Category Discovery under Visual Drift
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
SAME: Sparse and Anchored Model Editing for Heterogeneous Incremental Learning under Limited Data
BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling
PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing
PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation
FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving
Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
Unifying Language-Action Understanding and Generation for Autonomous Driving
CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
Feed-forward Gaussian Registration for Head Avatar Creation and Editing
Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
Same Attention, Different Truths: Put Logit-Lens over Visual Attention to Detect and Mitigate LVLM Object Hallucination
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
Lyapunov Probes for Hallucination Detection in Large Foundation Models
Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
PerpetualWonder: Long-horizon Action-conditioned 4D Scene Generation
CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision
RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes
RigMo: Unifying Rig and Motion Learning for Generative Animation
WHU-MARS: A Multispectral Aerial-Ground Benchmark Towards Any-Scenario Person Re-Identification
Text-guided Feature Disentanglement for Cross-modal Gait Recognition
PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
Portable Active Learning for Object Detection
Efficiency Follows Global-Local Decoupling
VRCLIP: Multimodal Canonical Correlation Alignment for CLIP-Driven Vision-Radio Person Re-Identification
Expert-Teacher-Student Collaborative Learning for Domain Adaptive Object Detection
CI-VID: A Coherent Interleaved Text-Video Dataset
GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
UniVBench: Towards Unified Evaluation for Video Foundation Models
NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers
TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression
FARMER: Flow AutoRegressive Transformer over Pixels
Efficient and High-Fidelity Omni Modality Retrieval
FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing
High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
3D-Object Perception Transformer (3PT)
Spe-BEVHead: Rethinking the Detection Head Design for Bird’s-Eye-View Object Detection
UI-Lens: Assessing General MLLMs’ Potential to Automate UI Display Quality Assurance
Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement
Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
Linking Perception, Confidence and Accuracy in MLLMs
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
From Pixel to Precision: Enhancing Handwritten Mathematical Expression Recognition with Image-Level Reward
Seele: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices
GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing
PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes
SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering
GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
FastGS: Training 3D Gaussian Splatting in 100 Seconds
BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
BA-GS: Bayesian Adaptive Gaussian Splatting for SFM-Free 3D Reconstruction
FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
UVU: Improving Multimodal Understanding via Vision-Language Unified Autoregressive Paradigm
OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
ApET: Approximation-Error Guided Token Compression for Efficient VLMs
Vision Transformers Need More Than Registers
Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation
PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models
ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning
HTTM: Head-wise Temporal Token Merging for Faster VGGT
Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-Based Structure Matching
MOGeo: Beyond One-to-One Cross-View Object Geo-localization
AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization
Asking like Socrates: Socrates helps VLMs understand remote sensing images
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Let VLMs Grade Their Own Thoughts: A Self-Quantification Approach to Reasoning-Aware Reward Modeling
SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System
VideoSSR: Video Self-Supervised Reinforcement Learning
Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion
MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement
Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification
Human-Centric Multi-Exposure Fusion: Benchmark and Bi-level Cognition Distillation Framework
NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
Universal-to-Specific: Dynamic Knowledge-Guided Multiple Instance Learning for Few-Shot Whole Slide Image Classification
Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models
Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
Streamlined Knowledge Distillation
Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation
IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation
240FPS Stereo Vision from Monocular Mixed Spikes
D^2-FOSA: Dual-Diffusion Guided EEG-to-Image Reconstruction with Frequency-Oriented Semantic Alignment
Self-Diffusion Driven Blind Imaging
MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification
Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation
Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels
Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis
Multi-View Hierarchical Alignment Learning for Spatial Transcriptomics
TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis
OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation
Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Seeing Depth Through Frequency and Motion: A Progressive Training Paradigm for Monocular Depth Estimation
GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation
PE3R: Perception-Efficient 3D Reconstruction
GS-ASM: 2DGS-Supervised Active Stereo Matching
Real2Sim2Real: RetinalDepth-64K for Depth Estimation in Posterior Segment Ophthalmic Surgery
Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
AirSim360: A Panoramic Simulation Platform within Drone View
Radar-Guided Polynomial Fitting for Metric Depth Estimation
SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
EI-Part: Explode for Completion and Implode for Refinement
MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning
FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
X-Part: High Fidelity And Structure Coherent Shape Decomposition And Completion
TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion
TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond
Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D
ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation
SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience
Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers
PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
Enhancing Spatial Understanding in Image Generation via Reward Modeling
Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization
Learning Latent Proxies for Controllable Single-Image Relighting
LAOF: Robust Latent Action Learning with Optical Flow Constraints
DarkAct: A RGB-Thermal Dataset and Fusion Framework for Multimodal Low-Light Action Recognition
Random Wins All: Rethinking Grouping Strategies for Vision Tokens
Steering Where to Diffuse: Generative Modeling of Phenotypic Response Simulation with Steered Diffusion Bridge
Deep Feature Deformation Weights
Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment
RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
Coupling Liquid Time‑Constant Encoders with Modern Hopfield Memory
Stronger Normalization-Free Transformers
Convolutional Neural Networks Driven by Content Similarity
MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence On Mobile Devices
OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
Beyond Weak Supervision: MLLMs-Guided Graded Knowledge Distillation for Unsupervised Camouflaged Object Detection
Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection
TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models
Parameterized Prompt for Incremental Object Detection
SRA-Det: Learning Omni-Grained Open-Vocabulary Detection Beyond Category Names
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation
Partial Weakly-Supervised Oriented Object Detection
Seeing Both Sides: Towards Bidirectional Semantic Alignment for Open-Vocabulary Camouflaged Object Segmentation
Towards Robust Multi-Modal Semantic Segmentation with Teacher-Student Framework and Hybrid Prototype Distillation
REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion
From Softmax to Dirichlet: Evidential Learning for Semi-supervised Semantic Segmentation
Particulate: Feed-Forward 3D Object Articulation
HOPS: Hierarchical Open-vocabulary Part Segmentation with Attention-Aware Filtering and Affinity-Guided Enhancement
Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Senisng
Seeing Through the Noise: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective
GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization
GeoSANE: Learning Geospatial Representations from Models, Not Data
Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation
Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image
RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation
ORSATR-X: A Foundation Model based on Differential-and-Excitation Networks for Optical Remote Sensing Object Recognition
SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning
DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
AdapAction: Adaptive Target Action Backdoor Attack against GUI Agents
Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors
Defending Unauthorized Model Merging via Dual-Stage Weight Protection
AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal
On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks
LacTokGen: Latent Consistency Tokenizer for 1024-pixel Image Generation by 256 Tokens
Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation
TSTM: Temporal Segmentation for Task-relevant Mask in Visual Reinforcement Learning Generalization
GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation
SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion
PromptDepth: Efficient and Promptable Geometric 3D Vision Model for Embodied Intelligence
Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains
IGen: Scalable Data Generation for Robot Learning from Open-World Images
TGTrack: Temporal Generative Learning for Unified Single Object Tracking
GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry
Generalizable Structure-Aware Keypoint Correspondence for Category-Unified 3D Single Object Tracking
RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation
Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition
GMT: Effective Global Framework for Multi-Target Multi-Camera Tracking
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification
Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
Dual-Level Hypergraph Generation for Addressing Feature Scarcity in Whole-Slide Image Classification
Temporal Inversion for Learning Interval Change in Chest X-Rays
JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction
PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
Anatomical Domain Shifts: Test-time Heterogeneous Adaptation for 3D Human Pose Prediction
Learning 3D Shape Fidelity Metric from Real-world Distortions
BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition
FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling
VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
Bringing Your Portrait to 3D Presence
FLOW: Feature-Level Optimal Warping for Generalized Remote Physiological Measurement
One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control
UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression
Multi-Prototype Compactness and Boundary-Aware Synthesis for Unsupervised Anomaly Detection
PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection
Learning Spatial-Temporal Consistency for 3D Semantic Scene Completion
OccAny: Generalized Unconstrained Urban 3D Occupancy
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
Global-Aware Edge Prioritization for Pose Graph Initialization
Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics
SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker
U^2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation
Confusion-Aware Spectral Regularizer for Long-Tailed Recognition
Learning Latent Concepts for Detecting Out-of-Distribution Objects
Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery
Understanding Task Transfer in Vision-Language Models
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization
SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation
TESO: Online Tracking of Essential Matrix by Stochastic Optimization
ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications
OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data
POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling
Scaling View Synthesis Transformers
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer
Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis
KV-Tracker: Real-Time Pose Tracking with Transformers
VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement
Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives
StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References
DynFusion: Rethinking Condition Fusion for Adaptive Multi-Conditional Text-to-Image Generation
Agentic Retoucher for Text-To-Image Generation
SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation
Paper2Figure: A Multi-Agent Collaborative System for Figure Generation Towards Academic Research Paper
Adapting In-context Generation for Enhanced Composed Image Retrieval
Transition Models: Rethinking the Generative Learning Objective
Rethinking Glyph Spatial Information in Font Generation
StreamDiT: Real-Time Streaming Text-to-Video Generation
ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
Aligning Multi-Character Narrative Image Generation with Multi-Aspect Human Preferences
FoleyDirector: Directing Temporal Controllable Video-to-Audio Generation via Fine-Grained Temporal Scripts
DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation
DreamOmni2: Multimodal Instruction-based Generation and Editing
AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models
PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation
PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment
IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis
BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
Boosting Visual Reprogramming for CLIP with Dual Granularity Alignment
Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning
UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in RL
Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning
Unified Personalized Understanding, Generating and Editing
Decision Boundary-aware Generation for Long-tailed Learning
Towards Stable Federated Continual Test-Time Adaptation in Wild World
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation
Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning
Depth Hypothesis Guided Iterative Refinement for Event–Image Monocular Depth Estimation
High-Quality and Efficient Turbulence Mitigation with Events
Tracking through Severe Occlusion via Event-Derived Transient Cues
FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera
Event-Based Motion Deblurring Using Task-Oriented 3D Gaussian Event Representations
From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras
Extending Embodied Question Answering from Perception to Decision
MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis
Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering
VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories
Rethinking Intermediate Representation for VLM-based Robot Manipulation
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation
MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition
Harmonic Canvas: Inversion-Free Editing for Visually-Guided Music Style Transfer
SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
NeAR: Coupled Neural Asset–Renderer Stack
PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex
Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration
ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction
Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters
Convexity-Aware Noise Calibration: A Self-Supervised Framework for Noise-Level-Unknown Image Denoising
UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration
ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration
Bilevel Layer-Positioning LoRA for Real Image Dehazing
SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation
GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation
SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models
Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting
Simple-ViLMedSAM: Simple Text Prompts Meet Vision-Language Models for Medical Image Segmentation
NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models
TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models
Jailbreaking Vision-Language Models via Dissonance-Guided Suffix Optimization and Image–Phrase Injection
BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation
Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
Transform to Transfer: Boosting Adversarial Attack Transferability on Vision-Language Pre-training Models
Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast
Reliable Clustering Number Estimation for Contrastive Multi-View Clustering
Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
EXOTIC: External Vision-driven Incomplete Multi-view Classification
OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration
Information-Theoretic Decomposition for Multimodal Interaction Learning
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality
Visual Autoregressive Modeling via Next Focus Prediction
Semantic Context Matters: Improving Conditioning for Autoregressive Models
TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
Improved Mean Flows: On the Challenges of Fastforward Generative Models
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
Match-and-Fuse: Consistent Generation from Unstructured Image Sets
Mixture of Style Experts for Diverse Image Stylization
Mirai: Autoregressive Visual Generation Needs Foresight
Align Images Before You Generate
Bridging the Perception Gap in Image Super-Resolution Evaluation
Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance
IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution
Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning
FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
OpenT2M: No-frill Motion Generation with Open-source, Large-scale, High-quality Data
Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models
PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects
CLEP: Contrastive Language-Pose Pretraining
OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis
ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
Beyond Mimicry: Learning Whole-Body Human-Humanoid Interaction from Human-Human Demonstrations
PHAC: Promptable Human Amodal Completion
CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation
IntrinsicWeather: Controllable Weather Editing in Intrinsic Space
Outlier-Robust Diffusion Solvers for Inverse Problems
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
ReasonX: MLLM-Guided Intrinsic Image Decomposition
Diff-SemiER: Transparency-Aware Adaptive Fusion Diffusion Model with Generative Prior for Semi-Transparent Eyeglasses Removal
KLIP: Localized Distribution Shift Detection via KL-Divergence with Diffusion Priors in Inverse Problems
Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models
Taming Generative Diffusion Model for Task-Oriented Infrared Imaging
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs
SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia
MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought
GeoWorld: Geometric World Models
Benchmarking PhD-Level Coding in 3D Geometric Computer Vision
MonoVLM: Monocular 3D Visual Grounding with Vision Language Models
Curvature-Aware Captioning: Leveraging Geodesic Attention for 3D Scene Understanding
SPREAD: Spatial-Physical REasoning via geometry Aware Diffusion
ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval
Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval
Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures
Progressive Cross-Modal Causal Intervention for Long-Term Action Recognition
EthoCLIP: Ontology-Enhanced Video-Language Pretraining for Animal Behavior Understanding
Probabilistic Precipitation Nowcasting with Rectified Flow Transformers
VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer
ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos
Gaze Target Estimation Anywhere with Concepts
Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation
See Through the Noise: Improving Domain Generalization in Gaze Estimation
Mechanisms of Object Localization in Vision–Language Models
From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching
Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence
BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images
SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching
LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
GM-R^2: Generative Matching Learning for Unsupervised Geometric Representation and Registration
4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis
PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation
MORE-STEM: Long-Short MemOry REcall and Spatio-TEmporal Consistency Model for Query-Driven 3D/4D Point Cloud Segmentation
Low-Rank Test-Time Training for Pre-Trained Point Cloud Models
STAR: Test-Time Adaptation Can Enhance Universal Prompt Learning for Vision-Language Models
Exploring Visual Pretraining for Learning Language Intelligence
VL-Eraser: Vacuum Distillation for Machine Unlearning in Vision-Language Models
DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
Mirror Illusion Art
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
Towards Human-Like Robot Handwriting via Contour-Aware Generation
MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks
Rank-Guided Pseudo-Bias Learning for Robust Black-Box Adaptation
WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
SineProject: Machine Unlearning for Stable Vision-Language Alignment
HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning
OS-Fed: One Snapshot Is All You Need
FedAlign: Differentially Private Distribution Alignment for Non-IID Federated Learning
Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning
Personalized Federated Training of Diffusion Models with Privacy Guarantees
FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning
Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
Small Object, Great Challenge: A Benchmark for Small Object Visual Grounding
UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos
Hybrid Token Compression for Vision-Language Models
When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
Hi-Lo Prune: Look at What You'll Lose before Pruning with Hierarchical Token Selection
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization
Heterogeneous Decentralized Diffusion Models
GaussianMatch: Semi-Supervised Regression with Pseudo-Label Filtering via Multi-View Gaussian Consistency
TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
The Universal Normal Embedding
Bypassing the Transport Plan: Dynamic Reweighting for Out-of-Distribution Detection with Optimal Transport
Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET
Debiased Sample Selection for Learning with Noisy Labels
Driving on Registers
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
EE-RL: Vision Language Guided Reinforcement Learning with Explorer and Expert model for End-to-End Autonomous Driving
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration
Efficient Equivariant Transformer for Self-Driving Agent Modeling
Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation
Saliency-Driven Token Merging for Vision Transformers
RISE: Single Static Radar-based Indoor Scene Understanding
Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation
Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation
SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation
Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
DGS: Dual Gradient and Semantic-Shift Guided Low-Rank Adaptation for Class Incremental Learning
Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification
Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning
Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models
Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging
Few-Shot Hybrid Incremental Learning: Continually Learning under Data Scarcity and Task Uncertainty
High-Fidelity Mobile Avatars with Pruned Local Blendshapes
PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning
Bridging Privacy and Provenance: Traceable Virtual Identity Generation
Dynamic Label Noise Suppression with Optimal Teacher Pool for Facial Expression Recognition
MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation
DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation
NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation
MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video
PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video
ZINA: Multimodal Fine-grained Hallucination Detection and Editing
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Video Understanding
Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention
Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models
ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
SounDiT: Geo-Contextual Soundscape-to-Landscape Generation
CamDirector: Towards Long-Term Coherent Video Trajectory Editing
Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding
Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection
Target-Aware Invertible Encoder with Reconstruction Guidance for Infrared Small Target Detection
BDNet:Bio-Inspired Dual-Backbone Small Object Detection Network
ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer
RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework
FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition
URICA: A Uniformity Region Affine Identifier Capture Algorithm for Arbitrary Region Retrieval in Pathology Images
Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video
Video-CoE: Reinforcing Video Event Prediction via Chain of Events
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning
Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
CADC: Content Adaptive Diffusion-Based Generative Image Compression
FG-Portrait: 3D Flow Guided Editable Portrait Animation
IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation
SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training
InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation
DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR
ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection
STUR3D: Spatio-Temporal Unified Representation Learning for 3D Object Detection
Exploring 6D Object Pose Estimation with Deformation
SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
Improving Vision-language Models with Perception-centric Process Reward Models
PhysInOne: Visual Physics Learning and Reasoning in One Suite
AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety–Critical Cloud Forecasts
TTRV: Test-Time Reinforcement Learning for Vision Language Models
Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction
Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting
GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting
Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos
MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing
Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting
RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing
PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting
Scene Grounding in the Wild
Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM
Revisiting 3D Reconstruction Kernels as Low-Pass Filters
SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models
IPR-1: Interactive Physical Reasoner
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension
Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models
Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework
CodePercept: Code-Grounded Visual STEM Perception for MLLMs
TableMix: Enhancing Multimodal Table Reasoning in MLLMs from a Data-Centric Perspective
Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing
Grounded Chain-of-Thought for Multimodal Large Language Models
SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference
Compressed-Domain-Aware Online Video Super-Resolution
Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
Task-Aware Image Signal Processor for Advanced Visual Perception
Enhancing Video Vision Language Model with Hippocampal Sensing
VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization
EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement
Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent
ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning
Think, Then Verify: A Hypothesis–Verification Multi-Agent Framework for Long Video Understanding
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Multi-Modal Image Fusion via Intervention-Stable Feature Learning
ReCoFuse: Ultra-Robust Image Fusion via Restorative Multi-Modal Diffusion Reciprocal Coupling
Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios
DF^2-VB: Dual-level Fuzzy Fusion with View-specific Boosting for Multi-view Multi-label Classification
UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation
Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs
Graph Attention Prototypical Network for Robust Few-Shot Classification
EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation
Flow Map Distillation Without Data
A Bit is All You Need! Efficient Video Capture via Single Bit Imaging
From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing
Physics-Guided Multistep Deformation Reversal for Ancient Bamboo Slip Restoration
cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
SGDE: Self-supervised Geometry Degradation Estimation Framework for Coded Aperture Compressive Spectral Imaging
Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy
RoSAMDepth: Robust Self-supervised Depth Estimation Leveraging Segment Anything Model
AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
Dark3R: Learning Structure from Motion in the Dark
What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?
Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents
Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement
Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation
Order Matters: 3D Shape Generation from Sequential VR Sketches
Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models
FabricGen: Microstructure-Aware Woven Fabric Generation
Leveraging Verifier-Based Reinforcement Learning in Image Editing
PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling
VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
C^2FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
Unified Customized Generation by Disentangled Reward Modeling
Region-Aware Instance Consistency Learning for Micro-Expression Recognition
LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation
Temporal Representation Enhancement (TRE): Learning to Forget Dominant Patterns for Enhanced Temporal Spiking Features
Unlocking Pre-trained Weights: Parameter Inheritance for Zero-Shot Initialization
Progressive Neural Architecture Generation
A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling
When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks
Temporal Interaction in Spiking Transformers with Multi-Delay Mixer
Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge
Sparsely Timing the Change: A Spiking Temporal Framework for Remote Sensing Interpretation
ProSoftArena: Benchmarking Hierarchical Capabilities of Multi-modal Agents in Professional Software Environments
BAMI: Training-Free Bias Mitigation in GUI Grounding
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection
Geometry-driven OOD Detectors Are Class-Incremental Learners
Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs
Prompt-Free Unknown Label Generation for Open World Detection in Remote Sensing
Learning to Diversify and Focus: A Reinforcement Framework for Open-Vocabulary HOI Detection
DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval
Leveraging Class Distributions in CLIP for Weakly Supervised Semantic Segmentation
CompetitorFormer: Mitigating Query Conflicts for 3D Instance Segmentation via Competitive Strategy
D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping
Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation
Fast Reasoning Segmentation for Images and Videos
FireScope: Wildfire Risk Raster Prediction With a Chain-of-Thought Oracle
TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
Regulating Rather than Constraining: Adaptive Guidance for Complex Spectral Reconstruction in Pansharpening
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks
Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation
Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats
Logit-Margin Repulsion for Backdoor Defense
Towards Stealthy and Effective Backdoor Attacks on Lane Detection: A Naturalistic Data Poisoning Approach
Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases
Fractal Camouflage: A Bio-Inspired Approach for Multi-Scale Adversarial Attacks in the Infrared Domain
EgoRoC: Towards Egocentric Robotic Control via Task-Agnostic Visual Alignment
Describe Anything Anywhere At Any Moment
StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation
VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation
Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
HQC-NBV: A Hybrid Quantum-Classical View Planning Approach
MM-ACT: Learn from Multimodal Parallel Generation to Act
Motus: A Unified Latent Action World Model
SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph
Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
Matching Every Pair to Track Every Point: PairFormer for All-Pairs Tracking and Video Trajectory Fields
Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning
Progressive Multi-cue Alignment for Unaligned RGBT Tracking
Real-Time Neural Video Compression with Unified Intra and Inter Coding
Adapting Lightweight Image-based Counting Models for Video Crowd Counting
Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis
MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration
GenTract: Generative Global Tractography
Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance
LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol
Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
IEBGL:An Interpretability-Enhanced Brain Graph Learning Framework with LLM-Instructed Topology and Literature-Augmented Semantics
F^2-Assist: Multi-Phase Fetal Growth Forecast and Report Generation from Ultrasound Examination
OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
Structural–Semantic Perception for Diffusion-Guided Temporal Forgery Localization
IncreFA: Breaking the Static Wall of Generative Model Attribution
AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
Detecting Compressed AI-Generated Images via Phase Spectrum Robustness
Detect Any AI-Counterfeited Text Image
DeepfakeImpact: A Two-Stage Benchmark with Real-World Impact in Deepfake Detection
Enhancing the Security of Visual Speaker Authentication Based on Dynamic Lip-Print Analysis
Editprint: General Digital Image Forensics via Editing Fingerprint with Self-Augmentation Training
Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification
Goldilocks Test Sets for Face Verification
DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting
MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection
Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection
TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection
Defect Cue-Preserved Structural Feature Refinement for Few-Shot Anomaly Detection
Anomaly-Related Residual Fields for Cross-domain Anomaly Detection
From Attraction to Equilibrium: Physics-Inspired Semantic Gravitons for Zero-Shot Anomaly Detection
No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection
FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving
GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
Test-Time 3D Occupancy Prediction
Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
RegionRoute: Regional Style Transfer with Diffusion Model
Low-Rank Residual Diffusion Models
TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration
Guiding Token-Sparse Diffusion Models
Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
High-Fidelity Virtual Try-On beyond Paired Data Scarcity via Diffusion-based Cycle-Consistent Learning
CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think
Scale Space Diffusion
Making Training-Free Diffusion Segmentors Scale with the Generative Power
Few-Step Diffusion Sampling Through Instance-Aware Discretizations
SpeeDiff: Scalable Pixel-Anchored End-to-End Latent Diffusion Model
Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation
Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production
PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
MotionV2V: Editing Motion in a Video
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
DreamStyle: A Unified Framework for Video Stylization
Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering
TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration
Cross-modal Representation Learning for Diffusion-generated Image Detection
Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models
CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis
DiP: Taming Diffusion Models in Pixel Space
RAPID: Reusing Attention Sparsity with Inter-step Adaptation for Efficient Video Diffusion
FILTR: Extracting Topological Features from Pretrained 3D Models
Learning Convex Decomposition via Feature Fields
Learning Eigenstructures of Unstructured Data Manifolds
Mapping Networks
SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training
DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging
Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation
FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization
SimScale: Learning to Drive via Real-World Simulation at Scale
Texvent: Asynchronous Event Data Simulation via Text Prompt
WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
BuildingGPT: Auto-Regressive Building Wireframe Reconstruction Model with Reinforcement Learning
Free-Grained Hierarchical Visual Recognition
LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging
LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
SEA-Flow3D: Simplified, Efficient, and Accurate Scene Flow via Spatial Vector Sampling and Multi-scale Refinement
OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
DROID-SLAM in the Wild
HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
Learning by Analogy: A Causal Framework for Compositional Generalization
ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment
DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation
Gloria: Consistent Character Video Generation via Content Anchors
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
M4V: Multimodal Mamba for Efficient Text-to-Video Generation
Property-Informed Diffusion-Based Text-to-Microstructure Generation
Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning
TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
CoV-Align: Efficient Fine-grained Cross-Modal Alignment with Cohesive Visual Semantics Priority
TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment
PhyCritic: Multimodal Critic Models for Physical AI
Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition
Decoupling Vision and Language: Codebook Anchored Visual Adaptation
MemFlow: A Lightweight Forward Memorizing Framework for Quick Domain Adaptive Feature Mapping
Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning
Vision-Language Model Guided Source-Free Domain Adaptation via Optimal Transport
TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer
ARES: Unifying Asymmetric RGB-Event Stereo for Probabilistic Scene Flow Estimation
MER-Tracker: Towards High-Speed 3D Point Tracking via Multi-View Event-RGB Hybrid Cameras
Moving Border Ownership for Event-based Motion Segmentation
TTAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
Seeing Motion Through Polarity for Event-based Action Recognition
Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning
Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
Experience Transfer for Multimodal LLM Agents in Minecraft Game
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
SaPaVe: Towards Active Perception and Manipulation in Vision-Language Action Models for Robotics
RealAppiance: Let High-fidelity Appliance Assets Controllable and Workable as Aligned Real Manauls
ForeAct: Steering Your VLA with Efficient Visual Foresight Planning
MERIT: Multi-domain Efficient RAW Image Translation
Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment
WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing
Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing
PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
iLRM: An Iterative Large 3D Reconstruction Model
MVInverse: Feed-forward Multiview Inverse Rendering in Seconds
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification
Multi-view Pyramid Transformer: Look Coarser to See Broader
CaT-GS: Efficient 3DGS Rendering for Large Scale Scenes via Inter-frame Caching and Tile Scheduling
RL‑ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment
Benchmarking Endoscopic Surgical Image Restoration and Beyond
SDUIE: Semi-Supervised Diffusion for Underwater Image Enhancement with Quant-Text Dual Control
Bi-Bridge: Bidirectional Diffusion Bridges for Low-Light Image Enhancement
UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration
MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency
Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation
Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
Photo-Guided Tooth Segmentation on 3D Oral Scan Model
Post-training Feature Pruning for Fundus Images Classification
Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
SafeLogo: Turning Your Logos into Jailbreak Shields via Micro-Regional Adversarial Training
Anti-I2V: Safeguarding your Photos from Malicious Image-to-video Generation
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Hierarchically Robust Zero-shot Vision-language Models
Beyond Text Prompts: Precise Concept Erasure through Text–Image Collaboration
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
ReMoE: Region-Mixture Experts for Adversarially-Robust Vision Transformers
TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration
Chain-of-Thought Guided Multi-Modal Object Re-Identification
When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
Parameter-Efficient Adaptation for MLLMs via Implicit Modality Decomposition
Hyperbolic Gramian Volumes for Multimodal Alignment
Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping
AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion
CoRiM: Conflict-driven Risk Minimization for Dynamic Multimodal Fusion
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
CICA: Coupling Confidence-Aware Pretraining with Confidence-Informed Attention for Robust Multimodal Sentiment Analysis
SAMTok: Representing Any Mask with Two Words
Multi-Metric Representation Learning Strategy Based on Clustering for Fine-Grained Multimodal Sentiment Analysis
Cinematic Audio Source Separation Using Visual Cues
MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
Progressive Supernet Training for Efficient Visual Autoregressive Modeling
CoT-Edit: Let CoT Guide Instruction Video Editing
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Dual-Granularity Memory for Efficient Video Generation
Unified Camera Positional Encoding for Controlled Video Generation
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
Object-WIPER: Training-Free Object and Associated Effect Removal in Videos
Towards Robust Sequential Decomposition for Complex Image Editing
Chain of Event-Centric Causal Thought for Physically Plausible Video Generation
LoL: Longer than Longer, Scaling Video Generation to Hour
FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens
DNF-SR: Dual-Input and Negative-Aware Feature Fine-Tuning for Real-World Image Super-Resolution
IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution
Edge-Focused Super-Resolution for Omnidirectional Images with Spherical Geometric Augmentation
Disentangled Textual Priors for Diffusion-based Image Super-Resolution
Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features
DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer
FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution
Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization
Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions
Human Geometry Distribution for 3D Animation Generation
A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation
Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions
MoLingo: Motion–Language Alignment for Text-to-Human Motion Generation
End-to-End Language-Action Model for Humanoid Whole Body Control
Toward Early Quality Assessment of Text-to-Image Diffusion Models
CoD: A Diffusion Foundation Model for Image Compression
Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
Language-Guided One-Step Diffusion Model for Nighttime Flare Removal
SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras
Landscape-Awareness for Geometric View Diffusion Model
Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism
OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
KαLOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding
LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting
Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
Geometrically-Constrained Agent for Spatial Reasoning
PARSE: Part-Aware Relational Spatial Modeling
R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark
CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
DiT-Distill: Open-Set Fine-Grained Retrieval via Generative Curriculum Knowledge
Rethinking BCE Loss for Multi-Label Image Recognition with Fine-Tuning
CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval
Interactive Episodic Memory with User Feedback
Seeing without Pixels: Perception from Camera Trajectories
StreamRAG: Enhancing Real-Time Video Understanding with Retrieval Augmentation
SkillSight: Efficient First-Person Skill Assessment with Gaze
BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Making the Classification Explanation Faithful to the Confidence Score
Intrinsic Concept Extraction Based on Compositional Interpretability
Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
Deformation-based In-Context Learning for Point Cloud Understanding
FMPose3D: monocular 3D pose estimation via flow matching
ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
ESAM++: Efficient Online 3D Perception on the Edge
DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration
Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs
Rethinking 2D-3D Registration: A Novel Network for High-Value Zone Selection and Representation Consistency Alignment
Adaptive 3D Perception for Small Aerial Targets Under Sparse Sampling via Reinforcement Learning
StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation
Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding
Towards Calibrating Prompt Tuning of Vision- Language Models
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
Language-guided Frequency Modulation for Large Vision-Language Models
TANGO: Text-Anchored Guided Optimization for Robust Fine-tuning Vision-Language Models under Label Noise
DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision–Language Transformers to Missing Modalities
BrepVGAE: Variational Graph Autoencoder with Unified Latent Representation for B-rep
NeuROK: Generative 4D Neural Object Kinematics
BrickNet: Graph-Backed Generative Brick Assembly
Unified Vector Floorplan Generation via Markup Representation
CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation
Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
EpiAgent: An Agent-Centric System for Ancient Inscription Restoration
The Invisible Gorilla Effect in Out-of-distribution Detection
Interpretable Debiasing of Vision-Language Models for Social Fairness
Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework
When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
IrisFP: Adversarial-Example-based Model Fingerprinting with Enhanced Uniqueness and Robustness
Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting
Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization
Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning
FedMOP: Achieving Enhanced Privacy and Performance in Federated Learning via Momentum Orthogonal Projection
HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation
Single-Round Scalable Analytic Federated Learning
FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning
Spatial Matters: Position-Guided 3D Referring Expression Segmentation
Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs
Imbalanced View Contribution Evaluation and Refinement for Deep Incomplete Multi-View Clustering
Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering
SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning
Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning
Residual Connections Harm Generative Representation Learning
Neural Mixture Density Processes
Large-scale Robust Enhanced Ensemble Clustering via Outlier Decoupling
DriveLaW: Unifying Planning and Video Generation in a Latent Driving World
InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training
Latent Chain-of-Thought World Modeling for End-to-End Driving
Failure Modes for Deep Learning–Based Online Mapping: How to Measure and Address Them
LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving
DIMOS: Disentangling Instance-level Moving Object Segmentation
EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision
Robust Promptable Video Object Segmentation
Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation
BEV-CAR: Enhancing Monocular Bird’s Eye View Segmentation with Context-Aware Rasterization
Exploring the Underwater World Segmentation without Extra Training
Cross-Architecture Adaptation: Cloud-Edge Continual Test-Time Adaptation with Dynamic Sampling and Heterogeneous Distillation
Towards Dynamic Modality Alignment in Multimodal Continual Learning
ϕ-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models
Incremental Object Detection via Future-Aware Decoupled Cross-Head Distillation
Smart Replay: Adaptive Scheduling of Memory Rehearsal for Computational Resource-Aware Incremental Learning
ReBaPL: Repulsive Bayesian Prompt Learning
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence
OctoNav: Towards Generalist Embodied Navigation
WalkGPT: Grounded Vision–Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
SMAP: Semantic Route Planning with Map-Grounded Multimodal Alignment
IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations
Motion-Aware Animatable Gaussian Avatars Deblurring
ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation
MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
Cross-Modal Attention Calibration for LVLM Hallucination Mitigation
Exposing and Evaluating Hallucinations for GUI Grounding
Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
AniMimic: Imitating 3D Animation from Video Priors
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
WorldStereo: Bridging Controllable Video Generation and Scene Reconstruction via 3D Geometric Memories
Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance
SANER: Switchable Adapter with Non-parametric Enhanced Routing for Person De-Reidentification
BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification
Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification
Diversity over Uniformity: Rethinking Representation in Generated Image Detection
Mining Instance-Centric Vision–Language Contexts for Human–Object Interaction Detection
FSLoRA: Harmonizing Detection and Re-Identification via Freq-Spatial Low-Rank Adapter for One-Stage Person Search
EEGiT: Teaching Vision Transformers to Understand the EEG signal
FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts
Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification
UAV-CB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
StreamReady: Learning What to Answer and When in Long Streaming Videos
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
Self-Critical Distillation Network for Video-based Commonsense Captioning
Ego-Grounding for Personalized Question-Answering in Egocentric Videos
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
EarlyTom: Early Token Compression Completes Fast Video Understanding
VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
RenderFlow: Single-Step Neural Rendering via Flow Matching
ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers
Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers
ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
H^2A^2: Homogeneity-Aware and Heterogeneity-Aware Feature Perception for Unified Indoor 3D Object Detection
Towards Intrinsic-Aware Monocular 3D Object Detection
SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection
FailureAtlas: Mapping the Failure Landscape of T2I Models via Active Exploration
HDR-VLM: HDR-Domain Adaptation of VLMs and Preference-Aligned Quality Assessment for HDR Video Color Grading
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
Revisiting Visual Corruptions in LVLMs: A Shape–Texture Perspective on Model Failures
From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
SunFaded: Illumination-Aware Gaussian Splatting for Dark Scenes with Camera-Mounted Active Lighting
TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction
Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting
AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM
Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images
TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
3D Gaussian Splatting from Unposed Spike Stream
SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method
ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction
MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps
DialogueVPR: Towards Conversational Visual Place Recognition
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
Grounding Everything in Tokens for Multimodal Large Language Models
Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory
ChartR: Evaluating Reasoning Accuracy and Robustness in Chart Question Answering
Think Visually, Reason Textually: Vision-Language Synergy in Abstract Reasoning
VKG-QA: Visual Knowledge Graph-based Question Answer for Large Multimodal Models
Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop
VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment
Generative Video Compression with One-Dimensional Latent Representation
Markovian Scale Prediction: A New Era of Visual Autoregressive Generation
Learned Image Compression via Sparse Attention and Adaptive Frequency
VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression
GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics
VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps
TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement
Towards Visual Query Localization in the 3D World
OVOD-Agent: A Markov–Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding
Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object Detection
Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization
Beyond Sequential Tools: A Unified VLM Agent System for Photographic Post-Processing via Dynamic Multi-Expert Fusion
Multi-modal Frequency Decomposition Network for Semantic Scene Completion
BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
FusionRegister: Every Infrared and Visible Image Fusion Deserves Registration
Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning
LRHDR: Learning Representation-enhanced HDR Video Reconstruction
Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation
Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
PP-Brep: Few-Shot B-rep Classification with Hybrid Graph Representation
AgentDet: A Shared-Blackboard Multi-Agent Framework for Zero-/Few-Shot Object Detection
SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection
Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment
Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Progressive Mask Distillation for Self-supervised Video Representation
HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
Computational Speckle Pattern Interferometry
DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging
Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT
Computer Vision with a Superpixelation Camera
Multi-Scale Gradient-Guided Unrolling Architecture with Adaptive Mamba for Compressive Sensing
Deciphering Genotype-Phenotype Mechanisms from High-Content Profiling via Knowledge-Guided Multi-modal Graph Learning
Bulk RNA-seq Guided Multi-modal Detection of Anomalous Regions in Human Cancer via Spatial Transcriptomics
ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction
EvoID: Reinforced Evolution for Identity-Preserving Video Generation
PhyCo: Learning Controllable Physical Priors for Generative Motion
Unified Multimodal Models as Auto-Encoders
Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity
DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging
Generative Modeling of Weights: Generalization or Memorization?
Vision-Oriented Lightweight Neural Architecture Search with Budget-Adaptive Evaluation
Improving Sparse Autoencoder with Dynamic Attention
Stepwise Credit Assignment for GRPO on Flow-Matching Models
FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models
Hyperbolic Busemann Neural Networks
FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
Image-to-Point Cloud Feature Back-Projection for Multimodal Training of 3D Semantic Segmentation
NG-GS: NeRF-guided 3D Gaussian Splatting Segmentation
MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision
Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation
MatchMask: Mask-Centric Generative Data Augmentation for Label-Scarce Semantic Segmentation
Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation
FUSAR-GPT: A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery
UniChange: Unifying Change Detection with Multimodal Large Language Model
See What We Cannot See: A Geo-guided Reasoning Benchmark for Object Counting under Adverse Earth Observation Conditions
MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
RECS4R: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation
Fourier Angle Alignment for Oriented Object Detection in Remote Sensing
Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling
FeatureFool: Zero-Query Fooling of Video Models via Feature Map
AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples
The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers
Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation
Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception
Hierarchical Attacks for Multi‑Modal Multi‑Agent Reasoning
CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer
ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model
BiPreManip: Learning Affordance-Based Bimanual Pre-Manipulation through Anticipatory Collaboration
Learning Surgical Robotic Manipulation with 3D Spatial Priors
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
RaUF: Learning the Spatial Uncertainty Field of Radar
SIR: Structured Image Representations for Explainable Robot Learning
Instance-level Visual Active Tracking with Occlusion-Aware Planning
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Tracking by Predicting 3-D Gaussians Over Time
Toward Low-Cost yet Effective Temporal Learning for UAV Tracking
Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Beyond Explicit Language: Plug-and-Play Visual-to-Linguistic Modeling Toward General Object Tracking
From Panel to Pixel: Zoom-In Vision–Language Pretraining from Biomedical Scientific Literature
LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction
Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation
Decoding 3D Perception via BrainSSD: Synergistic Fusion of EEG Representations from Static and Dynamic Visual Streams
Duala: Dual-Level Alignment of Subjects and Stimuli for Cross-Subject fMRI Decoding
OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks
Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control
FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
RAM: Recover Any 3D Human Motion in-the-Wild
From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction
ReGenHOI: Unifying Reconstruction and Generation for 3D Human–Object Interaction Understanding
Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation
Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion
Learning Forgery-Aware Lip Representations Without Forgery Priors
Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection
TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery
Unleashing Vision-Language Semantics for Deepfake Video Detection
RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
Zero-shot Detection of AI-Generated Image via RAW-RGB Alignment
Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes
Investigating Self-Supervised Representations for Audio-Visual Deepfake Detection
FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection
RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
Wavelet-Driven 3D Anomaly Detection under Pose-Agnostic and Sparse-View
GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
RebRL: Reinforcing Discrete Visual Diffusion Models with Rebalanced Timestep Credits
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
Towards Fine-Grained Attribution: Instance-Aware Preference Optimization for Aligning Diffusion Models
SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior-Guided Multimodal LLMs
UniPercept: A Unified Diffusion Model for Generalizable Visual Perception
Visual Diffusion Models are Geometric Solvers
You Only Erase Once: Erasing Anything without Bringing Unexpected Content
Smoothing the Score Function to Enhance Generalization in Diffusion Models
NS-Diff: Fluid Navier–Stokes Guided Video Diffusion via Reinforcement Learning
PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
Interpretable Prompts made Edit-Friendly: Token-to-Token Similarity Reduction in dLLMs for Edit-Friendly Hard Prompt Inversion
Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation
EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decompositio
Hierarchical Codec Diffusion for Video-to-Speech Generation
Semantic Alignment for Pose-Invariant Identity Preserving Diffusion
Causality in Video Diffusers is Separable from Denoising
2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching
MacTok: Robust Continuous Tokenization for Image Generation
Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training
Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective
Elucidating the SNR-t Bias of Diffusion Probabilistic Models
RFDM: Residual Flow Diffusion Models for Video Editing
FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing
Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models
CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration
Accelerating Diffusion Model Training under Minimal Budgets: A Condensation-Based Perspective
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
Text-Driven 3D Hand Motion Generation from Sign Language Data
Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models
Guiding Diffusion Models with Semantically Degraded Conditions
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance
Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
Accelerating Autoregressive Video Diffusion via History-Guided Cache and Residual Correction
MusicInfuser: Making Video Diffusion Listen and Dance
Affordance-First Decomposition for Continual Learning in Video–Language Understanding
TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
ArtLLM: Generating Articulated Assets via 3D LLM
A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
Refracting Reality: Generating Images with Realistic Transparent Objects
PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
NitroGen: An Open Foundation Model for Generalist Gaming Agents
Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations
Geometric Neural Distance Fields for Learning Human Motion Priors
Differentiable Laplacian Matrix Guided Superpixel Segmentation
UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
Image Generation from Contextually-Contradictory Prompts
Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
LeapAlign: Post-training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
Teaching DINOv3 About Partial 3D Geometry: A Self-Supervised Geometry-Aware Approach
FrankenMotion: Part-level Human Motion Generation and Composition
VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision–Language Models
3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding
Unique Lives, Shared World: Learning from Single-Life Videos
Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
Envisioning the Future, One Step at a Time
Geo2: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
Unified Latent Space for Understanding and Generation via Semantic Auto-encoder
CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks
BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
VGGT-Ω
TokenLight: Precise Lighting Control in Images using Attribute Tokens
Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global–Local Feature Fusion
3D Space as a Scratchpad for Editable Text-to-Image Generation
When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
Global Underwater Geolocation from Time-Lapse Polarization Imagery
SAM 3D: 3Dfy Anything in Images
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
OVI-MAP: Open-Vocabulary Instance-Semantic Mapping
Building Robust Vision Encoders for Cross-Dataset Evaluation in Immunofluorescent Microscopy
Lighting in Motion: Spatiotemporal HDR Lighting Estimation
Understanding Counting Mechanisms in Large Language and Vision-Language Models
Visual Grounding for Object Questions
Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule
Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies
UniVerse: A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization
PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
EasyV2V: A High-quality Instruction-based Video Editing Framework
Resolving the Identity Crisis in Text-to-Image Generation
Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration
Evidential Neural Radiance Fields
CaptionQA: Is Your Caption as Useful as the Image Itself?
Is the Modality Gap a Bug or a Feature? A Robustness Perspective
Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization
Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval
InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
MFEN: Multi-Frequency Expert Network for Visible-Infrared Person Re-ID
CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization
Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model
PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
CHEEM: Continual Learning by Reuse, New, Adapt and Skip - A Hierarchical Exploration-Exploitation Approach
Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
Joint Learning of General and Diverse Patterns with Mixture of Memory Experts for Weakly-Supervised Video Anomaly Detection
WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
Efficient and Training-Free Single-Image Diffusion Models
Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching
TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction
Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation
Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain
From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning
Circular-DPO: Aligning Multi-Stage 3D Generative Models via Preference Feedback Loop
SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer
MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images
LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Phrase-grounded APO for Improving Chest X-ray Report Generation
Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints
Learning 3D Reconstruction with Priors in Test Time
DSO: Direct Steering Optimization for Bias Mitigation
LumiMotion: Improving Gaussian Relighting with Scene Dynamics
AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation
CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision
SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders
Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning
Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training
AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking
SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals
QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
AnthroTAP: Learning Point Tracking with Real-World Motion
CLEX: Complementary Label Exchange Learning for Noisy Facial Expression Recognition
SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
Sparse Spectral LoRA: Routed Experts for Medical VLMs
Beyond Rule-Based Agents: Active Markov Games for Realistic Multi-Agent Interaction in Autonomous Driving
LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis
Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
More Natural, More Real: Object-aware Gaussian Splatting for 3D Visual Decoding from Human Brain
Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens
GDRO: Group-level Reward Post-training Suitable for Diffusion Models
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
Learnability-Driven Submodular Optimization for Active Roadside 3D Detection
When to Think and When to Look: Uncertainty-Guided Lookback
Latent Diffusion Inversion Requires Understanding the Latent Space
LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
LoPrune: Efficient Data Pruning for LoRA-Based Fine-Tuning of Vision Transformer
VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
MV-TAP: Tracking Any Point in Multi-View Videos
Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
Velox: Learning Representations of 4D Geometry and Appearance
D2T2 - Multimodal Automated Planning for Brachytherapy
PAI-Bench: A Comprehensive Benchmark For Physical AI
MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
Unlocking Motion from Large Vision Models with a Semantic and Kinematic Duality for Gait Recognition
ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
Reward Sharpness-Aware Fine-Tuning for Diffusion Models
DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene
Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
DeDelayed: Deleting Remote Inference Delay via On-Device Correction
RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space
Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
BluRef: Unsupervised Image Deblurring with Dense-Matching References
MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics
Point Cloud as a Foreign Language for Multi-modal Large Language Model
Spatiotemporal Pyramid Flow Matching for Climate Emulation
Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
Revisiting Model Stitching In the Foundation Model Era
Contact-Aware Neural Dynamics
UniVerse: Empower Unified Generation with Reasoning and Knowledge
VisPlay: Self-Evolving Vision-Language Models
EventGait: Towards Robust Gait Recognition with Event Streams
Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion
Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
Collaborative Multi-Mode Pruning for Vision-Language Models
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness
UniDAC: Universal Metric Depth Estimation for Any Camera
SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization
COT-FM: Cluster-wise Optimal Transport Flow Matching
ID-Sim: An Identity-Focused Similarity Metric
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
2D-LFM: Lifting Foundation Model without 3D Supervision
Aligning Text, Images and 3D Structure Token-by-Token
Learning Straight Flows: Variational Flow Matching for Efficient Generation
Captain Safari: A World Engine with Pose-Aligned 3D Memory
DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion
MOMO: Mars Orbital MOdel Foundation Model for Mars Orbital Applications
Forecasting 3D Scanpaths in Egocentric Video
BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections
Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks
Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization
Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking
Unsupervised Multi-agent and Single-agent Perception from Cooperative Views
GenMatter: Perceiving Physical Objects with Generative Matter Models
Retrieving Counterfactuals Improves Visual In-Context Learning
Measuring the (Un)Faithfulness of Concept-Based Explanations
PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
A More Word-like Image Tokenization for MLLMs
Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
Masked Representation Modeling for Domain-Adaptive Segmentation
RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection
Understanding, Accelerating, and Improving MeanFlow Training
Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining
Roots Beneath the Cut: Uncovering the Risk of Concept Recovery in Pruning-Based Unlearning for Diffusion Models
LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers
Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
FedCART: Tackling Long-Tailed Distributions in Federated Adversarial Training via Classifier Refinement
LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models
Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery
History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation
CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers
Finding Distributed Object-Centric Properties in Self-Supervised Transformers
Minimal Constraint Relaxation for Multiview Autocalibration
HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
FPSBench: A Benchmark for Video Understanding at High Frame Rates
Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization
Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
How Much 3D Do Video Foundation Models Encode?
Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization
STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation
ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval
Scaling Spatial and Temporal Context for Robotic Imitation Learning Policies With Scene Graphs
JRM: Joint Reconstruction Model for Multiple Objects without Alignment
QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
Self-Consistency for LLM-Based Motion Trajectory Generation and Verification
MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention
WPT: World-to-Policy Transfer via Online World Model Distillation
Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model
Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations
PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation
MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models
EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations
Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation
PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects
3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects
Reinforcing Structured Chain-of-Thought for Video Understanding
STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers
PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization
Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post‑hoc Debiasing in Vision-Language Models
RunawayEvil: Jailbreaking the Image-to-Video Generative Models
REACH: Explicit Recovery Behavior for Diffusion Policies
RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
Dual Ascent Diffusion for Inverse Problems
MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
RankOOD - Class Ranking-based Out-of-Distribution Detection
Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System
Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset
MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
CLP: A Real-World Dataset of Contaminated Lens Protectors for Robust Semantic Segmentation
OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving
ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization
PECCVAI: Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks
ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning
Concept-Aware Batch Sampling Improves Language-Image Pretraining
DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation
CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
Learnability-Guided Diffusion for Dataset Distillation
D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation
Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
Harnessing the Power of Foundation Models for Accurate Material Classification
Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities
Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
Same or Not? Enhancing Visual Perception in Vision-Language Models
VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs
PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
SAMIX: Reinforcing SAM2 with Semantic Adapter and Reference Selecting Policy for Mix-Supervised Segmentation
Reinforcing Video Reasoning Segmentation to Think Before It Segments
Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
A Difference-in-Difference Approach to Detecting AI-Generated Images
FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers
Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation
Differentially Private 2D Human Pose Estimation
Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning
Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning
Fine-Grained Multi Image Object Hallucination Benchmark
VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale
BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
Scalable Trajectory Generation for Whole-Body Mobile Manipulation
NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation
SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction
SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead
Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
Active Intelligence in Video Avatars via Closed-loop World Modeling
Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation
Verifying Neural Network Robustness with Dual Perturbations
Generalizable Video Quality Assessment via Weak-to-Strong Learning
QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification
Ego: Embedding-Guided Personalization of Vision-Language Models
Learning to Infer Parameterized Representations of Plants from 3D Scans
F^2HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space
D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Uni-Hema: Unified Model for Digital Hematopathology
MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning
Distribution-Aligned Multimodal Fusion for Robust Object Detection
Obstruction Reasoning for Robotic Grasping
Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction
The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation
Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling
VISTA: A Test-Time Self-Improving Video Generation Agent
Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization
Representing 3D Faces with Learnable B-Spline Volumes
Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection
Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning
OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models
Language Models Can Explain Visual Features via Steering
Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning
Data-Centric Meta-Learning for Robust Few-Shot Generalization
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
RINO: Rotation-Invariant Non-Rigid Correspondences
Demo2Tutorial: From Human Experience to Multimodal Software Tutorials
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Physical Object Understanding with a Physically Controllable World Model
GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning
Reconstructing CLIP for Open-Vocabulary Dense Perception
CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation
Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning
PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
SyncDreamer: Controllable and Expressive Avatar Generation Beyond the Talking Head
Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference
PhysVid: Physics Aware Local Conditioning for Generative Video Models
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection
DocPrune: Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
Test-time Sparsity for Extreme Fast Action Diffusion
BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
ROSE: Rotate Your Large Language Model to See
Beyond Soft Label: Dataset Distillation via Orthogonal Gradient Matching
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
Exemplar-Free Continual Learning for State Space Models
DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment
MUFASA: A Multi-Layer Framework for Slot Attention
Prompt Yourself: Awakening Textual Semantics in 1D Visual Tokenizers
M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis
HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation
AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning
Composing Concepts from Images and Videos via Concept-prompt Binding
Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Video Panels for Long Video Understanding
NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining
TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts
PaNDaS: Learnable Shape Interpolation Modeling with Localized Control
RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
AIMDepth: Asymmetric Image-Event Mamba for Monocular Depth Estimation
NIL: No-data Imitation Learning
Recovering Physically Plausible Human-Object Interactions from Monocular Videos
SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
What Is It Like to Be a Noise? An Entropy-based Gaussian Noise Regularization for Diffusion Models
PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding
Routing on Demand: DSNet for Efficient Progressive Point Cloud Denoising
Functional Mean Flow in Hilbert Space
Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization
RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation
Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
CLIP-like Model as a Foundational Density Ratio Estimator
Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport
AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
Group Editing: Edit Multiple Images in One Go
UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
VOSR: A Vision-Only Generative Model for Image Super-Resolution
Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning
pH-Strips for Selective Forgetting: A Blunt but Fast Diagnostic Baseline for Machine Unlearning
PlannerRFT: Reinforcing Diffusion Planners through Closed-Loop and Sample-Efficient Fine-Tuning
R3-PCQA: Ray-Reprojection-Reinforcement for No-Reference 3D Point Cloud Quality Assessment
CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization
MIBURI: Towards Expressive Interactive Gesture Synthesis
OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
FINER: MLLMs Hallucinate under Fine-grained Negative Queries
SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting
Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework
Content-Aware Dynamic Patchification for Efficient Video Diffusion
EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation
Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents
Explaining CLIP Zero-shot Predictions Through Concepts
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving
LAM: Language Articulated Object Modelers
Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection
SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering
Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models
Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy
BiGain: Unified Token Compression for Joint Generation and Classification
Extend3D: Town-Scale 3D Generation
PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
Dejavu: Towards Experience Feedback Learning for Embodied Intelligence
Generative Neural Video Compression via Video Diffusion Prior
FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
Coverage Optimization for Camera View Selection
DepthFocus: Controllable Depth Estimation for See-Through Scenes
Sampling-Aware Quantization for Diffusion Models
INSID3: Training-Free In-Context Segmentation with DINOv3
Scene-Centric Unsupervised Video Panoptic Segmentation
Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions
Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement
Condensed Test-Time Adaptation of VLMs for Action Recognition
VideoCoF: Unified Video Editing with Temporal Reasoner
Optical Diffraction-based Convolution for Semiconductor Lithography
Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models
Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
CountGD++: Generalized Prompting for Open-World Counting
Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments
Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning
OSMO: Open-vocabulary Self-eMOtion Tracking
GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Frequency Switching Mechanism for Parameter-Efficient Multi-Task Learning
Alert-CLIP: Abnormality-aware Latent-Enhanced Representation Tuning of CLIP for Video Anomaly Detection
ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos
ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology
Open the Motion Door: Atomic Motion Decomposition and Recomposition for Open-Vocabulary Motion Generation
LiveGesture: Streamable Co-Speech Gesture Generation Model
InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization
Flowception: Temporally Expansive Flow Matching for Video Generation
Black-Box Domain Adaptation for Object Detection with Retention-Driven Knowledge Compression
CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion
SpiderCam: Low-Power Snapshot Depth from Differential Defocus
SceneTok: A Compressed, Diffusable Token Space for 3D Scenes
Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers
MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation
AURA: Multi-modal Shared Autonomy for Urban Navigation
BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models
Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching
STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Back to Basics: Let Denoising Generative Models Denoise
DiffuView: Multi-View Diffusion Pretraining for 3D Aware Robotic Manipulation
TopoCL: Topological Contrastive Learning for Medical Imaging
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection
The Missing Point in Vision Transformers for Universal Image Segmentation
Direction-aware 3D Large Multimodal Models
Exploring Conditions for Diffusion Models in Robotic Control
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Training-free Motion Factorization for Compositional Video Generation
Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception
Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance
Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising
Vocabulary Scaling Law: Tuning Open-vocabulary Predictors for Their Openness
SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images
FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier–State Space Integration
Spectral Mixture-of-Experts for Continual Learning
Learning Hierarchical Hyperbolic Mixture Model for Part-aware 3D Generation
Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
SelfHVD: Self-Supervised Handheld Video Deblurring
RAVEN: Erasing Invisible Watermarks via Novel View Synthesis
Adaptive Capacity Autoregressive Visual Tracking
RAG-TP: A General Framework for Vehicle Trajectory Prediction via Retrieval-Augmented Generation
A Polynomial Chaos Framework for Causal Discovery in Nonlinear Uncertain Systems
HiDRA: Hierarchical Degradation Representation and Adaptation with Generative Priors for Enhancing Infrared Vision
FlashIn: Fast and Accurate Image Inversion for Real-time Image Editing
Forecast the Principal, Stabilize the Residual: Subspace-Aware Feature Caching for Diffusion Transformers
PixelDiT: Pixel Diffusion Transformers for Image Generation
RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation
HandWorld: Hand-Centric Unified Video Action Generation
BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos
MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
ShreddingNet: Coarse-to-Fine Restoration for Multi-Source Shredded Manuscripts
CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning
Learning to Learn Weight Generation via Local Consistency Diffusion
CrossHOI: Learning Cross-View Representations for Monocular 3D Human-Object Interaction Reconstruction
LightMover: Generative Light Movement with Color and Intensity Controls
ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
Agentic Video Summarization via Self-Reflecting Multimodal Understanding
CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics
Stabilizing Streaming Video Geometry via Dynamic Feature Normalization
Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models
Linking Modality Isolation in Heterogeneous Collaborative Perception
DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
Dynamic Important Example Mining for Reinforcement Finetuning
PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts
Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion
Spatia: Video Generation with Updatable Spatial Memory
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
P-Flow: Prompting Visual Effects Generation
Learning to Select Visual Tools from Experience
Zero-Shot Depth Completion with Vision-Language Model
Batch Loss Score for Dynamic Data Pruning
L3DR: 3D-aware LiDAR Diffusion and Rectification
Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
MVP: Multiple View Prediction Improves GUI Grounding
E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
Detect Anything via Next Point Prediction
SuP: Sub-cloud Driven Point Cloud Registration
DynamicsBoost: Dynamic Plausible Video Generation via Annotation-Free Continuation Preference Optimization
PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding
PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems
Self-Evaluation Unlocks Any-Step Text-to-Image Generation
Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild
4DP-QA: Scalable QA for 4D Perception in Vision Language Models
DDT: Decoupled Diffusion Transformer
3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator
Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors
R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning
SeD-UD: An Influence-Driven and Hierarchically-Decoupled Information Bottleneck for Multimodal Intent Recognition
SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
RAAS: LLM Agentic System Architecture Search with GRPO
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
RPGFusion: 4D Radar Prior-Guided Multi-Modal Fusion for 3D Detection
Underground Plant Exploration: Non-Destructive 3D Root Assessment with GPR Based on Point Graph Neural Network
What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
HSI-GPT2: A Dual-Granularity Large Motion Reasoning Model with Diffusion Refinement for Human–Scene Interaction
MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction
Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
Dynamics-Aware Preference Optimization for Vision-Language Models
STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution
EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy
SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping
Learning Differentiable Hierarchies in 3D Gaussian Splatting
LRDUN: A Low-Rank Deep Unfolding Network for Efficient Spectral Compressive Imaging
Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
Model Merging in the Essential Subspace
Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
Diffusion Guided Chain-of-Vision for Large Autoregressive Vision Models
iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
Generative Point Tracking and Forecasting
Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer
Region-Wise Correspondence Prediction between Manga Line Art Images
Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis
MeanFlow Transformers with Representation Autoencoders
S2D: Selective Spectral Decay for Quantization-Friendly Conditioning of Neural Activations
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
Unlocking Token Rewards via Training-Free Reward Attribution
VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models
Frequency-Aware Flow Matching for High-Quality Image Generation
EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
Tri-Modal Fusion Transformers for UAV-based Object Detection
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
EgoX: Egocentric Video Generation from a Single Exocentric Video
CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
Thinking with Programming Vision: Towards a Unified View for Thinking with Images
VideoMaMa: Mask-Guided Video Matting via Generative Prior
PhysGaia: A Physics-aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis
I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners
Structural Action Transformer for 3D Dexterous Manipulation
Talking Together: Synthesizing Co-Located 3D Conversations from Audio
PositionIC: Unified Position and Identity Consistency for Image Customization
Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning
Pixel2Phys: Distilling Governing Laws from Visual Dynamics
Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers
ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
MotionMaster: Generalizable Text-Driven Motion Generation and Editing
Flow Matching for Multimodal Distributions
PAS: Prelim Attention Score for Detecting Object Hallucinations in Large Vision-Language Models
GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning
HandX: Scaling Bimanual Motion and Interaction Generation
EgoAVU: Egocentric Audio-Visual Understanding
Composite-Attribute Person Re-Identification via Pose-Guided Disentanglement
Spherical Leech Quantization for Visual Tokenization and Generation
Omni-Attack: Adversarial Attacks on Open-Ended VQA in Black-Box Multimodal LLMs
Detecting Unknown Objects via Energy-based Separation for Open World Object Detection
From Where Things Are to What They Are For: Benchmarking Spatial–Functional Intelligence in Multimodal LLMs
In Pursuit of Pixel Supervision for Visual Pre-training
Lite Any Stereo: Efficient Zero-Shot Stereo Matching
HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration
GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes
DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease
Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay
We use cookies to store which papers have been visited.
I agree
Successful Page Load
We use cookies to store which papers have been visited.
I agree