Home
Schedule
Workshops
Tutorials
Keynotes & Panels
Awards
Papers
Organizers
Login
Show Detail »
Schedule
Sun
Mon
Tue
Wed
Thu
Timezone:
America/Los_Angeles
Filter Events:
Award
Break
Invited Talk
Panel
Poster
Reception
Remarks
Social
Tutorial
Workshop
Filter Rooms:
East 1
East 10
East 11
East 12
East 13
East 14
East 15
East 16
East 17
East 18
East 19 - 20
East 2
East 3
East 4
East 5
East 6
East 7
East 8
East 9
East Ballroom A
East Ballroom B
East Ballroom C
East Building Exhibit Hall AB
East Exhibit Hall A
East Exhibit Hall B
Exhibit Halls A - B; East Ballrooms A - C
Virtual
Virtual (AM); West 114 - 115 (PM)
West 103 - 104
West 105 - 106
West 107 - 108
West 109
West 109 - 110
West 110
West 111 - 112
West 113
West 114
West 114 - 115
West 115
West 116
West 116 - 117
West 117
West 118 - 120
West 121 - 122
West 201
West 202 - 204
West 205 - 206
West 207
West 208
West 208 - 209
West 209
West 210
West 211
West 211 - 214
West 212
West 213
West 213 - 214
West 214
West 215 - 216
West 217 - 219
West 220 - 222
West 223 - 224
West 301
West 302 - 305
West 306
West Building Exhibit Halls ABC
SUN 18 JUN
7:30 a.m.
Breakfast
7:50 a.m.
Workshop:
Synthetic Data for Autonomous Systems (SDAS)
(ends 12:30 PM)
Workshop:
The Second Workshop on Structural and Compositional Learning on 3D Data
(ends 12:30 PM)
Workshop:
The 2nd International Workshop on Transformers for Vision
(ends 5:30 PM)
8 a.m.
Workshop:
The Fourth Workshop on Fair, Data-efficient, and Trusted Computer Vision
(ends 4:30 PM)
Workshop:
OmniLabel: Infinite label spaces for semantic understanding via natural language
(ends 12:00 PM)
Workshop:
Second Workshop of Mobile Intelligent Photography and Imaging
(ends 12:00 PM)
Workshop:
8th New Trends in Image Restoration and Enhancement Workshop and Challenges
(ends 7:00 PM)
Workshop:
1st Workshop on Multimodal Content Moderation
(ends 6:00 PM)
Workshop:
Third Workshop on Ethical Considerations in Creative Applications of Computer Vision - EC3V
(ends 12:10 PM)
Workshop:
CVPR 2023 - 10th Workshop on Medical Computer Vision (MCV)
(ends 4:00 PM)
Workshop:
The 3rd Workshop of Adversarial Machine Learning on Computer Vision: Art of Robustness
(ends 4:00 PM)
Workshop:
The Second Workshop on 3D Vision and Robotics
(ends 6:00 PM)
Workshop:
New Frontiers for Zero-Shot Image Captioning Evaluation
(ends 12:00 PM)
8:20 a.m.
Workshop:
DL-UIA: Deep Learning in Ultrasound Image Analysis
(ends 12:00 PM)
8:25 a.m.
Workshop:
GAZE 2023: The 5th International Workshop on Gaze Estimation and Prediction in the Wild
(ends 12:00 PM)
8:30 a.m.
Workshop:
LatinX in Computer Vision Research Workshop
(ends 6:00 PM)
Workshop:
19th CVPR Workshop on Perception Beyond the Visible Spectrum (PBVS 2023)
(ends 5:30 PM)
Workshop:
Generative Models for Computer Vision
(ends 5:15 PM)
Workshop:
XRNeRF: Advances in NeRF for the Metaverse
(ends 12:30 PM)
Workshop:
VAND: Visual Anomaly and Novelty Detection
(ends 12:30 PM)
Workshop:
Catch UAVs that Want to Watch You: Detection and Tracking of Unmanned Aerial Vehicle (UAV) in the Wild and the 3rd Anti-UAV Workshop & Challenge
(ends 12:10 PM)
Workshop:
2nd Monocular Depth Estimation Challenge
(ends 12:00 PM)
Workshop:
4th Workshop on Continual Learning in Computer Vision (CLVision)
(ends 5:30 PM)
Workshop:
4th International Workshop on Large Scale Holistic Video Understanding
(ends 12:00 PM)
Tutorial:
Recent advances in anomaly detection
(ends 5:15 PM)
Tutorial:
A Comprehensive Tour and Recent Advancements toward Real-world Visual Geo-Localization
(ends 5:30 PM)
Tutorial:
Efficient Neural Networks: From Algorithm Design to Practical Mobile Deployment
(ends 12:00 PM)
Tutorial:
Recent Advances in Visual Domain Adaptation and Generalization
(ends 11:45 AM)
Tutorial:
Trustworthy AI in the Era of Foundation Models
(ends 11:45 AM)
Tutorial:
ML Systems for Large Models and Federated Learning
(ends 11:45 AM)
Tutorial:
Vision Transformer: More is different
(ends 11:45 AM)
8:45 a.m.
Workshop:
Topological, Algebraic, and Geometric Pattern Recognition with Applications Workshop Proposal
(ends 5:30 PM)
Workshop:
7th Workshop on Media Forensics
(ends 5:30 PM)
Workshop:
FGVC10: 10th Workshop on Fine-grained Visual Categorization
(ends 4:45 PM)
8:55 a.m.
Workshop:
3rd International Workshop and Challenge on Long-form Video Understanding and Generation
(ends 12:35 PM)
9 a.m.
Workshop:
Workshop on End-to-end Autonomous Driving
(ends 6:00 PM)
Workshop:
Visual Perception via Learning in an Open World
(ends 5:00 PM)
Workshop:
Computer Vision for Mixed Reality
(ends 12:30 PM)
Workshop:
CVPR 2023 Biometrics Workshop
(ends 5:30 PM)
Workshop:
12th IEEE International Workshop on Computational Cameras and Displays (CCD)
(ends 5:00 PM)
Workshop:
EarthVision: Large Scale Computer Vision for Remote Sensing Imagery
(ends 5:45 PM)
Workshop:
3rd Workshop and Challenge on Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings
(ends 6:00 PM)
Workshop:
Fourth Workshop on Neural Architecture Search, Third lightweight NAS challenge
(ends 6:00 PM)
Workshop:
2nd Workshop on Tracking and Its Many Guises: Tracking Any Object in Open-World
(ends 5:00 PM)
Workshop:
The 4th CVPR Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics
(ends 5:30 PM)
Tutorial:
Denoising Diffusion Models: A Generative Learning Big Bang
(ends 12:30 PM)
Tutorial:
Boosting Computer Vision Research with OpenMMLab and OpenDataLab
(ends 12:00 PM)
Tutorial:
All Things ViTs: Understanding and Interpreting Attention in Vision
(ends 12:00 PM)
9:15 a.m.
Workshop:
Workshop on Autonomous Driving (WAD)
(ends 6:15 PM)
Workshop:
6th Multi-modal Learning and Applications Workshop (MULA)
(ends 6:00 PM)
Workshop:
New Frontiers in Visual Language Reasoning: Compositionality, Prompts and Causality
(ends 5:30 PM)
9:20 a.m.
Workshop:
CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling (Workshop)
(ends 5:45 PM)
9:30 a.m.
Tutorial:
Skull Restoration, Facial Reconstruction and Expression
(ends 11:45 AM)
10 a.m.
Break
11:45 a.m.
Lunch
12:30 p.m.
Workshop:
End-to-End Autonomous Driving: Perception, Prediction, Planning and Simulation
(ends 7:05 PM)
12:45 p.m.
Workshop:
1st Workshop on Compositional 3D Vision & 3DCoMPaT Challenge
(ends 5:45 PM)
Workshop:
6th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues
(ends 6:30 PM)
1 p.m.
Workshop:
High-fidelity Neural Actors
(ends 6:00 PM)
Workshop:
AVA: Accessibility, Vision, and Autonomy Meet
(ends 5:30 PM)
Workshop:
4D Hand Object Interaction: Geometric Understanding and Applications in Dexterous Manipulation
(ends 6:00 PM)
Workshop:
QCVML: Quantum Computer Vision and Machine Learning Workshop
(ends 6:00 PM)
Workshop:
Computer Vision for Fashion, Art, and Design
(ends 5:30 PM)
Workshop:
1st workshop on Capturing, Interpreting & Visualizing Indoor Living Spaces
(ends 6:00 PM)
Workshop:
The Fifth Workshop on Precognition: Seeing through the Future
(ends 5:00 PM)
1:30 p.m.
Workshop:
2nd Workshop and Challenge on Vision Datasets Understanding
(ends 5:30 PM)
Workshop:
The Fourth Workshop on Face and Gesture Analysis for Health Informatics (FGAHI)
(ends 6:00 PM)
Workshop:
The 3rd Workshop on Light Fields for Computer Vision LFNAT: New Applications and Trends in Light Fields
(ends 6:15 PM)
Workshop:
Pixel-level Video Understanding in the Wild Challenge
(ends 5:30 PM)
Tutorial:
Contactless Healthcare using Cameras and Wireless Sensors
(ends 5:00 PM)
Tutorial:
Large-scale Deep Learning Optimization Techniques
(ends 5:00 PM)
3 p.m.
Break
MON 19 JUN
7:30 a.m.
Breakfast
8 a.m.
Workshop:
Vision-Centric Autonomous Driving (VCAD)
(ends 5:00 PM)
Workshop:
4th International Workshop on Event-based Vision
(ends 6:00 PM)
Workshop:
5th Workshop and Competition on Affective Behavior Analysis in-the-wild
(ends 12:30 PM)
Workshop:
3DMV: Learning 3D with Multi-View Supervision
(ends 12:15 PM)
Workshop:
The Sixth International Workshop on Computer Vision for Physiological Measurement (CVPM)
(ends 12:00 PM)
Workshop:
8th Workshop on Computer Vision for Microscopy Image Analysis
(ends 6:00 PM)
Workshop:
Multi-Agent Behavior: Properties, Computation and Emergence
(ends 12:30 PM)
Workshop:
Image Matching: Local Features and Beyond
(ends 12:30 PM)
Workshop:
4th Agriculture-Vision Workshop: Challenges & Opportunities for Computer Vision in Agriculture
(ends 12:00 PM)
Workshop:
3rd Mobile AI Workshop and Challenges
(ends 6:00 PM)
Workshop:
7th AI City Challenge Workshop
(ends 5:30 PM)
8:15 a.m.
Workshop:
VizWiz Grand Challenge: Describing Images and Videos Taken by Blind People
(ends 12:00 PM)
8:20 a.m.
Workshop:
First Rhobin Challenge - Reconstruction of human-object interaction
(ends 12:30 PM)
8:30 a.m.
Workshop:
Women in Computer Vision Workshop
(ends 1:00 PM)
Workshop:
The 6th Workshop and Prize Challenge Bridging the Gap between Computational Photography and Visual Recognition (UG2+) in conjunction with IEEE CVPR 2023
(ends 5:00 PM)
Workshop:
Workshop on Vision-based InduStrial InspectiON (VISION)
(ends 6:00 PM)
Workshop:
EMBEDDED VISION WORKSHOP 2023
(ends 5:00 PM)
Workshop:
2nd Workshop on Federated Learning for Computer Vision
(ends 5:30 PM)
Workshop:
Joint 3rd Ego4D and 11th EPIC Workshop on Egocentric Vision
(ends 6:45 PM)
Workshop:
O-DRUM: Workshop on Open-Domain Reasoning Under Multi-Modal Settings
(ends 5:00 PM)
Tutorial:
Large-Scale Visual Localization
(ends 12:15 PM)
Tutorial:
Object localization for free: Going beyond self-supervised learning
(ends 12:00 PM)
Tutorial:
Recent Advances in Vision Foundation Models
(ends 12:30 PM)
8:45 a.m.
Workshop:
Secure and Safe Autonomous Driving Workshop and Challenge (SSAD)
(ends 5:00 PM)
8:50 a.m.
Workshop:
The 6th Efficient Deep Learning for Computer Vision
(ends 6:35 PM)
9 a.m.
Workshop:
The 2nd Explainable AI for Computer Vision (XAI4CV) Workshop
(ends 5:30 PM)
Workshop:
Safe Artificial Intelligence for All Domains
(ends 5:00 PM)
Workshop:
The Fifth Workshop on Deep Learning for Geometric Computing
(ends 5:00 PM)
Workshop:
AI for Content Creation
(ends 6:15 PM)
Workshop:
Visual Copy Detection Workshop
(ends 12:00 PM)
Workshop:
Computer Vision in the Wild
(ends 5:30 PM)
Workshop:
Sight and Sound
(ends 6:00 PM)
Workshop:
L3D-IVU: 2nd Workshop on Learning with Limited Labelled Data for Image and Video Understanding
(ends 5:10 PM)
Workshop:
9th IEEE International Workshop on Computer Vision in Sports (CVsports)
(ends 5:30 PM)
Workshop:
Workshop on Foundation Models: 1st Foundation Model Challenge
(ends 12:30 PM)
Workshop:
Visual Pre-training for Robotics
(ends 5:30 PM)
Workshop:
4th Embodied AI Workshop
(ends 5:30 PM)
Workshop:
Vision for All Seasons: Adverse Weather and Lighting Conditions
(ends 5:30 PM)
Tutorial:
Polarization-based Computer Vision
(ends 12:00 PM)
Tutorial:
Multi-Objective Optimization for Deep Learning
(ends 12:00 PM)
Tutorial:
All you need to know about self-driving
(ends 5:00 PM)
Tutorial:
Few-shot Learning from Meta-Learning, Statistical Understanding to Applications
(ends 12:30 PM)
Tutorial:
Prompting in Vision
(ends 12:00 PM)
Tutorial:
Reverse Engineering of Deception (RED): Foundations and Applications
(ends 12:00 PM)
Tutorial:
Rolling Shutter Camera: Modeling, Optimization, Learning, and Hardware
(ends 12:00 PM)
Tutorial:
Hyperbolic Deep Learning in Computer Vision
(ends 12:00 PM)
Tutorial:
Deep Learning Theory for Computer Vision
(ends 12:00 PM)
Tutorial:
Automatic 3D modeling of indoor structures from panoramic imagery
(ends 12:30 PM)
Tutorial:
Optics for Better AI: Capturing and Synthesizing Realistic Data for Low-light Enhancement
(ends 12:00 PM)
Tutorial:
Knowledge-Driven Vision-Language Encoding
(ends 12:30 PM)
10 a.m.
Break
11:45 a.m.
Lunch
12:45 p.m.
Workshop:
Scholars and Big Models — How Can Academics Adapt?
(ends 6:05 PM)
1 p.m.
Workshop:
The 4th Workshop on Omnidirectional Computer Vision
(ends 6:00 PM)
Workshop:
Photogrammetric Computer Vision
(ends 6:00 PM)
Workshop:
2nd Challenge on Machine Visual Common Sense: Perception, Prediction, Planning
(ends 6:00 PM)
Workshop:
RetailVision - Revolutionizing the World of Retail
(ends 6:00 PM)
Workshop:
2nd Workshop on Multimodal Learning for Earth and Environment (MultiEarth)
(ends 5:00 PM)
1:30 p.m.
Workshop:
5th ScanNet Indoor Scene Understanding Challenge
(ends 5:30 PM)
Workshop:
The 4th Face Anti-spoofing Workshop and Challenge
(ends 5:30 PM)
Workshop:
DynaVis: The 4th International Workshop on Dynamic Scene Reconstruction
(ends 5:20 PM)
Tutorial:
Neural Search in Action
(ends 4:30 PM)
Tutorial:
Physics-based rendering and its applications in computational photography and imaging
(ends 5:00 PM)
Tutorial:
Exploring Synthetic data as an Enterprise Capability for Training and Validating CV Systems
(ends 4:30 PM)
Tutorial:
Full-Stack, GPU-based Acceleration of Deep Learning
(ends 5:00 PM)
Tutorial:
Hands-on Egocentric Research with Project Aria from Meta
(ends 5:00 PM)
3 p.m.
Break
TUE 20 JUN
7:30 a.m.
Breakfast
8:30 a.m.
Opening Ceremony:
Opening Ceremony
(ends 9:00 AM)
9 a.m.
Keynote:
Revisiting Old Ideas With Modern Hardware
Rodney Brooks
(ends 10:00 AM)
10 a.m.
Break
10:30 a.m.
Poster Session TUE-AM
[10:30-12:00]
Megahertz Light Steering Without Moving Parts
Robust Dynamic Radiance Fields
DBARF: Deep Bundle-Adjusting Generalizable Neural Radiance Fields
VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization
AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training
SeaThru-NeRF: Neural Radiance Fields in Scattering Media
Exact-NeRF: An Exploration of a Precise Volumetric Parameterization for Neural Radiance Fields
Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos
PlenVDB: Memory Efficient VDB-Based Radiance Fields for Fast Training and Rendering
Local Implicit Ray Function for Generalizable Radiance Field Representation
SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes
Frequency-Modulated Point Cloud Rendering With Easy Editing
HexPlane: A Fast Representation for Dynamic Scenes
Differentiable Shadow Mapping for Efficient Inverse Graphics
Hybrid Neural Rendering for Large-Scale Scenes With Motion Blur
TensoIR: Tensorial Inverse Rendering
ShadowNeuS: Neural SDF Reconstruction by Shadow Ray Supervision
Realistic Saliency Guided Image Enhancement
LightPainter: Interactive Portrait Relighting With Freehand Scribble
A Unified Spatial-Angular Structured Light for Single-View Acquisition of Shape and Reflectance
Learning Visibility Field for Detailed 3D Human Reconstruction and Relighting
Unsupervised Contour Tracking of Live Cells by Mechanical and Cycle Consistency Losses
NeUDF: Leaning Neural Unsigned Distance Fields With Volume Rendering
NeAT: Learning Neural Implicit Surfaces With Arbitrary Topologies From Multi-View Images
ALTO: Alternating Latent Topologies for Implicit 3D Reconstruction
Controllable Mesh Generation Through Sparse Latent Point Diffusion Models
Power Bundle Adjustment for Large-Scale 3D Reconstruction
Neural Pixel Composition for 3D-4D View Synthesis From Multi-Views
Magic3D: High-Resolution Text-to-3D Content Creation
3D Video Loops From Asynchronous Input
High-Fidelity 3D GAN Inversion by Pseudo-Multi-View Optimization
Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field
3D GAN Inversion With Facial Symmetry Prior
StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping
FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction
Robust Model-Based Face Reconstruction Through Weakly-Supervised Outlier Segmentation
Learning Neural Proto-Face Field for Disentangled 3D Face Modeling in the Wild
A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction From In-the-Wild Images
BlendFields: Few-Shot Example-Driven Facial Modeling
Implicit Neural Head Synthesis via Controllable Local Deformation Fields
DPE: Disentanglement of Pose and Expression for General Video Portrait Editing
GANHead: Towards Generative Animatable Neural Head Avatars
EDGE: Editable Dance Generation From Music
Blowing in the Wind: CycleNet for Human Cinemagraphs From Still Images
Generating Holistic 3D Human Motion From Speech
Avatars Grow Legs: Generating Smooth Human Motion From Sparse Tracking Inputs With Diffusion Model
Learning Anchor Transformations for 3D Garment Animation
CloSET: Modeling Clothed Humans on Continuous Surface With Explicit Template Decomposition
ECON: Explicit Clothed Humans Optimized via Normal Integration
PersonNeRF: Personalized Reconstruction From Photo Collections
3D Human Mesh Estimation From Virtual Markers
Overcoming the Trade-Off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction
Recovering 3D Hand Mesh Sequence From a Single Blurry Image: A New Dataset and Temporal Unfolding
MeMaHand: Exploiting Mesh-Mano Interaction for Single Image Two-Hand Reconstruction
PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation
CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis
Instant-NVR: Instant Neural Volumetric Rendering for Human-Object Interactions From Monocular RGBD Stream
BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects
Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes
Omnimatte3D: Associating Objects and Their Effects in Unconstrained Monocular Video
On the Benefits of 3D Pose and Tracking for Human Action Recognition
Towards Stable Human Pose Estimation via Cross-View Fusion and Foot Stabilization
Human Pose As Compositional Tokens
PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation
SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments
Semi-Supervised 2D Human Pose Estimation Driven by Position Inconsistency Pseudo Label Correction Module
Human Pose Estimation in Extremely Low-Light Conditions
Flexible-Cm GAN: Towards Precise 3D Dose Prediction in Radiotherapy
DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium
A Rotation-Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization
Semidefinite Relaxations for Robust Multiview Triangulation
A Probabilistic Attention Model With Occlusion-Aware Texture Regression for 3D Hand Reconstruction From a Single RGB Image
Instant Multi-View Head Capture Through Learnable Registration
On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks
Learning 3D Scene Priors With 2D Supervision
OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation
OpenScene: 3D Scene Understanding With Open Vocabularies
Multi-View Azimuth Stereo via Tangent Space Consistency
Progressive Transformation Learning for Leveraging Virtual Images in Training
Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries
NeRF-Supervised Deep Stereo
Semantic Scene Completion With Cleaner Self
PanelNet: Understanding 360 Indoor Environment via Panel Representation
Implicit View-Time Interpolation of Stereo Videos Using Multi-Plane Disparities and Non-Uniform Coordinates
Depth Estimation From Indoor Panoramas With Neural Scene Representation
NeuralPCI: Spatio-Temporal Neural Field for 3D Point Cloud Multi-Frame Non-Linear Interpolation
RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo
NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization
MACARONS: Mapping and Coverage Anticipation With RGB Online Self-Supervision
vMAP: Vectorised Object Mapping for Neural Field SLAM
Seeing a Rose in Five Thousand Ways
Propagate and Calibrate: Real-Time Passive Non-Line-of-Sight Tracking
Seeing With Sound: Long-range Acoustic Beamforming for Multimodal Scene Understanding
Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection
BEV-LaneDet: An Efficient 3D Lane Detection Based on Virtual Camera via Key-Points
AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers
Benchmarking Robustness of 3D Object Detection to Common Corruptions
Gaussian Label Distribution Learning for Spherical Image Object Detection
Deep Depth Estimation From Thermal Image
LidarGait: Benchmarking 3D Gait Recognition With Point Clouds
Generalized UAV Object Detection via Frequency Domain Disentanglement
Learning Compact Representations for LiDAR Completion and Generation
CXTrack: Improving 3D Point Cloud Tracking With Contextual Information
Multispectral Video Semantic Segmentation: A Benchmark Dataset and Baseline
LinK: Linear Kernel for LiDAR-Based 3D Perception
Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting
Curricular Object Manipulation in LiDAR-Based Object Detection
Delivering Arbitrary-Modal Semantic Segmentation
Robust Outlier Rejection for 3D Registration With Variational Bayes
3D Human Keypoints Estimation From Point Clouds in the Wild Without Human Labels
Self-Supervised Pre-Training With Masked Shape Prediction for 3D Scene Understanding
ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding
Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer
PointCMP: Contrastive Mask Prediction for Self-Supervised Learning on Point Cloud Videos
E2PN: Efficient SE(3)-Equivariant Point Network
Poly-PC: A Polyhedral Network for Multiple Point Cloud Tasks at Once
Improving Graph Representation for Point Cloud Segmentation via Attentive Filtering
BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration
TopDiG: Class-Agnostic Topological Directional Graph Extraction From Remote Sensing Images
Recognizing Rigid Patterns of Unlabeled Point Clouds by Complete and Continuous Isometry Invariants With No False Negatives and No False Positives
Both Style and Distortion Matter: Dual-Path Unsupervised Domain Adaptation for Panoramic Semantic Segmentation
CCuantuMM: Cycle-Consistent Quantum-Hybrid Matching of Multiple Shapes
Enhancing Deformable Local Features by Jointly Learning To Detect and Describe Keypoints
Understanding and Improving Features Learned in Deep Functional Maps
High-Frequency Stereo Matching Network
Rethinking Optical Flow From Geometric Matching Consistent Perspective
Efficient Robust Principal Component Analysis via Block Krylov Iteration and CUR Decomposition
VectorFloorSeg: Two-Stream Graph Attention Network for Vectorized Roughcast Floorplan Segmentation
TBP-Former: Learning Temporal Bird’s-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving
Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving
UniSim: A Neural Closed-Loop Sensor Simulator
FEND: A Future Enhanced Distribution-Aware Contrastive Learning Framework for Long-Tail Trajectory Prediction
EqMotion: Equivariant Multi-Agent Motion Prediction With Invariant Interaction Reasoning
Lookahead Diffusion Probabilistic Models for Refining Mean Estimation
Neural Volumetric Memory for Visual Locomotion Control
Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention
DrapeNet: Garment Generation and Self-Supervised Draping
Tracking Multiple Deformable Objects in Egocentric Videos
Good Is Bad: Causality Inspired Cloth-Debiasing for Cloth-Changing Person Re-Identification
Micron-BERT: BERT-Based Facial Micro-Expression Recognition
MARLIN: Masked Autoencoder for Facial Video Representation LearnINg
StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator
RealImpact: A Dataset of Impact Sound Fields for Real Objects
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
Progressive Spatio-Temporal Alignment for Efficient Event-Based Motion Estimation
Event-Based Shape From Polarization
Learning Spatial-Temporal Implicit Neural Representations for Event-Guided Video Super-Resolution
BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation
A Unified Pyramid Recurrent Network for Video Frame Interpolation
Event-Based Blurry Frame Interpolation Under Blind Exposure
FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation
POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery
Adaptive Patch Deformation for Textureless-Resilient Multi-View Stereo
On the Difficulty of Unpaired Infrared-to-Visible Video Translation: Fine-Grained Content-Rich Patches Transfer
Thermal Spread Functions (TSF): Physics-Guided Material Classification
Better “CMOS” Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution
Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement
CutMIB: Boosting Light Field Super-Resolution via Multi-View Image Blending
sRGB Real Noise Synthesizing With Neighboring Correlation-Aware Noise Model
Masked Image Training for Generalizable Deep Image Denoising
DR2: Diffusion-Based Robust Degradation Remover for Blind Face Restoration
Learning Distortion Invariant Representation for Image Restoration From a Causality Perspective
Perception-Oriented Single Image Super-Resolution Using Optimal Objective Estimation
Catch Missing Details: Image Reconstruction With Frequency Augmented Variational Autoencoder
MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos
CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network With Large Input
Initialization Noise in Image Gradients and Saliency Maps
Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution
Deep Arbitrary-Scale Image Super-Resolution via Scale-Equivariance Pursuit
CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution
Multiplicative Fourier Level of Detail
Document Image Shadow Removal Guided by Color-Aware Background
StyleRes: Transforming the Residuals for Real Image Editing With StyleGAN
TopNet: Transformer-Based Object Placement Network for Image Compositing
VecFontSDF: Learning To Reconstruct and Synthesize High-Quality Vector Fonts via Signed Distance Functions
CF-Font: Content Fusion for Few-Shot Font Generation
SIEDOB: Semantic Image Editing by Disentangling Object and Background
MaskSketch: Unpaired Structure-Guided Masked Image Generation
Text2Scene: Text-Driven Indoor Scene Stylization With Part-Aware Details
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
Multi-Concept Customization of Text-to-Image Diffusion
Unifying Layout Generation With a Decoupled Diffusion Model
BBDM: Image-to-Image Translation With Brownian Bridge Diffusion Models
Towards Practical Plug-and-Play Diffusion Models
Post-Training Quantization on Diffusion Models
DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation
Mask-Guided Matting in the Wild
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation
Compression-Aware Video Super-Resolution
Neural Rate Estimator and Unsupervised Learning for Efficient Distributed Image Analytics in Split-DNN Models
DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos
Polynomial Implicit Neural Representations for Large Diverse Datasets
Learning Decorrelated Representations Efficiently Using Fast Fourier Transform
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution
Slide-Transformer: Hierarchical Vision Transformer With Local Self-Attention
Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers
Top-Down Visual Attention From Analysis by Synthesis
Probing Neural Representations of Scene Perception in a Hippocampally Dependent Task Using Artificial Neural Networks
Masked Image Modeling With Local Multi-Scale Reconstruction
Siamese Image Modeling for Self-Supervised Vision Representation Learning
MAGE: MAsked Generative Encoder To Unify Representation Learning and Image Synthesis
Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-Identification
DistilPose: Tokenized Pose Regression With Heatmap Distillation
Graph Transformer GANs for Graph-Constrained House Generation
Automatic High Resolution Wire Segmentation and Removal
Tree Instance Segmentation With Temporal Contour Graph
Dual-Path Adaptation From Image to Video Transformers
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning
Masked Motion Encoding for Self-Supervised Video Representation Learning
Boosting Video Object Segmentation via Space-Time Correspondence Learning
Two-Shot Video Object Segmentation
Look Before You Match: Instance Understanding Matters in Video Object Segmentation
Spatial-Then-Temporal Self-Supervised Learning for Video Correspondence
Few-Shot Referring Relationships in Videos
Vision Transformers Are Parameter-Efficient Audio-Visual Learners
Egocentric Video Task Translation
QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation
Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition
How Can Objects Help Action Recognition?
Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based Action Recognition
Decomposed Cross-Modal Distillation for RGB-Based Temporal Action Detection
ASPnet: Action Segmentation With Shared-Private Representation of Multiple Data Sources
Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization
LOGO: A Long-Form Video Dataset for Group Action Quality Assessment
Use Your Head: Improving Long-Tail Video Recognition
Conditional Generation of Audio From Video via Foley Analogies
Weakly Supervised Video Representation Learning With Unaligned Text for Sequential Videos
You Can Ground Earlier Than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos
Connecting Vision and Language With Video Localized Narratives
Video-Text As Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
Make-a-Story: Visual Memory Conditioned Consistent Story Generation
Test of Time: Instilling Video-Language Models With a Sense of Time
How You Feelin’? Learning Emotions and Mental States in Movie Scenes
Continuous Sign Language Recognition With Correlation Network
DIP: Dual Incongruity Perceiving Network for Sarcasm Detection
Gloss Attention for Gloss-Free Sign Language Translation
Object-Goal Visual Navigation via Effective Exploration of Relations Among Historical Navigation States
Behavioral Analysis of Vision-and-Language Navigation Agents
KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
Where Is My Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization
Efficient Multimodal Fusion via Interactive Prompting
NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations
Dynamic Inference With Grounding Based Vision and Language Models
Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
Teaching Structured Vision & Language Concepts to Vision & Language Models
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
Learning From Unique Perspectives: User-Aware Saliency Modeling
CRAFT: Concept Recursive Activation FacTorization for Explainability
Doubly Right Object Recognition: A Why Prompt for Visual Rationales
Sketch2Saliency: Learning To Detect Salient Objects From Human Drawings
PIP-Net: Patch-Based Intuitive Prototypes for Interpretable Image Classification
Photo Pre-Training, but for Sketch
CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not
iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
Multi-Modal Representation Learning With Text-Driven Soft Masks
Texts as Images in Prompt Tuning for Multi-Label Image Recognition
Reproducible Scaling Laws for Contrastive Language-Image Learning
Multilateral Semantic Relations Modeling for Image Text Retrieval
SmallCap: Lightweight Image Captioning Prompted With Retrieval Augmentation
Probing Sentiment-Oriented Pre-Training Inspired by Human Sentiment Perception Mechanism
Prefix Conditioning Unifies Language and Label Supervision
Crossing the Gap: Domain Generalization for Image Captioning
A Bag-of-Prototypes Representation for Dataset-Level Applications
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model
D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-Based Transformers
Learning To Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space
Relational Context Learning for Human-Object Interaction Detection
Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision
Side Adapter Network for Open-Vocabulary Semantic Segmentation
Open-Vocabulary Panoptic Segmentation With Text-to-Image Diffusion Models
IFSeg: Image-Free Semantic Segmentation via Vision-Language Model
PartManip: Learning Cross-Category Generalizable Part Manipulation Policy From Point Cloud Observations
OneFormer: One Transformer To Rule Universal Image Segmentation
Delving Into Shape-Aware Zero-Shot Semantic Segmentation
CoMFormer: Continual Learning in Semantic and Panoptic Segmentation
Learning To Segment Every Referring Object Point by Point
Unsupervised Continual Semantic Adaptation Through Neural Rendering
Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation
Transformer Scale Gate for Semantic Segmentation
Style Projected Clustering for Domain Generalized Semantic Segmentation
Rethinking Few-Shot Medical Segmentation: A Vector Quantization View
Continual Semantic Segmentation With Automatic Memory Sample Selection
Token Contrast for Weakly-Supervised Semantic Segmentation
Multi-Granularity Archaeological Dating of Chinese Bronze Dings Based on a Knowledge-Guided Relation Graph
Hunting Sparsity: Density-Guided Contrastive Learning for Semi-Supervised Semantic Segmentation
Cut and Learn for Unsupervised Object Detection and Instance Segmentation
Extracting Class Activation Maps From Non-Discriminative Features As Well
BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation
Hierarchical Fine-Grained Image Forgery Detection and Localization
Towards Professional Level Crowd Annotation of Expert Domain Data
Unsupervised Object Localization: Observing the Background To Discover Objects
Semi-Supervised Learning Made Simple With Self-Supervised Clustering
Unbalanced Optimal Transport: A Unified Framework for Object Detection
DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection
CLIP the Gap: A Single Domain Generalization Approach for Object Detection
Unknown Sniffer for Object Detection: Don’t Turn a Blind Eye to Unknown Objects
Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection
Optimal Proposal Learning for Deployable End-to-End Pedestrian Detection
AsyFOD: An Asymmetric Adaptation Paradigm for Few-Shot Domain Adaptive Object Detection
Where Is My Spot? Few-Shot Image Generation via Latent Subspace Optimization
Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection
MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition
Orthogonal Annotation Benefits Barely-Supervised Medical Image Segmentation
RepMode: Learning to Re-Parameterize Diverse Experts for Subcellular Structure Prediction
Topology-Guided Multi-Class Cell Context Generation for Digital Pathology
Dynamic Graph Enhanced Contrastive Learning for Chest X-Ray Report Generation
Benchmarking Self-Supervised Learning on Diverse Pathology Datasets
Multiple Instance Learning via Iterative Self-Paced Supervised Contrastive Learning
Learning Expressive Prompting With Residuals for Vision Transformers
Decoupling MaxLogit for Out-of-Distribution Detection
Exploring Structured Semantic Prior for Multi Label Recognition With Incomplete Labels
Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification
DivClust: Controlling Diversity in Deep Clustering
Deep Semi-Supervised Metric Learning With Mixed Label Propagation
Leveraging Inter-Rater Agreement for Classification in the Presence of Noisy Labels
Modeling Inter-Class and Intra-Class Constraints in Novel Class Discovery
Bootstrap Your Own Prior: Towards Distribution-Agnostic Novel Class Discovery
Towards Realistic Long-Tailed Semi-Supervised Learning: Consistency Is All You Need
PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery
Probabilistic Knowledge Distillation of Face Ensembles
Class-Conditional Sharpness-Aware Minimization for Deep Long-Tailed Recognition
Promoting Semantic Connectivity: Dual Nearest Neighbors Contrastive Learning for Unsupervised Domain Generalization
Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection
MOT: Masked Optimal Transport for Partial Domain Adaptation
TOPLight: Lightweight Neural Networks With Task-Oriented Pretraining for Visible-Infrared Recognition
OSAN: A One-Stage Alignment Network To Unify Multimodal Alignment and Unsupervised Domain Adaptation
Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective
ARO-Net: Learning Implicit Fields From Anchored Radial Observations
A Probabilistic Framework for Lifelong Test-Time Adaptation
Distribution Shift Inversion for Out-of-Distribution Prediction
Learning Joint Latent Space EBM Prior Model for Multi-Layer Generator
A Data-Based Perspective on Transfer Learning
A Meta-Learning Approach to Predicting Performance and Data Requirements
Guided Recommendation for Model Fine-Tuning
EMT-NAS:Transferring Architectural Knowledge Between Tasks From Different Datasets
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning
Batch Model Consolidation: A Multi-Task Model Consolidation Framework
SmartAssign: Learning a Smart Knowledge Assignment Strategy for Deraining and Desnowing
TinyMIM: An Empirical Study of Distilling MIM Pre-Trained Models
Computationally Budgeted Continual Learning: What Does Matter?
GradMA: A Gradient-Memory-Based Accelerated Federated Learning With Alleviated Catastrophic Forgetting
Rethinking Gradient Projection Continual Learning: Stability / Plasticity Feature Space Decoupling
Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation
Generalizing Dataset Distillation via Deep Generative Prior
Minimizing the Accumulated Trajectory Error To Improve Dataset Distillation
Slimmable Dataset Condensation
Sharpness-Aware Gradient Matching for Domain Generalization
Dynamic Neural Network for Multi-Task Learning Searching Across Diverse Network Topologies
SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries
VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution
Efficient On-Device Training via Gradient Filtering
Are Data-Driven Explanations Robust Against Out-of-Distribution Data?
BiasAdv: Bias-Adversarial Augmentation for Model Debiasing
Q-DETR: An Efficient Low-Bit Quantized Detection Transformer
NIPQ: Noise Proxy-Based Integrated Pseudo-Quantization
CUDA: Convolution-Based Unlearnable Datasets
KD-DLGAN: Data Limited Image Generation via Knowledge Distillation
Spider GAN: Leveraging Friendly Neighbors To Accelerate GAN Training
Efficient Verification of Neural Networks Against LVM-Based Specifications
Bi-Directional Feature Fusion Generative Adversarial Network for Ultra-High Resolution Pathological Image Virtual Re-Staining
DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection
OmniAL: A Unified CNN Framework for Unsupervised Anomaly Localization
Federated Incremental Semantic Segmentation
Re-Thinking Federated Active Learning Based on Inter-Class Diversity
Federated Domain Generalization With Generalization Adjustment
On the Effectiveness of Partial Variance Reduction in Federated Learning With Heterogeneous Data
The Resource Problem of Using Linear Layer Leakage Attack in Federated Learning
Unlearnable Clusters: Towards Label-Agnostic Unlearnable Examples
Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization
Backdoor Defense via Adaptively Splitting Poisoned Dataset
How to Backdoor Diffusion Models?
TrojViT: Trojan Insertion in Vision Transformers
TrojDiff: Trojan Attacks on Diffusion Models With Diverse Targets
Ensemble-Based Blackbox Attacks on Dense Prediction
Efficient Loss Function by Minimizing the Detrimental Effect of Floating-Point Errors on Gradient-Based Attacks
The Best Defense Is a Good Offense: Adversarial Augmentation Against Adversarial Attacks
Adversarial Robustness via Random Projection Filters
Jedi: Entropy-Based Localization and Removal of Adversarial Patches
Exploring the Relationship Between Architectural Design and Adversarially Robust Generalization
Improving Robustness of Vision Transformers by Reducing Sensitivity To Patch Corruptions
Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition
AltFreezing for More General Video Face Forgery Detection
(ends 12:00 PM)
12:30 p.m.
Lunch
2 p.m.
Panel:
History and Future of Artificial Intelligence and Computer Vision
(ends 3:00 PM)
3 p.m.
Award:
Award Candidates TUE
(ends 4:00 PM)
4 p.m.
Break
4:30 p.m.
Poster Session TUE-PM
[4:30-6:00]
Passive Micron-Scale Time-of-Flight With Sunlight Interferometry
F2-NeRF: Fast Neural Radiance Field Training With Free Camera Trajectories
NoPe-NeRF: Optimising Neural Radiance Field With No Pose Prior
BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields
DiffusioNeRF: Regularizing Neural Radiance Fields With Denoising Diffusion Models
SPARF: Neural Radiance Fields From Sparse and Noisy Poses
Interactive Segmentation of Radiance Fields
Temporal Interpolation Is All You Need for Dynamic Neural Radiance Fields
Compressing Volumetric Radiance Fields to 1 MB
Multiscale Tensor Decomposition and Rendering Equation Encoding for View Synthesis
Ref-NPR: Reference-Based Non-Photorealistic Radiance Fields for Controllable Scene Stylization
Representing Volumetric Videos As Dynamic MLP Maps
Fast Monocular Scene Reconstruction With Global-Sparse Local-Dense Grids
DynIBaR: Neural Dynamic Image-Based Rendering
Plateau-Reduced Differentiable Path Tracing
NeFII: Inverse Rendering for Reflectance Decomposition With Near-Field Indirect Illumination
WildLight: In-the-Wild Inverse Rendering With a Flashlight
Relightable Neural Human Assets From Multi-View Gradient Illuminations
DiffRF: Rendering-Guided 3D Radiance Field Diffusion
Analyzing Physical Impacts Using Transient Surface Wave Imaging
Neural Kaleidoscopic Space Sculpting
Towards Unbiased Volume Rendering of Neural Implicit Surfaces With Geometry Priors
Neural Kernel Surface Reconstruction
MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency
Shape, Pose, and Appearance From a Single Image via Bootstrapped Radiance Field Inversion
DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis
Heat Diffusion Based Multi-Scale and Geometric Structure-Aware Transformer for Mesh Segmentation
Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis From Monocular Image
3D-Aware Conditional Image Synthesis
VIVE3D: Viewpoint-Independent Video Editing Using 3D-Aware GANs
SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation
Generating Part-Aware Editable 3D Shapes Without 3D Supervision
NeuralLift-360: Lifting an In-the-Wild 2D Photo to a 3D Object With 360° Views
Implicit Identity Driven Deepfake Face Swapping Detection
Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields
Improving Fairness in Facial Albedo Estimation via Visual-Textual Cues
High-Fidelity 3D Face Generation From Natural Language Descriptions
DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment
High-Fidelity Facial Avatar Reconstruction From Monocular Video With Generative Priors
3DAvatarGAN: Bridging Domains for Personalized Editable Avatars
RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion
Instant Volumetric Head Avatars
Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement
3D Cinemagraphy From a Single Image
TryOnDiffusion: A Tale of Two UNets
Diverse 3D Hand Gesture Prediction From Body Dynamics by Bilateral Hand Disentanglement
Normal-Guided Garment UV Prediction for Human Re-Texturing
REC-MV: REconstructing 3D Dynamic Cloth From Monocular Videos
SeSDF: Self-Evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction
Unsupervised Volumetric Animation
Handy: Towards a High Fidelity 3D Hand Shape and Appearance Model
Fantastic Breaks: A Dataset of Paired 3D Scans of Real-World Broken Objects and Their Complete Counterparts
Distilling Neural Fields for Real-Time Articulated Shape Reconstruction
GANmouflage: 3D Object Nondetection With Texture Fields
3D Human Pose Estimation via Intuitive Physics
Object Pop-Up: Can We Infer 3D Objects and Their Poses From Human Interactions Alone?
UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy
Constrained Evolutionary Diffusion Filter for Monocular Endoscope Tracking
Visibility Aware Human-Object Interaction Tracking From Single RGB Camera
Transformer-Based Unified Recognition of Two Hands Manipulating Objects
HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation
3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention
GFPose: Learning 3D Human Pose Prior With Gradient Fields
JRDB-Pose: A Large-Scale Dataset for Multi-Person Pose Estimation and Tracking
Analyzing and Diagnosing Pose Estimation With Attributions
Shape-Constraint Recurrent Flow for 6D Object Pose Estimation
TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation
Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery From Sparse Image Ensemble
Revisiting Rolling Shutter Bundle Adjustment: Toward Accurate and Fast Solution
Revisiting the P3P Problem
Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories
MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices
EFEM: Equivariant Neural Field Expectation Maximization for 3D Object Segmentation Without Scene Supervision
GINA-3D: Learning To Generate Implicit Neural Assets in the Wild
Habitat-Matterport 3D Semantics Dataset
BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image
Panoptic Compositional Feature Field for Editable Scene Rendering With Network-Inferred Labels via Metric Learning
A Light Touch Approach to Teaching Transformers Multi-View Geometry
Learning To Render Novel Views From Wide-Baseline Stereo Pairs
Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo
EventNeRF: Neural Radiance Fields From a Single Colour Event Camera
LightedDepth: Video Depth Estimation in Light of Limited Inference View Angles
Generating Aligned Pseudo-Supervision From Non-Aligned Data for Image Restoration in Under-Display Camera
Spatio-Focal Bidirectional Disparity Estimation From a Dual-Pixel Image
Trap Attention: Monocular Depth Estimation With Manual Traps
Accelerated Coordinate Encoding: Learning to Relocalize in Minutes Using RGB and Poses
Energy-Efficient Adaptive 3D Sensing
Incremental 3D Semantic Scene Graph Prediction From RGB Sequences
Consistent Direct Time-of-Flight Video Depth Super-Resolution
Learning To Zoom and Unzoom
FrustumFormer: Adaptive Instance-Aware Resampling for Multi-View 3D Detection
3D Video Object Detection With Learnable Object-Centric Global Optimization
UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird’s-Eye View
ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data
Deep Dive Into Gradients: Better Optimization for 3D Object Detection With Gradient-Corrected IoU Supervision
SlowLiDAR: Increasing the Latency of LiDAR-Based Detection Using Adversarial Examples
Normalizing Flow Based Feature Synthesis for Outlier-Aware Object Detection
OcTr: Octree-Based Transformer for 3D Object Detection
HypLiLoc: Towards Effective LiDAR Pose Regression With Hyperbolic Fusion
LiDAR2Map: In Defense of LiDAR-Based Semantic Map Construction Using Online Camera Distillation
MSF: Motion-Guided Sequential Fusion for Efficient 3D Object Detection From Point Cloud Sequences
SFD2: Semantic-Guided Feature Detection and Description
Temporal Consistent 3D LiDAR Representation Learning for Semantic Perception in Autonomous Driving
Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast for Autonomous Driving
RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving
Spatiotemporal Self-Supervised Learning for Point Clouds in the Wild
Change-Aware Sampling and Contrastive Learning for Satellite Images
Self-Supervised 3D Scene Flow Estimation Guided by Superpoints
SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow
PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection
CP3: Channel Pruning Plug-In for Point-Based Networks
Binarizing Sparse Convolutional Networks for Efficient Point Cloud Analysis
Hyperspherical Embedding for Point Cloud Completion
Attention-Based Point Cloud Edge Sampling
Starting From Non-Parametric Networks for 3D Point Cloud Analysis
Grad-PU: Arbitrary-Scale Point Cloud Upsampling via Gradient Descent With Learned Distance Functions
SE-ORNet: Self-Ensembling Orientation-Aware Network for Unsupervised Point Cloud Shape Correspondence
Robust 3D Shape Classification via Non-Local Graph Attention Network
Rotation-Invariant Transformer for Point Cloud Matching
Deep Graph-Based Spatial Consistency for Robust Non-Rigid Point Cloud Registration
Efficient RGB-T Tracking via Cross-Modality Distillation
Finding Geometric Models by Clustering in the Consensus Space
Adaptive Assignment for Geometry Aware Local Feature Matching
Masked Representation Learning for Domain Generalized Stereo Matching
Learning Optical Expansion From Scale Matching
AnyFlow: Arbitrary Scale Optical Flow With Implicit Neural Representation
HouseDiffusion: Vector Floorplan Generation via a Diffusion Model With Discrete and Continuous Denoising
Localized Semantic Feature Mixers for Efficient Pedestrian Detection in Autonomous Driving
V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting
ViP3D: End-to-End Visual Trajectory Prediction via 3D Agent Queries
IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction
Leapfrog Diffusion Model for Stochastic Trajectory Prediction
DeFeeNet: Consecutive 3D Human Motion Prediction With Deviation Feedback
Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation
ReDirTrans: Latent-to-Latent Translation for Gaze and Head Redirection
Feature Shrinkage Pyramid for Camouflaged Object Detection With Transformers
OVTrack: Open-Vocabulary Multiple Object Tracking
GaitGCI: Generative Counterfactual Intervention for Gait Recognition
Multi-Label Compound Expression Recognition: C-EXPR Database & Network
Blemish-Aware and Progressive Face Retouching With Limited Paired Data
High-Fidelity and Freely Controllable Talking Head Video Generation
3Mformer: Multi-Order Multi-Mode Transformer for Skeletal Action Recognition
UDE: A Unified Driving Engine for Human Motion Generation
Data-Driven Feature Tracking for Event Cameras
MoStGAN-V: Video Generation With Temporal Motion Styles
Two-Stage Co-Segmentation Network Based on Discriminative Representation for Recovering Human Mesh From Videos
Joint Appearance and Motion Learning for Efficient Rolling Shutter Correction
Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation
Deep Stereo Video Inpainting
Burstormer: Burst Image Restoration and Enhancement Transformer
Blur Interpolation Transformer for Real-World Motion From Blur
HDR Imaging With Spatially Varying Signal-to-Noise Ratios
Light Source Separation and Intrinsic Image Decomposition Under AC Illumination
Physics-Guided ISO-Dependent Sensor Noise Modeling for Extreme Low-Light Photography
Neumann Network With Recursive Kernels for Single Image Defocus Deblurring
UMat: Uncertainty-Aware Single Image High Resolution Material Capture
SMAE: Few-Shot Learning for HDR Deghosting With Saturation-Aware Masked Autoencoders
Curricular Contrastive Regularization for Physics-Aware Single Image Dehazing
Patch-Craft Self-Supervised Training for Correlated Image Denoising
Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising
All-in-One Image Restoration for Unknown Degradations Using Adaptive Discriminative Filters for Specific Degradations
Ingredient-Oriented Multi-Degradation Learning for Image Restoration
CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability
Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild
Toward Accurate Post-Training Quantization for Image Super Resolution
Learning Steerable Function for Efficient Image Resampling
ABCD: Arbitrary Bitwise Coefficient for De-Quantization
Efficient Frequency Domain-Based Transformers for High-Quality Image Deblurring
Learning a Sparse Transformer Network for Effective Image Deraining
CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion
PCT-Net: Full Resolution Image Harmonization Using Pixel-Wise Color Transformations
Semi-Supervised Parametric Real-World Image Harmonization
Towards Robust Tampered Text Detection in Document Image: New Dataset and New Solution
QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity
Deep Curvilinear Editing: Commutative and Nonlinear Image Manipulation for Pretrained Deep Generative Model
Person Image Synthesis via Denoising Diffusion Model
Disentangling Writer and Character Styles for Handwriting Generation
NoisyTwins: Class-Consistent and Diverse Image Generation Through StyleGANs
High-Fidelity Guided Image Synthesis With Latent Diffusion Models
Imagic: Text-Based Real Image Editing With Diffusion Models
PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout
SINE: SINgle Image Editing With Text-to-Image Diffusion Models
NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models
Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models
Parallel Diffusion Models of Operator and Image for Blind Inverse Problems
Unite and Conquer: Plug & Play Multi-Modal Synthesis Using Diffusion Models
Collaborative Diffusion for Multi-Modal Face Generation and Editing
Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding
NVTC: Nonlinear Vector Transform Coding
Motion Information Propagation for Neural Video Compression
A Dynamic Multi-Scale Voxel Flow Network for Video Prediction
Towards Scalable Neural Representation for Diverse Videos
DINER: Disorder-Invariant Implicit Neural Representation
SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy
DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network
Optimization-Inspired Cross-Attention Transformer for Compressive Sensing
Neighborhood Attention Transformer
Making Vision Transformers Efficient From a Token Sparsification View
Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors
Neuralizer: General Neuroimage Analysis Without Re-Training
Learning Partial Correlation Based Deep Visual Representation for Image Classification
Understanding Masked Image Modeling via Learning Occlusion Invariant Feature
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers
Adaptive Graph Convolutional Subspace Clustering
Deep Learning of Partial Graph Matching via Differentiable Top-K
DynamicDet: A Unified Dynamic Architecture for Object Detection
IS-GGT: Iterative Scene Graph Generation With Generative Transformers
Fast Contextual Scene Graph Generation With Unbiased Context Augmentation
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning
MED-VT: Multiscale Encoder-Decoder Video Transformer With Application To Object Segmentation
MOVES: Manipulated Objects in Video Enable Segmentation
InstMove: Instance Motion for Object-Centric Video Segmentation
ZBS: Zero-Shot Background Subtraction via Instance-Level Background Modeling and Foreground Selection
Feature Aggregated Queries for Transformer-Based Video Object Detectors
Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation
Selective Structured State-Spaces for Long-Form Video Understanding
Relational Space-Time Query in Long-Form Videos
Novel-View Acoustic Synthesis
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective
Decompose More and Aggregate Better: Two Closer Looks at Frequency Representation Learning for Human Motion Prediction
TempSAL – Uncovering Temporal Information for Deep Saliency Prediction
Prompt-Guided Zero-Shot Anomaly Action Recognition Using Pretrained Deep Skeleton Features
MMG-Ego4D: Multimodal Generalization in Egocentric Action Recognition
Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition
Reducing the Label Bias for Timestamp Supervised Temporal Action Segmentation
Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks
Iterative Proposal Refinement for Weakly-Supervised Video Grounding
Movies2Scenes: Using Movie Metadata To Learn Scene Representation
Fine-Tuned CLIP Models Are Efficient Video Learners
Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring
VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval
ProTéGé: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding
Learning Video Representations From Large Language Models
All in One: Exploring Unified Video-Language Pre-Training
High-Fidelity Generalized Emotional Talking Face Generation With Multi-Modal Emotion Space Learning
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition With Pre-Trained Vision-Language Models
Decoupled Multimodal Distilling for Emotion Recognition
Affection: Learning Affective Explanations for Real-World Visual Data
An Actor-Centric Causality Graph for Asynchronous Temporal Inference in Group Activity
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision
3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification
Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding
NaQ: Leveraging Narrations As Queries To Supervise Episodic Memory
EC2: Emergent Communication for Embodied Control
Abstract Visual Reasoning: An Algebraic Approach for Solving Raven’s Progressive Matrices
Logical Implications for Visual Question Answering Consistency
Divide and Conquer: Answering Questions With Object Factorization and Compositional Reasoning
The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training
Visual-Language Prompt Tuning With Knowledge-Guided Context Optimization
Probabilistic Prompt Learning for Dense Prediction
Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding
Seeing What You Miss: Vision-Language Pre-Training With Semantic Completion Learning
Affordance Grounding From Demonstration Video To Target Image
Leverage Interactive Affinity for Affordance Learning
DeAR: Debiasing Vision-Language Models With Additive Residuals
Images Speak in Images: A Generalist Painter for In-Context Visual Learning
Hyperbolic Contrastive Learning for Visual Representations Beyond Objects
Picture That Sketch: Photorealistic Image Generation From Abstract Sketches
GeneCIS: A Benchmark for General Conditional Image Similarity
Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR
Parts2Words: Learning Joint Embedding of Point Clouds and Texts by Bidirectional Matching Between Parts and Words
DeltaEdit: Exploring Text-Free Training for Text-Driven Image Manipulation
Detecting and Grounding Multi-Modal Media Manipulation
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation
Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding
Cross-Domain Image Captioning With Discriminative Finetuning
EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Turning a CLIP Model Into a Scene Text Detector
ScanDMM: A Deep Markov Model of Scanpath Prediction for 360° Images
CrOC: Cross-View Online Clustering for Dense Visual Representation Learning
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP
CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching
Open-Vocabulary Attribute Detection
Learning To Detect and Segment for Open Vocabulary Object Detection
Open-Vocabulary Semantic Segmentation With Mask-Adapted CLIP
A Simple Framework for Text-Supervised Semantic Segmentation
GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts
GeoLayoutLM: Geometric Pre-Training for Visual Information Extraction
Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss
Generative Semantic Segmentation
MISC210K: A Large-Scale Dataset for Multi-Instance Semantic Correspondence
MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation
PACO: Parts and Attributes of Common Objects
PartDistillation: Learning Parts From Instance Segmentation
ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation
Reliability in Semantic Segmentation: Are We on the Right Track?
Rethinking the Correlation in Few-Shot Segmentation: A Buoys View
SIM: Semantic-Aware Instance Mask Generation for Box-Supervised Instance Segmentation
Endpoints Weight Fusion for Class Incremental Semantic Segmentation
Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class
Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation With Implicit Neural Representations
Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation
Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection
Texture-Guided Saliency Distilling for Unsupervised Salient Object Detection
An Erudite Fine-Grained Visual Classification Model
Dynamic Graph Learning With Content-Guided Spatial-Frequency Relation Reasoning for Deepfake Detection
ScaleDet: A Scalable Multi-Dataset Object Detector
Multi-Centroid Task Descriptor for Dynamic Class Incremental Inference
Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation
Dynamic Coarse-To-Fine Learning for Oriented Tiny Object Detection
Dense Distinct Query for End-to-End Object Detection
Meta-Tuning Loss Functions and Data Augmentation for Few-Shot Object Detection
One-to-Few Label Assignment for End-to-End Dense Detection
Test Time Adaptation With Regularized Loss for Weakly Supervised Salient Object Detection
MixTeacher: Mining Promising Labels With Mixed Scale Teacher for Semi-Supervised Object Detection
Exploring Incompatible Knowledge Transfer in Few-Shot Image Generation
Exploring Intra-Class Variation Factors With Learnable Cluster Prompts for Semi-Supervised Image Synthesis
A Soma Segmentation Benchmark in Full Adult Fly Brain
SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation
Label-Free Liver Tumor Segmentation
Interactive and Explainable Region-Guided Radiology Report Generation
A Loopback Network for Explainable Microvascular Invasion Classification
Task-Specific Fine-Tuning via Variational Information Bottleneck for Weakly-Supervised Pathology Whole Slide Image Classification
YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors
Two-Way Multi-Label Loss
Teaching Matters: Investigating the Role of Supervision in Vision Transformers
Detection of Out-of-Distribution Samples Using Binary Neuron Activation Patterns
Label Information Bottleneck for Label Enhancement
Glocal Energy-Based Learning for Few-Shot Open-Set Recognition
Noisy Correspondence Learning With Meta Similarity Correction
Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-Shot Learning With Hyperspherical Embeddings
Coreset Sampling From Open-Set for Fine-Grained Self-Supervised Learning
Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data
Trade-Off Between Robustness and Accuracy of Vision Transformers
Exploring and Utilizing Pattern Imbalance
Dynamic Conceptional Contrastive Learning for Generalized Category Discovery
Towards Better Decision Forests: Forest Alternating Optimization
Learning Debiased Representations via Conditional Attribute Interpolation
On the Pitfall of Mixup for Uncertainty Calibration
Class Relationship Embedded Learning for Source-Free Unsupervised Domain Adaptation
FeatureBooster: Boosting Feature Descriptors With a Lightweight Neural Network
Guiding Pseudo-Labels With Uncertainty Estimation for Source-Free Unsupervised Domain Adaptation
Divide and Adapt: Active Domain Adaptation via Customized Learning
Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning
Deep Factorized Metric Learning
Meta-Causal Learning for Single Domain Generalization
Meta Omnium: A Benchmark for General-Purpose Learning-To-Learn
Robust Mean Teacher for Continual and Gradual Test-Time Adaptation
NAR-Former: Neural Architecture Representation Learning Towards Holistic Attributes Prediction
Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning
Architecture, Dataset and Model-Scale Agnostic Data-Free Meta-Learning
GKEAL: Gaussian Kernel Embedded Analytic Learning for Few-Shot Class Incremental Task
Mitigating Task Interference in Multi-Task Learning via Explicit Task Routing With Non-Learnable Primitives
Boundary Unlearning: Rapid Forgetting of Deep Networks via Shifting the Decision Boundary
Task Difficulty Aware Parameter Allocation & Regularization for Lifelong Learning
Learning To Retain While Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation
A Unified Knowledge Distillation Framework for Deep Directed Graphical Models
Coaching a Teachable Student
Adaptive Plasticity Improvement for Continual Learning
Improving Generalization of Meta-Learning With Inverted Regularization at Inner-Level
Trainable Projected Gradient Method for Robust Fine-Tuning
Imitation Learning As State Matching via Differentiable Physics
Improved Distribution Matching for Dataset Condensation
A General Regret Bound of Preconditioned Gradient Method for DNN Training
From Node Interaction To Hop Interaction: New Effective and Scalable Graph Learning Paradigm
Constructing Deep Spiking Neural Networks From Artificial Neural Networks With Knowledge Distillation
Rate Gradient Approximation Attack Threats Deep Spiking Neural Networks
MobileOne: An Improved One Millisecond Mobile Backbone
Understanding Masked Autoencoders via Hierarchical Latent Variable Models
Training Debiased Subnetworks With Contrastive Weight Pruning
One-Shot Model for Mixed-Precision Quantization
Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective
Adaptive Data-Free Quantization
Learning To Generate Image Embeddings With User-Level Differential Privacy
Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences Between Pretrained Generative Models
HandsOff: Labeled Dataset Generation With No Additional Human Annotations
Attribute-Preserving Face Dataset Anonymization via Latent Code Optimization
Fake It Till You Make It: Learning Transferable Representations From Synthetic ImageNet Clones
Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection
Multimodal Industrial Anomaly Detection via Hybrid Fusion
FedSeg: Class-Heterogeneous Federated Learning for Semantic Segmentation
Decentralized Learning With Multi-Headed Distillation
Learning Federated Visual Prompt in Null Space for MRI Reconstruction
Federated Learning With Data-Agnostic Distribution Fusion
CaPriDe Learning: Confidential and Private Decentralized Learning Based on Encryption-Friendly Distillation Loss
RiDDLE: Reversible and Diversified De-Identification With Latent Encryptor
Multi-View Adversarial Discriminator: Mine the Non-Causal Factors for Object Detection in Unseen Domains
Single Image Backdoor Inversion via Robust Smoothed Classifiers
Effective Ambiguity Attack Against Passport-Based DNN Intellectual Property Protection Schemes Through Fully Connected Layer Substitution
Color Backdoor: A Robust Poisoning Attack in Color Space
Adversarially Robust Neural Architecture Search for Graph Neural Networks
Minimizing Maximum Model Discrepancy for Transferable Black-Box Targeted Attacks
StyLess: Boosting the Transferability of Adversarial Examples
Improving the Transferability of Adversarial Samples by Path-Augmented Method
Feature Separation and Recalibration for Adversarial Robustness
CFA: Class-Wise Calibrated Fair Adversarial Training
Revisiting Residual Networks for Adversarial Robustness
Privacy-Preserving Adversarial Facial Features
Edge-Aware Regional Message Passing Controller for Image Forgery Localization
(ends 6:00 PM)
7 p.m.
Social:
Diversity and Inclusion Social
(ends 9:00 PM)
Social:
CV Entrepreneurship – Founders, Freelancers & Friends
(ends 9:00 PM)
Social:
How to Negotiate Industry Offers in AI proposal
(ends 9:00 PM)
Social:
Black in AI Social
(ends 9:00 PM)
Social:
AMA with Senior Faculty and Industry Leaders
(ends 9:00 PM)
WED 21 JUN
7:30 a.m.
Breakfast
8:30 a.m.
Award:
Award Ceremony
(ends 9:00 AM)
9 a.m.
Keynote:
An AI Odyssey: the Dark Matter of Intelligence
Yejin Choi
(ends 10:00 AM)
10 a.m.
Break
10:30 a.m.
Poster Session WED-AM
[10:30-12:00]
Swept-Angle Synthetic Wavelength Interferometry
RefSR-NeRF: Towards High Fidelity and Super Resolution View Synthesis
FreeNeRF: Improving Few-Shot Neural Rendering With Free Frequency Regularization
Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields
Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation From 2D Supervision
NeRF-DS: Neural Radiance Fields for Dynamic Specular Objects
Grid-Guided Neural Radiance Fields for Large Urban Scenes
Learning Neural Duplex Radiance Fields for Real-Time View Synthesis
EditableNeRF: Editing Topologically Varying Neural Radiance Fields by Key Points
Real-Time Neural Light Field on Mobile Devices
StyleRF: Zero-Shot 3D Style Transfer of Neural Radiance Fields
Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields
Pointersect: Neural Rendering With Cloud-Ray Intersection
Neural Fields Meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes
DANI-Net: Uncalibrated Photometric Stereo by Differentiable Shadow Handling, Anisotropic Reflectance Modeling, and Neural Inverse Rendering
MAIR: Multi-View Attention Inverse Rendering With 3D Spatially-Varying Lighting Estimation
Weakly-Supervised Single-View Image Relighting
Controllable Light Diffusion for Portraits
RGBD2: Generative Scene Synthesis via Incremental View Inpainting Using RGBD Diffusion Models
Neural Lens Modeling
RealFusion: 360° Reconstruction of Any Object From a Single Image
Neuralangelo: High-Fidelity Neural Surface Reconstruction
PermutoSDF: Fast Multi-View Reconstruction With Implicit Surfaces Using Permutohedral Lattices
NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction
NEF: Neural Edge Fields for 3D Parametric Curve Reconstruction From Multi-View Images
NeuralField-LDM: Scene Generation With Hierarchical Latent Diffusion Models
SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene
Painting 3D Nature in 2D: View Synthesis of Natural Scenes From a Single Semantic Mask
Quantitative Manipulation of Custom Attributes on 3D-Aware Image Synthesis
NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-Shot Real Image Animation
PREIM3D: 3D Consistent Precise Image Attribute Editing From a Single Image
Unsupervised 3D Shape Reconstruction by Part Retrieval and Assembly
DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion
Fine-Grained Face Swapping via Regional GAN Inversion
Logical Consistency and Greater Descriptive Power for Facial Hair Attribute Learning
Learning a 3D Morphable Face Reflectance Model From Low-Cost Data
StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer
FaceLit: Neural 3D Relightable Faces
FitMe: Deep Photorealistic 3D Morphable Model Avatars
NeuWigs: A Neural Dynamic Model for Volumetric Hair Capture and Animation
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
High-Fidelity Clothed Avatar Reconstruction From a Single Image
Music-Driven Group Choreography
Hand Avatar: Free-Pose Hand Animation and Rendering From Monocular Video
Biomechanics-Guided Facial Action Unit Detection Through Force Modeling
Zero-Shot Pose Transfer for Unrigged Stylized 3D Characters
Invertible Neural Skinning
BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion
DIFu: Depth-Guided Implicit Function for Clothed Human Reconstruction
Complete 3D Human Reconstruction From a Single Incomplete Image
Learning Neural Volumetric Representations of Dynamic Humans in Minutes
Marching-Primitives: Shape Abstraction From Signed Distance Function
Learning Analytical Posterior Probability for Human Mesh Recovery
MagicPony: Learning Articulated 3D Animals in the Wild
Visual-Tactile Sensing for In-Hand Object Reconstruction
Command-Driven Articulated Object Understanding and Manipulation
Target-Referenced Reactive Grasping for Dynamic Objects
NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions
A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation From a Single RGB Image
TRACE: 5D Temporal Regression of Avatars With Dynamic Cameras in 3D Environments
BITE: Beyond Priors for Improved Three-D Dog Pose Estimation
PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation
Global-to-Local Modeling for Video-Based 3D Human Pose and Shape Estimation
TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers
GFIE: A Dataset and Baseline for Gaze-Following From 2D to 3D in Indoor Environments
Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation From Image Sequence
Rigidity-Aware Detection for 6D Object Pose Estimation
Crowd3D: Towards Hundreds of People Reconstruction From a Single Image
Object Pose Estimation With Statistical Guarantees: Conformal Keypoint Detection and Geometric Uncertainty Propagation
expOSE: Accurate Initialization-Free Projective Factorization Using Exponential Regularization
Neural Voting Field for Camera-Space 3D Hand Pose Estimation
Two-View Geometry Scoring Without Correspondences
Four-View Geometry With Unknown Radial Distortion
BKinD-3D: Self-Supervised 3D Keypoint Discovery From Multi-View Videos
BAAM: Monocular 3D Pose and Shape Reconstruction With Bi-Contextual Attention Module and Attention-Guided Modeling
Multi-Object Manipulation via Object-Centric Neural Scattering Functions
Neural Part Priors: Learning To Optimize Part-Based Object Completion in RGB-D Scans
Panoptic Lifting for 3D Scene Understanding With Neural Fields
Virtual Occlusions Through Implicit Depth
Multiview Compressive Coding for 3D Reconstruction
Behind the Scenes: Density Fields for Single View Reconstruction
VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion
Renderable Neural Radiance Map for Visual Navigation
Learning To Detect Mirrors From Videos via Dual Correspondences
Temporally Consistent Online Depth Estimation Using Point-Based Fusion
Zero-Shot Dual-Lens Super-Resolution
Fully Self-Supervised Depth Estimation From Defocus Clue
MVImgNet: A Large-Scale Dataset of Multi-View Images
Revisiting the Stack-Based Inverse Tone Mapping
Combining Implicit-Explicit View Correlation for Light Field Semantic Segmentation
3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud
Role of Transients in Two-Bounce Non-Line-of-Sight Imaging
3D Concept Learning and Reasoning From Multi-View Images
Viewpoint Equivariance for Multi-View 3D Object Detection
Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction
BEV@DC: Bird’s-Eye View Assisted Training for Depth Completion
Collaboration Helps Camera Overtake LiDAR in 3D Detection
Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection
Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration
Depth Estimation From Camera Image and mmWave Radar Point Cloud
SGLoc: Scene Geometry Encoding for Outdoor LiDAR Localization
ConQueR: Query Contrast Voxel-DETR for 3D Object Detection
DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization
Towards Unsupervised Object Detection From LiDAR Point Clouds
MoDAR: Using Motion Forecasting for 3D Object Detection in Point Cloud Sequences
Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision
Instant Domain Augmentation for LiDAR Semantic Segmentation
Less Is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation
MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds
3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point Clouds
Novel Class Discovery for 3D Point Cloud Semantic Segmentation
GD-MAE: Generative Decoder for MAE Pre-Training on LiDAR Point Clouds
Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning
Open-Set Semantic Segmentation for Point Clouds via Adversarial Prototype Framework
ACL-SPC: Adaptive Closed-Loop System for Self-Supervised Point Cloud Completion
Fast Point Cloud Generation With Straight Flows
PointVector: A Vector Representation in Point Cloud Analysis
ProxyFormer: Proxy Alignment Assisted Point Cloud Completion With Missing Part Sensitive Transformer
FAC: 3D Representation Learning via Foreground Aware Feature Contrast
Rethinking the Approximation Error in 3D Surface Fitting for Point Cloud Normal Estimation
PointCert: Point Cloud Classification With Deterministic Certified Robustness Guarantees
Robust Multiview Point Cloud Registration With Reliable Pose Graph Initialization and History Reweighting
Visual Prompt Multi-Modal Tracking
Progressive Neighbor Consistency Mining for Correspondence Pruning
Geometric Visual Similarity Learning in 3D Medical Image Self-Supervised Pre-Training
Unsupervised Visible-Infrared Person Re-Identification via Progressive Graph Matching and Alternate Learning
Domain Generalized Stereo Matching via Hierarchical Visual Transformation
Unsupervised Cumulative Domain Adaptation for Foggy Scene Optical Flow
PVO: Panoptic Visual Odometry
BAEFormer: Bi-Directional and Early Interaction Transformers for Bird’s Eye View Semantic Segmentation
Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark
Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving
MixSim: A Hierarchical Framework for Mixed Reality Traffic Simulation
Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction
MotionDiffuser: Controllable Multi-Agent Motion Prediction Using Diffusion
Learning Human-to-Robot Handovers From Point Clouds
Phone2Proc: Bringing Robust Robots Into Our Chaotic World
GazeNeRF: 3D-Aware Gaze Redirection With Neural Radiance Fields
Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking
Autoregressive Visual Tracking
OpenGait: Revisiting Gait Recognition Towards Better Practicality
Pose-Disentangled Contrastive Learning for Self-Supervised Facial Representation
Identity-Preserving Talking Face Generation With Landmark and Appearance Priors
DF-Platter: Multi-Face Heterogeneous Deepfake Dataset
Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos
Mofusion: A Framework for Denoising-Diffusion-Based Motion Synthesis
Adaptive Global Decay Process for Event Cameras
Frame-Event Alignment and Fusion Network for High Frame Rate Tracking
Exploring Discontinuity for Video Frame Interpolation
AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation
Frame Interpolation Transformer and Uncertainty Guidance
A Simple Baseline for Video Restoration With Grouped Spatial-Temporal Shift
Recurrent Homography Estimation Using Homography-Guided Image Warping and Focus Transformer
HyperCUT: Video Sequence From a Single Blurry Image Using Unsupervised Ordering
Indescribable Multi-Modal Spatial Evaluator
Structured Kernel Estimation for Photon-Limited Deconvolution
Polarized Color Image Denoising
Uncertainty-Aware Unsupervised Image Deblurring With Deep Residual Prior
Low-Light Image Enhancement via Structure Modeling and Guidance
Learning Sample Relationship for Exposure Correction
Spatially Adaptive Self-Supervised Learning for Real-World Image Denoising
Quantum-Inspired Spectral-Spatial Pyramid Network for Hyperspectral Image Classification
Generative Diffusion Prior for Unified Image Restoration and Enhancement
Ground-Truth Free Meta-Learning for Deep Compressive Sampling
Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation
An Image Quality Assessment Dataset for Portraits
Bitstream-Corrupted JPEG Images Are Restorable: Two-Stage Compensation and Alignment Framework for Image Restoration
Image Super-Resolution Using T-Tetromino Pixels
CUF: Continuous Upsampling Filters
OPE-SR: Orthogonal Position Encoding for Designing a Parameter-Free Upsampling Module in Arbitrary-Scale Image Super-Resolution
Implicit Diffusion Models for Continuous Super-Resolution
Pixels, Regions, and Objects: Multiple Enhancement for Salient Object Detection
VILA: Learning Image Aesthetics From User Comments With Vision-Language Pretraining
Image Cropping With Spatial-Aware Feature and Rank Consistency
B-Spline Texture Coefficients Estimator for Screen Content Image Super-Resolution
Delving StyleGAN Inversion for Image Editing: A Foundation Latent Space Viewpoint
Learning Dynamic Style Kernels for Artistic Style Transfer
SVGformer: Representation Learning for Continuous Vector Graphics Using Transformers
Learning Generative Structure Prior for Blind Text Image Super-Resolution
Unsupervised Domain Adaption With Pixel-Level Discriminator for Image-Aware Layout Generation
Scaling Up GANs for Text-to-Image Synthesis
ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model With Knowledge-Enhanced Mixture-of-Denoising-Experts
Inversion-Based Style Transfer With Diffusion Models
Shifted Diffusion for Text-to-Image Generation
LayoutDM: Discrete Diffusion Model for Controllable Layout Generation
Unpaired Image-to-Image Translation With Shortest Path Regularization
DiffCollage: Parallel Generation of Large Content With Diffusion Models
Wavelet Diffusion Models Are Fast and Scalable Image Generators
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Adaptive Human Matting for Dynamic Videos
LVQAC: Lattice Vector Quantization Coupled With Spatially Adaptive Companding for Efficient Learned Image Compression
Hierarchical B-Frame Video Coding Using Two-Layer CANF Without Motion Coding
Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting
HNeRV: A Hybrid Neural Representation for Videos
Regularize Implicit Neural Representation by Itself
SMPConv: Self-Moving Point Representations for Continuous Convolution
Long Range Pooling for 3D Large-Scale Scene Understanding
Progressive Random Convolutions for Single Domain Generalization
BiFormer: Vision Transformer With Bi-Level Routing Attention
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
BioNet: A Biologically-Inspired Network for Face Recognition
Dual-Bridging With Adversarial Noise Generation for Domain Adaptive rPPG Estimation
On Data Scaling in Masked Image Modeling
Hard Patches Mining for Masked Image Modeling
Evolved Part Masking for Self-Supervised Learning
BASiS: Batch Aligned Spectral Embedding Space
OmniMAE: Single Model Masked Pretraining on Images and Videos
ViTs for SITS: Vision Transformers for Satellite Image Time Series
Probabilistic Debiasing of Scene Graphs
Blind Video Deflickering by Neural Filtering With a Flawed Atlas
SCOTCH and SODA: A Transformer Video Shadow Detection Framework
MAGVIT: Masked Generative Video Transformer
Improving Robustness of Semantic Segmentation to Motion-Blur Using Class-Centric Augmentation
MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation
Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
Frame Flexible Network
System-Status-Aware Adaptive Network for Online Streaming Video Understanding
MDQE: Mining Discriminative Query Embeddings To Segment Occluded Instances on Challenging Videos
Spatio-Temporal Pixel-Level Contrastive Learning-Based Source-Free Domain Adaptation for Video Semantic Segmentation
Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
Chat2Map: Efficient Scene Mapping From Multi-Ego Conversations
Audio-Visual Grouping Network for Sound Localization From Mixtures
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Fine-Grained Audible Video Description
Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-Based Action Recognition
Learning Discriminative Representations for Skeleton Based Action Recognition
Therbligs in Action: Video Understanding Through Motion Primitives
Search-Map-Search: A Frame Selection Paradigm for Action Recognition
Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization
Boosting Weakly-Supervised Temporal Action Localization With Text Information
Perception and Semantic Aware Regularization for Sequential Confidence Calibration
NewsNet: A Novel Dataset for Hierarchical Temporal Segmentation
Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation
Leveraging Temporal Context in Low Representational Power Regimes
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Procedure-Aware Pretraining for Instructional Video Understanding
VindLU: A Recipe for Effective Video-and-Language Pretraining
Modular Memorability: Tiered Representations for Video Memorability Prediction
Multivariate, Multi-Frequency and Multimodal: Rethinking Graph Neural Networks for Emotion Recognition in Conversation
Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model
Layout-Based Causal Inference for Object Navigation
Improving Vision-and-Language Navigation by Generating Future-View Image Semantics
A New Path: Scaling Vision-and-Language Navigation With Synthetic Instructions and Imitation Learning
A-Cap: Anticipation Captioning With Commonsense Knowledge
Are Deep Neural Networks SMARTer Than Second Graders?
Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning
Language Adaptive Weight Generation for Multi-Task Visual Grounding
From Images to Textual Prompts: Zero-Shot Visual Question Answering With Frozen Large Language Models
Diversity-Aware Meta Visual Prompting
Hierarchical Prompt Learning for Multi-Task Learning
Task Residual for Tuning Vision-Language Models
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding
Overlooked Factors in Concept-Based Explanations: Dataset Choice, Concept Learnability, and Human Capability
Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space
GIVL: Improving Geographical Inclusivity of Vision-Language Models With Pre-Training Methods
Learning Bottleneck Concepts in Image Classification
SceneTrilogy: On Human Scene-Sketch and Its Complementarity With Photo and Text
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
CLIPPO: Image-and-Language Understanding From Pixels Only
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
Non-Contrastive Learning Meets Language-Image Pre-Training
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Learning Attribute and Class-Specific Representation Duet for Fine-Grained Fashion Analysis
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce
Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning
Asymmetric Feature Fusion for Image Retrieval
Improving Zero-Shot Generalization and Robustness of Multi-Modal Models
Hint-Aug: Drawing Hints From Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning
Visual DNA: Representing and Comparing Images Using Distributions of Neuron Activations
End-to-End 3D Dense Captioning With Vote2Cap-DETR
Improving Table Structure Recognition With Visual-Alignment Sequential Coordinate Modeling
Region-Aware Pretraining for Open-Vocabulary Object Detection With Vision Transformers
Mobile User Interface Element Detection via Adaptively Prompt Tuning
Learning To Generate Text-Grounded Mask for Open-World Semantic Segmentation From Only Image-Text Pairs
ZegCLIP: Towards Adapting CLIP for Zero-Shot Semantic Segmentation
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection
Learning Conditional Attributes for Compositional Zero-Shot Learning
CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation
StructVPR: Distill Structural Knowledge With Weighting Samples for Visual Place Recognition
UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration
Primitive Generation and Semantic-Related Alignment for Universal Zero-Shot Segmentation
Inferring and Leveraging Parts From Object Shape for Improving Semantic Image Synthesis
Compositor: Bottom-Up Clustering and Compositing for Robust Part and Object Segmentation
A Strong Baseline for Generalized Few-Shot Semantic Segmentation
DynaMask: Dynamic Mask Selection for Instance Segmentation
Focus on Details: Online Multi-Object Tracking With Diverse Fine-Grained Representation
Dynamic Focus-Aware Positional Queries for Semantic Segmentation
Beyond mAP: Towards Better Evaluation of Instance Segmentation
Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation
Weakly Supervised Semantic Segmentation via Adversarial Learning of Classifier and Reconstructor
SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation
Augmentation Matters: A Simple-Yet-Effective Approach to Semi-Supervised Semantic Segmentation
The Devil Is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-Guided Mask Representation
Class-Incremental Exemplar Compression for Class-Incremental Learning
Full or Weak Annotations? An Adaptive Strategy for Budget-Constrained Annotation Campaigns
Learning Common Rationale To Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
Self-Supervised AutoFlow
DETR With Additional Global Aggregation for Cross-Domain Weakly Supervised Object Detection
Detecting Everything in the Open World: Towards Universal Object Detection
PROB: Probabilistic Objectness for Open World Object Detection
Annealing-Based Label-Transfer Learning for Open World Object Detection
Learning Transformation-Predictive Representations for Detection and Description of Local Features
Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection
2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection
Zero-Shot Generative Model Adaptation via Image-Specific Prompt Learning
AutoLabel: CLIP-Based Framework for Open-Set Video Domain Adaptation
Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation
Directional Connectivity-Based Segmentation of Medical Images
Ambiguous Medical Image Segmentation Using Diffusion Models
Sparse Multi-Modal Graph Transformer With Shared-Context Processing for Representation Learning of Giga-Pixel Images
METransformer: Radiology Report Generation by Transformer With Multiple Learnable Expert Tokens
Towards Trustable Skin Cancer Diagnosis via Rewriting Model’s Decision
Rethinking Out-of-Distribution (OOD) Detection: Masked Image Modeling Is All You Need
MetaViewer: Towards a Unified Multi-View Representation
Deep Incomplete Multi-View Clustering With Cross-View Partial Sample and Prototype Alignment
RONO: Robust Discriminative Learning With Noisy Labels for 2D-3D Cross-Modal Retrieval
Mind the Label Shift of Augmentation-Based Graph OOD Generalization
Zero-Shot Model Diagnosis
ProtoCon: Pseudo-Label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-Supervised Learning
Fine-Grained Classification With Noisy Labels
Twin Contrastive Learning With Noisy Labels
RMLVQA: A Margin Loss Approach for Visual Question Answering With Language Biases
Generative Bias for Robust Visual Question Answering
On-the-Fly Category Discovery
Co-Training 2L Submodels for Visual Recognition
Neural Dependencies Emerging From Learning Massive Categories
MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation
Towards Better Stability and Adaptability: Improve Online Self-Training for Model Adaptation in Semantic Segmentation
DARE-GRAM: Unsupervised Domain Adaptation Regression by Aligning Inverse Gram Matrices
Equiangular Basis Vectors
Enhanced Multimodal Representation Learning With Cross-Modal KD
Decompose, Adjust, Compose: Effective Normalization by Playing With Frequency for Domain Generalization
Back to the Source: Diffusion-Driven Adaptation To Test-Time Corruption
Deep Frequency Filtering for Domain Generalization
Generalizable Implicit Neural Representations via Instance Pattern Composers
Train-Once-for-All Personalization
Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners
Few-Shot Class-Incremental Learning via Class-Aware Bilateral Distillation
Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning
Dense Network Expansion for Class Incremental Learning
Class Attention Transfer Based Knowledge Distillation
Dealing With Cross-Task Class Discrimination in Online Continual Learning
Real-Time Evaluation in Online Continual Learning: A New Hope
DisWOT: Student Architecture Search for Distillation WithOut Training
CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning
EcoTTA: Memory-Efficient Continual Test-Time Adaptation via Self-Distilled Regularization
Achieving a Better Stability-Plasticity Trade-Off via Auxiliary Networks in Continual Learning
PA&DA: Jointly Sampling Path and Data for Consistent NAS
Accelerating Dataset Distillation via Model Augmentation
Multi-Agent Automated Machine Learning
Transformer-Based Learned Optimization
Solving Relaxations of MAP-MRF Problems: Combinatorial In-Face Frank-Wolfe Directions
HOTNAS: Hierarchical Optimal Transport for Neural Architecture Search
Disentangled Representation Learning for Unsupervised Neural Quantization
FFCV: Accelerating Training by Removing Data Bottlenecks
Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks
FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits
Gradient-Based Uncertainty Attribution for Explainable Bayesian Deep Learning
How To Prevent the Continuous Damage of Noises To Model Training?
Genie: Show Me the Data for Quantization
OpenMix: Exploring Outlier Samples for Misclassification Detection
Data-Free Sketch-Based Image Retrieval
GLeaD: Improving GANs With a Generator-Leading Task
Learning on Gradients: Generalized Artifacts Representation for GAN-Generated Images Detection
Adversarial Normalization: I Can Visualize Everything (ICE)
Semi-Supervised Hand Appearance Recovery via Structure Disentanglement and Dual Adversarial Discrimination
Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning
Diversity-Measurable Anomaly Detection
Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-World
How To Prevent the Poor Performance Clients for Personalized Federated Learning?
DynaFed: Tackling Client Data Heterogeneity With Global Dynamics
Elastic Aggregation for Federated Optimization
Breaching FedMD: Image Recovery via Paired-Logits Inversion Attack
Learning To Measure the Point Cloud Reconstruction Loss in a Representation Space
Backdoor Cleansing With Unlabeled Data
Backdoor Defense via Deconfounded Representation Learning
Defending Against Patch-Based Backdoor Attacks on Self-Supervised Learning
Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger
CAP: Robust Point Cloud Classification via Semantic and Structural Modeling
Evading DeepFake Detectors via Adversarial Statistical Consistency
Enhancing the Self-Universality for Transferable Targeted Attacks
Black-Box Sparse Adversarial Attack via Multi-Objective Optimisation
Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression
Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts
Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial Attacks
Physically Adversarial Infrared Patches With Learnable Shapes and Locations
MaLP: Manipulation Localization Using a Proactive Scheme
(ends 12:00 PM)
12:30 p.m.
Lunch
2 p.m.
Panel:
Vision, Language, and Creativity
(ends 3:00 PM)
3 p.m.
Meeting:
PAMI TC Meeting
(ends 4:00 PM)
4 p.m.
Break
4:30 p.m.
Poster Session WED-PM
[4:30-6:00]
Polarimetric iToF: Measuring High-Fidelity Depth Through Scattering Media
NeRFLix: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-Viewpoint MiXer
SUDS: Scalable Urban Dynamic Scenes
DP-NeRF: Deblurred Neural Radiance Field With Physical Scene Priors
DyLiN: Making Light Field Networks Dynamic
Multi-Space Neural Radiance Fields
NeRFLight: Fast and Light Neural Radiance Fields Using a Shared Feature Grid
Cross-Guided Optimization of Radiance Fields With Multi-View Image Super-Resolution for High-Resolution Novel View Synthesis
NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds
DINER: Depth-Aware Image-Based NEural Radiance Fields
Modernizing Old Photos Using Multiple References via Photorealistic Style Transfer
Efficient Map Sparsification Based on 2D and 3D Discretized Grids
K-Planes: Explicit Radiance Fields in Space, Time, and Appearance
I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs
Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes
Inverse Rendering of Translucent Objects Using Physical and Neural Renderers
Accidental Light Probes
Humans As Light Bulbs: 3D Human Reconstruction From Thermal Reflection
HumanGen: Generating Human Radiance Fields With Explicit Priors
Seeing Through the Glass: Neural 3D Reconstruction of Object Inside a Transparent Container
3D Shape Reconstruction of Semi-Transparent Worms
Dionysus: Recovering Scene Structures by Dividing Into Semantic Pieces
SparseFusion: Distilling View-Conditioned Diffusion for 3D Reconstruction
PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces
RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation
Infinite Photorealistic Worlds Using Procedural Generation
Diffusion-SDF: Text-To-Shape via Voxelized Diffusion
3D-Aware Multi-Class Image-to-Image Translation With NeRFs
Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures
Local 3D Editing via 3D Distillation of CLIP Knowledge
ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations
CoralStyleCLIP: Co-Optimized Region and Layer Selection for Image Editing
3D-Aware Face Swapping
DCFace: Synthetic Face Generation With Dual Condition Diffusion Model
HairStep: Transfer Synthetic to Real Using Strand and Depth Maps for Single-View 3D Hair Modeling
DiffusionRig: Learning Personalized Priors for Facial Appearance Editing
3D-Aware Facial Landmark Detection via Multi-View Consistent Training on Synthetic Data
Parametric Implicit Face Representation for Audio-Driven Facial Reenactment
MEGANE: Morphable Eyeglass and Avatar Network
CodeTalker: Speech-Driven 3D Facial Animation With Discrete Motion Prior
Reconstructing Signing Avatars From Video Using Linguistic Priors
HARP: Personalized Hand Reconstruction From a Monocular RGB Video
OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis
RaBit: Parametric Modeling of 3D Biped Cartoon Characters With a Topological-Consistent Dataset
Transfer4D: A Framework for Frugal Motion Capture and Deformation Transfer
CLOTH4D: A Dataset for Clothed Human Reconstruction
Vid2Avatar: 3D Avatar Reconstruction From Videos in the Wild via Self-Supervised Scene Decomposition
High-Fidelity 3D Human Digitization From Single 2K Resolution Images
Sampling Is Matter: Point-Guided 3D Human Mesh Reconstruction
gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction
Human Body Shape Completion With Implicit Shape and Flow Learning
ShapeClipper: Scalable 3D Shape Learning From Single-View Images via Geometric and CLIP-Based Consistency
PC2: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction
NIKI: Neural Inverse Kinematics With Invertible Neural Networks for 3D Human Pose and Shape Estimation
ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation
ACR: Attention Collaboration-Based Regressor for Arbitrary Two-Hand Reconstruction
MIME: Human-Aware 3D Scene Generation
CIMI4D: A Large Multimodal Climbing Motion Dataset Under Human-Scene Interactions
Harmonious Feature Learning for Interactive Hand-Object Pose Estimation
AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation
A Characteristic Function-Based Method for Bottom-Up Human Pose Estimation
Unified Pose Sequence Modeling
Scene-Aware Egocentric 3D Human Pose Estimation
DiffPose: Toward More Reliable 3D Pose Estimation
MammalNet: A Large-Scale Video Benchmark for Mammal Recognition and Behavior Understanding
Learning 3D-Aware Image Synthesis With Unknown Pose Distribution
Pose Synchronization Under Multiple Pair-Wise Relative Poses
ObjectMatch: Robust Registration Using Canonical Object Correspondences
Learning Articulated Shape With Keypoint Pseudo-Labels From Web Images
Learning Correspondence Uncertainty via Differentiable Nonlinear Least Squares
Efficient Second-Order Plane Adjustment
Learning a Depth Covariance Function
Privacy-Preserving Representations Are Not Enough: Recovering Scene Content From Camera Poses
Objaverse: A Universe of Annotated 3D Objects
Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild
HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes With Iterative Intertwined Regularization
Visual Localization Using Imperfect 3D Models From the Internet
PRISE: Demystifying Deep Lucas-Kanade With Strongly Star-Convex Constraints for Multimodel Image Alignment
Scalable, Detailed and Mask-Free Universal Photometric Stereo
Enhanced Stable View Synthesis
End-to-End Vectorized HD-Map Construction With Piecewise Bezier Curve
DynamicStereo: Consistent Dynamic Depth From Stereo Videos
Shakes on a Plane: Unsupervised Depth Estimation From Unstabilized Photography
Gated Stereo: Joint Depth Estimation From Gated and Wide-Baseline Active Stereo Cues
K3DN: Disparity-Aware Kernel Estimation for Dual-Pixel Defocus Deblurring
HRDFuse: Monocular 360° Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions
OSRT: Omnidirectional Image Super-Resolution With Distortion-Aware Transformer
Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM
Few-Shot Non-Line-of-Sight Imaging With Signal-Surface Collaborative Regularization
NLOST: Non-Line-of-Sight Imaging With Transformer
Listening Human Behavior: 3D Human Pose Estimation With Acoustic Signals
Towards Domain Generalization for Multi-View 3D Object Detection in Bird-Eye-View
X3KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection
Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection
Learned Two-Plane Perspective Prior Based Image Resampling for Efficient Object Detection
Resource-Efficient RGBD Aerial Tracking
Toward RAW Object Detection: A New Benchmark and a New Model
Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection
LiDAR-in-the-Loop Hyperparameter Optimization
Learning and Aggregating Lane Graphs for Urban Automated Driving
Center Focusing Network for Real-Time LiDAR Panoptic Segmentation
Adaptive Sparse Convolutional Networks With Global Context Enhancement for Faster Object Detection on Drone Images
MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training
ALSO: Automotive Lidar Self-Supervision by Occupancy Estimation
Unsupervised Intrinsic Image Decomposition With LiDAR Intensity
PVT-SSD: Single-Stage 3D Object Detector With Point-Voxel Transformer
LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs
WeatherStream: Light Transport Automation of Single Image Deweathering
Mask3D: Pre-Training 2D Vision Transformers by Learning Masked 3D Priors
DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets
IterativePFN: True Iterative Point Cloud Filtering
itKD: Interchange Transfer-Based Knowledge Distillation for 3D Object Detection
ISBNet: A 3D Point Cloud Instance Segmentation Network With Instance-Aware Sampling and Box-Aware Dynamic Convolution
Symmetric Shape-Preserving Autoencoder for Unsupervised Real Scene Point Cloud Completion
GeoMAE: Masked Geometric Target Prediction for Self-Supervised Point Cloud Pre-Training
AnchorFormer: Point Cloud Completion From Discriminative Nodes
SHS-Net: Learning Signed Hyper Surfaces for Oriented Normal Estimation of Point Clouds
NerVE: Neural Volumetric Edges for Parametric Curve Extraction From Point Cloud
Unsupervised Deep Probabilistic Approach for Partial Point Cloud Registration
Local Connectivity-Based Density Estimation for Face Clustering
Bridging Search Region Interaction With Template for RGB-T Tracking
Quantum Multi-Model Fitting
Generalizable Local Feature Pre-Training for Deformable Shape Analysis
Similarity Metric Learning for RGB-Infrared Group Re-Identification
Unsupervised Deep Asymmetric Stereo Matching With Spatially-Adaptive Self-Similarity
Sliced Optimal Partial Transport
DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling
Bayesian Posterior Approximation With Stochastic Ensembles
V2V4Real: A Real-World Large-Scale Dataset for Vehicle-to-Vehicle Cooperative Perception
ReasonNet: End-to-End Driving With Temporal and Global Reasoning
Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction
FJMP: Factorized Joint Multi-Agent Motion Prediction Over Learned Directed Acyclic Interaction Graphs
Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion
Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-per-Second
Affordances From Human Videos as a Versatile Representation for Robotics
Indiscernible Object Counting in Underwater Scenes
Tracking Through Containers and Occluders in the Wild
Simple Cues Lead to a Strong Multi-Object Tracker
An In-Depth Exploration of Person Re-Identification and Gait Recognition in Cloth-Changing Conditions
SelfME: Self-Supervised Motion Learning for Micro-Expression Recognition
LipFormer: High-Fidelity and Generalizable Talking Face Generation With a Pre-Learned Facial Codebook
Real-Time Multi-Person Eyeblink Detection in the Wild for Untrimmed Video
Skinned Motion Retargeting With Residual Perception of Motion Semantics & Geometry
MoDi: Unconditional Motion Synthesis From Diverse Data
Recurrent Vision Transformers for Object Detection With Event Cameras
Continuous Intermediate Token Learning With Implicit Motion Manifold for Keyframe Based Motion Interpolation
EvShutter: Transforming Events for Unconstrained Rolling Shutter Correction
Multi Domain Learning for Motion Magnification
Learning Event Guided High Dynamic Range Video Reconstruction
Joint Video Multi-Frame Interpolation and Deblurring Under Unknown Exposure Time
FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER
MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding From Object Detection
Joint HDR Denoising and Fusion: A Real-World Mobile HDR Image Dataset
Visibility Constrained Wide-Band Illumination Spectrum Design for Seeing-in-the-Dark
Self-Supervised Blind Motion Deblurring With Deep Expectation Maximization
Structure Aggregation for Cross-Spectral Stereo Image Guided Denoising
Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments
Zero-Shot Noise2Noise: Efficient Image Denoising Without Any Data
Real-Time Controllable Denoising for Image and Video
Probability-Based Global Cross-Modal Upsampling for Pansharpening
ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal
Visual Recognition-Driven Image Restoration for Multiple Degradation With Intrinsic Semantics Recovery
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective
Human Guided Ground-Truth Generation for Realistic Image Super-Resolution
Real-Time 6K Image Rescaling With Rate-Distortion Optimization
Equivalent Transformation and Dual Stream Network Construction for Mobile Image Super-Resolution
Ultrahigh Resolution Image/Video Matting With Spatio-Temporal Sparsity
Comprehensive and Delicate: An Efficient Transformer for Image Restoration
PHA: Patch-Wise High-Frequency Augmentation for Transformer-Based Person Re-Identification
PyramidFlow: High-Resolution Defect Contrastive Localization Using Pyramid Normalizing Flow
Neural Fourier Filter Bank
Restoration of Hand-Drawn Architectural Drawings Using Latent Space Mapping With Degradation Generator
Neural Preset for Color Style Transfer
NÜWA-LIP: Language-Guided Image Inpainting With Defect-Free VQGAN
DualVector: Unsupervised Vector Font Synthesis With Dual-Part Representation
DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model
GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
Fix the Noise: Disentangling Source Feature for Controllable Domain Translation
Conditional Text Image Generation With Diffusion Models
ReCo: Region-Controlled Text-to-Image Generation
Freestyle Layout-to-Image Synthesis
Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
Towards Flexible Multi-Modal Document Models
On Distillation of Guided Diffusion Models
Dimensionality-Varying Diffusion Process
Shape-Aware Text-Driven Layered Video Editing
Rethinking Image Super Resolution From Long-Tailed Distribution Learning Perspective
End-to-End Video Matting With Trimap Propagation
Context-Based Trit-Plane Coding for Progressive Image Compression
Complexity-Guided Slimmable Decoder for Efficient Deep Video Compression
Efficient Hierarchical Entropy Model for Learned Point Cloud Compression
NIRVANA: Neural Implicit Representations of Videos With Adaptive Networks and Autoregressive Patch-Wise Modeling
Learned Image Compression With Mixed Transformer-CNN Architectures
Memory-Friendly Scalable Super-Resolution via Rewinding Lottery Ticket Hypothesis
InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions
EfficientViT: Memory Efficient Vision Transformer With Cascaded Group Attention
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer
High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity
Non-Contrastive Unsupervised Learning of Physiological Signals From Video
Revealing the Dark Secrets of Masked Image Modeling
Improving Visual Representation Learning Through Perceptual Understanding
FlexiViT: One Model for All Patch Sizes
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning With Masked Autoencoders
SimpSON: Simplifying Photo Cleanup With Single-Click Distracting Object Segmentation Network
Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention
Iterative Next Boundary Detection for Instance Segmentation of Tree Rings in Microscopy Images of Shrub Cross Sections
VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking
DropMAE: Masked Autoencoders With Spatial-Attention Dropout for Tracking Tasks
SeqTrack: Sequence to Sequence Learning for Visual Object Tracking
Bootstrapping Objectness From Videos by Relaxed Common Fate and Visual Grouping
Video Event Restoration Based on Keyframes for Video Anomaly Detection
Streaming Video Model
LSTFE-Net:Long Short-Term Feature Enhancement Network for Video Small Object Detection
A Generalized Framework for Video Instance Segmentation
Referring Multi-Object Tracking
Source-Free Video Domain Adaptation With Spatial-Temporal-Historical Consistency Learning
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
Egocentric Auditory Attention Localization in Conversations
iQuery: Instruments As Queries for Audio-Visual Sound Separation
Learning To Dub Movies via Hierarchical Prosody Models
A Large-Scale Robustness Analysis of Video Action Recognition Models
The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction
STMixer: A One-Stage Sparse Action Detector
Generating Human Motion From Textual Descriptions With Discrete Representations
Cascade Evidential Learning for Open-World Weakly-Supervised Temporal Action Localization
Distilling Vision-Language Pre-Training To Collaborate With Weakly-Supervised Temporal Action Localization
Simultaneously Short- and Long-Term Temporal Modeling for Semi-Supervised Video Semantic Segmentation
MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering
Language-Guided Music Recommendation for Video via Prompt Analogies
Text-Visual Prompting for Efficient 2D Temporal Video Grounding
CelebV-Text: A Large-Scale Facial Text-Video Dataset
CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset
Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations
PDPP:Projected Diffusion for Procedure Planning in Instructional Videos
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval
Clover: Towards a Unified Video-Language Alignment and Fusion Model
Align and Attend: Multimodal Summarization With Dual Contrastive Losses
Learning Situation Hyper-Graphs for Video Question Answering
Natural Language-Assisted Sign Language Recognition
SkyEye: Self-Supervised Bird’s-Eye-View Semantic Mapping Using Monocular Frontal View Images
Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation
Iterative Vision-and-Language Navigation
EXCALIBUR: Encouraging and Evaluating Embodied Exploration
Multimodal Prompting With Missing Modalities for Visual Recognition
Visual Programming: Compositional Visual Reasoning Without Training
Super-CLEVR: A Virtual Benchmark To Diagnose Domain Robustness in Visual Reasoning
Prompting Large Language Models With Answer Heuristics for Knowledge-Based Visual Question Answering
À-La-Carte Prompt Tuning (APT): Combining Distinct Data via Composable Prompting
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
Q: How To Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Learning To Exploit Temporal Structure for Biomedical Vision–Language Processing
FashionSAP: Symbols and Attributes Prompt for Fine-Grained Fashion Vision-Language Pre-Training
Advancing Visual Grounding With Scene Knowledge: Benchmark and Method
Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks
OCTET: Object-Aware Counterfactual Explanations
Local-Guided Global: Paired Similarity Representation for Visual Reinforcement Learning
What Can Human Sketches Do for Object Detection?
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Correlational Image Modeling for Self-Supervised Visual Pre-Training
Generalized Decoding for Pixel, Image, and Language
Towards Modality-Agnostic Person Re-Identification With Descriptive Query
M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis
Learning Customized Visual Models With Retrieval-Augmented Knowledge
Learning Semantic Relationship Among Instances for Image-Text Matching
I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification
ImageBind: One Embedding Space To Bind Them All
Model-Agnostic Gender Debiased Image Captioning
Boundary-Aware Backward-Compatible Representation via Adversarial Learning in Image Retrieval
Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners
Towards Unified Scene Text Spotting Based on Sequence Generation
CapDet: Unifying Dense Captioning and Open-World Detection Pretraining
CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data
Aligning Bag of Regions for Open-Vocabulary Object Detection
Visual Recognition by Request
Category Query Learning for Human-Object Interaction Classification
Self-Supervised Implicit Glyph Attention for Text Recognition
Enlarging Instance-Specific and Class-Specific Information for Open-Set Action Recognition
CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
Learning Attention As Disentangler for Compositional Zero-Shot Learning
Universal Instance Perception As Object Discovery and Retrieval
Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning
DPF: Learning Dense Prediction Fields With Weak Supervision
Modeling Entities As Semantic Points for Visual Information Extraction in the Wild
GeoNet: Benchmarking Unsupervised Adaptation Across Geographies
SegLoc: Learning Segmentation-Based Representations for Privacy-Preserving Visual Localization
Towards Open-World Segmentation of Parts
Pruning Parameterization With Bi-Level Optimization for Efficient Semantic Segmentation on the Edge
HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation
Exemplar-FreeSOLO: Enhancing Unsupervised Instance Segmentation With Exemplars
Weakly-Supervised Domain Adaptive Semantic Segmentation With Prototypical Contrastive Learning
Spatial-Temporal Concept Based Explanation of 3D ConvNets
Sparsely Annotated Semantic Segmentation With Adaptive Gaussian Mixtures
Fuzzy Positive Learning for Semi-Supervised Semantic Segmentation
STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection
Boosting Low-Data Instance Segmentation by Unsupervised Pre-Training With Saliency Prompt
Decoupled Semantic Prototypes Enable Learning From Diverse Annotation Types for Semi-Weakly Segmentation in Expert-Driven Domains
The Treasure Beneath Multiple Annotations: An Uncertainty-Aware Edge Detector
Knowledge Combination To Learn Rotated Detection Without Rotated Annotation
Mapping Degeneration Meets Label Evolution: Learning Infrared Small Target Detection With Single Point Supervision
SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency
Zero-Shot Object Counting
SOOD: Towards Semi-Supervised Oriented Object Detection
Large-Scale Training Data Search for Object Re-Identification
Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection
Towards Effective Visual Representations for Partial-Label Learning
Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection
Boosting Detection in Crowd Analysis via Underutilized Output Features
Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture
Weakly Supervised Segmentation With Point Annotations for Histopathology Images via Contrast-Based Variational Model
DoNet: Deep De-Overlapping Network for Cytology Instance Segmentation
MCF: Mutual Correction Framework for Semi-Supervised Medical Image Segmentation
Histopathology Whole Slide Image Analysis With Heterogeneous Graph Representation Learning
PEFAT: Boosting Semi-Supervised Medical Image Classification via Pseudo-Loss Estimation and Feature Adversarial Training
Causally-Aware Intraoperative Imputation for Overall Survival Time Prediction
Balanced Energy Regularization Loss for Out-of-Distribution Detection
Block Selection Method for Using Feature Norm in Out-of-Distribution Detection
Highly Confident Local Structure Based Consensus Graph Learning for Incomplete Multi-View Clustering
Siamese DETR
Towards Bridging the Performance Gaps of Joint Energy-Based Models
Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning
Boosting Transductive Few-Shot Fine-Tuning With Margin-Based Uncertainty Weighting and Probability Regularization
CHMATCH: Contrastive Hierarchical Matching and Robust Adaptive Threshold Boosted Semi-Supervised Learning
MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins
Ranking Regularization for Critical Rare Classes: Minimizing False Positives at a High True Positive Rate
Learning Imbalanced Data With Vision Transformers
No One Left Behind: Improving the Worst Categories in Long-Tailed Learning
Global and Local Mixture Consistency Cumulative Learning for Long-Tailed Visual Recognitions
Curvature-Balanced Feature Manifold Learning for Long-Tailed Classification
DAA: A Delta Age AdaIN Operation for Age Estimation via Binary Code Transformer
DLBD: A Self-Supervised Direct-Learned Binary Descriptor
Progressive Open Space Expansion for Open-Set Model Attribution
DiGA: Distil To Generalize and Then Adapt for Domain Adaptive Semantic Segmentation
Multi-Modal Learning With Missing Modality via Shared-Specific Feature Modelling
Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information
Bi-Level Meta-Learning for Few-Shot Domain Generalization
Train/Test-Time Adaptation With Retrieval
Robust Test-Time Adaptation in Dynamic Scenarios
Domain Expansion of Image Generators
Switchable Representation Learning Framework With Self-Compatibility
A New Benchmark: On the Utility of Synthetic Data With Blender for Bare Supervised Learning and Downstream Domain Adaptation
Adapting Shortcut With Normalizing Flow: An Efficient Tuning Framework for Visual Recognition
Manipulating Transfer Learning for Property Inference
Heterogeneous Continual Learning
Generic-to-Specific Distillation of Masked Autoencoders
Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval
CafeBoost: Causal Feature Boost To Eliminate Task-Induced Bias for Class Incremental Learning
Bilateral Memory Consolidation for Continual Learning
NICO++: Towards Better Benchmarking for Domain Generalization
DART: Diversify-Aggregate-Repeat Training Improves Generalization of Neural Networks
Differentiable Architecture Search With Random Features
Class Adaptive Network Calibration
Meta-Learning With a Geometry-Adaptive Preconditioner
DepGraph: Towards Any Structural Pruning
Stitchable Neural Networks
Integral Neural Networks
Regularization of Polynomial Networks for Image Recognition
ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders
Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations
Don’t Lie to Me! Robust and Efficient Explainability With Verified Perturbation Analysis
OT-Filter: An Optimal Transport Filter for Learning With Noisy Labels
Robust Generalization Against Photon-Limited Corruptions via Worst-Case Sharpness Minimization
Learning With Noisy Labels via Self-Supervised Adversarial Noisy Masking
Bit-Shrinking: Limiting Instantaneous Sharpness for Improving Post-Training Quantization
Enhancing Multiple Reliability Measures via Nuisance-Extended Information Bottleneck
AdaptiveMix: Improving GAN Training via Feature Space Shrinkage
Re-GAN: Data-Efficient GANs Training via Architectural Reconfiguration
Soft Augmentation for Image Classification
Boosting Verified Training for Robust Image Classifications via Abstraction
A New Dataset Based on Images Taken by Blind People for Testing the Robustness of Image Classification Models Trained for ImageNet Categories
Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection
Prototypical Residual Networks for Anomaly Detection and Localization
Class Balanced Adaptive Pseudo Labeling for Federated Semi-Supervised Learning
Fair Federated Medical Image Segmentation via Client Contribution Estimation
Rethinking Federated Learning With Domain Shift: A Prototype View
FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning
Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations
STDLens: Model Hijacking-Resilient Federated Learning for Object Detection
Detecting Backdoors in Pre-Trained Encoders
Detecting Backdoors During the Inference Stage Based on Corruption Robustness Consistency
Can’t Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders
Re-Thinking Model Inversion Attacks Against Deep Neural Networks
Turning Strengths Into Weaknesses: A Certified Robustness Inspired Attack Framework Against Graph Neural Networks
Dynamic Generative Targeted Attacks With Pattern Injection
Transferable Adversarial Attacks on Vision Transformers With Token Gradient Regularization
Adversarial Counterfactual Visual Explanations
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization
Randomized Adversarial Training via Taylor Expansion
Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization
Evading Forensic Classifiers With Attribute-Conditioned Adversarial Faces
DartBlur: Privacy Preservation With Detection Artifact Suppression
(ends 6:00 PM)
7 p.m.
Reception:
Reception & Musical Performances
(ends 9:00 PM)
THU 22 JUN
7:30 a.m.
Breakfast
9 a.m.
Keynote:
Modeling Atoms to Address Our Climate Crisis
Larry Zitnick
(ends 10:00 AM)
10 a.m.
Break
10:30 a.m.
Poster Session THU-AM
[10:30-12:00]
Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection
JacobiNeRF: NeRF Shaping With Mutual Information Gradients
ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-Real Novel View Synthesis via Contrastive Learning
SCADE: NeRFs from Space Carving With Ambiguity-Aware Depth Estimates
Removing Objects From Neural Radiance Fields
Progressively Optimized Local Radiance Fields for Robust View Synthesis
NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds
ABLE-NeRF: Attention-Based Rendering With Learnable Embeddings for Neural Radiance Field
MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures
pCON: Polarimetric Coordinate Networks for Neural Scene Representations
Balanced Spherical Grid for Egocentric View Synthesis
Complementary Intrinsics From Neural Radiance Fields and CNNs for Outdoor Scene Relighting
HyperReel: High-Fidelity 6-DoF Video With Ray-Conditioned Sampling
UV Volumes for Real-Time Rendering of Editable Free-View Human Performance
Tensor4D: Efficient Neural 4D Decomposition for High-Fidelity Dynamic Reconstruction and Rendering
PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing
Computational Flash Photography Through Intrinsics
RelightableHands: Efficient Neural Relighting of Articulated Hand Models
TMO: Textured Mesh Acquisition of Objects With a Mobile Device by Using Differentiable Rendering
VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction
Multi-View Reconstruction Using Signed Ray Distance Functions (SRDF)
Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction
Octree Guided Unoriented Surface Reconstruction
Neural Vector Fields: Implicit Representation by Explicit Learning
DA Wand: Distortion-Aware Selection Using Neural Mesh Parameterization
Diffusion-Based Generation, Optimization, and Planning in 3D Scenes
Patch-Based 3D Natural Scene Generation From a Single Example
Consistent View Synthesis With Pose-Guided Diffusion Models
Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process
High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition
TAPS3D: Text-Guided 3D Textured Shape Generation From Pseudo Supervision
SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations
Interactive Cartoonization With Controllable Perceptual Factors
High-Res Facial Appearance Capture From Polarized Smartphone Images
GlassesGAN: Eyewear Personalization Using Synthetic Appearance Discovery and Targeted Subspace Modeling
Continuous Landmark Detection With 3D Queries
NeuFace: Realistic 3D Neural Face Rendering From Multi-View Images
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
Learning Personalized High Quality Volumetric Head Avatars From Monocular RGB Videos
OTAvatar: One-Shot Talking Face Avatar With Controllable Tri-Plane Rendering
X-Avatar: Expressive Human Avatars
InstantAvatar: Learning Avatars From Monocular Video in 60 Seconds
JAWS: Just a Wild Shot for Cinematic Transfer in Neural Radiance Fields
MonoHuman: Animatable Human Neural Field From Monocular Video
Structured 3D Features for Reconstructing Controllable Avatars
HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics
Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling
Learning Semantic-Aware Disentangled Representation for Flexible 3D Human Body Editing
Reconstructing Animatable Categories From Videos
Deformable Mesh Transformer for 3D Human Mesh Recovery
Hi4D: 4D Instance Segmentation of Close Human Interaction
Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild
Learning Human Mesh Recovery in 3D Scenes
H2ONet: Hand-Occlusion-and-Orientation-Aware Network for Real-Time 3D Hand Mesh Reconstruction
What You Can Reconstruct From a Shadow
Autonomous Manipulation Learning for Similar Deformable Objects via Only One Demonstration
In-Hand 3D Object Scanning From an RGB Sequence
Putting People in Their Place: Affordance-Aware Human Insertion Into Scenes
Detecting Human-Object Contact in Images
What Happened 3 Seconds Ago? Inferring the Past With Thermal Imaging
Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting
Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video
Ego-Body Pose Estimation via Ego-Head Pose Estimation
ViPLO: Vision Transformer Based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection
HS-Pose: Hybrid Scope Feature Extraction for Category-Level Object Pose Estimation
ScarceNet: Animal Pose Estimation With Scarce Annotations
Cross-Domain 3D Hand Pose Estimation With Dual Modalities
Linking Garment With Person via Semantically Associated Landmarks for Virtual Try-On
Level-S$^2$fM: Structure From Motion on Neural Level Set of Implicit Surfaces
Revisiting Rotation Averaging: Uncertainties and Robust Losses
SliceMatch: Geometry-Guided Aggregation for Cross-View Pose Estimation
Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation
Long-Term Visual Localization With Mobile Sensors
Learning To Predict Scene-Level Implicit 3D From Posed RGBD Data
Paired-Point Lifting for Enhanced Privacy-Preserving Visual Localization
The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects
Learning Accurate 3D Shape Based on Stereo Polarimetric Imaging
RUST: Latent Neural Scene Representations From Unposed Imagery
Perspective Fields for Single Image Camera Calibration
VisFusion: Visibility-Aware Online 3D Scene Reconstruction From Videos
DeepLSD: Line Segment Detection and Refinement With Deep Image Gradients
Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation With Cross-Scale Distortion Awareness
Single Image Depth Prediction Made Better: A Multivariate Gaussian Take
Wide-Angle Rectification via Content-Aware Conformal Mapping
All-in-Focus Imaging From Event Focal Stack
Multi-View Stereo Representation Revist: Region-Aware MVSNet
Semantic Ray: Learning a Generalizable Semantic Field With Cross-Reprojection Attention
OmniCity: Omnipotent City Understanding With Multi-Level and Multi-View Images
ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields
Non-Line-of-Sight Imaging With Signal Superresolution Network
Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence
Learning Transformations To Reduce the Geometric Shift in Object Detection
Anchor3DLane: Learning To Regress 3D Anchors for Monocular 3D Lane Detection
BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks
Semi-Supervised Stereo-Based 3D Object Detection via Cross-View Consensus
Weakly Supervised Monocular 3D Object Detection Using Multi-View Projection and Direction Consistency
MonoATT: Online Monocular 3D Object Detection With Adaptive Token Transformer
Azimuth Super-Resolution for FMCW Radar in Autonomous Driving
Pix2map: Cross-Modal Retrieval for Inferring Street Maps From Images
LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion
Neural Map Prior for Autonomous Driving
Spherical Transformer for LiDAR-Based 3D Recognition
Density-Insensitive Unsupervised Domain Adaption on 3D Object Detection
PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds
PeakConv: Learning Peak Receptive Field for Radar Semantic Segmentation
Single Domain Generalization for LiDAR Semantic Segmentation
Weakly Supervised Class-Agnostic Motion Prediction for Autonomous Driving
MethaneMapper: Spectral Absorption Aware Hyperspectral Transformer for Methane Detection
GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds
SCoDA: Domain Adaptive Shape Completion for Real Scans
SCPNet: Semantic Scene Completion on Point Cloud
ViewNet: A Novel Projection-Based Backbone With View Pooling for Few-Shot Point Cloud Classification
Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning
Learnable Skeleton-Aware 3D Point Cloud Sampling
Meta Architecture for Point Cloud Analysis
PointListNet: Deep Learning on 3D Point Lists
PEAL: Prior-Embedded Explicit Attention Learning for Low-Overlap Point Cloud Registration
Unsupervised Inference of Signed Distance Functions From Single Sparse Point Clouds Without Learning Priors
Towards Better Gradient Consistency for Neural Signed Distance Functions via Level Set Alignment
Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching
3D Registration With Maximal Cliques
PanoSwin: A Pano-Style Swin Transformer for Panorama Understanding
DKM: Dense Kernelized Feature Matching for Geometry Estimation
PATS: Patch Area Transportation With Subdivision for Local Feature Matching
Correspondence Transformers With Asymmetric Feature Learning and Matching Flow Super-Resolution
Learning Adaptive Dense Event Stereo From the Image Domain
On the Convergence of IRLS and Its Variants in Outlier-Robust Estimation
You Only Segment Once: Towards Real-Time Panoptic Segmentation
BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision
UniHCP: A Unified Model for Human-Centric Perceptions
Planning-Oriented Autonomous Driving
Query-Centric Trajectory Prediction
Unsupervised Sampling Promoting for Stochastic Human Trajectory Prediction
AdamsFormer for Spatial Action Localization in the Future
PIRLNav: Pretraining With Imitation and RL Finetuning for ObjectNav
NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis
Camouflaged Instance Segmentation via Explicit De-Camouflaging
Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking
MotionTrack: Learning Robust Short-Term and Long-Term Motions for Multi-Object Tracking
Multi-Modal Gait Recognition via Effective Spatial-Temporal Feature Fusion
Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition
One-Shot High-Fidelity Talking-Head Synthesis With Deformable Neural Radiance Field
Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis
Event-Guided Person Re-Identification via Sparse-Dense Complementary Learning
Executing Your Commands via Motion Diffusion in Latent Space
MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Action Recognition
“Seeing” Electric Network Frequency From Events
Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields
Event-Based Frame Interpolation With Ad-Hoc Deblurring
Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior
TransFlow: Transformer As Flow Learner
MP-Former: Mask-Piloted Transformer for Image Segmentation
GradICON: Approximate Diffeomorphisms via Gradient Inverse Consistency
Neural Texture Synthesis With Guided Correspondence
Self-Supervised Non-Uniform Kernel Estimation With Flow-Based Motion Prior for Blind Image Deblurring
Decoupling-and-Aggregating for Image Exposure Correction
You Do Not Need Additional Priors or Regularizers in Retinex-Based Low-Light Image Enhancement
DNF: Decouple and Feedback Network for Seeing in the Dark
Contrastive Semi-Supervised Learning for Underwater Image Restoration via Reliable Bank
LG-BPN: Local and Global Blind-Patch Network for Self-Supervised Real-World Denoising
Spectral Bayesian Uncertainty for Image Super-Resolution
Deep Random Projector: Accelerated Deep Image Prior
Context-Aware Pretraining for Efficient Blind Image Decomposition
Metadata-Based RAW Reconstruction via Implicit Neural Functions
Raw Image Reconstruction With Learned Compact Metadata
AccelIR: Task-Aware Image Compression for Accelerating Neural Restoration
AutoFocusFormer: Image Segmentation off the Grid
Guided Depth Super-Resolution by Deep Anisotropic Diffusion
Super-Resolution Neural Operator
Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution
GamutMLP: A Lightweight MLP for Color Loss Recovery
Efficient and Explicit Modelling of Image Hierarchies for Image Restoration
LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization
CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer
ObjectStitch: Object Compositing With Diffusion Model
DeepVecFont-v2: Exploiting Transformers To Synthesize Vector Fonts With Higher Quality
Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer
CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language
LayoutDM: Transformer-Based Diffusion Model for Layout Generation
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
SpaText: Spatio-Textual Representation for Controllable Image Generation
Paint by Example: Exemplar-Based Image Editing With Diffusion Models
InstructPix2Pix: Learning To Follow Image Editing Instructions
LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction
Self-Guided Diffusion Models
HOLODIFFUSION: Training a 3D Diffusion Model Using 2D Images
Class-Balancing Diffusion Models
Conditional Image-to-Video Generation With Latent Flow Diffusion Models
Video Probabilistic Diffusion Models in Projected Latent Space
Regularized Vector Quantization for Tokenized Image Synthesis
EfficientSCI: Densely Connected Network With Space-Time Factorization for Large-Scale Video Snapshot Compressive Imaging
MMVC: Learned Multi-Mode Video Compression With Block-Based Prediction Mode Selection and Density-Adaptive Entropy Coding
Video Compression With Entropy-Constrained Neural Representations
WIRE: Wavelet Implicit Neural Representations
TINC: Tree-Structured Implicit Neural Compression
CompletionFormer: Depth Completion With Convolutions and Vision Transformers
Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
Global Vision Transformer Pruning With Hessian-Aware Saliency
Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR
PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers
Visual Atoms: Pre-Training Vision Transformers With Sinusoidal Waves
Neuron Structure Modeling for Generalizable Remote Physiological Measurement
Explaining Image Classifiers With Multiscale Directional Image Representation
Integrally Pre-Trained Transformer Pyramid Networks
PartMix: Regularization Strategy To Learn Part Discovery for Visible-Infrared Person Re-Identification
Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions
Focused and Collaborative Feedback Integration for Interactive Image Segmentation
PolyFormer: Referring Image Segmentation As Sequential Polygon Generation
Devil’s on the Edges: Selective Quad Attention for Scene Graph Generation
Panoptic Video Scene Graph Generation
Generalized Relation Modeling for Transformer Tracking
Representation Learning for Visual Object Tracking by Masked Appearance Transfer
Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation
EVAL: Explainable Video Anomaly Localization
MOSO: Decomposing MOtion, Scene and Object for Video Prediction
TarViS: A Unified Approach for Target-Based Video Segmentation
Efficient Movie Scene Detection Using State-Space Transformers
Latency Matters: Real-Time Action Forecasting Transformer
Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning
Watch or Listen: Robust Audio-Visual Speech Recognition With Visual Corruption Modeling and Reliability Scoring
ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision
SVFormer: Semi-Supervised Video Transformer for Action Recognition
Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
Post-Processing Temporal Action Detection
HaLP: Hallucinating Latent Positives for Skeleton-Based Self-Supervised Learning of Actions
TriDet: Temporal Action Detection With Relative Boundary Modeling
Hybrid Active Learning via Deep Clustering for Video Action Detection
Two-Stream Networks for Weakly-Supervised Temporal Action Localization With Semantic-Aware Mechanisms
Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network
Collaborative Noisy Label Cleaner: Learning Scene-Aware Trailers for Multi-Modal Highlight Detection in Movies
Weakly Supervised Temporal Sentence Grounding With Uncertainty-Guided Self-Training
SViTT: Temporal Learning of Sparse Video-Text Transformers
AutoAD: Movie Description in Context
Text With Knowledge Graph Augmented Transformer for Video Captioning
StepFormer: Self-Supervised Step Discovery and Localization in Instructional Videos
Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval
Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding
CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval
Learning Emotion Representations From Verbal and Nonverbal Communication
Context De-Confounded Emotion Recognition
CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning
Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
LEGO-Net: Learning Regular Rearrangements of Objects in Rooms
LANA: A Language-Capable Navigator for Instruction Following and Generation
Policy Adaptation From Foundation Model Feedback
Token Turing Machines
Unicode Analogies: An Anti-Objectivist Visual Reasoning Challenge
Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language
VQACL: A Novel Visual Question Answering Continual Learning Setting
MaPLe: Multi-Modal Prompt Learning
Meta-Personalizing Vision-Language Models To Find Named Instances in Video
Understanding and Improving Visual Prompting: A Label-Mapping Perspective
RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension
Leveraging per Image-Token Consistency for Vision-Language Pre-Training
Improving Visual Grounding by Encouraging Consistent Gradient-Based Explanations
Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks
Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification
Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning
Learning Visual Representations via Language-Guided Sampling
L-CoIns: Language-Based Colorization With Instance Awareness
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID
Unifying Vision, Text, and Layout for Universal Document Processing
RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training
Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network
Text-Guided Unsupervised Latent Transformation for Multi-Attribute Image Manipulation
Improving Image Recognition by Retrieving From Web-Scale Image-Text Data
Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval
DATE: Domain Adaptive Product Seeker for E-Commerce
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning With Multimodal Models
Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models
DeepSolo: Let Transformer Decoder With Explicit Points Solo for Text Spotting
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
R2Former: Unified Retrieval and Reranking Transformer for Place Recognition
Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator
Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework
Neural Congealing: Aligning Images to a Joint Semantic Atlas
Open Vocabulary Semantic Segmentation With Patch Aligned Contrastive Learning
Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains
Explicit Visual Prompting for Low-Level Structure Segmentations
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation
Zero-Shot Referring Image Segmentation With Global-Local Context Features
DejaVu: Conditional Regenerative Learning To Enhance Dense Prediction
Meta Compositional Referring Expression Segmentation
Interactive Segmentation As Gaussion Process Classification
Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation
Principles of Forgetting in Domain-Incremental Semantic Segmentation in Adverse Weather Conditions
AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation
PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers
Leveraging Hidden Positives for Unsupervised Semantic Segmentation
Understanding Imbalanced Semantic Segmentation Through Neural Collapse
Balancing Logit Variation for Long-Tailed Semantic Segmentation
Boundary-Enhanced Co-Training for Weakly Supervised Semantic Segmentation
Conflict-Based Cross-View Consistency for Semi-Supervised Semantic Segmentation
Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
DualRel: Semi-Supervised Mitochondria Segmentation From a Prototype Perspective
Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation
Co-Salient Object Detection With Uncertainty-Aware Group Exchange-Masking
Supervised Masked Knowledge Distillation for Few-Shot Transformers
Modeling the Distributional Uncertainty for Salient Object Detection Models
Weak-Shot Object Detection Through Mutual Knowledge Transfer
CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection
Adaptive Sparse Pairwise Loss for Object Re-Identification
DETRs With Hybrid Matching
Generating Features With Increased Crop-Related Diversity for Few-Shot Object Detection
ScaleKD: Distilling Scale-Aware Knowledge in Small Object Detector
Multiclass Confidence and Localization Calibration for Object Detection
Open-Set Representation Learning Through Combinatorial Embedding
ProD: Prompting-To-Disentangle Domain Knowledge for Cross-Domain Few-Shot Image Classification
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images
FFF: Fragment-Guided Flexible Fitting for Building Complete Protein Structures
Pseudo-Label Guided Contrastive Learning for Semi-Supervised Medical Image Segmentation
Hierarchical Discriminative Learning Improves Visual Representations of Biomedical Microscopy
KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation
Image Quality-Aware Diagnosis via Meta-Knowledge Co-Embedding
Interventional Bag Multi-Instance Learning on Whole-Slide Pathological Images
Visual Prompt Tuning for Generative Transfer Learning
LINe: Out-of-Distribution Detection by Leveraging Important Neurons
GCFAgg: Global and Cross-View Feature Aggregation for Multi-View Clustering
Exploring and Exploiting Uncertainty for Incomplete Multi-View Classification
BiCro: Noisy Correspondence Rectification for Multi-Modality Data via Bi-Directional Cross-Modal Similarity Consistency
Bi-Directional Distribution Alignment for Transductive Zero-Shot Learning
HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization
MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset
Class Prototypes Based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos
Learning From Noisy Labels With Decoupled Meta Label Purifier
SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail
Why Is the Winner the Best?
Balanced Product of Calibrated Experts for Long-Tailed Recognition
Transfer Knowledge From Head to Tail: Uncertainty Calibration Under Long-Tailed Distribution
FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding
COT: Unsupervised Domain Adaptation With Clustering and Optimal Transport
MHPL: Minimum Happy Points Learning for Active Source Free Domain Adaptation
Upcycling Models Under Domain and Category Shift
PMR: Prototypical Modal Rebalance for Multimodal Learning
MMANet: Margin-Aware Distillation and Modality-Aware Regularization for Incomplete Multimodal Learning
Feature Alignment and Uniformity for Test Time Adaptation
Revisiting Prototypical Network for Cross Domain Few-Shot Learning
A Whac-a-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others
Independent Component Alignment for Multi-Task Learning
MDL-NAS: A Joint Multi-Domain Learning Framework for Vision Transformer
MELTR: Meta Loss Transformer for Learning To Fine-Tune Video Foundation Models
1% VS 100%: Parameter-Efficient Low Rank Adapter for Dense Predictions
Rebalancing Batch Normalization for Exemplar-Based Class-Incremental Learning
Partial Network Cloning
ERM-KTP: Knowledge-Level Machine Unlearning via Knowledge Transfer
Rethinking Feature-Based Knowledge Distillation for Face Recognition
Regularizing Second-Order Influences for Continual Learning
Generalization Matters: Loss Minima Flattening via Parameter Hybridization for Efficient Online Knowledge Distillation
Decoupling Learning and Remembering: A Bilevel Memory Framework With Knowledge Projection for Task-Incremental Learning
On the Stability-Plasticity Dilemma of Class-Incremental Learning
Simulated Annealing in Early Layers Leads to Better Generalization
Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning
Tunable Convolutions With Parametric Multi-Loss Optimization
Re-Basin via Implicit Sinkhorn Differentiation
Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization
AstroNet: When Astrocyte Meets Artificial Neural Network
Network Expansion for Practical Training Acceleration
Defining and Quantifying the Emergence of Sparse Concepts in DNNs
Samples With Low Loss Curvature Improve Data Efficiency
Masked Images Are Counterfactual Samples for Robust Fine-Tuning
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation
NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers
Practical Network Acceleration With Tiny Sets
TeSLA: Test-Time Self-Learning With Automatic Adversarial Augmentation
Discriminator-Cooperated Feature Map Distillation for GAN Compression
Private Image Generation With Dual-Purpose Auxiliary Classifier
ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing
Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers
A New Comprehensive Benchmark for Semi-Supervised Video Anomaly Detection and Anticipation
SimpleNet: A Simple Network for Image Anomaly Detection and Localization
DaFKD: Domain-Aware Federated Knowledge Distillation
Reliable and Interpretable Personalized Federated Learning
Adaptive Channel Sparsity for Federated Learning Under System Heterogeneity
Bias-Eliminating Augmentation Learning for Debiased Federated Learning
Instance-Aware Domain Generalization for Face Anti-Spoofing
Adversarially Masking Synthetic To Mimic Real: Adaptive Noise Injection for Point Cloud Segmentation Adaptation
Model Barrier: A Compact Un-Transferable Isolation Domain for Model Intellectual Property Protection
MEDIC: Remove Model Backdoors via Importance Driven Cloning
Progressive Backdoor Erasing via Connecting Backdoor and Adversarial Attacks
Reinforcement Learning-Based Black-Box Model Inversion Attacks
T-SEA: Transfer-Based Self-Ensemble Attack on Object Detection
Proximal Splitting Adversarial Attack for Semantic Segmentation
Towards Transferable Targeted Adversarial Examples
AGAIN: Adversarial Training With Attribution Span Enlargement and Hybrid Feature Fusion
Generalist: Decoupling Natural and Robust Generalization
Cooperation or Competition: Avoiding Player Domination for Multi-Target Robustness via Adaptive Budgets
Discrete Point-Wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition
RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation With Natural Prompts
CLIP2Protect: Protecting Facial Privacy Using Text-Guided Makeup via Adversarial Latent Search
TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization
(ends 12:00 PM)
12:30 p.m.
Lunch
2 p.m.
Panel:
Scientific Discovery and the Environment
(ends 3:00 PM)
3 p.m.
Award:
Award Candidates THU
(ends 4:00 PM)
4 p.m.
Break
4:30 p.m.
Poster Session THU-PM
[4:30-6:00]
High-Fidelity Event-Radiance Recovery via Transient Event Frequency
RobustNeRF: Ignoring Distractors With Robust Losses
NeRDi: Single-View NeRF Synthesis With Language-Guided Diffusion As General Image Priors
GM-NeRF: Learning Generalizable Model-Based Neural Radiance Fields From Multi-View Images
MixNeRF: Modeling a Ray With Mixture Density for Novel View Synthesis From Sparse Inputs
SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting With Neural Radiance Fields
Masked Wavelet Representation for Compact Neural Radiance Fields
PaletteNeRF: Palette-Based Appearance Editing of Neural Radiance Fields
SteerNeRF: Accelerating NeRF Rendering via Smooth Viewpoint Trajectory
Transforming Radiance Field With Lipschitz Network for Photorealistic 3D Scene Stylization
Occlusion-Free Scene Recovery via Neural Radiance Fields
TriVol: Point Cloud Rendering via Triple Volumes
DyNCA: Real-Time Dynamic Texture Synthesis Using Neural Cellular Automata
Neural Scene Chronology
ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects
ORCa: Glossy Objects As Radiance-Field Cameras
Nighttime Smartphone Reflective Flare Removal Using Optical Center Symmetry Prior
SunStage: Portrait Reconstruction and Relighting Using the Sun as a Light Stage
The Differentiable Lens: Compound Lens Search Over Glass Surfaces and Materials for Object Detection
Teleidoscopic Imaging System for Microscale 3D Shape Reconstruction
Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections
NeuralUDF: Learning Unsigned Distance Fields for Multi-View Reconstruction of Surfaces With Arbitrary Topologies
Sphere-Guided Training of Neural Implicit Surfaces
OReX: Object Reconstruction From Planar Cross-Sections Using Neural Fields
Persistent Nature: A Generative Model of Unbounded 3D Worlds
3D Neural Field Generation Using Triplane Diffusion
Diffusion-Based Signed Distance Fields for 3D Shape Generation
Efficient View Synthesis and 3D-Based Multi-Frame Denoising With Multiplane Feature Representations
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
SINE: Semantic-Driven Image-Based NeRF Editing With Prior-Guided Editing Field
3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions
Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion
PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360°
StyleGene: Crossover and Mutation of Region-Level Facial Genes for Kinship Face Synthesis
Parameter Efficient Local Implicit Image Function Network for Face Segmentation
Graphics Capsule: Learning Hierarchical 3D Face Representations From 2D Images
Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars
Learning Neural Parametric Head Models
Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation
Learning Locally Editable Virtual Humans
Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-Time Mobile Telepresence
Ham2Pose: Animating Sign Language Notation Into Pose Sequences
PointAvatar: Deformable Point-Based Head Avatars From Videos
PAniC-3D: Stylized Single-View 3D Reconstruction From Portraits of Anime Characters
HandNeRF: Neural Radiance Fields for Animatable Interacting Hands
VGFlow: Visibility Guided Flow Network for Human Reposing
Clothed Human Performance Capture With a Double-Layer Neural Radiance Fields
POEM: Reconstructing Hand in a Point Embedded Multi-View Stereo
FlexNeRF: Photorealistic Free-Viewpoint Rendering of Moving Humans From Sparse Views
Flow Supervision for Deformable NeRF
Building Rearticulable Models for Arbitrary 3D Objects From 4D Point Clouds
Implicit 3D Human Mesh Recovery Using Consistency With Pose and Shape From Unseen-View
One-Stage 3D Whole-Body Mesh Recovery With Component Aware Transformer
Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes
FLEX: Full-Body Grasping Without Full-Body Grasps
DexArt: Benchmarking Generalizable Dexterous Manipulation With Articulated Objects
CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
CIRCLE: Capture in Rich Contextual Environments
Decoupling Human and Camera Motion From Videos in the Wild
GarmentTracking: Category-Level Garment Pose Tracking
Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition From Egocentric RGB Videos
PSVT: End-to-End Multi-Person 3D Pose and Shape Estimation With Progressive Video Transformers
Delving Into Discrete Normalizing Flows on SO(3) Manifold for Probabilistic Rotation Modeling
3D-POP – An Automated Annotation Approach to Facilitate Markerless 2D-3D Tracking of Freely Moving Birds With Marker-Based Motion Capture
TTA-COPE: Test-Time Adaptation for Category-Level Object Pose Estimation
Markerless Camera-to-Robot Pose Estimation via Self-Supervised Sim-to-Real Transfer
SMOC-Net: Leveraging Camera Pose for Self-Supervised Monocular Object Pose Estimation
IMP: Iterative Matching and Pose Estimation With Adaptive Pooling
Self-Supervised Representation Learning for CAD
Few-Shot Geometry-Aware Keypoint Localization
SparsePose: Sparse-View Camera Pose Regression and Refinement
A Large-Scale Homography Benchmark
Learning Geometric-Aware Properties in 2D Representation Using Lightweight CAD Models, or Zero Real 3D Pairs
AutoRecon: Automated 3D Object Discovery and Reconstruction
Multi-Sensor Large-Scale Dataset for Multi-View 3D Reconstruction
NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization
Self-Supervised Super-Plane for Neural 3D Reconstruction
PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes
Single View Scene Scale Estimation Using Scale Field
3D Line Mapping Revisited
Inverting the Imaging Process by Learning an Implicit Camera Model
SfM-TTR: Using Structure From Motion for Test-Time Refinement of Single-View Depth Networks
iDisc: Internal Discretization for Monocular Depth Estimation
DC2: Dual-Camera Defocus Control by Learning To Refocus
A Practical Stereo Depth System for Smart Glasses
GeoMVSNet: Learning Multi-View Stereo With Geometry Perception
DINN360: Deformable Invertible Neural Network for Latitude-Aware 360° Image Rescaling
OmniVidar: Omnidirectional Depth Estimation From Multi-Fisheye Images
Learning To Fuse Monocular and Multi-View Cues for Multi-Frame Depth Estimation in Dynamic Scenes
Modality-Invariant Visual Odometry for Embodied Vision
VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud
CAPE: Camera View Position Embedding for Multi-View 3D Object Detection
AeDet: Azimuth-Invariant Multi-View 3D Object Detection
Object Detection With Self-Supervised Scene Adaptation
Understanding the Robustness of 3D Object Detection With Bird’s-Eye-View Representations in Autonomous Driving
BEVHeight: A Robust Framework for Vision-Based Roadside 3D Object Detection
Uncertainty-Aware Vision-Based Metric Cross-View Geolocalization
OrienterNet: Visual Localization in 2D Public Maps With Neural Matching
MSMDFusion: Fusing LiDAR and Camera at Multiple Scales With Multi-Depth Seeds for 3D Object Detection
Virtual Sparse Convolution for Multimodal 3D Object Detection
Optimal Transport Minimization: Crowd Localization on Density Maps for Semi-Supervised Counting
VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking
GraVoS: Voxel Selection for 3D Point-Cloud Detection
MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving
LaserMix for Semi-Supervised LiDAR Semantic Segmentation
Implicit Surface Contrastive Clustering for LiDAR Point Clouds
Semi-Weakly Supervised Object Kinematic Motion Prediction
PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models
Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions
Geometry and Uncertainty-Aware 3D Point Cloud Class-Incremental Semantic Segmentation
Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders
ToThePoint: Efficient Contrastive Learning of 3D Point Clouds via Recycling
PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection
PointConvFormer: Revenge of the Point-Based Convolution
Self-Positioning Point-Based Transformer for Point Cloud Understanding
PointClustering: Unsupervised Point Cloud Pre-Training Using Transformation Invariance in Clustering
Neural Intrinsic Embedding for Non-Rigid Point Cloud Matching
HGNet: Learning Hierarchical Geometry From Points, Edges, and Surfaces
LP-DIF: Learning Local Pattern-Specific Deep Implicit Function for 3D Objects and Scenes
Conjugate Product Graphs for Globally Optimal 2D-3D Shape Matching
UTM: A Unified Multiple Object Tracking Model With Identity-Aware Feature Enhancement
Learning Rotation-Equivariant Features for Visual Correspondence
Adaptive Spot-Guided Transformer for Consistent Local Feature Matching
PMatch: Paired Masked Image Modeling for Dense Geometric Matching
Iterative Geometry Encoding Volume for Stereo Matching
Adaptive Annealing for Robust Geometric Estimation
Tangentially Elongated Gaussian Belief Propagation for Event-Based Incremental Optical Flow Estimation
Robust and Scalable Gaussian Process Regression and Its Applications
BEV-Guided Multi-Modality Fusion for Driving Perception
HumanBench: Towards General Human-Centric Perception With Projector Assisted Pretraining
Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
ProphNet: Efficient Agent-Centric Motion Forecasting With Anchor-Informed Proposals
StarCraftImage: A Dataset for Prototyping Spatial Reasoning Methods for Multi-Agent Environments
Stimulus Verification Is a Universal and Effective Sampler in Multi-Modal Human Trajectory Prediction
PyPose: A Library for Robot Learning With Physics-Based Optimization
Source-Free Adaptive Gaze Estimation by Uncertainty Reduction
Camouflaged Object Detection With Feature Decomposition and Edge Reconstruction
MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors
Clothing-Change Feature Augmentation for Person Re-Identification
Dynamic Aggregated Network for Gait Recognition
Feature Representation Learning With Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition
MetaPortrait: Identity-Preserving Talking Head Generation With Fast Personalized Adaptation
FLAG3D: A 3D Fitness Activity Dataset With Language Instruction
TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning With Structure-Trajectory Prompted Reconstruction for Person Re-Identification
NeMo: Learning 3D Neural Motion Fields From Multiple Video Instances of the Same Action
Unsupervised Space-Time Network for Temporally-Consistent Segmentation of Multiple Motions
Deep Polarization Reconstruction With PDAVIS Events
Range-Nullspace Video Frame Interpolation With Focalized Motion Estimation
Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation
1000 FPS HDR Video With a Spike-RGB Hybrid Camera
Deep Discriminative Spatial and Temporal Network for Efficient Video Deblurring
Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement
A Unified HDR Imaging Method With Pixel and Patch Level
BiasBed – Rigorous Texture Bias Evaluation
Learning a Practical SDR-to-HDRTV Up-Conversion Using New Dataset and Degradation Models
Learning a Deep Color Difference Metric for Photographic Images
Learning a Simple Low-Light Image Enhancer From Paired Low-Light Instances
Residual Degradation Learning Unfolding Framework With Mixing Priors Across Spectral and Spatial for Compressive Spectral Imaging
Toward Stable, Interpretable, and Lightweight Hyperspectral Super-Resolution
RIDCP: Revitalizing Real Image Dehazing via High-Quality Codebook Priors
Robust Unsupervised StyleGAN Image Restoration
Quality-Aware Pre-Trained Models for Blind Image Quality Assessment
Learning To Exploit the Sequence-Specific Prior Knowledge for Image Processing Pipelines Optimization
Multi-Realism Image Compression With a Conditional Generator
RGB No More: Minimally-Decoded JPEG Vision Transformers
Kernel Aware Resampler
Spatial-Frequency Mutual Learning for Face Super-Resolution
Activating More Pixels in Image Super-Resolution Transformer
Omni Aggregation Networks for Lightweight Image Super-Resolution
Towards Artistic Image Aesthetics Assessment: A Large-Scale Dataset and a New Method
RWSC-Fusion: Region-Wise Style-Controlled Fusion Network for the Prohibited X-Ray Security Image Synthesis
Efficient Scale-Invariant Generator With Column-Row Entangled Pixel Synthesis
Masked and Adaptive Transformer for Exemplar Based Image Translation
SmartBrush: Text and Shape Guided Object Inpainting With Diffusion Model
Neural Transformation Fields for Arbitrary-Styled Font Generation
Referring Image Matting
Handwritten Text Generation From Visual Archetypes
SceneComposer: Any-Level Semantic Image Synthesis
Affordance Diffusion: Synthesizing Hand-Object Interactions
LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
GLIGEN: Open-Set Grounded Text-to-Image Generation
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models
EDICT: Exact Diffusion Inversion via Coupled Transformations
Solving 3D Inverse Problems Using Pre-Trained 2D Diffusion Models
Diffusion Probabilistic Model Made Slim
Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models
Binary Latent Diffusion
Semi-Supervised Video Inpainting With Cycle Consistency Constraints
Towards Accurate Image Coding: Improved Autoregressive Image Generation With Dynamic Vector Quantization
Large-Capacity and Flexible Video Steganography via Invertible Neural Network
Neural Video Compression With Diverse Contexts
Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos
Structured Sparsity Learning for Efficient Video Super-Resolution
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
Boost Vision Transformer With GPU-Friendly Sparsity and Quantization
All Are Worth Words: A ViT Backbone for Diffusion Models
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers
Vision Transformer With Super Token Sampling
DropKey for Vision Transformer
Seeing Beyond the Brain: Conditional Diffusion Model With Sparse Masked Modeling for Vision Decoding
ResFormer: Scaling ViTs With Multi-Resolution Training
Stare at What You See: Masked Image Modeling Without Reconstruction
Mixed Autoencoder for Self-Supervised Visual Representation Learning
Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification
G-MSM: Unsupervised Multi-Shape Matching With Graph-Based Affinity Priors
Efficient Mask Correction for Click-Based Interactive Image Segmentation
Prototype-Based Embedding Network for Scene Graph Generation
Graph Representation for Order-Aware Visual Transformation
Unbiased Scene Graph Generation in Videos
Recurrence Without Recurrence: Stable Video Landmark Detection With Deep Equilibrium Models
VideoTrack: Learning To Track Objects via Video Transformer
Breaking the “Object” in Video Object Segmentation
Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection
Mask-Free Video Instance Segmentation
Hierarchical Neural Memory Network for Low Latency Event Processing
Unifying Short and Long-Term Tracking With Graph Hierarchies
Towards End-to-End Generative Modeling of Long Videos With Memory-Efficient Bidirectional Transformers
An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling
Egocentric Audio-Visual Object Localization
AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR
A Light Weight Model for Active Speaker Detection
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Video Test-Time Adaptation for Action Recognition
Unified Keypoint-Based Action Recognition Framework via Structured Keypoint Pooling
Object Discovery From Motion-Guided Tokens
Open Set Action Recognition via Multi-Label Evidential Learning
PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization
Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels
Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-Based Active Learning
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training
Hierarchical Video-Moment Retrieval and Step-Captioning
HierVL: Learning Hierarchical Video-Language Embeddings
Learning Transferable Spatiotemporal Representations From Natural Script Knowledge
WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding
Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding
Learning Action Changes by Measuring Verb-Adverb Textual Relationships
LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling
DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-To-Fine Contrastive Ranking
CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition With Variational Alignment
Joint Visual Grounding and Tracking With Natural Language Specification
Accelerating Vision-Language Pretraining With Free Language Modeling
CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation
Where We Are and What We’re Looking At: Query Based Worldwide Image Geo-Localization Using Hierarchies and Scenes
ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos
MetaCLUE: Towards Comprehensive Visual Metaphors Research
GeoVLN: Learning Geometry-Enhanced Visual Representation With Slot Attention for Vision-and-Language Navigation
Being Comes From Not-Being: Open-Vocabulary Text-to-Motion Generation With Wordless Training
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models
Position-Guided Text Prompt for Vision-Language Pre-Training
Intrinsic Physical Concepts Discovery With Object-Centric Predictive Models
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-Training Model
CLAMP: Prompt-Based Contrastive Learning for Connecting Language and Animal Pose
Teacher-Generated Spatial-Attention Labels Boost Robustness and Accuracy of Contrastive Models
DegAE: A New Pretraining Paradigm for Low-Level Vision
RILS: Masked Visual Reconstruction in Language Semantic Space
Learning Geometry-Aware Representations by Sketching
SketchXAI: A First Look at Explainability for Human Sketches
MAGVLT: Masked Generative Vision-and-Language Transformer
Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style
Semantic-Conditional Diffusion Networks for Image Captioning
REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
Variational Distribution Learning for Unsupervised Text-to-Image Generation
Scaling Language-Image Pre-Training via Masking
LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data
Revisiting Self-Similarity: Structural Embedding for Image Retrieval
Improving Cross-Modal Retrieval With Set of Diverse Embeddings
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Few-Shot Learning With Visual Distribution Calibration and Cross-Modal Distribution Alignment
Deep Hashing With Minimal-Distance-Separated Hash Centers
ConZIC: Controllable Zero-Shot Image Captioning by Sampling-Based Polishing
Learning To Name Classes for Vision and Language Models
Data-Efficient Large Scale Place Recognition With Graded Similarity Supervision
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment
HOICLIP: Efficient Knowledge Transfer for HOI Detection With Vision-Language Models
OvarNet: Towards Open-Vocabulary Object Attribute Recognition
NeRF-RPN: A General Framework for Object Detection in NeRFs
Mask-Free OVIS: Open-Vocabulary Instance Segmentation Without Manual Mask Annotations
GP-VTON: Towards General Purpose Virtual Try-On via Collaborative Local-Flow Global-Parsing Learning
Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning
Contrastive Grouping With Transformer for Referring Image Segmentation
Semantic Prompt for Few-Shot Image Recognition
GRES: Generalized Referring Expression Segmentation
Network-Free, Unsupervised Semantic Segmentation With Synthetic Images
Few-Shot Semantic Image Synthesis With Class Affinity Transfer
Ultra-High Resolution Segmentation With Ultra-Rich Context: A Novel Benchmark
Content-Aware Token Sharing for Efficient Semantic Segmentation With Vision Transformers
Hierarchical Dense Correlation Distillation for Few-Shot Segmentation
On Calibrating Semantic Segmentation Models: Analyses and an Algorithm
FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation
Out-of-Candidate Rectification for Weakly Supervised Semantic Segmentation
Foundation Model Drives Weakly Incremental Learning for Semantic Segmentation
Long-Tailed Visual Recognition via Self-Heterogeneous Integration With Knowledge Excavation
Instance-Specific and Model-Adaptive Supervision for Semi-Supervised Semantic Segmentation
Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm
IDGI: A Framework To Eliminate Explanation Noise From Integrated Gradients
Weakly Supervised Posture Mining for Fine-Grained Classification
Vision Transformers Are Good Mask Auto-Labelers
Enhanced Training of Query-Based Object Detection via Selective Query Recollection
Box-Level Active Detection
CIGAR: Cross-Modality Graph Reasoning for Domain Adaptive Object Detection
DA-DETR: Domain Adaptive Detection Transformer With Information Fusion
Continual Detection Transformer for Incremental Object Detection
Semi-DETR: Semi-Supervised Object Detection With Detection Transformers
Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection
Harmonious Teacher for Cross-Domain Object Detection
Contrastive Mean Teacher for Domain Adaptive Object Detectors
Out-of-Distributed Semantic Pruning for Robust Semi-Supervised Learning
(ML)$^2$P-Encoder: On Exploration of Channel-Class Correlation for Multi-Label Zero-Shot Learning
MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery
Devil Is in the Queries: Advancing Mask Transformers for Real-World Medical Image Segmentation and Out-of-Distribution Localization
SQUID: Deep Feature In-Painting for Unsupervised Anomaly Detection
OCELOT: Overlapped Cell on Tissue Dataset for Histopathology
DeGPR: Deep Guided Posterior Regularization for Multi-Class Cell Detection and Counting
Best of Both Worlds: Multimodal Contrastive Learning With Tabular and Imaging Data
RankMix: Data Augmentation for Weakly Supervised Learning of Classifying Whole Slide Images With Diverse Sizes and Imbalanced Categories
GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection
Discriminating Known From Unknown Objects via Structure-Enhanced Recurrent Variational AutoEncoder
Sample-Level Multi-View Graph Clustering
On the Effects of Self-Supervision and Contrastive Alignment in Deep Multi-View Clustering
Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric
Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement
Open-Set Likelihood Maximization for Few-Shot Learning
HyperMatch: Noise-Tolerant Semi-Supervised Learning via Relaxed Contrastive Constraint
Token Boosting for Robust Self-Supervised Visual Transformer Pre-Training
Difficulty-Based Sampling for Debiased Contrastive Representation Learning
Improving Selective Visual Question Answering by Learning From Your Peers
Superclass Learning With Representation Enhancement
DISC: Learning From Noisy Labels via Dynamic Instance-Specific Selection and Correction
FCC: Feature Clusters Compression for Long-Tailed Visual Recognition
Dynamically Instance-Guided Adaptation: A Backward-Free Approach for Test-Time Domain Adaptive Semantic Segmentation
Semi-Supervised Domain Adaptation With Source Label Adaptation
Adjustment and Alignment for Unbiased Open Set Domain Adaptation
C-SFDA: A Curriculum Learning Aided Self-Training Framework for Efficient Source Free Domain Adaptation
ALOFT: A Lightweight MLP-Like Architecture With Dynamic Low-Frequency Transform for Domain Generalization
Modality-Agnostic Debiasing for Single Domain Generalization
ActMAD: Activation Matching To Align Distributions for Test-Time-Training
TIPI: Test Time Adaptation With Transformation Invariance
Improved Test-Time Adaptation for Domain Generalization
Learning With Fantasy: Semantic-Aware Virtual Contrastive Constraint for Few-Shot Class-Incremental Learning
NIFF: Alleviating Forgetting in Generalized Few-Shot Object Detection via Neural Instance Feature Forging
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering
PIVOT: Prompting for Video Continual Learning
BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning
DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning
PCR: Proxy-Based Contrastive Replay for Online Class-Incremental Continual Learning
Masked Autoencoders Enable Efficient Knowledge Distillers
Data-Free Knowledge Distillation via Feature Exchange and Activation Region Constraint
Multi-Level Logit Distillation
Preserving Linear Separability in Continual Learning by Backward Feature Projection
Critical Learning Periods for Multisensory Integration in Deep Networks
SLACK: Stable Learning of Augmentations With Cold-Start and KL Regularization
Improving Generalization With Domain Convex Game
Exploring Data Geometry for Continual Learning
FlowGrad: Controlling the Output of Generative ODEs With Gradients
Deep Graph Reprogramming
X-Pruner: eXplainable Pruning for Vision Transformers
Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures
Compacting Binary Neural Networks by Sparse Kernel Selection
Deep Deterministic Uncertainty: A New Simple Baseline
Understanding Deep Generative Models With Generalized Empirical Likelihoods
Fair Scratch Tickets: Finding Fair Sparse Networks Without Weight Training
Hard Sample Matters a Lot in Zero-Shot Quantization
PD-Quant: Post-Training Quantization Based on Prediction Difference Metric
Vector Quantization With Self-Attention for Quality-Independent Representation Learning
Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond
Sequential Training of GANs Against GAN-Classifiers Reveals Correlated “Knowledge Gaps” Present Among Independently Trained GAN Instances
Edges to Shapes to Concepts: Adversarial Augmentation for Robust Vision
Towards Universal Fake Image Detectors That Generalize Across Generative Models
Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection
Generating Anomalies for Video Anomaly Detection With Prompt-Based Feature Mapping
Revisiting Reverse Distillation for Anomaly Detection
MetaMix: Towards Corruption-Robust Continual Learning With Temporally Self-Adaptive Data Transformation
ScaleFL: Resource-Adaptive Federated Learning With Heterogeneous Clients
Confidence-Aware Personalized Federated Learning via Variational Expectation Maximization
Make Landscape Flatter in Differentially Private Federated Learning
Rethinking Domain Generalization for Face Anti-Spoofing: Separability and Alignment
StyleAdv: Meta Style Adversarial Training for Cross-Domain Few-Shot Learning
The Dark Side of Dynamic Routing Neural Networks: Towards Efficiency Backdoor Injection
Architectural Backdoors in Neural Networks
You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?
A Practical Upper Bound for the Worst-Case Attribution Deviations
Sibling-Attack: Rethinking Transferable Adversarial Attacks Against Face Recognition
Angelic Patches for Improving Third-Party Object Detector Performance
Introducing Competition To Boost the Transferability of Targeted Adversarial Examples Through Clean Feature Mixup
Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations
Boosting Accuracy and Robustness of Student Models via Adaptive Adversarial Distillation
The Enemy of My Enemy Is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training
Robust Single Image Reflection Removal Against Adversarial Attacks
Physical-World Optical Adversarial Attacks on 3D Face Recognition
AUNet: Learning Relations Between Action Units for Face Forgery Detection
(ends 6:00 PM)