CVPR 2023 Schedule

Filter Events

Filter Rooms:

SUN 18 JUN

7:30 a.m.

Breakfast:

Breakfast

(ends 9:00 AM)

7:50 a.m.

Workshop:

Synthetic Data for Autonomous Systems (SDAS)

(ends 12:30 PM)

Workshop:

The Second Workshop on Structural and Compositional Learning on 3D Data

(ends 12:30 PM)

Workshop:

The 2nd International Workshop on Transformers for Vision

(ends 5:30 PM)

8 a.m.

Workshop:

The Fourth Workshop on Fair, Data-efficient, and Trusted Computer Vision

(ends 4:30 PM)

Workshop:

OmniLabel: Infinite label spaces for semantic understanding via natural language

(ends 12:00 PM)

Workshop:

Second Workshop of Mobile Intelligent Photography and Imaging

(ends 12:00 PM)

Workshop:

8th New Trends in Image Restoration and Enhancement Workshop and Challenges

(ends 7:00 PM)

Workshop:

1st Workshop on Multimodal Content Moderation

(ends 6:00 PM)

Workshop:

Third Workshop on Ethical Considerations in Creative Applications of Computer Vision - EC3V

(ends 12:10 PM)

Workshop:

CVPR 2023 - 10th Workshop on Medical Computer Vision (MCV)

(ends 4:00 PM)

Workshop:

The 3rd Workshop of Adversarial Machine Learning on Computer Vision: Art of Robustness

(ends 4:00 PM)

Workshop:

The Second Workshop on 3D Vision and Robotics

(ends 6:00 PM)

Workshop:

New Frontiers for Zero-Shot Image Captioning Evaluation

(ends 12:00 PM)

8:20 a.m.

Workshop:

DL-UIA: Deep Learning in Ultrasound Image Analysis

(ends 12:00 PM)

8:25 a.m.

Workshop:

GAZE 2023: The 5th International Workshop on Gaze Estimation and Prediction in the Wild

(ends 12:00 PM)

8:30 a.m.

Workshop:

LatinX in Computer Vision Research Workshop

(ends 6:00 PM)

Workshop:

19th CVPR Workshop on Perception Beyond the Visible Spectrum (PBVS 2023)

(ends 5:30 PM)

Workshop:

Generative Models for Computer Vision

(ends 5:15 PM)

Workshop:

XRNeRF: Advances in NeRF for the Metaverse

(ends 12:30 PM)

Workshop:

VAND: Visual Anomaly and Novelty Detection

(ends 12:30 PM)

Workshop:

Catch UAVs that Want to Watch You: Detection and Tracking of Unmanned Aerial Vehicle (UAV) in the Wild and the 3rd Anti-UAV Workshop & Challenge

(ends 12:10 PM)

Workshop:

2nd Monocular Depth Estimation Challenge

(ends 12:00 PM)

Workshop:

4th Workshop on Continual Learning in Computer Vision (CLVision)

(ends 5:30 PM)

Workshop:

4th International Workshop on Large Scale Holistic Video Understanding

(ends 12:00 PM)

Tutorial:

Recent advances in anomaly detection

(ends 5:15 PM)

Tutorial:

A Comprehensive Tour and Recent Advancements toward Real-world Visual Geo-Localization

(ends 5:30 PM)

Tutorial:

Efficient Neural Networks: From Algorithm Design to Practical Mobile Deployment

(ends 12:00 PM)

Tutorial:

Recent Advances in Visual Domain Adaptation and Generalization

(ends 11:45 AM)

Tutorial:

Trustworthy AI in the Era of Foundation Models

(ends 11:45 AM)

Tutorial:

ML Systems for Large Models and Federated Learning

(ends 11:45 AM)

Tutorial:

Vision Transformer: More is different

(ends 11:45 AM)

8:45 a.m.

Workshop:

Topological, Algebraic, and Geometric Pattern Recognition with Applications Workshop Proposal

(ends 5:30 PM)

Workshop:

7th Workshop on Media Forensics

(ends 5:30 PM)

Workshop:

FGVC10: 10th Workshop on Fine-grained Visual Categorization

(ends 4:45 PM)

8:55 a.m.

Workshop:

3rd International Workshop and Challenge on Long-form Video Understanding and Generation

(ends 12:35 PM)

9 a.m.

Workshop:

Workshop on End-to-end Autonomous Driving

(ends 6:00 PM)

Workshop:

Visual Perception via Learning in an Open World

(ends 5:00 PM)

Workshop:

Computer Vision for Mixed Reality

(ends 12:30 PM)

Workshop:

CVPR 2023 Biometrics Workshop

(ends 5:30 PM)

Workshop:

12th IEEE International Workshop on Computational Cameras and Displays (CCD)

(ends 5:00 PM)

Workshop:

EarthVision: Large Scale Computer Vision for Remote Sensing Imagery

(ends 5:45 PM)

Workshop:

3rd Workshop and Challenge on Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings

(ends 6:00 PM)

Workshop:

Fourth Workshop on Neural Architecture Search, Third lightweight NAS challenge

(ends 6:00 PM)

Workshop:

2nd Workshop on Tracking and Its Many Guises: Tracking Any Object in Open-World

(ends 5:00 PM)

Workshop:

The 4th CVPR Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics

(ends 5:30 PM)

Tutorial:

Denoising Diffusion Models: A Generative Learning Big Bang

(ends 12:30 PM)

Tutorial:

Boosting Computer Vision Research with OpenMMLab and OpenDataLab

(ends 12:00 PM)

Tutorial:

All Things ViTs: Understanding and Interpreting Attention in Vision

(ends 12:00 PM)

9:15 a.m.

Workshop:

Workshop on Autonomous Driving (WAD)

(ends 6:15 PM)

Workshop:

6th Multi-modal Learning and Applications Workshop (MULA)

(ends 6:00 PM)

Workshop:

New Frontiers in Visual Language Reasoning: Compositionality, Prompts and Causality

(ends 5:30 PM)

9:20 a.m.

Workshop:

CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling (Workshop)

(ends 5:45 PM)

9:30 a.m.

Tutorial:

Skull Restoration, Facial Reconstruction and Expression

(ends 11:45 AM)

10 a.m.

Break:

Break

(ends 10:45 AM)

11:45 a.m.

Break:

Lunch

(ends 1:30 PM)

12:30 p.m.

Workshop:

End-to-End Autonomous Driving: Perception, Prediction, Planning and Simulation

(ends 7:05 PM)

12:45 p.m.

Workshop:

1st Workshop on Compositional 3D Vision & 3DCoMPaT Challenge

(ends 5:45 PM)

Workshop:

6th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues

(ends 6:30 PM)

1 p.m.

Workshop:

High-fidelity Neural Actors

(ends 6:00 PM)

Workshop:

AVA: Accessibility, Vision, and Autonomy Meet

(ends 5:30 PM)

Workshop:

4D Hand Object Interaction: Geometric Understanding and Applications in Dexterous Manipulation

(ends 6:00 PM)

Workshop:

QCVML: Quantum Computer Vision and Machine Learning Workshop

(ends 6:00 PM)

Workshop:

Computer Vision for Fashion, Art, and Design

(ends 5:30 PM)

Workshop:

1st workshop on Capturing, Interpreting & Visualizing Indoor Living Spaces

(ends 6:00 PM)

Workshop:

The Fifth Workshop on Precognition: Seeing through the Future

(ends 5:00 PM)

1:30 p.m.

Workshop:

2nd Workshop and Challenge on Vision Datasets Understanding

(ends 5:30 PM)

Workshop:

The Fourth Workshop on Face and Gesture Analysis for Health Informatics (FGAHI)

(ends 6:00 PM)

Workshop:

The 3rd Workshop on Light Fields for Computer Vision LFNAT: New Applications and Trends in Light Fields

(ends 6:15 PM)

Workshop:

Pixel-level Video Understanding in the Wild Challenge

(ends 5:30 PM)

Tutorial:

Contactless Healthcare using Cameras and Wireless Sensors

(ends 5:00 PM)

Tutorial:

Large-scale Deep Learning Optimization Techniques

(ends 5:00 PM)

3 p.m.

Break:

Break

(ends 3:45 PM)

MON 19 JUN

7:30 a.m.

Breakfast:

Breakfast

(ends 9:00 AM)

8 a.m.

Workshop:

Vision-Centric Autonomous Driving (VCAD)

(ends 5:00 PM)

Workshop:

4th International Workshop on Event-based Vision

(ends 6:00 PM)

Workshop:

5th Workshop and Competition on Affective Behavior Analysis in-the-wild

(ends 12:30 PM)

Workshop:

3DMV: Learning 3D with Multi-View Supervision

(ends 12:15 PM)

Workshop:

The Sixth International Workshop on Computer Vision for Physiological Measurement (CVPM)

(ends 12:00 PM)

Workshop:

8th Workshop on Computer Vision for Microscopy Image Analysis

(ends 6:00 PM)

Workshop:

Multi-Agent Behavior: Properties, Computation and Emergence

(ends 12:30 PM)

Workshop:

Image Matching: Local Features and Beyond

(ends 12:30 PM)

Workshop:

4th Agriculture-Vision Workshop: Challenges & Opportunities for Computer Vision in Agriculture

(ends 12:00 PM)

Workshop:

3rd Mobile AI Workshop and Challenges

(ends 6:00 PM)

Workshop:

7th AI City Challenge Workshop

(ends 5:30 PM)

8:15 a.m.

Workshop:

VizWiz Grand Challenge: Describing Images and Videos Taken by Blind People

(ends 12:00 PM)

8:20 a.m.

Workshop:

First Rhobin Challenge - Reconstruction of human-object interaction

(ends 12:30 PM)

8:30 a.m.

Workshop:

Women in Computer Vision Workshop

(ends 1:00 PM)

Workshop:

The 6th Workshop and Prize Challenge Bridging the Gap between Computational Photography and Visual Recognition (UG2+) in conjunction with IEEE CVPR 2023

(ends 5:00 PM)

Workshop:

Workshop on Vision-based InduStrial InspectiON (VISION)

(ends 6:00 PM)

Workshop:

EMBEDDED VISION WORKSHOP 2023

(ends 5:00 PM)

Workshop:

2nd Workshop on Federated Learning for Computer Vision

(ends 5:30 PM)

Workshop:

Joint 3rd Ego4D and 11th EPIC Workshop on Egocentric Vision

(ends 6:45 PM)

Workshop:

O-DRUM: Workshop on Open-Domain Reasoning Under Multi-Modal Settings

(ends 5:00 PM)

Tutorial:

Large-Scale Visual Localization

(ends 12:15 PM)

Tutorial:

Object localization for free: Going beyond self-supervised learning

(ends 12:00 PM)

Tutorial:

Recent Advances in Vision Foundation Models

(ends 12:30 PM)

8:45 a.m.

Workshop:

Secure and Safe Autonomous Driving Workshop and Challenge (SSAD)

(ends 5:00 PM)

8:50 a.m.

Workshop:

The 6th Efficient Deep Learning for Computer Vision

(ends 6:35 PM)

9 a.m.

Workshop:

The 2nd Explainable AI for Computer Vision (XAI4CV) Workshop

(ends 5:30 PM)

Workshop:

Safe Artificial Intelligence for All Domains

(ends 5:00 PM)

Workshop:

The Fifth Workshop on Deep Learning for Geometric Computing

(ends 5:00 PM)

Workshop:

AI for Content Creation

(ends 6:15 PM)

Workshop:

Visual Copy Detection Workshop

(ends 12:00 PM)

Workshop:

Computer Vision in the Wild

(ends 5:30 PM)

Workshop:

Sight and Sound

(ends 6:00 PM)

Workshop:

L3D-IVU: 2nd Workshop on Learning with Limited Labelled Data for Image and Video Understanding

(ends 5:10 PM)

Workshop:

9th IEEE International Workshop on Computer Vision in Sports (CVsports)

(ends 5:30 PM)

Workshop:

Workshop on Foundation Models: 1st Foundation Model Challenge

(ends 12:30 PM)

Workshop:

Visual Pre-training for Robotics

(ends 5:30 PM)

Workshop:

4th Embodied AI Workshop

(ends 5:30 PM)

Workshop:

Vision for All Seasons: Adverse Weather and Lighting Conditions

(ends 5:30 PM)

Tutorial:

Polarization-based Computer Vision

(ends 12:00 PM)

Tutorial:

Multi-Objective Optimization for Deep Learning

(ends 12:00 PM)

Tutorial:

All you need to know about self-driving

(ends 5:00 PM)

Tutorial:

Few-shot Learning from Meta-Learning, Statistical Understanding to Applications

(ends 12:30 PM)

Tutorial:

Prompting in Vision

(ends 12:00 PM)

Tutorial:

Reverse Engineering of Deception (RED): Foundations and Applications

(ends 12:00 PM)

Tutorial:

Rolling Shutter Camera: Modeling, Optimization, Learning, and Hardware

(ends 12:00 PM)

Tutorial:

Hyperbolic Deep Learning in Computer Vision

(ends 12:00 PM)

Tutorial:

Deep Learning Theory for Computer Vision

(ends 12:00 PM)

Tutorial:

Automatic 3D modeling of indoor structures from panoramic imagery

(ends 12:30 PM)

Tutorial:

Optics for Better AI: Capturing and Synthesizing Realistic Data for Low-light Enhancement

(ends 12:00 PM)

Tutorial:

Knowledge-Driven Vision-Language Encoding

(ends 12:30 PM)

10 a.m.

Break:

Break

(ends 10:45 AM)

11:45 a.m.

Break:

Lunch

(ends 1:30 PM)

12:45 p.m.

Workshop:

Scholars and Big Models — How Can Academics Adapt?

(ends 6:05 PM)

1 p.m.

Workshop:

The 4th Workshop on Omnidirectional Computer Vision

(ends 6:00 PM)

Workshop:

Photogrammetric Computer Vision

(ends 6:00 PM)

Workshop:

2nd Challenge on Machine Visual Common Sense: Perception, Prediction, Planning

(ends 6:00 PM)

Workshop:

RetailVision - Revolutionizing the World of Retail

(ends 6:00 PM)

Workshop:

2nd Workshop on Multimodal Learning for Earth and Environment (MultiEarth)

(ends 5:00 PM)

1:30 p.m.

Workshop:

5th ScanNet Indoor Scene Understanding Challenge

(ends 5:30 PM)

Workshop:

The 4th Face Anti-spoofing Workshop and Challenge

(ends 5:30 PM)

Workshop:

DynaVis: The 4th International Workshop on Dynamic Scene Reconstruction

(ends 5:20 PM)

Tutorial:

Neural Search in Action

(ends 4:30 PM)

Tutorial:

Physics-based rendering and its applications in computational photography and imaging

(ends 5:00 PM)

Tutorial:

Exploring Synthetic data as an Enterprise Capability for Training and Validating CV Systems

(ends 4:30 PM)

Tutorial:

Full-Stack, GPU-based Acceleration of Deep Learning

(ends 5:00 PM)

Tutorial:

Hands-on Egocentric Research with Project Aria from Meta

(ends 5:00 PM)

3 p.m.

Break:

Break

(ends 3:45 PM)

TUE 20 JUN

7:30 a.m.

Breakfast:

Breakfast

(ends 9:00 AM)

8:30 a.m.

Opening Ceremony:

Opening Ceremony

(ends 9:00 AM)

9 a.m.

Keynote:

Revisiting Old Ideas With Modern Hardware

Rodney Brooks

(ends 10:00 AM)

10 a.m.

Break:

Break

(ends 10:30 AM)

10:30 a.m.

Poster Session TUE-AM [10:30-12:00]

Posters 10:30-12:00

Megahertz Light Steering Without Moving Parts

Robust Dynamic Radiance Fields

DBARF: Deep Bundle-Adjusting Generalizable Neural Radiance Fields

VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization

AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training

SeaThru-NeRF: Neural Radiance Fields in Scattering Media

Exact-NeRF: An Exploration of a Precise Volumetric Parameterization for Neural Radiance Fields

Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos

PlenVDB: Memory Efficient VDB-Based Radiance Fields for Fast Training and Rendering

Local Implicit Ray Function for Generalizable Radiance Field Representation

SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes

Frequency-Modulated Point Cloud Rendering With Easy Editing

HexPlane: A Fast Representation for Dynamic Scenes

Differentiable Shadow Mapping for Efficient Inverse Graphics

Hybrid Neural Rendering for Large-Scale Scenes With Motion Blur

TensoIR: Tensorial Inverse Rendering

ShadowNeuS: Neural SDF Reconstruction by Shadow Ray Supervision

Realistic Saliency Guided Image Enhancement

LightPainter: Interactive Portrait Relighting With Freehand Scribble

A Unified Spatial-Angular Structured Light for Single-View Acquisition of Shape and Reflectance

Learning Visibility Field for Detailed 3D Human Reconstruction and Relighting

Unsupervised Contour Tracking of Live Cells by Mechanical and Cycle Consistency Losses

NeUDF: Leaning Neural Unsigned Distance Fields With Volume Rendering

NeAT: Learning Neural Implicit Surfaces With Arbitrary Topologies From Multi-View Images

ALTO: Alternating Latent Topologies for Implicit 3D Reconstruction

Controllable Mesh Generation Through Sparse Latent Point Diffusion Models

Power Bundle Adjustment for Large-Scale 3D Reconstruction

Neural Pixel Composition for 3D-4D View Synthesis From Multi-Views

Magic3D: High-Resolution Text-to-3D Content Creation

3D Video Loops From Asynchronous Input

High-Fidelity 3D GAN Inversion by Pseudo-Multi-View Optimization

Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field

3D GAN Inversion With Facial Symmetry Prior

StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping

FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction

Robust Model-Based Face Reconstruction Through Weakly-Supervised Outlier Segmentation

Learning Neural Proto-Face Field for Disentangled 3D Face Modeling in the Wild

A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction From In-the-Wild Images

BlendFields: Few-Shot Example-Driven Facial Modeling

Implicit Neural Head Synthesis via Controllable Local Deformation Fields

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

GANHead: Towards Generative Animatable Neural Head Avatars

EDGE: Editable Dance Generation From Music

Blowing in the Wind: CycleNet for Human Cinemagraphs From Still Images

Generating Holistic 3D Human Motion From Speech

Avatars Grow Legs: Generating Smooth Human Motion From Sparse Tracking Inputs With Diffusion Model

Learning Anchor Transformations for 3D Garment Animation

CloSET: Modeling Clothed Humans on Continuous Surface With Explicit Template Decomposition

ECON: Explicit Clothed Humans Optimized via Normal Integration

PersonNeRF: Personalized Reconstruction From Photo Collections

3D Human Mesh Estimation From Virtual Markers

Overcoming the Trade-Off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction

Recovering 3D Hand Mesh Sequence From a Single Blurry Image: A New Dataset and Temporal Unfolding

MeMaHand: Exploiting Mesh-Mano Interaction for Single Image Two-Hand Reconstruction

PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation

CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis

Instant-NVR: Instant Neural Volumetric Rendering for Human-Object Interactions From Monocular RGBD Stream

BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects

Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes

Omnimatte3D: Associating Objects and Their Effects in Unconstrained Monocular Video

On the Benefits of 3D Pose and Tracking for Human Action Recognition

Towards Stable Human Pose Estimation via Cross-View Fusion and Foot Stabilization

Human Pose As Compositional Tokens

PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation

SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments

Semi-Supervised 2D Human Pose Estimation Driven by Position Inconsistency Pseudo Label Correction Module

Human Pose Estimation in Extremely Low-Light Conditions

Flexible-Cm GAN: Towards Precise 3D Dose Prediction in Radiotherapy

DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium

A Rotation-Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization

Semidefinite Relaxations for Robust Multiview Triangulation

A Probabilistic Attention Model With Occlusion-Aware Texture Regression for 3D Hand Reconstruction From a Single RGB Image

Instant Multi-View Head Capture Through Learnable Registration

On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks

Learning 3D Scene Priors With 2D Supervision

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation

OpenScene: 3D Scene Understanding With Open Vocabularies

Multi-View Azimuth Stereo via Tangent Space Consistency

Progressive Transformation Learning for Leveraging Virtual Images in Training

Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries

NeRF-Supervised Deep Stereo

Semantic Scene Completion With Cleaner Self

PanelNet: Understanding 360 Indoor Environment via Panel Representation

Implicit View-Time Interpolation of Stereo Videos Using Multi-Plane Disparities and Non-Uniform Coordinates

Depth Estimation From Indoor Panoramas With Neural Scene Representation

NeuralPCI: Spatio-Temporal Neural Field for 3D Point Cloud Multi-Frame Non-Linear Interpolation

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization

MACARONS: Mapping and Coverage Anticipation With RGB Online Self-Supervision

vMAP: Vectorised Object Mapping for Neural Field SLAM

Seeing a Rose in Five Thousand Ways

Propagate and Calibrate: Real-Time Passive Non-Line-of-Sight Tracking

Seeing With Sound: Long-range Acoustic Beamforming for Multimodal Scene Understanding

Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

BEV-LaneDet: An Efficient 3D Lane Detection Based on Virtual Camera via Key-Points

AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers

Benchmarking Robustness of 3D Object Detection to Common Corruptions

Gaussian Label Distribution Learning for Spherical Image Object Detection

Deep Depth Estimation From Thermal Image

LidarGait: Benchmarking 3D Gait Recognition With Point Clouds

Generalized UAV Object Detection via Frequency Domain Disentanglement

Learning Compact Representations for LiDAR Completion and Generation

CXTrack: Improving 3D Point Cloud Tracking With Contextual Information

Multispectral Video Semantic Segmentation: A Benchmark Dataset and Baseline

LinK: Linear Kernel for LiDAR-Based 3D Perception

Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting

Curricular Object Manipulation in LiDAR-Based Object Detection

Delivering Arbitrary-Modal Semantic Segmentation

Robust Outlier Rejection for 3D Registration With Variational Bayes

3D Human Keypoints Estimation From Point Clouds in the Wild Without Human Labels

Self-Supervised Pre-Training With Masked Shape Prediction for 3D Scene Understanding

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer

PointCMP: Contrastive Mask Prediction for Self-Supervised Learning on Point Cloud Videos

E2PN: Efficient SE(3)-Equivariant Point Network

Poly-PC: A Polyhedral Network for Multiple Point Cloud Tasks at Once

Improving Graph Representation for Point Cloud Segmentation via Attentive Filtering

BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration

TopDiG: Class-Agnostic Topological Directional Graph Extraction From Remote Sensing Images

Recognizing Rigid Patterns of Unlabeled Point Clouds by Complete and Continuous Isometry Invariants With No False Negatives and No False Positives

Both Style and Distortion Matter: Dual-Path Unsupervised Domain Adaptation for Panoramic Semantic Segmentation

CCuantuMM: Cycle-Consistent Quantum-Hybrid Matching of Multiple Shapes

Enhancing Deformable Local Features by Jointly Learning To Detect and Describe Keypoints

Understanding and Improving Features Learned in Deep Functional Maps

High-Frequency Stereo Matching Network

Rethinking Optical Flow From Geometric Matching Consistent Perspective

Efficient Robust Principal Component Analysis via Block Krylov Iteration and CUR Decomposition

VectorFloorSeg: Two-Stream Graph Attention Network for Vectorized Roughcast Floorplan Segmentation

TBP-Former: Learning Temporal Bird’s-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving

Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving

UniSim: A Neural Closed-Loop Sensor Simulator

FEND: A Future Enhanced Distribution-Aware Contrastive Learning Framework for Long-Tail Trajectory Prediction

EqMotion: Equivariant Multi-Agent Motion Prediction With Invariant Interaction Reasoning

Lookahead Diffusion Probabilistic Models for Refining Mean Estimation

Neural Volumetric Memory for Visual Locomotion Control

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

DrapeNet: Garment Generation and Self-Supervised Draping

Tracking Multiple Deformable Objects in Egocentric Videos

Good Is Bad: Causality Inspired Cloth-Debiasing for Cloth-Changing Person Re-Identification

Micron-BERT: BERT-Based Facial Micro-Expression Recognition

MARLIN: Masked Autoencoder for Facial Video Representation LearnINg

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator

RealImpact: A Dataset of Impact Sound Fields for Real Objects

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

Progressive Spatio-Temporal Alignment for Efficient Event-Based Motion Estimation

Event-Based Shape From Polarization

Learning Spatial-Temporal Implicit Neural Representations for Event-Guided Video Super-Resolution

BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation

A Unified Pyramid Recurrent Network for Video Frame Interpolation

Event-Based Blurry Frame Interpolation Under Blind Exposure

FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation

POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery

Adaptive Patch Deformation for Textureless-Resilient Multi-View Stereo

On the Difficulty of Unpaired Infrared-to-Visible Video Translation: Fine-Grained Content-Rich Patches Transfer

Thermal Spread Functions (TSF): Physics-Guided Material Classification

Better “CMOS” Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution

Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement

CutMIB: Boosting Light Field Super-Resolution via Multi-View Image Blending

sRGB Real Noise Synthesizing With Neighboring Correlation-Aware Noise Model

Masked Image Training for Generalizable Deep Image Denoising

DR2: Diffusion-Based Robust Degradation Remover for Blind Face Restoration

Learning Distortion Invariant Representation for Image Restoration From a Causality Perspective

Perception-Oriented Single Image Super-Resolution Using Optimal Objective Estimation

Catch Missing Details: Image Reconstruction With Frequency Augmented Variational Autoencoder

MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos

CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network With Large Input

Initialization Noise in Image Gradients and Saliency Maps

Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution

Deep Arbitrary-Scale Image Super-Resolution via Scale-Equivariance Pursuit

CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution

Multiplicative Fourier Level of Detail

Document Image Shadow Removal Guided by Color-Aware Background

StyleRes: Transforming the Residuals for Real Image Editing With StyleGAN

TopNet: Transformer-Based Object Placement Network for Image Compositing

VecFontSDF: Learning To Reconstruct and Synthesize High-Quality Vector Fonts via Signed Distance Functions

CF-Font: Content Fusion for Few-Shot Font Generation

SIEDOB: Semantic Image Editing by Disentangling Object and Background

MaskSketch: Unpaired Structure-Guided Masked Image Generation

Text2Scene: Text-Driven Indoor Scene Stylization With Part-Aware Details

Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models

VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Multi-Concept Customization of Text-to-Image Diffusion

Unifying Layout Generation With a Decoupled Diffusion Model

BBDM: Image-to-Image Translation With Brownian Bridge Diffusion Models

Towards Practical Plug-and-Play Diffusion Models

Post-Training Quantization on Diffusion Models

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

Mask-Guided Matting in the Wild

Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation

Compression-Aware Video Super-Resolution

Neural Rate Estimator and Unsupervised Learning for Efficient Distributed Image Analytics in Split-DNN Models

DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos

Polynomial Implicit Neural Representations for Large Diverse Datasets

Learning Decorrelated Representations Efficiently Using Fast Fourier Transform

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution

Slide-Transformer: Hierarchical Vision Transformer With Local Self-Attention

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Top-Down Visual Attention From Analysis by Synthesis

Probing Neural Representations of Scene Perception in a Hippocampally Dependent Task Using Artificial Neural Networks

Masked Image Modeling With Local Multi-Scale Reconstruction

Siamese Image Modeling for Self-Supervised Vision Representation Learning

MAGE: MAsked Generative Encoder To Unify Representation Learning and Image Synthesis

Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-Identification

DistilPose: Tokenized Pose Regression With Heatmap Distillation

Graph Transformer GANs for Graph-Constrained House Generation

Automatic High Resolution Wire Segmentation and Removal

Tree Instance Segmentation With Temporal Contour Graph

Dual-Path Adaptation From Image to Video Transformers

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning

Masked Motion Encoding for Self-Supervised Video Representation Learning

Boosting Video Object Segmentation via Space-Time Correspondence Learning

Two-Shot Video Object Segmentation

Look Before You Match: Instance Understanding Matters in Video Object Segmentation

Spatial-Then-Temporal Self-Supervised Learning for Video Correspondence

Few-Shot Referring Relationships in Videos

Vision Transformers Are Parameter-Efficient Audio-Visual Learners

Egocentric Video Task Translation

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation

Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards

TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition

How Can Objects Help Action Recognition?

Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based Action Recognition

Decomposed Cross-Modal Distillation for RGB-Based Temporal Action Detection

ASPnet: Action Segmentation With Shared-Private Representation of Multiple Data Sources

Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization

LOGO: A Long-Form Video Dataset for Group Action Quality Assessment

Use Your Head: Improving Long-Tail Video Recognition

Conditional Generation of Audio From Video via Foley Analogies

Weakly Supervised Video Representation Learning With Unaligned Text for Sequential Videos

You Can Ground Earlier Than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

Connecting Vision and Language With Video Localized Narratives

Video-Text As Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

Make-a-Story: Visual Memory Conditioned Consistent Story Generation

Test of Time: Instilling Video-Language Models With a Sense of Time

How You Feelin’? Learning Emotions and Mental States in Movie Scenes

Continuous Sign Language Recognition With Correlation Network

DIP: Dual Incongruity Perceiving Network for Sarcasm Detection

Gloss Attention for Gloss-Free Sign Language Translation

Object-Goal Visual Navigation via Effective Exploration of Relations Among Historical Navigation States

Behavioral Analysis of Vision-and-Language Navigation Agents

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

Where Is My Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization

Efficient Multimodal Fusion via Interactive Prompting

NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations

Dynamic Inference With Grounding Based Vision and Language Models

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning

Teaching Structured Vision & Language Concepts to Vision & Language Models

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Learning From Unique Perspectives: User-Aware Saliency Modeling

CRAFT: Concept Recursive Activation FacTorization for Explainability

Doubly Right Object Recognition: A Why Prompt for Visual Rationales

Sketch2Saliency: Learning To Detect Salient Objects From Human Drawings

PIP-Net: Patch-Based Intuitive Prototypes for Interpretable Image Classification

Photo Pre-Training, but for Sketch

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval

Multi-Modal Representation Learning With Text-Driven Soft Masks

Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Reproducible Scaling Laws for Contrastive Language-Image Learning

Multilateral Semantic Relations Modeling for Image Text Retrieval

SmallCap: Lightweight Image Captioning Prompted With Retrieval Augmentation

Probing Sentiment-Oriented Pre-Training Inspired by Human Sentiment Perception Mechanism

Prefix Conditioning Unifies Language and Label Supervision

Crossing the Gap: Domain Generalization for Image Captioning

A Bag-of-Prototypes Representation for Dataset-Level Applications

CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model

D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-Based Transformers

Learning To Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space

Relational Context Learning for Human-Object Interaction Detection

Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision

Side Adapter Network for Open-Vocabulary Semantic Segmentation

Open-Vocabulary Panoptic Segmentation With Text-to-Image Diffusion Models

IFSeg: Image-Free Semantic Segmentation via Vision-Language Model

PartManip: Learning Cross-Category Generalizable Part Manipulation Policy From Point Cloud Observations

OneFormer: One Transformer To Rule Universal Image Segmentation

Delving Into Shape-Aware Zero-Shot Semantic Segmentation

CoMFormer: Continual Learning in Semantic and Panoptic Segmentation

Learning To Segment Every Referring Object Point by Point

Unsupervised Continual Semantic Adaptation Through Neural Rendering

Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation

Transformer Scale Gate for Semantic Segmentation

Style Projected Clustering for Domain Generalized Semantic Segmentation

Rethinking Few-Shot Medical Segmentation: A Vector Quantization View

Continual Semantic Segmentation With Automatic Memory Sample Selection

Token Contrast for Weakly-Supervised Semantic Segmentation

Multi-Granularity Archaeological Dating of Chinese Bronze Dings Based on a Knowledge-Guided Relation Graph

Hunting Sparsity: Density-Guided Contrastive Learning for Semi-Supervised Semantic Segmentation

Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Extracting Class Activation Maps From Non-Discriminative Features As Well

BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation

Hierarchical Fine-Grained Image Forgery Detection and Localization

Towards Professional Level Crowd Annotation of Expert Domain Data

Unsupervised Object Localization: Observing the Background To Discover Objects

Semi-Supervised Learning Made Simple With Self-Supervised Clustering

Unbalanced Optimal Transport: A Unified Framework for Object Detection

DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection

CLIP the Gap: A Single Domain Generalization Approach for Object Detection

Unknown Sniffer for Object Detection: Don’t Turn a Blind Eye to Unknown Objects

Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection

Optimal Proposal Learning for Deployable End-to-End Pedestrian Detection

AsyFOD: An Asymmetric Adaptation Paradigm for Few-Shot Domain Adaptive Object Detection

Where Is My Spot? Few-Shot Image Generation via Latent Subspace Optimization

Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection

MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition

Orthogonal Annotation Benefits Barely-Supervised Medical Image Segmentation

RepMode: Learning to Re-Parameterize Diverse Experts for Subcellular Structure Prediction

Topology-Guided Multi-Class Cell Context Generation for Digital Pathology

Dynamic Graph Enhanced Contrastive Learning for Chest X-Ray Report Generation

Benchmarking Self-Supervised Learning on Diverse Pathology Datasets

Multiple Instance Learning via Iterative Self-Paced Supervised Contrastive Learning

Learning Expressive Prompting With Residuals for Vision Transformers

Decoupling MaxLogit for Out-of-Distribution Detection

Exploring Structured Semantic Prior for Multi Label Recognition With Incomplete Labels

Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification

DivClust: Controlling Diversity in Deep Clustering

Deep Semi-Supervised Metric Learning With Mixed Label Propagation

Leveraging Inter-Rater Agreement for Classification in the Presence of Noisy Labels

Modeling Inter-Class and Intra-Class Constraints in Novel Class Discovery

Bootstrap Your Own Prior: Towards Distribution-Agnostic Novel Class Discovery

Towards Realistic Long-Tailed Semi-Supervised Learning: Consistency Is All You Need

PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery

Probabilistic Knowledge Distillation of Face Ensembles

Class-Conditional Sharpness-Aware Minimization for Deep Long-Tailed Recognition

Promoting Semantic Connectivity: Dual Nearest Neighbors Contrastive Learning for Unsupervised Domain Generalization

Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection

MOT: Masked Optimal Transport for Partial Domain Adaptation

TOPLight: Lightweight Neural Networks With Task-Oriented Pretraining for Visible-Infrared Recognition

OSAN: A One-Stage Alignment Network To Unify Multimodal Alignment and Unsupervised Domain Adaptation

Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective

ARO-Net: Learning Implicit Fields From Anchored Radial Observations

A Probabilistic Framework for Lifelong Test-Time Adaptation

Distribution Shift Inversion for Out-of-Distribution Prediction

Learning Joint Latent Space EBM Prior Model for Multi-Layer Generator

A Data-Based Perspective on Transfer Learning

A Meta-Learning Approach to Predicting Performance and Data Requirements

Guided Recommendation for Model Fine-Tuning

EMT-NAS:Transferring Architectural Knowledge Between Tasks From Different Datasets

AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning

Batch Model Consolidation: A Multi-Task Model Consolidation Framework

SmartAssign: Learning a Smart Knowledge Assignment Strategy for Deraining and Desnowing

TinyMIM: An Empirical Study of Distilling MIM Pre-Trained Models

Computationally Budgeted Continual Learning: What Does Matter?

GradMA: A Gradient-Memory-Based Accelerated Federated Learning With Alleviated Catastrophic Forgetting

Rethinking Gradient Projection Continual Learning: Stability / Plasticity Feature Space Decoupling

Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation

Generalizing Dataset Distillation via Deep Generative Prior

Minimizing the Accumulated Trajectory Error To Improve Dataset Distillation

Slimmable Dataset Condensation

Sharpness-Aware Gradient Matching for Domain Generalization

Dynamic Neural Network for Multi-Task Learning Searching Across Diverse Network Topologies

SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries

VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution

Efficient On-Device Training via Gradient Filtering

Are Data-Driven Explanations Robust Against Out-of-Distribution Data?

BiasAdv: Bias-Adversarial Augmentation for Model Debiasing

Q-DETR: An Efficient Low-Bit Quantized Detection Transformer

NIPQ: Noise Proxy-Based Integrated Pseudo-Quantization

CUDA: Convolution-Based Unlearnable Datasets

KD-DLGAN: Data Limited Image Generation via Knowledge Distillation

Spider GAN: Leveraging Friendly Neighbors To Accelerate GAN Training

Efficient Verification of Neural Networks Against LVM-Based Specifications

Bi-Directional Feature Fusion Generative Adversarial Network for Ultra-High Resolution Pathological Image Virtual Re-Staining

DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection

OmniAL: A Unified CNN Framework for Unsupervised Anomaly Localization

Federated Incremental Semantic Segmentation

Re-Thinking Federated Active Learning Based on Inter-Class Diversity

Federated Domain Generalization With Generalization Adjustment

On the Effectiveness of Partial Variance Reduction in Federated Learning With Heterogeneous Data

The Resource Problem of Using Linear Layer Leakage Attack in Federated Learning

Unlearnable Clusters: Towards Label-Agnostic Unlearnable Examples

Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization

Backdoor Defense via Adaptively Splitting Poisoned Dataset

How to Backdoor Diffusion Models?

TrojViT: Trojan Insertion in Vision Transformers

TrojDiff: Trojan Attacks on Diffusion Models With Diverse Targets

Ensemble-Based Blackbox Attacks on Dense Prediction

Efficient Loss Function by Minimizing the Detrimental Effect of Floating-Point Errors on Gradient-Based Attacks

The Best Defense Is a Good Offense: Adversarial Augmentation Against Adversarial Attacks

Adversarial Robustness via Random Projection Filters

Jedi: Entropy-Based Localization and Removal of Adversarial Patches

Exploring the Relationship Between Architectural Design and Adversarially Robust Generalization

Improving Robustness of Vision Transformers by Reducing Sensitivity To Patch Corruptions

Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition

AltFreezing for More General Video Face Forgery Detection

(ends 12:00 PM)

12:30 p.m.

Break:

Lunch

(ends 2:00 PM)

2 p.m.

Panel:

History and Future of Artificial Intelligence and Computer Vision

(ends 3:00 PM)

3 p.m.

Award:

Award Candidates TUE

(ends 4:00 PM)

4 p.m.

Break:

Break

(ends 4:30 PM)

4:30 p.m.

Poster Session TUE-PM [4:30-6:00]

Posters 4:30-6:00

Passive Micron-Scale Time-of-Flight With Sunlight Interferometry

F2-NeRF: Fast Neural Radiance Field Training With Free Camera Trajectories

NoPe-NeRF: Optimising Neural Radiance Field With No Pose Prior

BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields

DiffusioNeRF: Regularizing Neural Radiance Fields With Denoising Diffusion Models

SPARF: Neural Radiance Fields From Sparse and Noisy Poses

Interactive Segmentation of Radiance Fields

Temporal Interpolation Is All You Need for Dynamic Neural Radiance Fields

Compressing Volumetric Radiance Fields to 1 MB

Multiscale Tensor Decomposition and Rendering Equation Encoding for View Synthesis

Ref-NPR: Reference-Based Non-Photorealistic Radiance Fields for Controllable Scene Stylization

Representing Volumetric Videos As Dynamic MLP Maps

Fast Monocular Scene Reconstruction With Global-Sparse Local-Dense Grids

DynIBaR: Neural Dynamic Image-Based Rendering

Plateau-Reduced Differentiable Path Tracing

NeFII: Inverse Rendering for Reflectance Decomposition With Near-Field Indirect Illumination

WildLight: In-the-Wild Inverse Rendering With a Flashlight

Relightable Neural Human Assets From Multi-View Gradient Illuminations

DiffRF: Rendering-Guided 3D Radiance Field Diffusion

Analyzing Physical Impacts Using Transient Surface Wave Imaging

Neural Kaleidoscopic Space Sculpting

Towards Unbiased Volume Rendering of Neural Implicit Surfaces With Geometry Priors

Neural Kernel Surface Reconstruction

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency

Shape, Pose, and Appearance From a Single Image via Bootstrapped Radiance Field Inversion

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Heat Diffusion Based Multi-Scale and Geometric Structure-Aware Transformer for Mesh Segmentation

Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis From Monocular Image

3D-Aware Conditional Image Synthesis

VIVE3D: Viewpoint-Independent Video Editing Using 3D-Aware GANs

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

Generating Part-Aware Editable 3D Shapes Without 3D Supervision

NeuralLift-360: Lifting an In-the-Wild 2D Photo to a 3D Object With 360° Views

Implicit Identity Driven Deepfake Face Swapping Detection

Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields

Improving Fairness in Facial Albedo Estimation via Visual-Textual Cues

High-Fidelity 3D Face Generation From Natural Language Descriptions

DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment

High-Fidelity Facial Avatar Reconstruction From Monocular Video With Generative Priors

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion

Instant Volumetric Head Avatars

Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement

3D Cinemagraphy From a Single Image

TryOnDiffusion: A Tale of Two UNets

Diverse 3D Hand Gesture Prediction From Body Dynamics by Bilateral Hand Disentanglement

Normal-Guided Garment UV Prediction for Human Re-Texturing

REC-MV: REconstructing 3D Dynamic Cloth From Monocular Videos

SeSDF: Self-Evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction

Unsupervised Volumetric Animation

Handy: Towards a High Fidelity 3D Hand Shape and Appearance Model

Fantastic Breaks: A Dataset of Paired 3D Scans of Real-World Broken Objects and Their Complete Counterparts

Distilling Neural Fields for Real-Time Articulated Shape Reconstruction

GANmouflage: 3D Object Nondetection With Texture Fields

3D Human Pose Estimation via Intuitive Physics

Object Pop-Up: Can We Infer 3D Objects and Their Poses From Human Interactions Alone?

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

Constrained Evolutionary Diffusion Filter for Monocular Endoscope Tracking

Visibility Aware Human-Object Interaction Tracking From Single RGB Camera

Transformer-Based Unified Recognition of Two Hands Manipulating Objects

HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation

3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention

GFPose: Learning 3D Human Pose Prior With Gradient Fields

JRDB-Pose: A Large-Scale Dataset for Multi-Person Pose Estimation and Tracking

Analyzing and Diagnosing Pose Estimation With Attributions

Shape-Constraint Recurrent Flow for 6D Object Pose Estimation

TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation

Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery From Sparse Image Ensemble

Revisiting Rolling Shutter Bundle Adjustment: Toward Accurate and Fast Solution

Revisiting the P3P Problem

Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices

EFEM: Equivariant Neural Field Expectation Maximization for 3D Object Segmentation Without Scene Supervision

GINA-3D: Learning To Generate Implicit Neural Assets in the Wild

Habitat-Matterport 3D Semantics Dataset

BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image

Panoptic Compositional Feature Field for Editable Scene Rendering With Network-Inferred Labels via Metric Learning

A Light Touch Approach to Teaching Transformers Multi-View Geometry

Learning To Render Novel Views From Wide-Baseline Stereo Pairs

Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo

EventNeRF: Neural Radiance Fields From a Single Colour Event Camera

LightedDepth: Video Depth Estimation in Light of Limited Inference View Angles

Generating Aligned Pseudo-Supervision From Non-Aligned Data for Image Restoration in Under-Display Camera

Spatio-Focal Bidirectional Disparity Estimation From a Dual-Pixel Image

Trap Attention: Monocular Depth Estimation With Manual Traps

Accelerated Coordinate Encoding: Learning to Relocalize in Minutes Using RGB and Poses

Energy-Efficient Adaptive 3D Sensing

Incremental 3D Semantic Scene Graph Prediction From RGB Sequences

Consistent Direct Time-of-Flight Video Depth Super-Resolution

Learning To Zoom and Unzoom

FrustumFormer: Adaptive Instance-Aware Resampling for Multi-View 3D Detection

3D Video Object Detection With Learnable Object-Centric Global Optimization

UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird’s-Eye View

ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data

Deep Dive Into Gradients: Better Optimization for 3D Object Detection With Gradient-Corrected IoU Supervision

SlowLiDAR: Increasing the Latency of LiDAR-Based Detection Using Adversarial Examples

Normalizing Flow Based Feature Synthesis for Outlier-Aware Object Detection

OcTr: Octree-Based Transformer for 3D Object Detection

HypLiLoc: Towards Effective LiDAR Pose Regression With Hyperbolic Fusion

LiDAR2Map: In Defense of LiDAR-Based Semantic Map Construction Using Online Camera Distillation

MSF: Motion-Guided Sequential Fusion for Efficient 3D Object Detection From Point Cloud Sequences

SFD2: Semantic-Guided Feature Detection and Description

Temporal Consistent 3D LiDAR Representation Learning for Semantic Perception in Autonomous Driving

Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast for Autonomous Driving

RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

Spatiotemporal Self-Supervised Learning for Point Clouds in the Wild

Change-Aware Sampling and Contrastive Learning for Satellite Images

Self-Supervised 3D Scene Flow Estimation Guided by Superpoints

SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

CP3: Channel Pruning Plug-In for Point-Based Networks

Binarizing Sparse Convolutional Networks for Efficient Point Cloud Analysis

Hyperspherical Embedding for Point Cloud Completion

Attention-Based Point Cloud Edge Sampling

Starting From Non-Parametric Networks for 3D Point Cloud Analysis

Grad-PU: Arbitrary-Scale Point Cloud Upsampling via Gradient Descent With Learned Distance Functions

SE-ORNet: Self-Ensembling Orientation-Aware Network for Unsupervised Point Cloud Shape Correspondence

Robust 3D Shape Classification via Non-Local Graph Attention Network

Rotation-Invariant Transformer for Point Cloud Matching

Deep Graph-Based Spatial Consistency for Robust Non-Rigid Point Cloud Registration

Efficient RGB-T Tracking via Cross-Modality Distillation

Finding Geometric Models by Clustering in the Consensus Space

Adaptive Assignment for Geometry Aware Local Feature Matching

Masked Representation Learning for Domain Generalized Stereo Matching

Learning Optical Expansion From Scale Matching

AnyFlow: Arbitrary Scale Optical Flow With Implicit Neural Representation

HouseDiffusion: Vector Floorplan Generation via a Diffusion Model With Discrete and Continuous Denoising

Localized Semantic Feature Mixers for Efficient Pedestrian Detection in Autonomous Driving

V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting

ViP3D: End-to-End Visual Trajectory Prediction via 3D Agent Queries

IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction

Leapfrog Diffusion Model for Stochastic Trajectory Prediction

DeFeeNet: Consecutive 3D Human Motion Prediction With Deviation Feedback

Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation

ReDirTrans: Latent-to-Latent Translation for Gaze and Head Redirection

Feature Shrinkage Pyramid for Camouflaged Object Detection With Transformers

OVTrack: Open-Vocabulary Multiple Object Tracking

GaitGCI: Generative Counterfactual Intervention for Gait Recognition

Multi-Label Compound Expression Recognition: C-EXPR Database & Network

Blemish-Aware and Progressive Face Retouching With Limited Paired Data

High-Fidelity and Freely Controllable Talking Head Video Generation

3Mformer: Multi-Order Multi-Mode Transformer for Skeletal Action Recognition

UDE: A Unified Driving Engine for Human Motion Generation

Data-Driven Feature Tracking for Event Cameras

MoStGAN-V: Video Generation With Temporal Motion Styles

Two-Stage Co-Segmentation Network Based on Discriminative Representation for Recovering Human Mesh From Videos

Joint Appearance and Motion Learning for Efficient Rolling Shutter Correction

Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation

Deep Stereo Video Inpainting

Burstormer: Burst Image Restoration and Enhancement Transformer

Blur Interpolation Transformer for Real-World Motion From Blur

HDR Imaging With Spatially Varying Signal-to-Noise Ratios

Light Source Separation and Intrinsic Image Decomposition Under AC Illumination

Physics-Guided ISO-Dependent Sensor Noise Modeling for Extreme Low-Light Photography

Neumann Network With Recursive Kernels for Single Image Defocus Deblurring

UMat: Uncertainty-Aware Single Image High Resolution Material Capture

SMAE: Few-Shot Learning for HDR Deghosting With Saturation-Aware Masked Autoencoders

Curricular Contrastive Regularization for Physics-Aware Single Image Dehazing

Patch-Craft Self-Supervised Training for Correlated Image Denoising

Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising

All-in-One Image Restoration for Unknown Degradations Using Adaptive Discriminative Filters for Specific Degradations

Ingredient-Oriented Multi-Degradation Learning for Image Restoration

CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability

Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild

Toward Accurate Post-Training Quantization for Image Super Resolution

Learning Steerable Function for Efficient Image Resampling

ABCD: Arbitrary Bitwise Coefficient for De-Quantization

Efficient Frequency Domain-Based Transformers for High-Quality Image Deblurring

Learning a Sparse Transformer Network for Effective Image Deraining

CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion

PCT-Net: Full Resolution Image Harmonization Using Pixel-Wise Color Transformations

Semi-Supervised Parametric Real-World Image Harmonization

Towards Robust Tampered Text Detection in Document Image: New Dataset and New Solution

QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity

Deep Curvilinear Editing: Commutative and Nonlinear Image Manipulation for Pretrained Deep Generative Model

Person Image Synthesis via Denoising Diffusion Model

Disentangling Writer and Character Styles for Handwriting Generation

NoisyTwins: Class-Consistent and Diverse Image Generation Through StyleGANs

High-Fidelity Guided Image Synthesis With Latent Diffusion Models

Imagic: Text-Based Real Image Editing With Diffusion Models

PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout

SINE: SINgle Image Editing With Text-to-Image Diffusion Models

NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models

Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models

Parallel Diffusion Models of Operator and Image for Blind Inverse Problems

Unite and Conquer: Plug & Play Multi-Modal Synthesis Using Diffusion Models

Collaborative Diffusion for Multi-Modal Face Generation and Editing

Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding

NVTC: Nonlinear Vector Transform Coding

Motion Information Propagation for Neural Video Compression

A Dynamic Multi-Scale Voxel Flow Network for Video Prediction

Towards Scalable Neural Representation for Diverse Videos

DINER: Disorder-Invariant Implicit Neural Representation

SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy

DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network

Optimization-Inspired Cross-Attention Transformer for Compressive Sensing

Neighborhood Attention Transformer

Making Vision Transformers Efficient From a Token Sparsification View

Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors

Neuralizer: General Neuroimage Analysis Without Re-Training

Learning Partial Correlation Based Deep Visual Representation for Image Classification

Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

Adaptive Graph Convolutional Subspace Clustering

Deep Learning of Partial Graph Matching via Differentiable Top-K

DynamicDet: A Unified Dynamic Architecture for Object Detection

IS-GGT: Iterative Scene Graph Generation With Generative Transformers

Fast Contextual Scene Graph Generation With Unbiased Context Augmentation

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning

MED-VT: Multiscale Encoder-Decoder Video Transformer With Application To Object Segmentation

MOVES: Manipulated Objects in Video Enable Segmentation

InstMove: Instance Motion for Object-Centric Video Segmentation

ZBS: Zero-Shot Background Subtraction via Instance-Level Background Modeling and Foreground Selection

Feature Aggregated Queries for Transformer-Based Video Object Detectors

Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation

Selective Structured State-Spaces for Long-Form Video Understanding

Relational Space-Time Query in Long-Form Videos

Novel-View Acoustic Synthesis

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective

Decompose More and Aggregate Better: Two Closer Looks at Frequency Representation Learning for Human Motion Prediction

TempSAL – Uncovering Temporal Information for Deep Saliency Prediction

Prompt-Guided Zero-Shot Anomaly Action Recognition Using Pretrained Deep Skeleton Features

MMG-Ego4D: Multimodal Generalization in Egocentric Action Recognition

Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition

Reducing the Label Bias for Timestamp Supervised Temporal Action Segmentation

Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks

Iterative Proposal Refinement for Weakly-Supervised Video Grounding

Movies2Scenes: Using Movie Metadata To Learn Scene Representation

Fine-Tuned CLIP Models Are Efficient Video Learners

Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring

VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval

ProTéGé: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding

Learning Video Representations From Large Language Models

All in One: Exploring Unified Video-Language Pre-Training

High-Fidelity Generalized Emotional Talking Face Generation With Multi-Modal Emotion Space Learning

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition With Pre-Trained Vision-Language Models

Decoupled Multimodal Distilling for Emotion Recognition

Affection: Learning Affective Explanations for Real-World Visual Data

An Actor-Centric Causality Graph for Asynchronous Temporal Inference in Group Activity

VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision

3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification

Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding

NaQ: Leveraging Narrations As Queries To Supervise Episodic Memory

EC2: Emergent Communication for Embodied Control

Abstract Visual Reasoning: An Algebraic Approach for Solving Raven’s Progressive Matrices

Logical Implications for Visual Question Answering Consistency

Divide and Conquer: Answering Questions With Object Factorization and Compositional Reasoning

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Visual-Language Prompt Tuning With Knowledge-Guided Context Optimization

Probabilistic Prompt Learning for Dense Prediction

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

Seeing What You Miss: Vision-Language Pre-Training With Semantic Completion Learning

Affordance Grounding From Demonstration Video To Target Image

Leverage Interactive Affinity for Affordance Learning

DeAR: Debiasing Vision-Language Models With Additive Residuals

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Hyperbolic Contrastive Learning for Visual Representations Beyond Objects

Picture That Sketch: Photorealistic Image Generation From Abstract Sketches

GeneCIS: A Benchmark for General Conditional Image Similarity

Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR

Parts2Words: Learning Joint Embedding of Point Clouds and Texts by Bidirectional Matching Between Parts and Words

DeltaEdit: Exploring Text-Free Training for Text-Driven Image Manipulation

Detecting and Grounding Multi-Modal Media Manipulation

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding

Cross-Domain Image Captioning With Discriminative Finetuning

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata

Uncurated Image-Text Datasets: Shedding Light on Demographic Bias

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

Turning a CLIP Model Into a Scene Text Detector

ScanDMM: A Deep Markov Model of Scanpath Prediction for 360° Images

CrOC: Cross-View Online Clustering for Dense Visual Representation Learning

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching

Open-Vocabulary Attribute Detection

Learning To Detect and Segment for Open Vocabulary Object Detection

Open-Vocabulary Semantic Segmentation With Mask-Adapted CLIP

A Simple Framework for Text-Supervised Semantic Segmentation

GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts

GeoLayoutLM: Geometric Pre-Training for Visual Information Extraction

Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss

Generative Semantic Segmentation

MISC210K: A Large-Scale Dataset for Multi-Instance Semantic Correspondence

MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation

PACO: Parts and Attributes of Common Objects

PartDistillation: Learning Parts From Instance Segmentation

ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation

Reliability in Semantic Segmentation: Are We on the Right Track?

Rethinking the Correlation in Few-Shot Segmentation: A Buoys View

SIM: Semantic-Aware Instance Mask Generation for Box-Supervised Instance Segmentation

Endpoints Weight Fusion for Class Incremental Semantic Segmentation

Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class

Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation With Implicit Neural Representations

Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation

Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection

Texture-Guided Saliency Distilling for Unsupervised Salient Object Detection

An Erudite Fine-Grained Visual Classification Model

Dynamic Graph Learning With Content-Guided Spatial-Frequency Relation Reasoning for Deepfake Detection

ScaleDet: A Scalable Multi-Dataset Object Detector

Multi-Centroid Task Descriptor for Dynamic Class Incremental Inference

Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation

Dynamic Coarse-To-Fine Learning for Oriented Tiny Object Detection

Dense Distinct Query for End-to-End Object Detection

Meta-Tuning Loss Functions and Data Augmentation for Few-Shot Object Detection

One-to-Few Label Assignment for End-to-End Dense Detection

Test Time Adaptation With Regularized Loss for Weakly Supervised Salient Object Detection

MixTeacher: Mining Promising Labels With Mixed Scale Teacher for Semi-Supervised Object Detection

Exploring Incompatible Knowledge Transfer in Few-Shot Image Generation

Exploring Intra-Class Variation Factors With Learnable Cluster Prompts for Semi-Supervised Image Synthesis

A Soma Segmentation Benchmark in Full Adult Fly Brain

SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation

Label-Free Liver Tumor Segmentation

Interactive and Explainable Region-Guided Radiology Report Generation

A Loopback Network for Explainable Microvascular Invasion Classification

Task-Specific Fine-Tuning via Variational Information Bottleneck for Weakly-Supervised Pathology Whole Slide Image Classification

YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors

Two-Way Multi-Label Loss

Teaching Matters: Investigating the Role of Supervision in Vision Transformers

Detection of Out-of-Distribution Samples Using Binary Neuron Activation Patterns

Label Information Bottleneck for Label Enhancement

Glocal Energy-Based Learning for Few-Shot Open-Set Recognition

Noisy Correspondence Learning With Meta Similarity Correction

Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-Shot Learning With Hyperspherical Embeddings

Coreset Sampling From Open-Set for Fine-Grained Self-Supervised Learning

Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data

Trade-Off Between Robustness and Accuracy of Vision Transformers

Exploring and Utilizing Pattern Imbalance

Dynamic Conceptional Contrastive Learning for Generalized Category Discovery

Towards Better Decision Forests: Forest Alternating Optimization

Learning Debiased Representations via Conditional Attribute Interpolation

On the Pitfall of Mixup for Uncertainty Calibration

Class Relationship Embedded Learning for Source-Free Unsupervised Domain Adaptation

FeatureBooster: Boosting Feature Descriptors With a Lightweight Neural Network

Guiding Pseudo-Labels With Uncertainty Estimation for Source-Free Unsupervised Domain Adaptation

Divide and Adapt: Active Domain Adaptation via Customized Learning

Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning

Deep Factorized Metric Learning

Meta-Causal Learning for Single Domain Generalization

Meta Omnium: A Benchmark for General-Purpose Learning-To-Learn

Robust Mean Teacher for Continual and Gradual Test-Time Adaptation

NAR-Former: Neural Architecture Representation Learning Towards Holistic Attributes Prediction

Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning

Architecture, Dataset and Model-Scale Agnostic Data-Free Meta-Learning

GKEAL: Gaussian Kernel Embedded Analytic Learning for Few-Shot Class Incremental Task

Mitigating Task Interference in Multi-Task Learning via Explicit Task Routing With Non-Learnable Primitives

Boundary Unlearning: Rapid Forgetting of Deep Networks via Shifting the Decision Boundary

Task Difficulty Aware Parameter Allocation & Regularization for Lifelong Learning

Learning To Retain While Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation

A Unified Knowledge Distillation Framework for Deep Directed Graphical Models

Coaching a Teachable Student

Adaptive Plasticity Improvement for Continual Learning

Improving Generalization of Meta-Learning With Inverted Regularization at Inner-Level

Trainable Projected Gradient Method for Robust Fine-Tuning

Imitation Learning As State Matching via Differentiable Physics

Improved Distribution Matching for Dataset Condensation

A General Regret Bound of Preconditioned Gradient Method for DNN Training

From Node Interaction To Hop Interaction: New Effective and Scalable Graph Learning Paradigm

Constructing Deep Spiking Neural Networks From Artificial Neural Networks With Knowledge Distillation

Rate Gradient Approximation Attack Threats Deep Spiking Neural Networks

MobileOne: An Improved One Millisecond Mobile Backbone

Understanding Masked Autoencoders via Hierarchical Latent Variable Models

Training Debiased Subnetworks With Contrastive Weight Pruning

One-Shot Model for Mixed-Precision Quantization

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Adaptive Data-Free Quantization

Learning To Generate Image Embeddings With User-Level Differential Privacy

Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences Between Pretrained Generative Models

HandsOff: Labeled Dataset Generation With No Additional Human Annotations

Attribute-Preserving Face Dataset Anonymization via Latent Code Optimization

Fake It Till You Make It: Learning Transferable Representations From Synthetic ImageNet Clones

Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection

Multimodal Industrial Anomaly Detection via Hybrid Fusion

FedSeg: Class-Heterogeneous Federated Learning for Semantic Segmentation

Decentralized Learning With Multi-Headed Distillation

Learning Federated Visual Prompt in Null Space for MRI Reconstruction

Federated Learning With Data-Agnostic Distribution Fusion

CaPriDe Learning: Confidential and Private Decentralized Learning Based on Encryption-Friendly Distillation Loss

RiDDLE: Reversible and Diversified De-Identification With Latent Encryptor

Multi-View Adversarial Discriminator: Mine the Non-Causal Factors for Object Detection in Unseen Domains

Single Image Backdoor Inversion via Robust Smoothed Classifiers

Effective Ambiguity Attack Against Passport-Based DNN Intellectual Property Protection Schemes Through Fully Connected Layer Substitution

Color Backdoor: A Robust Poisoning Attack in Color Space

Adversarially Robust Neural Architecture Search for Graph Neural Networks

Minimizing Maximum Model Discrepancy for Transferable Black-Box Targeted Attacks

StyLess: Boosting the Transferability of Adversarial Examples

Improving the Transferability of Adversarial Samples by Path-Augmented Method

Feature Separation and Recalibration for Adversarial Robustness

CFA: Class-Wise Calibrated Fair Adversarial Training

Revisiting Residual Networks for Adversarial Robustness

Privacy-Preserving Adversarial Facial Features

Edge-Aware Regional Message Passing Controller for Image Forgery Localization

(ends 6:00 PM)

7 p.m.

WED 21 JUN

7:30 a.m.

Breakfast:

Breakfast

(ends 9:00 AM)

8:30 a.m.

Award:

Award Ceremony

(ends 9:00 AM)

9 a.m.

Keynote:

An AI Odyssey: the Dark Matter of Intelligence

Yejin Choi

(ends 10:00 AM)

10 a.m.

Break:

Break

(ends 10:30 AM)

10:30 a.m.

Poster Session WED-AM [10:30-12:00]

Posters 10:30-12:00

Swept-Angle Synthetic Wavelength Interferometry

RefSR-NeRF: Towards High Fidelity and Super Resolution View Synthesis

FreeNeRF: Improving Few-Shot Neural Rendering With Free Frequency Regularization

Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields

Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation From 2D Supervision

NeRF-DS: Neural Radiance Fields for Dynamic Specular Objects

Grid-Guided Neural Radiance Fields for Large Urban Scenes

Learning Neural Duplex Radiance Fields for Real-Time View Synthesis

EditableNeRF: Editing Topologically Varying Neural Radiance Fields by Key Points

Real-Time Neural Light Field on Mobile Devices

StyleRF: Zero-Shot 3D Style Transfer of Neural Radiance Fields

Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields

Pointersect: Neural Rendering With Cloud-Ray Intersection

Neural Fields Meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes

DANI-Net: Uncalibrated Photometric Stereo by Differentiable Shadow Handling, Anisotropic Reflectance Modeling, and Neural Inverse Rendering

MAIR: Multi-View Attention Inverse Rendering With 3D Spatially-Varying Lighting Estimation

Weakly-Supervised Single-View Image Relighting

Controllable Light Diffusion for Portraits

RGBD2: Generative Scene Synthesis via Incremental View Inpainting Using RGBD Diffusion Models

Neural Lens Modeling

RealFusion: 360° Reconstruction of Any Object From a Single Image

Neuralangelo: High-Fidelity Neural Surface Reconstruction

PermutoSDF: Fast Multi-View Reconstruction With Implicit Surfaces Using Permutohedral Lattices

NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction

NEF: Neural Edge Fields for 3D Parametric Curve Reconstruction From Multi-View Images

NeuralField-LDM: Scene Generation With Hierarchical Latent Diffusion Models

SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene

Painting 3D Nature in 2D: View Synthesis of Natural Scenes From a Single Semantic Mask

Quantitative Manipulation of Custom Attributes on 3D-Aware Image Synthesis

NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-Shot Real Image Animation

PREIM3D: 3D Consistent Precise Image Attribute Editing From a Single Image

Unsupervised 3D Shape Reconstruction by Part Retrieval and Assembly

DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion

Fine-Grained Face Swapping via Regional GAN Inversion

Logical Consistency and Greater Descriptive Power for Facial Hair Attribute Learning

Learning a 3D Morphable Face Reflectance Model From Low-Cost Data

StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer

FaceLit: Neural 3D Relightable Faces

FitMe: Deep Photorealistic 3D Morphable Model Avatars

NeuWigs: A Neural Dynamic Model for Volumetric Hair Capture and Animation

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

High-Fidelity Clothed Avatar Reconstruction From a Single Image

Music-Driven Group Choreography

Hand Avatar: Free-Pose Hand Animation and Rendering From Monocular Video

Biomechanics-Guided Facial Action Unit Detection Through Force Modeling

Zero-Shot Pose Transfer for Unrigged Stylized 3D Characters

Invertible Neural Skinning

BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion

DIFu: Depth-Guided Implicit Function for Clothed Human Reconstruction

Complete 3D Human Reconstruction From a Single Incomplete Image

Learning Neural Volumetric Representations of Dynamic Humans in Minutes

Marching-Primitives: Shape Abstraction From Signed Distance Function

Learning Analytical Posterior Probability for Human Mesh Recovery

MagicPony: Learning Articulated 3D Animals in the Wild

Visual-Tactile Sensing for In-Hand Object Reconstruction

Command-Driven Articulated Object Understanding and Manipulation

Target-Referenced Reactive Grasping for Dynamic Objects

NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions

A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation From a Single RGB Image

TRACE: 5D Temporal Regression of Avatars With Dynamic Cameras in 3D Environments

BITE: Beyond Priors for Improved Three-D Dog Pose Estimation

PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation

Global-to-Local Modeling for Video-Based 3D Human Pose and Shape Estimation

TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers

GFIE: A Dataset and Baseline for Gaze-Following From 2D to 3D in Indoor Environments

Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation From Image Sequence

Rigidity-Aware Detection for 6D Object Pose Estimation

Crowd3D: Towards Hundreds of People Reconstruction From a Single Image

Object Pose Estimation With Statistical Guarantees: Conformal Keypoint Detection and Geometric Uncertainty Propagation

expOSE: Accurate Initialization-Free Projective Factorization Using Exponential Regularization

Neural Voting Field for Camera-Space 3D Hand Pose Estimation

Two-View Geometry Scoring Without Correspondences

Four-View Geometry With Unknown Radial Distortion

BKinD-3D: Self-Supervised 3D Keypoint Discovery From Multi-View Videos

BAAM: Monocular 3D Pose and Shape Reconstruction With Bi-Contextual Attention Module and Attention-Guided Modeling

Multi-Object Manipulation via Object-Centric Neural Scattering Functions

Neural Part Priors: Learning To Optimize Part-Based Object Completion in RGB-D Scans

Panoptic Lifting for 3D Scene Understanding With Neural Fields

Virtual Occlusions Through Implicit Depth

Multiview Compressive Coding for 3D Reconstruction

Behind the Scenes: Density Fields for Single View Reconstruction

VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion

Renderable Neural Radiance Map for Visual Navigation

Learning To Detect Mirrors From Videos via Dual Correspondences

Temporally Consistent Online Depth Estimation Using Point-Based Fusion

Zero-Shot Dual-Lens Super-Resolution

Fully Self-Supervised Depth Estimation From Defocus Clue

MVImgNet: A Large-Scale Dataset of Multi-View Images

Revisiting the Stack-Based Inverse Tone Mapping

Combining Implicit-Explicit View Correlation for Light Field Semantic Segmentation

3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud

Role of Transients in Two-Bounce Non-Line-of-Sight Imaging

3D Concept Learning and Reasoning From Multi-View Images

Viewpoint Equivariance for Multi-View 3D Object Detection

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

BEV@DC: Bird’s-Eye View Assisted Training for Depth Completion

Collaboration Helps Camera Overtake LiDAR in 3D Detection

Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection

Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration

Depth Estimation From Camera Image and mmWave Radar Point Cloud

SGLoc: Scene Geometry Encoding for Outdoor LiDAR Localization

ConQueR: Query Contrast Voxel-DETR for 3D Object Detection

DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization

Towards Unsupervised Object Detection From LiDAR Point Clouds

MoDAR: Using Motion Forecasting for 3D Object Detection in Point Cloud Sequences

Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision

Instant Domain Augmentation for LiDAR Semantic Segmentation

Less Is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation

MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point Clouds

Novel Class Discovery for 3D Point Cloud Semantic Segmentation

GD-MAE: Generative Decoder for MAE Pre-Training on LiDAR Point Clouds

Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning

Open-Set Semantic Segmentation for Point Clouds via Adversarial Prototype Framework

ACL-SPC: Adaptive Closed-Loop System for Self-Supervised Point Cloud Completion

Fast Point Cloud Generation With Straight Flows

PointVector: A Vector Representation in Point Cloud Analysis

ProxyFormer: Proxy Alignment Assisted Point Cloud Completion With Missing Part Sensitive Transformer

FAC: 3D Representation Learning via Foreground Aware Feature Contrast

Rethinking the Approximation Error in 3D Surface Fitting for Point Cloud Normal Estimation

PointCert: Point Cloud Classification With Deterministic Certified Robustness Guarantees

Robust Multiview Point Cloud Registration With Reliable Pose Graph Initialization and History Reweighting

Visual Prompt Multi-Modal Tracking

Progressive Neighbor Consistency Mining for Correspondence Pruning

Geometric Visual Similarity Learning in 3D Medical Image Self-Supervised Pre-Training

Unsupervised Visible-Infrared Person Re-Identification via Progressive Graph Matching and Alternate Learning

Domain Generalized Stereo Matching via Hierarchical Visual Transformation

Unsupervised Cumulative Domain Adaptation for Foggy Scene Optical Flow

PVO: Panoptic Visual Odometry

BAEFormer: Bi-Directional and Early Interaction Transformers for Bird’s Eye View Semantic Segmentation

Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark

Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving

MixSim: A Hierarchical Framework for Mixed Reality Traffic Simulation

Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction

MotionDiffuser: Controllable Multi-Agent Motion Prediction Using Diffusion

Learning Human-to-Robot Handovers From Point Clouds

Phone2Proc: Bringing Robust Robots Into Our Chaotic World

GazeNeRF: 3D-Aware Gaze Redirection With Neural Radiance Fields

Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking

Autoregressive Visual Tracking

OpenGait: Revisiting Gait Recognition Towards Better Practicality

Pose-Disentangled Contrastive Learning for Self-Supervised Facial Representation

Identity-Preserving Talking Face Generation With Landmark and Appearance Priors

DF-Platter: Multi-Face Heterogeneous Deepfake Dataset

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos

Mofusion: A Framework for Denoising-Diffusion-Based Motion Synthesis

Adaptive Global Decay Process for Event Cameras

Frame-Event Alignment and Fusion Network for High Frame Rate Tracking

Exploring Discontinuity for Video Frame Interpolation

AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation

Frame Interpolation Transformer and Uncertainty Guidance

A Simple Baseline for Video Restoration With Grouped Spatial-Temporal Shift

Recurrent Homography Estimation Using Homography-Guided Image Warping and Focus Transformer

HyperCUT: Video Sequence From a Single Blurry Image Using Unsupervised Ordering

Indescribable Multi-Modal Spatial Evaluator

Structured Kernel Estimation for Photon-Limited Deconvolution

Polarized Color Image Denoising

Uncertainty-Aware Unsupervised Image Deblurring With Deep Residual Prior

Low-Light Image Enhancement via Structure Modeling and Guidance

Learning Sample Relationship for Exposure Correction

Spatially Adaptive Self-Supervised Learning for Real-World Image Denoising

Quantum-Inspired Spectral-Spatial Pyramid Network for Hyperspectral Image Classification

Generative Diffusion Prior for Unified Image Restoration and Enhancement

Ground-Truth Free Meta-Learning for Deep Compressive Sampling

Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation

An Image Quality Assessment Dataset for Portraits

Bitstream-Corrupted JPEG Images Are Restorable: Two-Stage Compensation and Alignment Framework for Image Restoration

Image Super-Resolution Using T-Tetromino Pixels

CUF: Continuous Upsampling Filters

OPE-SR: Orthogonal Position Encoding for Designing a Parameter-Free Upsampling Module in Arbitrary-Scale Image Super-Resolution

Implicit Diffusion Models for Continuous Super-Resolution

Pixels, Regions, and Objects: Multiple Enhancement for Salient Object Detection

VILA: Learning Image Aesthetics From User Comments With Vision-Language Pretraining

Image Cropping With Spatial-Aware Feature and Rank Consistency

B-Spline Texture Coefficients Estimator for Screen Content Image Super-Resolution

Delving StyleGAN Inversion for Image Editing: A Foundation Latent Space Viewpoint

Learning Dynamic Style Kernels for Artistic Style Transfer

SVGformer: Representation Learning for Continuous Vector Graphics Using Transformers

Learning Generative Structure Prior for Blind Text Image Super-Resolution

Unsupervised Domain Adaption With Pixel-Level Discriminator for Image-Aware Layout Generation

Scaling Up GANs for Text-to-Image Synthesis

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model With Knowledge-Enhanced Mixture-of-Denoising-Experts

Inversion-Based Style Transfer With Diffusion Models

Shifted Diffusion for Text-to-Image Generation

LayoutDM: Discrete Diffusion Model for Controllable Layout Generation

Unpaired Image-to-Image Translation With Shortest Path Regularization

DiffCollage: Parallel Generation of Large Content With Diffusion Models

Wavelet Diffusion Models Are Fast and Scalable Image Generators

VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Adaptive Human Matting for Dynamic Videos

LVQAC: Lattice Vector Quantization Coupled With Spatially Adaptive Companding for Efficient Learned Image Compression

Hierarchical B-Frame Video Coding Using Two-Layer CANF Without Motion Coding

Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting

HNeRV: A Hybrid Neural Representation for Videos

Regularize Implicit Neural Representation by Itself

SMPConv: Self-Moving Point Representations for Continuous Convolution

Long Range Pooling for 3D Large-Scale Scene Understanding

Progressive Random Convolutions for Single Domain Generalization

BiFormer: Vision Transformer With Bi-Level Routing Attention

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

BioNet: A Biologically-Inspired Network for Face Recognition

Dual-Bridging With Adversarial Noise Generation for Domain Adaptive rPPG Estimation

On Data Scaling in Masked Image Modeling

Hard Patches Mining for Masked Image Modeling

Evolved Part Masking for Self-Supervised Learning

BASiS: Batch Aligned Spectral Embedding Space

OmniMAE: Single Model Masked Pretraining on Images and Videos

ViTs for SITS: Vision Transformers for Satellite Image Time Series

Probabilistic Debiasing of Scene Graphs

Blind Video Deflickering by Neural Filtering With a Flawed Atlas

SCOTCH and SODA: A Transformer Video Shadow Detection Framework

MAGVIT: Masked Generative Video Transformer

Improving Robustness of Semantic Segmentation to Motion-Blur Using Class-Centric Augmentation

MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation

Self-Supervised Video Forensics by Audio-Visual Anomaly Detection

Frame Flexible Network

System-Status-Aware Adaptive Network for Online Streaming Video Understanding

MDQE: Mining Discriminative Query Embeddings To Segment Occluded Instances on Challenging Videos

Spatio-Temporal Pixel-Level Contrastive Learning-Based Source-Free Domain Adaptation for Video Semantic Segmentation

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Chat2Map: Efficient Scene Mapping From Multi-Ego Conversations

Audio-Visual Grouping Network for Sound Localization From Mixtures

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

Fine-Grained Audible Video Description

Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-Based Action Recognition

Learning Discriminative Representations for Skeleton Based Action Recognition

Therbligs in Action: Video Understanding Through Motion Primitives

Search-Map-Search: A Frame Selection Paradigm for Action Recognition

Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

Boosting Weakly-Supervised Temporal Action Localization With Text Information

Perception and Semantic Aware Regularization for Sequential Confidence Calibration

NewsNet: A Novel Dataset for Hierarchical Temporal Segmentation

Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation

Leveraging Temporal Context in Low Representational Power Regimes

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Procedure-Aware Pretraining for Instructional Video Understanding

VindLU: A Recipe for Effective Video-and-Language Pretraining

Modular Memorability: Tiered Representations for Video Memorability Prediction

Multivariate, Multi-Frequency and Multimodal: Rethinking Graph Neural Networks for Emotion Recognition in Conversation

Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

Layout-Based Causal Inference for Object Navigation

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

A New Path: Scaling Vision-and-Language Navigation With Synthetic Instructions and Imitation Learning

A-Cap: Anticipation Captioning With Commonsense Knowledge

Are Deep Neural Networks SMARTer Than Second Graders?

Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning

Language Adaptive Weight Generation for Multi-Task Visual Grounding

From Images to Textual Prompts: Zero-Shot Visual Question Answering With Frozen Large Language Models

Diversity-Aware Meta Visual Prompting

Hierarchical Prompt Learning for Multi-Task Learning

Task Residual for Tuning Vision-Language Models

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding

Overlooked Factors in Concept-Based Explanations: Dataset Choice, Concept Learnability, and Human Capability

Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space

GIVL: Improving Geographical Inclusivity of Vision-Language Models With Pre-Training Methods

Learning Bottleneck Concepts in Image Classification

SceneTrilogy: On Human Scene-Sketch and Its Complementarity With Photo and Text

Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

CLIPPO: Image-and-Language Understanding From Pixels Only

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

Non-Contrastive Learning Meets Language-Image Pre-Training

HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

Learning Attribute and Class-Specific Representation Duet for Fine-Grained Fashion Analysis

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce

Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning

Asymmetric Feature Fusion for Image Retrieval

Improving Zero-Shot Generalization and Robustness of Multi-Modal Models

Hint-Aug: Drawing Hints From Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning

Visual DNA: Representing and Comparing Images Using Distributions of Neuron Activations

End-to-End 3D Dense Captioning With Vote2Cap-DETR

Improving Table Structure Recognition With Visual-Alignment Sequential Coordinate Modeling

Region-Aware Pretraining for Open-Vocabulary Object Detection With Vision Transformers

Mobile User Interface Element Detection via Adaptively Prompt Tuning

Learning To Generate Text-Grounded Mask for Open-World Semantic Segmentation From Only Image-Text Pairs

ZegCLIP: Towards Adapting CLIP for Zero-Shot Semantic Segmentation

Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection

Learning Conditional Attributes for Compositional Zero-Shot Learning

CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation

StructVPR: Distill Structural Knowledge With Weighting Samples for Visual Place Recognition

UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration

Primitive Generation and Semantic-Related Alignment for Universal Zero-Shot Segmentation

Inferring and Leveraging Parts From Object Shape for Improving Semantic Image Synthesis

Compositor: Bottom-Up Clustering and Compositing for Robust Part and Object Segmentation

A Strong Baseline for Generalized Few-Shot Semantic Segmentation

DynaMask: Dynamic Mask Selection for Instance Segmentation

Focus on Details: Online Multi-Object Tracking With Diverse Fine-Grained Representation

Dynamic Focus-Aware Positional Queries for Semantic Segmentation

Beyond mAP: Towards Better Evaluation of Instance Segmentation

Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation

Weakly Supervised Semantic Segmentation via Adversarial Learning of Classifier and Reconstructor

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

Augmentation Matters: A Simple-Yet-Effective Approach to Semi-Supervised Semantic Segmentation

The Devil Is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-Guided Mask Representation

Class-Incremental Exemplar Compression for Class-Incremental Learning

Full or Weak Annotations? An Adaptive Strategy for Budget-Constrained Annotation Campaigns

Learning Common Rationale To Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

Self-Supervised AutoFlow

DETR With Additional Global Aggregation for Cross-Domain Weakly Supervised Object Detection

Detecting Everything in the Open World: Towards Universal Object Detection

PROB: Probabilistic Objectness for Open World Object Detection

Annealing-Based Label-Transfer Learning for Open World Object Detection

Learning Transformation-Predictive Representations for Detection and Description of Local Features

Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection

2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection

Zero-Shot Generative Model Adaptation via Image-Specific Prompt Learning

AutoLabel: CLIP-Based Framework for Open-Set Video Domain Adaptation

Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation

Directional Connectivity-Based Segmentation of Medical Images

Ambiguous Medical Image Segmentation Using Diffusion Models

Sparse Multi-Modal Graph Transformer With Shared-Context Processing for Representation Learning of Giga-Pixel Images

METransformer: Radiology Report Generation by Transformer With Multiple Learnable Expert Tokens

Towards Trustable Skin Cancer Diagnosis via Rewriting Model’s Decision

Rethinking Out-of-Distribution (OOD) Detection: Masked Image Modeling Is All You Need

MetaViewer: Towards a Unified Multi-View Representation

Deep Incomplete Multi-View Clustering With Cross-View Partial Sample and Prototype Alignment

RONO: Robust Discriminative Learning With Noisy Labels for 2D-3D Cross-Modal Retrieval

Mind the Label Shift of Augmentation-Based Graph OOD Generalization

Zero-Shot Model Diagnosis

ProtoCon: Pseudo-Label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-Supervised Learning

Fine-Grained Classification With Noisy Labels

Twin Contrastive Learning With Noisy Labels

RMLVQA: A Margin Loss Approach for Visual Question Answering With Language Biases

Generative Bias for Robust Visual Question Answering

On-the-Fly Category Discovery

Co-Training 2L Submodels for Visual Recognition

Neural Dependencies Emerging From Learning Massive Categories

MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation

Towards Better Stability and Adaptability: Improve Online Self-Training for Model Adaptation in Semantic Segmentation

DARE-GRAM: Unsupervised Domain Adaptation Regression by Aligning Inverse Gram Matrices

Equiangular Basis Vectors

Enhanced Multimodal Representation Learning With Cross-Modal KD

Decompose, Adjust, Compose: Effective Normalization by Playing With Frequency for Domain Generalization

Back to the Source: Diffusion-Driven Adaptation To Test-Time Corruption

Deep Frequency Filtering for Domain Generalization

Generalizable Implicit Neural Representations via Instance Pattern Composers

Train-Once-for-All Personalization

Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners

Few-Shot Class-Incremental Learning via Class-Aware Bilateral Distillation

Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning

Dense Network Expansion for Class Incremental Learning

Class Attention Transfer Based Knowledge Distillation

Dealing With Cross-Task Class Discrimination in Online Continual Learning

Real-Time Evaluation in Online Continual Learning: A New Hope

DisWOT: Student Architecture Search for Distillation WithOut Training

CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning

EcoTTA: Memory-Efficient Continual Test-Time Adaptation via Self-Distilled Regularization

Achieving a Better Stability-Plasticity Trade-Off via Auxiliary Networks in Continual Learning

PA&DA: Jointly Sampling Path and Data for Consistent NAS

Accelerating Dataset Distillation via Model Augmentation

Multi-Agent Automated Machine Learning

Transformer-Based Learned Optimization

Solving Relaxations of MAP-MRF Problems: Combinatorial In-Face Frank-Wolfe Directions

HOTNAS: Hierarchical Optimal Transport for Neural Architecture Search

Disentangled Representation Learning for Unsupervised Neural Quantization

FFCV: Accelerating Training by Removing Data Bottlenecks

Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks

FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits

Gradient-Based Uncertainty Attribution for Explainable Bayesian Deep Learning

How To Prevent the Continuous Damage of Noises To Model Training?

Genie: Show Me the Data for Quantization

OpenMix: Exploring Outlier Samples for Misclassification Detection

Data-Free Sketch-Based Image Retrieval

GLeaD: Improving GANs With a Generator-Leading Task

Learning on Gradients: Generalized Artifacts Representation for GAN-Generated Images Detection

Adversarial Normalization: I Can Visualize Everything (ICE)

Semi-Supervised Hand Appearance Recovery via Structure Disentanglement and Dual Adversarial Discrimination

Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning

Diversity-Measurable Anomaly Detection

Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-World

How To Prevent the Poor Performance Clients for Personalized Federated Learning?

DynaFed: Tackling Client Data Heterogeneity With Global Dynamics

Elastic Aggregation for Federated Optimization

Breaching FedMD: Image Recovery via Paired-Logits Inversion Attack

Learning To Measure the Point Cloud Reconstruction Loss in a Representation Space

Backdoor Cleansing With Unlabeled Data

Backdoor Defense via Deconfounded Representation Learning

Defending Against Patch-Based Backdoor Attacks on Self-Supervised Learning

Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger

CAP: Robust Point Cloud Classification via Semantic and Structural Modeling

Evading DeepFake Detectors via Adversarial Statistical Consistency

Enhancing the Self-Universality for Transferable Targeted Attacks

Black-Box Sparse Adversarial Attack via Multi-Objective Optimisation

Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression

Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts

Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial Attacks

Physically Adversarial Infrared Patches With Learnable Shapes and Locations

MaLP: Manipulation Localization Using a Proactive Scheme

(ends 12:00 PM)

12:30 p.m.

Break:

Lunch

(ends 2:00 PM)

2 p.m.

Panel:

Vision, Language, and Creativity

(ends 3:00 PM)

3 p.m.

Meeting:

PAMI TC Meeting

(ends 4:00 PM)

4 p.m.

Break:

Break

(ends 4:30 PM)

4:30 p.m.

Poster Session WED-PM [4:30-6:00]

Posters 4:30-6:00

Polarimetric iToF: Measuring High-Fidelity Depth Through Scattering Media

NeRFLix: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-Viewpoint MiXer

SUDS: Scalable Urban Dynamic Scenes

DP-NeRF: Deblurred Neural Radiance Field With Physical Scene Priors

DyLiN: Making Light Field Networks Dynamic

Multi-Space Neural Radiance Fields

NeRFLight: Fast and Light Neural Radiance Fields Using a Shared Feature Grid

Cross-Guided Optimization of Radiance Fields With Multi-View Image Super-Resolution for High-Resolution Novel View Synthesis

NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds

DINER: Depth-Aware Image-Based NEural Radiance Fields

Modernizing Old Photos Using Multiple References via Photorealistic Style Transfer

Efficient Map Sparsification Based on 2D and 3D Discretized Grids

K-Planes: Explicit Radiance Fields in Space, Time, and Appearance

I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs

Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes

Inverse Rendering of Translucent Objects Using Physical and Neural Renderers

Accidental Light Probes

Humans As Light Bulbs: 3D Human Reconstruction From Thermal Reflection

HumanGen: Generating Human Radiance Fields With Explicit Priors

Seeing Through the Glass: Neural 3D Reconstruction of Object Inside a Transparent Container

3D Shape Reconstruction of Semi-Transparent Worms

Dionysus: Recovering Scene Structures by Dividing Into Semantic Pieces

SparseFusion: Distilling View-Conditioned Diffusion for 3D Reconstruction

PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation

Infinite Photorealistic Worlds Using Procedural Generation

Diffusion-SDF: Text-To-Shape via Voxelized Diffusion

3D-Aware Multi-Class Image-to-Image Translation With NeRFs

Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures

Local 3D Editing via 3D Distillation of CLIP Knowledge

ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations

CoralStyleCLIP: Co-Optimized Region and Layer Selection for Image Editing

3D-Aware Face Swapping

DCFace: Synthetic Face Generation With Dual Condition Diffusion Model

HairStep: Transfer Synthetic to Real Using Strand and Depth Maps for Single-View 3D Hair Modeling

DiffusionRig: Learning Personalized Priors for Facial Appearance Editing

3D-Aware Facial Landmark Detection via Multi-View Consistent Training on Synthetic Data

Parametric Implicit Face Representation for Audio-Driven Facial Reenactment

MEGANE: Morphable Eyeglass and Avatar Network

CodeTalker: Speech-Driven 3D Facial Animation With Discrete Motion Prior

Reconstructing Signing Avatars From Video Using Linguistic Priors

HARP: Personalized Hand Reconstruction From a Monocular RGB Video

OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis

RaBit: Parametric Modeling of 3D Biped Cartoon Characters With a Topological-Consistent Dataset

Transfer4D: A Framework for Frugal Motion Capture and Deformation Transfer

CLOTH4D: A Dataset for Clothed Human Reconstruction

Vid2Avatar: 3D Avatar Reconstruction From Videos in the Wild via Self-Supervised Scene Decomposition

High-Fidelity 3D Human Digitization From Single 2K Resolution Images

Sampling Is Matter: Point-Guided 3D Human Mesh Reconstruction

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

Human Body Shape Completion With Implicit Shape and Flow Learning

ShapeClipper: Scalable 3D Shape Learning From Single-View Images via Geometric and CLIP-Based Consistency

PC2: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction

NIKI: Neural Inverse Kinematics With Invertible Neural Networks for 3D Human Pose and Shape Estimation

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

ACR: Attention Collaboration-Based Regressor for Arbitrary Two-Hand Reconstruction

MIME: Human-Aware 3D Scene Generation

CIMI4D: A Large Multimodal Climbing Motion Dataset Under Human-Scene Interactions

Harmonious Feature Learning for Interactive Hand-Object Pose Estimation

AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation

A Characteristic Function-Based Method for Bottom-Up Human Pose Estimation

Unified Pose Sequence Modeling

Scene-Aware Egocentric 3D Human Pose Estimation

DiffPose: Toward More Reliable 3D Pose Estimation

MammalNet: A Large-Scale Video Benchmark for Mammal Recognition and Behavior Understanding

Learning 3D-Aware Image Synthesis With Unknown Pose Distribution

Pose Synchronization Under Multiple Pair-Wise Relative Poses

ObjectMatch: Robust Registration Using Canonical Object Correspondences

Learning Articulated Shape With Keypoint Pseudo-Labels From Web Images

Learning Correspondence Uncertainty via Differentiable Nonlinear Least Squares

Efficient Second-Order Plane Adjustment

Learning a Depth Covariance Function

Privacy-Preserving Representations Are Not Enough: Recovering Scene Content From Camera Poses

Objaverse: A Universe of Annotated 3D Objects

Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild

HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes With Iterative Intertwined Regularization

Visual Localization Using Imperfect 3D Models From the Internet

PRISE: Demystifying Deep Lucas-Kanade With Strongly Star-Convex Constraints for Multimodel Image Alignment

Scalable, Detailed and Mask-Free Universal Photometric Stereo

Enhanced Stable View Synthesis

End-to-End Vectorized HD-Map Construction With Piecewise Bezier Curve

DynamicStereo: Consistent Dynamic Depth From Stereo Videos

Shakes on a Plane: Unsupervised Depth Estimation From Unstabilized Photography

Gated Stereo: Joint Depth Estimation From Gated and Wide-Baseline Active Stereo Cues

K3DN: Disparity-Aware Kernel Estimation for Dual-Pixel Defocus Deblurring

HRDFuse: Monocular 360° Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions

OSRT: Omnidirectional Image Super-Resolution With Distortion-Aware Transformer

Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM

Few-Shot Non-Line-of-Sight Imaging With Signal-Surface Collaborative Regularization

NLOST: Non-Line-of-Sight Imaging With Transformer

Listening Human Behavior: 3D Human Pose Estimation With Acoustic Signals

Towards Domain Generalization for Multi-View 3D Object Detection in Bird-Eye-View

X3KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection

Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection

Learned Two-Plane Perspective Prior Based Image Resampling for Efficient Object Detection

Resource-Efficient RGBD Aerial Tracking

Toward RAW Object Detection: A New Benchmark and a New Model

Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection

LiDAR-in-the-Loop Hyperparameter Optimization

Learning and Aggregating Lane Graphs for Urban Automated Driving

Center Focusing Network for Real-Time LiDAR Panoptic Segmentation

Adaptive Sparse Convolutional Networks With Global Context Enhancement for Faster Object Detection on Drone Images

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

ALSO: Automotive Lidar Self-Supervision by Occupancy Estimation

Unsupervised Intrinsic Image Decomposition With LiDAR Intensity

PVT-SSD: Single-Stage 3D Object Detector With Point-Voxel Transformer

LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs

WeatherStream: Light Transport Automation of Single Image Deweathering

Mask3D: Pre-Training 2D Vision Transformers by Learning Masked 3D Priors

DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets

IterativePFN: True Iterative Point Cloud Filtering

itKD: Interchange Transfer-Based Knowledge Distillation for 3D Object Detection

ISBNet: A 3D Point Cloud Instance Segmentation Network With Instance-Aware Sampling and Box-Aware Dynamic Convolution

Symmetric Shape-Preserving Autoencoder for Unsupervised Real Scene Point Cloud Completion

GeoMAE: Masked Geometric Target Prediction for Self-Supervised Point Cloud Pre-Training

AnchorFormer: Point Cloud Completion From Discriminative Nodes

SHS-Net: Learning Signed Hyper Surfaces for Oriented Normal Estimation of Point Clouds

NerVE: Neural Volumetric Edges for Parametric Curve Extraction From Point Cloud

Unsupervised Deep Probabilistic Approach for Partial Point Cloud Registration

Local Connectivity-Based Density Estimation for Face Clustering

Bridging Search Region Interaction With Template for RGB-T Tracking

Quantum Multi-Model Fitting

Generalizable Local Feature Pre-Training for Deformable Shape Analysis

Similarity Metric Learning for RGB-Infrared Group Re-Identification

Unsupervised Deep Asymmetric Stereo Matching With Spatially-Adaptive Self-Similarity

Sliced Optimal Partial Transport

DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling

Bayesian Posterior Approximation With Stochastic Ensembles

V2V4Real: A Real-World Large-Scale Dataset for Vehicle-to-Vehicle Cooperative Perception

ReasonNet: End-to-End Driving With Temporal and Global Reasoning

Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction

FJMP: Factorized Joint Multi-Agent Motion Prediction Over Learned Directed Acyclic Interaction Graphs

Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion

Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-per-Second

Affordances From Human Videos as a Versatile Representation for Robotics

Indiscernible Object Counting in Underwater Scenes

Tracking Through Containers and Occluders in the Wild

Simple Cues Lead to a Strong Multi-Object Tracker

An In-Depth Exploration of Person Re-Identification and Gait Recognition in Cloth-Changing Conditions

SelfME: Self-Supervised Motion Learning for Micro-Expression Recognition

LipFormer: High-Fidelity and Generalizable Talking Face Generation With a Pre-Learned Facial Codebook

Real-Time Multi-Person Eyeblink Detection in the Wild for Untrimmed Video

Skinned Motion Retargeting With Residual Perception of Motion Semantics & Geometry

MoDi: Unconditional Motion Synthesis From Diverse Data

Recurrent Vision Transformers for Object Detection With Event Cameras

Continuous Intermediate Token Learning With Implicit Motion Manifold for Keyframe Based Motion Interpolation

EvShutter: Transforming Events for Unconstrained Rolling Shutter Correction

Multi Domain Learning for Motion Magnification

Learning Event Guided High Dynamic Range Video Reconstruction

Joint Video Multi-Frame Interpolation and Deblurring Under Unknown Exposure Time

FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER

MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding From Object Detection

Joint HDR Denoising and Fusion: A Real-World Mobile HDR Image Dataset

Visibility Constrained Wide-Band Illumination Spectrum Design for Seeing-in-the-Dark

Self-Supervised Blind Motion Deblurring With Deep Expectation Maximization

Structure Aggregation for Cross-Spectral Stereo Image Guided Denoising

Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments

Zero-Shot Noise2Noise: Efficient Image Denoising Without Any Data

Real-Time Controllable Denoising for Image and Video

Probability-Based Global Cross-Modal Upsampling for Pansharpening

ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal

Visual Recognition-Driven Image Restoration for Multiple Degradation With Intrinsic Semantics Recovery

Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective

Human Guided Ground-Truth Generation for Realistic Image Super-Resolution

Real-Time 6K Image Rescaling With Rate-Distortion Optimization

Equivalent Transformation and Dual Stream Network Construction for Mobile Image Super-Resolution

Ultrahigh Resolution Image/Video Matting With Spatio-Temporal Sparsity

Comprehensive and Delicate: An Efficient Transformer for Image Restoration

PHA: Patch-Wise High-Frequency Augmentation for Transformer-Based Person Re-Identification

PyramidFlow: High-Resolution Defect Contrastive Localization Using Pyramid Normalizing Flow

Neural Fourier Filter Bank

Restoration of Hand-Drawn Architectural Drawings Using Latent Space Mapping With Degradation Generator

Neural Preset for Color Style Transfer

NÜWA-LIP: Language-Guided Image Inpainting With Defect-Free VQGAN

DualVector: Unsupervised Vector Font Synthesis With Dual-Part Representation

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

Fix the Noise: Disentangling Source Feature for Controllable Domain Translation

Conditional Text Image Generation With Diffusion Models

ReCo: Region-Controlled Text-to-Image Generation

Freestyle Layout-to-Image Synthesis

Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Towards Flexible Multi-Modal Document Models

On Distillation of Guided Diffusion Models

Dimensionality-Varying Diffusion Process

Shape-Aware Text-Driven Layered Video Editing

Rethinking Image Super Resolution From Long-Tailed Distribution Learning Perspective

End-to-End Video Matting With Trimap Propagation

Context-Based Trit-Plane Coding for Progressive Image Compression

Complexity-Guided Slimmable Decoder for Efficient Deep Video Compression

Efficient Hierarchical Entropy Model for Learned Point Cloud Compression

NIRVANA: Neural Implicit Representations of Videos With Adaptive Networks and Autoregressive Patch-Wise Modeling

Learned Image Compression With Mixed Transformer-CNN Architectures

Memory-Friendly Scalable Super-Resolution via Rewinding Lottery Ticket Hypothesis

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

EfficientViT: Memory Efficient Vision Transformer With Cascaded Group Attention

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer

High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity

Non-Contrastive Unsupervised Learning of Physiological Signals From Video

Revealing the Dark Secrets of Masked Image Modeling

Improving Visual Representation Learning Through Perceptual Understanding

FlexiViT: One Model for All Patch Sizes

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning With Masked Autoencoders

SimpSON: Simplifying Photo Cleanup With Single-Click Distracting Object Segmentation Network

Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention

Iterative Next Boundary Detection for Instance Segmentation of Tree Rings in Microscopy Images of Shrub Cross Sections

VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking

DropMAE: Masked Autoencoders With Spatial-Attention Dropout for Tracking Tasks

SeqTrack: Sequence to Sequence Learning for Visual Object Tracking

Bootstrapping Objectness From Videos by Relaxed Common Fate and Visual Grouping

Video Event Restoration Based on Keyframes for Video Anomaly Detection

Streaming Video Model

LSTFE-Net:Long Short-Term Feature Enhancement Network for Video Small Object Detection

A Generalized Framework for Video Instance Segmentation

Referring Multi-Object Tracking

Source-Free Video Domain Adaptation With Spatial-Temporal-Historical Consistency Learning

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Egocentric Auditory Attention Localization in Conversations

iQuery: Instruments As Queries for Audio-Visual Sound Separation

Learning To Dub Movies via Hierarchical Prosody Models

A Large-Scale Robustness Analysis of Video Action Recognition Models

The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction

STMixer: A One-Stage Sparse Action Detector

Generating Human Motion From Textual Descriptions With Discrete Representations

Cascade Evidential Learning for Open-World Weakly-Supervised Temporal Action Localization

Distilling Vision-Language Pre-Training To Collaborate With Weakly-Supervised Temporal Action Localization

Simultaneously Short- and Long-Term Temporal Modeling for Semi-Supervised Video Semantic Segmentation

MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering

Language-Guided Music Recommendation for Video via Prompt Analogies

Text-Visual Prompting for Efficient 2D Temporal Video Grounding

CelebV-Text: A Large-Scale Facial Text-Video Dataset

CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset

Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval

Clover: Towards a Unified Video-Language Alignment and Fusion Model

Align and Attend: Multimodal Summarization With Dual Contrastive Losses

Learning Situation Hyper-Graphs for Video Question Answering

Natural Language-Assisted Sign Language Recognition

SkyEye: Self-Supervised Bird’s-Eye-View Semantic Mapping Using Monocular Frontal View Images

Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation

Iterative Vision-and-Language Navigation

EXCALIBUR: Encouraging and Evaluating Embodied Exploration

Multimodal Prompting With Missing Modalities for Visual Recognition

Visual Programming: Compositional Visual Reasoning Without Training

Super-CLEVR: A Virtual Benchmark To Diagnose Domain Robustness in Visual Reasoning

Prompting Large Language Models With Answer Heuristics for Knowledge-Based Visual Question Answering

À-La-Carte Prompt Tuning (APT): Combining Distinct Data via Composable Prompting

ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

Q: How To Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

Learning To Exploit Temporal Structure for Biomedical Vision–Language Processing

FashionSAP: Symbols and Attributes Prompt for Fine-Grained Fashion Vision-Language Pre-Training

Advancing Visual Grounding With Scene Knowledge: Benchmark and Method

Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks

OCTET: Object-Aware Counterfactual Explanations

Local-Guided Global: Paired Similarity Representation for Visual Reinforcement Learning

What Can Human Sketches Do for Object Detection?

Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

Correlational Image Modeling for Self-Supervised Visual Pre-Training

Generalized Decoding for Pixel, Image, and Language

Towards Modality-Agnostic Person Re-Identification With Descriptive Query

M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis

Learning Customized Visual Models With Retrieval-Augmented Knowledge

Learning Semantic Relationship Among Instances for Image-Text Matching

I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

ImageBind: One Embedding Space To Bind Them All

Model-Agnostic Gender Debiased Image Captioning

Boundary-Aware Backward-Compatible Representation via Adversarial Learning in Image Retrieval

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners

Towards Unified Scene Text Spotting Based on Sequence Generation

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data

Aligning Bag of Regions for Open-Vocabulary Object Detection

Visual Recognition by Request

Category Query Learning for Human-Object Interaction Classification

Self-Supervised Implicit Glyph Attention for Text Recognition

Enlarging Instance-Specific and Class-Specific Information for Open-Set Action Recognition

CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

Learning Attention As Disentangler for Compositional Zero-Shot Learning

Universal Instance Perception As Object Discovery and Retrieval

Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning

DPF: Learning Dense Prediction Fields With Weak Supervision

Modeling Entities As Semantic Points for Visual Information Extraction in the Wild

GeoNet: Benchmarking Unsupervised Adaptation Across Geographies

SegLoc: Learning Segmentation-Based Representations for Privacy-Preserving Visual Localization

Towards Open-World Segmentation of Parts

Pruning Parameterization With Bi-Level Optimization for Efficient Semantic Segmentation on the Edge

HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation

Exemplar-FreeSOLO: Enhancing Unsupervised Instance Segmentation With Exemplars

Weakly-Supervised Domain Adaptive Semantic Segmentation With Prototypical Contrastive Learning

Spatial-Temporal Concept Based Explanation of 3D ConvNets

Sparsely Annotated Semantic Segmentation With Adaptive Gaussian Mixtures

Fuzzy Positive Learning for Semi-Supervised Semantic Segmentation

STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection

Boosting Low-Data Instance Segmentation by Unsupervised Pre-Training With Saliency Prompt

Decoupled Semantic Prototypes Enable Learning From Diverse Annotation Types for Semi-Weakly Segmentation in Expert-Driven Domains

The Treasure Beneath Multiple Annotations: An Uncertainty-Aware Edge Detector

Knowledge Combination To Learn Rotated Detection Without Rotated Annotation

Mapping Degeneration Meets Label Evolution: Learning Infrared Small Target Detection With Single Point Supervision

SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency

Zero-Shot Object Counting

SOOD: Towards Semi-Supervised Oriented Object Detection

Large-Scale Training Data Search for Object Re-Identification

Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection

Towards Effective Visual Representations for Partial-Label Learning

Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection

Boosting Detection in Crowd Analysis via Underutilized Output Features

Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture

Weakly Supervised Segmentation With Point Annotations for Histopathology Images via Contrast-Based Variational Model

DoNet: Deep De-Overlapping Network for Cytology Instance Segmentation

MCF: Mutual Correction Framework for Semi-Supervised Medical Image Segmentation

Histopathology Whole Slide Image Analysis With Heterogeneous Graph Representation Learning

PEFAT: Boosting Semi-Supervised Medical Image Classification via Pseudo-Loss Estimation and Feature Adversarial Training

Causally-Aware Intraoperative Imputation for Overall Survival Time Prediction

Balanced Energy Regularization Loss for Out-of-Distribution Detection

Block Selection Method for Using Feature Norm in Out-of-Distribution Detection

Highly Confident Local Structure Based Consensus Graph Learning for Incomplete Multi-View Clustering

Siamese DETR

Towards Bridging the Performance Gaps of Joint Energy-Based Models

Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning

Boosting Transductive Few-Shot Fine-Tuning With Margin-Based Uncertainty Weighting and Probability Regularization

CHMATCH: Contrastive Hierarchical Matching and Robust Adaptive Threshold Boosted Semi-Supervised Learning

MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins

Ranking Regularization for Critical Rare Classes: Minimizing False Positives at a High True Positive Rate

Learning Imbalanced Data With Vision Transformers

No One Left Behind: Improving the Worst Categories in Long-Tailed Learning

Global and Local Mixture Consistency Cumulative Learning for Long-Tailed Visual Recognitions

Curvature-Balanced Feature Manifold Learning for Long-Tailed Classification

DAA: A Delta Age AdaIN Operation for Age Estimation via Binary Code Transformer

DLBD: A Self-Supervised Direct-Learned Binary Descriptor

Progressive Open Space Expansion for Open-Set Model Attribution

DiGA: Distil To Generalize and Then Adapt for Domain Adaptive Semantic Segmentation

Multi-Modal Learning With Missing Modality via Shared-Specific Feature Modelling

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

Bi-Level Meta-Learning for Few-Shot Domain Generalization

Train/Test-Time Adaptation With Retrieval

Robust Test-Time Adaptation in Dynamic Scenarios

Domain Expansion of Image Generators

Switchable Representation Learning Framework With Self-Compatibility

A New Benchmark: On the Utility of Synthetic Data With Blender for Bare Supervised Learning and Downstream Domain Adaptation

Adapting Shortcut With Normalizing Flow: An Efficient Tuning Framework for Visual Recognition

Manipulating Transfer Learning for Property Inference

Heterogeneous Continual Learning

Generic-to-Specific Distillation of Masked Autoencoders

Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval

CafeBoost: Causal Feature Boost To Eliminate Task-Induced Bias for Class Incremental Learning

Bilateral Memory Consolidation for Continual Learning

NICO++: Towards Better Benchmarking for Domain Generalization

DART: Diversify-Aggregate-Repeat Training Improves Generalization of Neural Networks

Differentiable Architecture Search With Random Features

Class Adaptive Network Calibration

Meta-Learning With a Geometry-Adaptive Preconditioner

DepGraph: Towards Any Structural Pruning

Stitchable Neural Networks

Integral Neural Networks

Regularization of Polynomial Networks for Image Recognition

ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders

Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations

Don’t Lie to Me! Robust and Efficient Explainability With Verified Perturbation Analysis

OT-Filter: An Optimal Transport Filter for Learning With Noisy Labels

Robust Generalization Against Photon-Limited Corruptions via Worst-Case Sharpness Minimization

Learning With Noisy Labels via Self-Supervised Adversarial Noisy Masking

Bit-Shrinking: Limiting Instantaneous Sharpness for Improving Post-Training Quantization

Enhancing Multiple Reliability Measures via Nuisance-Extended Information Bottleneck

AdaptiveMix: Improving GAN Training via Feature Space Shrinkage

Re-GAN: Data-Efficient GANs Training via Architectural Reconfiguration

Soft Augmentation for Image Classification

Boosting Verified Training for Robust Image Classifications via Abstraction

A New Dataset Based on Images Taken by Blind People for Testing the Robustness of Image Classification Models Trained for ImageNet Categories

Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection

Prototypical Residual Networks for Anomaly Detection and Localization

Class Balanced Adaptive Pseudo Labeling for Federated Semi-Supervised Learning

Fair Federated Medical Image Segmentation via Client Contribution Estimation

Rethinking Federated Learning With Domain Shift: A Prototype View

FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning

Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations

STDLens: Model Hijacking-Resilient Federated Learning for Object Detection

Detecting Backdoors in Pre-Trained Encoders

Detecting Backdoors During the Inference Stage Based on Corruption Robustness Consistency

Can’t Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders

Re-Thinking Model Inversion Attacks Against Deep Neural Networks

Turning Strengths Into Weaknesses: A Certified Robustness Inspired Attack Framework Against Graph Neural Networks

Dynamic Generative Targeted Attacks With Pattern Injection

Transferable Adversarial Attacks on Vision Transformers With Token Gradient Regularization

Adversarial Counterfactual Visual Explanations

TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization

Randomized Adversarial Training via Taylor Expansion

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization

Evading Forensic Classifiers With Attribute-Conditioned Adversarial Faces

DartBlur: Privacy Preservation With Detection Artifact Suppression

(ends 6:00 PM)

7 p.m.

Reception:

Reception & Musical Performances

(ends 9:00 PM)

THU 22 JUN

7:30 a.m.

Breakfast:

Breakfast

(ends 9:00 AM)

9 a.m.

Keynote:

Modeling Atoms to Address Our Climate Crisis

Larry Zitnick

(ends 10:00 AM)

10 a.m.

Break:

Break

(ends 10:30 AM)

10:30 a.m.

Poster Session THU-AM [10:30-12:00]

Posters 10:30-12:00

Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection

JacobiNeRF: NeRF Shaping With Mutual Information Gradients

ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-Real Novel View Synthesis via Contrastive Learning

SCADE: NeRFs from Space Carving With Ambiguity-Aware Depth Estimates

Removing Objects From Neural Radiance Fields

Progressively Optimized Local Radiance Fields for Robust View Synthesis

NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds

ABLE-NeRF: Attention-Based Rendering With Learnable Embeddings for Neural Radiance Field

MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures

pCON: Polarimetric Coordinate Networks for Neural Scene Representations

Balanced Spherical Grid for Egocentric View Synthesis

Complementary Intrinsics From Neural Radiance Fields and CNNs for Outdoor Scene Relighting

HyperReel: High-Fidelity 6-DoF Video With Ray-Conditioned Sampling

UV Volumes for Real-Time Rendering of Editable Free-View Human Performance

Tensor4D: Efficient Neural 4D Decomposition for High-Fidelity Dynamic Reconstruction and Rendering

PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing

Computational Flash Photography Through Intrinsics

RelightableHands: Efficient Neural Relighting of Articulated Hand Models

TMO: Textured Mesh Acquisition of Objects With a Mobile Device by Using Differentiable Rendering

VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction

Multi-View Reconstruction Using Signed Ray Distance Functions (SRDF)

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction

Octree Guided Unoriented Surface Reconstruction

Neural Vector Fields: Implicit Representation by Explicit Learning

DA Wand: Distortion-Aware Selection Using Neural Mesh Parameterization

Diffusion-Based Generation, Optimization, and Planning in 3D Scenes

Patch-Based 3D Natural Scene Generation From a Single Example

Consistent View Synthesis With Pose-Guided Diffusion Models

Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process

High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition

TAPS3D: Text-Guided 3D Textured Shape Generation From Pseudo Supervision

SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations

Interactive Cartoonization With Controllable Perceptual Factors

High-Res Facial Appearance Capture From Polarized Smartphone Images

GlassesGAN: Eyewear Personalization Using Synthetic Appearance Discovery and Targeted Subspace Modeling

Continuous Landmark Detection With 3D Queries

NeuFace: Realistic 3D Neural Face Rendering From Multi-View Images

AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction

Learning Personalized High Quality Volumetric Head Avatars From Monocular RGB Videos

OTAvatar: One-Shot Talking Face Avatar With Controllable Tri-Plane Rendering

X-Avatar: Expressive Human Avatars

InstantAvatar: Learning Avatars From Monocular Video in 60 Seconds

JAWS: Just a Wild Shot for Cinematic Transfer in Neural Radiance Fields

MonoHuman: Animatable Human Neural Field From Monocular Video

Structured 3D Features for Reconstructing Controllable Avatars

HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics

Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling

Learning Semantic-Aware Disentangled Representation for Flexible 3D Human Body Editing

Reconstructing Animatable Categories From Videos

Deformable Mesh Transformer for 3D Human Mesh Recovery

Hi4D: 4D Instance Segmentation of Close Human Interaction

Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild

Learning Human Mesh Recovery in 3D Scenes

H2ONet: Hand-Occlusion-and-Orientation-Aware Network for Real-Time 3D Hand Mesh Reconstruction

What You Can Reconstruct From a Shadow

Autonomous Manipulation Learning for Similar Deformable Objects via Only One Demonstration

In-Hand 3D Object Scanning From an RGB Sequence

Putting People in Their Place: Affordance-Aware Human Insertion Into Scenes

Detecting Human-Object Contact in Images

What Happened 3 Seconds Ago? Inferring the Past With Thermal Imaging

Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting

Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video

Ego-Body Pose Estimation via Ego-Head Pose Estimation

ViPLO: Vision Transformer Based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection

HS-Pose: Hybrid Scope Feature Extraction for Category-Level Object Pose Estimation

ScarceNet: Animal Pose Estimation With Scarce Annotations

Cross-Domain 3D Hand Pose Estimation With Dual Modalities

Linking Garment With Person via Semantically Associated Landmarks for Virtual Try-On

Level-S$^2$fM: Structure From Motion on Neural Level Set of Implicit Surfaces

Revisiting Rotation Averaging: Uncertainties and Robust Losses

SliceMatch: Geometry-Guided Aggregation for Cross-View Pose Estimation

Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation

Long-Term Visual Localization With Mobile Sensors

Learning To Predict Scene-Level Implicit 3D From Posed RGBD Data

Paired-Point Lifting for Enhanced Privacy-Preserving Visual Localization

The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects

Learning Accurate 3D Shape Based on Stereo Polarimetric Imaging

RUST: Latent Neural Scene Representations From Unposed Imagery

Perspective Fields for Single Image Camera Calibration

VisFusion: Visibility-Aware Online 3D Scene Reconstruction From Videos

DeepLSD: Line Segment Detection and Refinement With Deep Image Gradients

Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation With Cross-Scale Distortion Awareness

Single Image Depth Prediction Made Better: A Multivariate Gaussian Take

Wide-Angle Rectification via Content-Aware Conformal Mapping

All-in-Focus Imaging From Event Focal Stack

Multi-View Stereo Representation Revist: Region-Aware MVSNet

Semantic Ray: Learning a Generalizable Semantic Field With Cross-Reprojection Attention

OmniCity: Omnipotent City Understanding With Multi-Level and Multi-View Images

ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields

Non-Line-of-Sight Imaging With Signal Superresolution Network

Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence

Learning Transformations To Reduce the Geometric Shift in Object Detection

Anchor3DLane: Learning To Regress 3D Anchors for Monocular 3D Lane Detection

BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks

Semi-Supervised Stereo-Based 3D Object Detection via Cross-View Consensus

Weakly Supervised Monocular 3D Object Detection Using Multi-View Projection and Direction Consistency

MonoATT: Online Monocular 3D Object Detection With Adaptive Token Transformer

Azimuth Super-Resolution for FMCW Radar in Autonomous Driving

Pix2map: Cross-Modal Retrieval for Inferring Street Maps From Images

LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion

Neural Map Prior for Autonomous Driving

Spherical Transformer for LiDAR-Based 3D Recognition

Density-Insensitive Unsupervised Domain Adaption on 3D Object Detection

PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds

PeakConv: Learning Peak Receptive Field for Radar Semantic Segmentation

Single Domain Generalization for LiDAR Semantic Segmentation

Weakly Supervised Class-Agnostic Motion Prediction for Autonomous Driving

MethaneMapper: Spectral Absorption Aware Hyperspectral Transformer for Methane Detection

GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds

SCoDA: Domain Adaptive Shape Completion for Real Scans

SCPNet: Semantic Scene Completion on Point Cloud

ViewNet: A Novel Projection-Based Backbone With View Pooling for Few-Shot Point Cloud Classification

Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning

Learnable Skeleton-Aware 3D Point Cloud Sampling

Meta Architecture for Point Cloud Analysis

PointListNet: Deep Learning on 3D Point Lists

PEAL: Prior-Embedded Explicit Attention Learning for Low-Overlap Point Cloud Registration

Unsupervised Inference of Signed Distance Functions From Single Sparse Point Clouds Without Learning Priors

Towards Better Gradient Consistency for Neural Signed Distance Functions via Level Set Alignment

Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching

3D Registration With Maximal Cliques

PanoSwin: A Pano-Style Swin Transformer for Panorama Understanding

DKM: Dense Kernelized Feature Matching for Geometry Estimation

PATS: Patch Area Transportation With Subdivision for Local Feature Matching

Correspondence Transformers With Asymmetric Feature Learning and Matching Flow Super-Resolution

Learning Adaptive Dense Event Stereo From the Image Domain

On the Convergence of IRLS and Its Variants in Outlier-Robust Estimation

You Only Segment Once: Towards Real-Time Panoptic Segmentation

BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision

UniHCP: A Unified Model for Human-Centric Perceptions

Planning-Oriented Autonomous Driving

Query-Centric Trajectory Prediction

Unsupervised Sampling Promoting for Stochastic Human Trajectory Prediction

AdamsFormer for Spatial Action Localization in the Future

PIRLNav: Pretraining With Imitation and RL Finetuning for ObjectNav

NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis

Camouflaged Instance Segmentation via Explicit De-Camouflaging

Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking

MotionTrack: Learning Robust Short-Term and Long-Term Motions for Multi-Object Tracking

Multi-Modal Gait Recognition via Effective Spatial-Temporal Feature Fusion

Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition

One-Shot High-Fidelity Talking-Head Synthesis With Deformable Neural Radiance Field

Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

Event-Guided Person Re-Identification via Sparse-Dense Complementary Learning

Executing Your Commands via Motion Diffusion in Latent Space

MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Action Recognition

“Seeing” Electric Network Frequency From Events

Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields

Event-Based Frame Interpolation With Ad-Hoc Deblurring

Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior

TransFlow: Transformer As Flow Learner

MP-Former: Mask-Piloted Transformer for Image Segmentation

GradICON: Approximate Diffeomorphisms via Gradient Inverse Consistency

Neural Texture Synthesis With Guided Correspondence

Self-Supervised Non-Uniform Kernel Estimation With Flow-Based Motion Prior for Blind Image Deblurring

Decoupling-and-Aggregating for Image Exposure Correction

You Do Not Need Additional Priors or Regularizers in Retinex-Based Low-Light Image Enhancement

DNF: Decouple and Feedback Network for Seeing in the Dark

Contrastive Semi-Supervised Learning for Underwater Image Restoration via Reliable Bank

LG-BPN: Local and Global Blind-Patch Network for Self-Supervised Real-World Denoising

Spectral Bayesian Uncertainty for Image Super-Resolution

Deep Random Projector: Accelerated Deep Image Prior

Context-Aware Pretraining for Efficient Blind Image Decomposition

Metadata-Based RAW Reconstruction via Implicit Neural Functions

Raw Image Reconstruction With Learned Compact Metadata

AccelIR: Task-Aware Image Compression for Accelerating Neural Restoration

AutoFocusFormer: Image Segmentation off the Grid

Guided Depth Super-Resolution by Deep Anisotropic Diffusion

Super-Resolution Neural Operator

Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution

GamutMLP: A Lightweight MLP for Color Loss Recovery

Efficient and Explicit Modelling of Image Hierarchies for Image Restoration

LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization

CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer

ObjectStitch: Object Compositing With Diffusion Model

DeepVecFont-v2: Exploiting Transformers To Synthesize Vector Fonts With Higher Quality

Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer

CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language

LayoutDM: Transformer-Based Diffusion Model for Layout Generation

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

SpaText: Spatio-Textual Representation for Controllable Image Generation

Paint by Example: Exemplar-Based Image Editing With Diffusion Models

InstructPix2Pix: Learning To Follow Image Editing Instructions

LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction

Self-Guided Diffusion Models

HOLODIFFUSION: Training a 3D Diffusion Model Using 2D Images

Class-Balancing Diffusion Models

Conditional Image-to-Video Generation With Latent Flow Diffusion Models

Video Probabilistic Diffusion Models in Projected Latent Space

Regularized Vector Quantization for Tokenized Image Synthesis

EfficientSCI: Densely Connected Network With Space-Time Factorization for Large-Scale Video Snapshot Compressive Imaging

MMVC: Learned Multi-Mode Video Compression With Block-Based Prediction Mode Selection and Density-Adaptive Entropy Coding

Video Compression With Entropy-Constrained Neural Representations

WIRE: Wavelet Implicit Neural Representations

TINC: Tree-Structured Implicit Neural Compression

CompletionFormer: Depth Completion With Convolutions and Vision Transformers

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

Global Vision Transformer Pruning With Hessian-Aware Saliency

Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR

PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers

Visual Atoms: Pre-Training Vision Transformers With Sinusoidal Waves

Neuron Structure Modeling for Generalizable Remote Physiological Measurement

Explaining Image Classifiers With Multiscale Directional Image Representation

Integrally Pre-Trained Transformer Pyramid Networks

PartMix: Regularization Strategy To Learn Part Discovery for Visible-Infrared Person Re-Identification

Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions

Focused and Collaborative Feedback Integration for Interactive Image Segmentation

PolyFormer: Referring Image Segmentation As Sequential Polygon Generation

Devil’s on the Edges: Selective Quad Attention for Scene Graph Generation

Panoptic Video Scene Graph Generation

Generalized Relation Modeling for Transformer Tracking

Representation Learning for Visual Object Tracking by Masked Appearance Transfer

Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

EVAL: Explainable Video Anomaly Localization

MOSO: Decomposing MOtion, Scene and Object for Video Prediction

TarViS: A Unified Approach for Target-Based Video Segmentation

Efficient Movie Scene Detection Using State-Space Transformers

Latency Matters: Real-Time Action Forecasting Transformer

Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning

Watch or Listen: Robust Audio-Visual Speech Recognition With Visual Corruption Modeling and Reliability Scoring

ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

SVFormer: Semi-Supervised Video Transformer for Action Recognition

Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception

Post-Processing Temporal Action Detection

HaLP: Hallucinating Latent Positives for Skeleton-Based Self-Supervised Learning of Actions

TriDet: Temporal Action Detection With Relative Boundary Modeling

Hybrid Active Learning via Deep Clustering for Video Action Detection

Two-Stream Networks for Weakly-Supervised Temporal Action Localization With Semantic-Aware Mechanisms

Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network

Collaborative Noisy Label Cleaner: Learning Scene-Aware Trailers for Multi-Modal Highlight Detection in Movies

Weakly Supervised Temporal Sentence Grounding With Uncertainty-Guided Self-Training

SViTT: Temporal Learning of Sparse Video-Text Transformers

AutoAD: Movie Description in Context

Text With Knowledge Graph Augmented Transformer for Video Captioning

StepFormer: Self-Supervised Step Discovery and Localization in Instructional Videos

Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval

Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding

CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval

Learning Emotion Representations From Verbal and Nonverbal Communication

Context De-Confounded Emotion Recognition

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning

Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering

LEGO-Net: Learning Regular Rearrangements of Objects in Rooms

LANA: A Language-Capable Navigator for Instruction Following and Generation

Policy Adaptation From Foundation Model Feedback

Token Turing Machines

Unicode Analogies: An Anti-Objectivist Visual Reasoning Challenge

Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

VQACL: A Novel Visual Question Answering Continual Learning Setting

MaPLe: Multi-Modal Prompt Learning

Meta-Personalizing Vision-Language Models To Find Named Instances in Video

Understanding and Improving Visual Prompting: A Label-Mapping Perspective

RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension

Leveraging per Image-Token Consistency for Vision-Language Pre-Training

Improving Visual Grounding by Encouraging Consistent Gradient-Based Explanations

Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks

Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification

Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning

Learning Visual Representations via Language-Guided Sampling

L-CoIns: Language-Based Colorization With Instance Awareness

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID

Unifying Vision, Text, and Layout for Universal Document Processing

RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training

Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network

Text-Guided Unsupervised Latent Transformation for Multi-Attribute Image Manipulation

Improving Image Recognition by Retrieving From Web-Scale Image-Text Data

Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval

DATE: Domain Adaptive Product Seeker for E-Commerce

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning With Multimodal Models

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models

DeepSolo: Let Transformer Decoder With Explicit Points Solo for Text Spotting

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

R2Former: Unified Retrieval and Reranking Transformer for Place Recognition

Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator

Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework

Neural Congealing: Aligning Images to a Joint Semantic Atlas

Open Vocabulary Semantic Segmentation With Patch Aligned Contrastive Learning

Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains

Explicit Visual Prompting for Low-Level Structure Segmentations

FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Zero-Shot Referring Image Segmentation With Global-Local Context Features

DejaVu: Conditional Regenerative Learning To Enhance Dense Prediction

Meta Compositional Referring Expression Segmentation

Interactive Segmentation As Gaussion Process Classification

Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation

Principles of Forgetting in Domain-Incremental Semantic Segmentation in Adverse Weather Conditions

AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation

PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers

Leveraging Hidden Positives for Unsupervised Semantic Segmentation

Understanding Imbalanced Semantic Segmentation Through Neural Collapse

Balancing Logit Variation for Long-Tailed Semantic Segmentation

Boundary-Enhanced Co-Training for Weakly Supervised Semantic Segmentation

Conflict-Based Cross-View Consistency for Semi-Supervised Semantic Segmentation

Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization

WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation

DualRel: Semi-Supervised Mitochondria Segmentation From a Prototype Perspective

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

Co-Salient Object Detection With Uncertainty-Aware Group Exchange-Masking

Supervised Masked Knowledge Distillation for Few-Shot Transformers

Modeling the Distributional Uncertainty for Salient Object Detection Models

Weak-Shot Object Detection Through Mutual Knowledge Transfer

CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection

Adaptive Sparse Pairwise Loss for Object Re-Identification

DETRs With Hybrid Matching

Generating Features With Increased Crop-Related Diversity for Few-Shot Object Detection

ScaleKD: Distilling Scale-Aware Knowledge in Small Object Detector

Multiclass Confidence and Localization Calibration for Object Detection

Open-Set Representation Learning Through Combinatorial Embedding

ProD: Prompting-To-Disentangle Domain Knowledge for Cross-Domain Few-Shot Image Classification

Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images

FFF: Fragment-Guided Flexible Fitting for Building Complete Protein Structures

Pseudo-Label Guided Contrastive Learning for Semi-Supervised Medical Image Segmentation

Hierarchical Discriminative Learning Improves Visual Representations of Biomedical Microscopy

KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation

Image Quality-Aware Diagnosis via Meta-Knowledge Co-Embedding

Interventional Bag Multi-Instance Learning on Whole-Slide Pathological Images

Visual Prompt Tuning for Generative Transfer Learning

LINe: Out-of-Distribution Detection by Leveraging Important Neurons

GCFAgg: Global and Cross-View Feature Aggregation for Multi-View Clustering

Exploring and Exploiting Uncertainty for Incomplete Multi-View Classification

BiCro: Noisy Correspondence Rectification for Multi-Modality Data via Bi-Directional Cross-Modal Similarity Consistency

Bi-Directional Distribution Alignment for Transductive Zero-Shot Learning

HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization

MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset

Class Prototypes Based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos

Learning From Noisy Labels With Decoupled Meta Label Purifier

SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail

Why Is the Winner the Best?

Balanced Product of Calibrated Experts for Long-Tailed Recognition

Transfer Knowledge From Head to Tail: Uncertainty Calibration Under Long-Tailed Distribution

FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding

COT: Unsupervised Domain Adaptation With Clustering and Optimal Transport

MHPL: Minimum Happy Points Learning for Active Source Free Domain Adaptation

Upcycling Models Under Domain and Category Shift

PMR: Prototypical Modal Rebalance for Multimodal Learning

MMANet: Margin-Aware Distillation and Modality-Aware Regularization for Incomplete Multimodal Learning

Feature Alignment and Uniformity for Test Time Adaptation

Revisiting Prototypical Network for Cross Domain Few-Shot Learning

A Whac-a-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others

Independent Component Alignment for Multi-Task Learning

MDL-NAS: A Joint Multi-Domain Learning Framework for Vision Transformer

MELTR: Meta Loss Transformer for Learning To Fine-Tune Video Foundation Models

1% VS 100%: Parameter-Efficient Low Rank Adapter for Dense Predictions

Rebalancing Batch Normalization for Exemplar-Based Class-Incremental Learning

Partial Network Cloning

ERM-KTP: Knowledge-Level Machine Unlearning via Knowledge Transfer

Rethinking Feature-Based Knowledge Distillation for Face Recognition

Regularizing Second-Order Influences for Continual Learning

Generalization Matters: Loss Minima Flattening via Parameter Hybridization for Efficient Online Knowledge Distillation

Decoupling Learning and Remembering: A Bilevel Memory Framework With Knowledge Projection for Task-Incremental Learning

On the Stability-Plasticity Dilemma of Class-Incremental Learning

Simulated Annealing in Early Layers Leads to Better Generalization

Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning

Tunable Convolutions With Parametric Multi-Loss Optimization

Re-Basin via Implicit Sinkhorn Differentiation

Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization

AstroNet: When Astrocyte Meets Artificial Neural Network

Network Expansion for Practical Training Acceleration

Defining and Quantifying the Emergence of Sparse Concepts in DNNs

Samples With Low Loss Curvature Improve Data Efficiency

Masked Images Are Counterfactual Samples for Robust Fine-Tuning

Bias Mimicking: A Simple Sampling Approach for Bias Mitigation

NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers

Practical Network Acceleration With Tiny Sets

TeSLA: Test-Time Self-Learning With Automatic Adversarial Augmentation

Discriminator-Cooperated Feature Map Distillation for GAN Compression

Private Image Generation With Dual-Purpose Auxiliary Classifier

ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing

Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers

A New Comprehensive Benchmark for Semi-Supervised Video Anomaly Detection and Anticipation

SimpleNet: A Simple Network for Image Anomaly Detection and Localization

DaFKD: Domain-Aware Federated Knowledge Distillation

Reliable and Interpretable Personalized Federated Learning

Adaptive Channel Sparsity for Federated Learning Under System Heterogeneity

Bias-Eliminating Augmentation Learning for Debiased Federated Learning

Instance-Aware Domain Generalization for Face Anti-Spoofing

Adversarially Masking Synthetic To Mimic Real: Adaptive Noise Injection for Point Cloud Segmentation Adaptation

Model Barrier: A Compact Un-Transferable Isolation Domain for Model Intellectual Property Protection

MEDIC: Remove Model Backdoors via Importance Driven Cloning

Progressive Backdoor Erasing via Connecting Backdoor and Adversarial Attacks

Reinforcement Learning-Based Black-Box Model Inversion Attacks

T-SEA: Transfer-Based Self-Ensemble Attack on Object Detection

Proximal Splitting Adversarial Attack for Semantic Segmentation

Towards Transferable Targeted Adversarial Examples

AGAIN: Adversarial Training With Attribution Span Enlargement and Hybrid Feature Fusion

Generalist: Decoupling Natural and Robust Generalization

Cooperation or Competition: Avoiding Player Domination for Multi-Target Robustness via Adaptive Budgets

Discrete Point-Wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition

RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation With Natural Prompts

CLIP2Protect: Protecting Facial Privacy Using Text-Guided Makeup via Adversarial Latent Search

TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization

(ends 12:00 PM)

12:30 p.m.

Break:

Lunch

(ends 2:00 PM)

2 p.m.

Panel:

Scientific Discovery and the Environment

(ends 3:00 PM)

3 p.m.

Award:

Award Candidates THU

(ends 4:00 PM)

4 p.m.

Break:

Break

(ends 4:30 PM)

4:30 p.m.

Poster Session THU-PM [4:30-6:00]

Posters 4:30-6:00

High-Fidelity Event-Radiance Recovery via Transient Event Frequency

RobustNeRF: Ignoring Distractors With Robust Losses

NeRDi: Single-View NeRF Synthesis With Language-Guided Diffusion As General Image Priors

GM-NeRF: Learning Generalizable Model-Based Neural Radiance Fields From Multi-View Images

MixNeRF: Modeling a Ray With Mixture Density for Novel View Synthesis From Sparse Inputs

SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting With Neural Radiance Fields

Masked Wavelet Representation for Compact Neural Radiance Fields

PaletteNeRF: Palette-Based Appearance Editing of Neural Radiance Fields

SteerNeRF: Accelerating NeRF Rendering via Smooth Viewpoint Trajectory

Transforming Radiance Field With Lipschitz Network for Photorealistic 3D Scene Stylization

Occlusion-Free Scene Recovery via Neural Radiance Fields

TriVol: Point Cloud Rendering via Triple Volumes

DyNCA: Real-Time Dynamic Texture Synthesis Using Neural Cellular Automata

Neural Scene Chronology

ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects

ORCa: Glossy Objects As Radiance-Field Cameras

Nighttime Smartphone Reflective Flare Removal Using Optical Center Symmetry Prior

SunStage: Portrait Reconstruction and Relighting Using the Sun as a Light Stage

The Differentiable Lens: Compound Lens Search Over Glass Surfaces and Materials for Object Detection

Teleidoscopic Imaging System for Microscale 3D Shape Reconstruction

Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections

NeuralUDF: Learning Unsigned Distance Fields for Multi-View Reconstruction of Surfaces With Arbitrary Topologies

Sphere-Guided Training of Neural Implicit Surfaces

OReX: Object Reconstruction From Planar Cross-Sections Using Neural Fields

Persistent Nature: A Generative Model of Unbounded 3D Worlds

3D Neural Field Generation Using Triplane Diffusion

Diffusion-Based Signed Distance Fields for 3D Shape Generation

Efficient View Synthesis and 3D-Based Multi-Frame Denoising With Multiplane Feature Representations

Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

SINE: Semantic-Driven Image-Based NeRF Editing With Prior-Guided Editing Field

3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions

Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion

PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360°

StyleGene: Crossover and Mutation of Region-Level Facial Genes for Kinship Face Synthesis

Parameter Efficient Local Implicit Image Function Network for Face Segmentation

Graphics Capsule: Learning Hierarchical 3D Face Representations From 2D Images

Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars

Learning Neural Parametric Head Models

Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation

Learning Locally Editable Virtual Humans

Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-Time Mobile Telepresence

Ham2Pose: Animating Sign Language Notation Into Pose Sequences

PointAvatar: Deformable Point-Based Head Avatars From Videos

PAniC-3D: Stylized Single-View 3D Reconstruction From Portraits of Anime Characters

HandNeRF: Neural Radiance Fields for Animatable Interacting Hands

VGFlow: Visibility Guided Flow Network for Human Reposing

Clothed Human Performance Capture With a Double-Layer Neural Radiance Fields

POEM: Reconstructing Hand in a Point Embedded Multi-View Stereo

FlexNeRF: Photorealistic Free-Viewpoint Rendering of Moving Humans From Sparse Views

Flow Supervision for Deformable NeRF

Building Rearticulable Models for Arbitrary 3D Objects From 4D Point Clouds

Implicit 3D Human Mesh Recovery Using Consistency With Pose and Shape From Unseen-View

One-Stage 3D Whole-Body Mesh Recovery With Component Aware Transformer

Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes

FLEX: Full-Body Grasping Without Full-Body Grasps

DexArt: Benchmarking Generalizable Dexterous Manipulation With Articulated Objects

CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

CIRCLE: Capture in Rich Contextual Environments

Decoupling Human and Camera Motion From Videos in the Wild

GarmentTracking: Category-Level Garment Pose Tracking

Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition From Egocentric RGB Videos

PSVT: End-to-End Multi-Person 3D Pose and Shape Estimation With Progressive Video Transformers

Delving Into Discrete Normalizing Flows on SO(3) Manifold for Probabilistic Rotation Modeling

3D-POP – An Automated Annotation Approach to Facilitate Markerless 2D-3D Tracking of Freely Moving Birds With Marker-Based Motion Capture

TTA-COPE: Test-Time Adaptation for Category-Level Object Pose Estimation

Markerless Camera-to-Robot Pose Estimation via Self-Supervised Sim-to-Real Transfer

SMOC-Net: Leveraging Camera Pose for Self-Supervised Monocular Object Pose Estimation

IMP: Iterative Matching and Pose Estimation With Adaptive Pooling

Self-Supervised Representation Learning for CAD

Few-Shot Geometry-Aware Keypoint Localization

SparsePose: Sparse-View Camera Pose Regression and Refinement

A Large-Scale Homography Benchmark

Learning Geometric-Aware Properties in 2D Representation Using Lightweight CAD Models, or Zero Real 3D Pairs

AutoRecon: Automated 3D Object Discovery and Reconstruction

Multi-Sensor Large-Scale Dataset for Multi-View 3D Reconstruction

NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization

Self-Supervised Super-Plane for Neural 3D Reconstruction

PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes

Single View Scene Scale Estimation Using Scale Field

3D Line Mapping Revisited

Inverting the Imaging Process by Learning an Implicit Camera Model

SfM-TTR: Using Structure From Motion for Test-Time Refinement of Single-View Depth Networks

iDisc: Internal Discretization for Monocular Depth Estimation

DC2: Dual-Camera Defocus Control by Learning To Refocus

A Practical Stereo Depth System for Smart Glasses

GeoMVSNet: Learning Multi-View Stereo With Geometry Perception

DINN360: Deformable Invertible Neural Network for Latitude-Aware 360° Image Rescaling

OmniVidar: Omnidirectional Depth Estimation From Multi-Fisheye Images

Learning To Fuse Monocular and Multi-View Cues for Multi-Frame Depth Estimation in Dynamic Scenes

Modality-Invariant Visual Odometry for Embodied Vision

VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud

CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

AeDet: Azimuth-Invariant Multi-View 3D Object Detection

Object Detection With Self-Supervised Scene Adaptation

Understanding the Robustness of 3D Object Detection With Bird’s-Eye-View Representations in Autonomous Driving

BEVHeight: A Robust Framework for Vision-Based Roadside 3D Object Detection

Uncertainty-Aware Vision-Based Metric Cross-View Geolocalization

OrienterNet: Visual Localization in 2D Public Maps With Neural Matching

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales With Multi-Depth Seeds for 3D Object Detection

Virtual Sparse Convolution for Multimodal 3D Object Detection

Optimal Transport Minimization: Crowd Localization on Density Maps for Semi-Supervised Counting

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

GraVoS: Voxel Selection for 3D Point-Cloud Detection

MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving

LaserMix for Semi-Supervised LiDAR Semantic Segmentation

Implicit Surface Contrastive Clustering for LiDAR Point Clouds

Semi-Weakly Supervised Object Kinematic Motion Prediction

PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models

Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions

Geometry and Uncertainty-Aware 3D Point Cloud Class-Incremental Semantic Segmentation

Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders

ToThePoint: Efficient Contrastive Learning of 3D Point Clouds via Recycling

PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection

PointConvFormer: Revenge of the Point-Based Convolution

Self-Positioning Point-Based Transformer for Point Cloud Understanding

PointClustering: Unsupervised Point Cloud Pre-Training Using Transformation Invariance in Clustering

Neural Intrinsic Embedding for Non-Rigid Point Cloud Matching

HGNet: Learning Hierarchical Geometry From Points, Edges, and Surfaces

LP-DIF: Learning Local Pattern-Specific Deep Implicit Function for 3D Objects and Scenes

Conjugate Product Graphs for Globally Optimal 2D-3D Shape Matching

UTM: A Unified Multiple Object Tracking Model With Identity-Aware Feature Enhancement

Learning Rotation-Equivariant Features for Visual Correspondence

Adaptive Spot-Guided Transformer for Consistent Local Feature Matching

PMatch: Paired Masked Image Modeling for Dense Geometric Matching

Iterative Geometry Encoding Volume for Stereo Matching

Adaptive Annealing for Robust Geometric Estimation

Tangentially Elongated Gaussian Belief Propagation for Event-Based Incremental Optical Flow Estimation

Robust and Scalable Gaussian Process Regression and Its Applications

BEV-Guided Multi-Modality Fusion for Driving Perception

HumanBench: Towards General Human-Centric Perception With Projector Assisted Pretraining

Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

ProphNet: Efficient Agent-Centric Motion Forecasting With Anchor-Informed Proposals

StarCraftImage: A Dataset for Prototyping Spatial Reasoning Methods for Multi-Agent Environments

Stimulus Verification Is a Universal and Effective Sampler in Multi-Modal Human Trajectory Prediction

PyPose: A Library for Robot Learning With Physics-Based Optimization

Source-Free Adaptive Gaze Estimation by Uncertainty Reduction

Camouflaged Object Detection With Feature Decomposition and Edge Reconstruction

MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors

Clothing-Change Feature Augmentation for Person Re-Identification

Dynamic Aggregated Network for Gait Recognition

Feature Representation Learning With Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition

MetaPortrait: Identity-Preserving Talking Head Generation With Fast Personalized Adaptation

FLAG3D: A 3D Fitness Activity Dataset With Language Instruction

TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning With Structure-Trajectory Prompted Reconstruction for Person Re-Identification

NeMo: Learning 3D Neural Motion Fields From Multiple Video Instances of the Same Action

Unsupervised Space-Time Network for Temporally-Consistent Segmentation of Multiple Motions

Deep Polarization Reconstruction With PDAVIS Events

Range-Nullspace Video Frame Interpolation With Focalized Motion Estimation

Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation

1000 FPS HDR Video With a Spike-RGB Hybrid Camera

Deep Discriminative Spatial and Temporal Network for Efficient Video Deblurring

Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement

A Unified HDR Imaging Method With Pixel and Patch Level

BiasBed – Rigorous Texture Bias Evaluation

Learning a Practical SDR-to-HDRTV Up-Conversion Using New Dataset and Degradation Models

Learning a Deep Color Difference Metric for Photographic Images

Learning a Simple Low-Light Image Enhancer From Paired Low-Light Instances

Residual Degradation Learning Unfolding Framework With Mixing Priors Across Spectral and Spatial for Compressive Spectral Imaging

Toward Stable, Interpretable, and Lightweight Hyperspectral Super-Resolution

RIDCP: Revitalizing Real Image Dehazing via High-Quality Codebook Priors

Robust Unsupervised StyleGAN Image Restoration

Quality-Aware Pre-Trained Models for Blind Image Quality Assessment

Learning To Exploit the Sequence-Specific Prior Knowledge for Image Processing Pipelines Optimization

Multi-Realism Image Compression With a Conditional Generator

RGB No More: Minimally-Decoded JPEG Vision Transformers

Kernel Aware Resampler

Spatial-Frequency Mutual Learning for Face Super-Resolution

Activating More Pixels in Image Super-Resolution Transformer

Omni Aggregation Networks for Lightweight Image Super-Resolution

Towards Artistic Image Aesthetics Assessment: A Large-Scale Dataset and a New Method

RWSC-Fusion: Region-Wise Style-Controlled Fusion Network for the Prohibited X-Ray Security Image Synthesis

Efficient Scale-Invariant Generator With Column-Row Entangled Pixel Synthesis

Masked and Adaptive Transformer for Exemplar Based Image Translation

SmartBrush: Text and Shape Guided Object Inpainting With Diffusion Model

Neural Transformation Fields for Arbitrary-Styled Font Generation

Referring Image Matting

Handwritten Text Generation From Visual Archetypes

SceneComposer: Any-Level Semantic Image Synthesis

Affordance Diffusion: Synthesizing Hand-Object Interactions

LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

GLIGEN: Open-Set Grounded Text-to-Image Generation

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

EDICT: Exact Diffusion Inversion via Coupled Transformations

Solving 3D Inverse Problems Using Pre-Trained 2D Diffusion Models

Diffusion Probabilistic Model Made Slim

Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models

Binary Latent Diffusion

Semi-Supervised Video Inpainting With Cycle Consistency Constraints

Towards Accurate Image Coding: Improved Autoregressive Image Generation With Dynamic Vector Quantization

Large-Capacity and Flexible Video Steganography via Invertible Neural Network

Neural Video Compression With Diverse Contexts

Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Structured Sparsity Learning for Efficient Video Super-Resolution

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

Boost Vision Transformer With GPU-Friendly Sparsity and Quantization

All Are Worth Words: A ViT Backbone for Diffusion Models

Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers

Vision Transformer With Super Token Sampling

DropKey for Vision Transformer

Seeing Beyond the Brain: Conditional Diffusion Model With Sparse Masked Modeling for Vision Decoding

ResFormer: Scaling ViTs With Multi-Resolution Training

Stare at What You See: Masked Image Modeling Without Reconstruction

Mixed Autoencoder for Self-Supervised Visual Representation Learning

Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification

G-MSM: Unsupervised Multi-Shape Matching With Graph-Based Affinity Priors

Efficient Mask Correction for Click-Based Interactive Image Segmentation

Prototype-Based Embedding Network for Scene Graph Generation

Graph Representation for Order-Aware Visual Transformation

Unbiased Scene Graph Generation in Videos

Recurrence Without Recurrence: Stable Video Landmark Detection With Deep Equilibrium Models

VideoTrack: Learning To Track Objects via Video Transformer

Breaking the “Object” in Video Object Segmentation

Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection

Mask-Free Video Instance Segmentation

Hierarchical Neural Memory Network for Low Latency Event Processing

Unifying Short and Long-Term Tracking With Graph Hierarchies

Towards End-to-End Generative Modeling of Long Videos With Memory-Efficient Bidirectional Transformers

An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling

Egocentric Audio-Visual Object Localization

AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR

A Light Weight Model for Active Speaker Detection

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Video Test-Time Adaptation for Action Recognition

Unified Keypoint-Based Action Recognition Framework via Structured Keypoint Pooling

Object Discovery From Motion-Guided Tokens

Open Set Action Recognition via Multi-Label Evidential Learning

PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization

Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels

Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-Based Active Learning

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

Hierarchical Video-Moment Retrieval and Step-Captioning

HierVL: Learning Hierarchical Video-Language Embeddings

Learning Transferable Spatiotemporal Representations From Natural Script Knowledge

WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding

Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding

Learning Action Changes by Measuring Verb-Adverb Textual Relationships

LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling

DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-To-Fine Contrastive Ranking

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition With Variational Alignment

Joint Visual Grounding and Tracking With Natural Language Specification

Accelerating Vision-Language Pretraining With Free Language Modeling

CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

Where We Are and What We’re Looking At: Query Based Worldwide Image Geo-Localization Using Hierarchies and Scenes

ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos

MetaCLUE: Towards Comprehensive Visual Metaphors Research

GeoVLN: Learning Geometry-Enhanced Visual Representation With Slot Attention for Vision-and-Language Navigation

Being Comes From Not-Being: Open-Vocabulary Text-to-Motion Generation With Wordless Training

LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models

Position-Guided Text Prompt for Vision-Language Pre-Training

Intrinsic Physical Concepts Discovery With Object-Centric Predictive Models

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-Training Model

CLAMP: Prompt-Based Contrastive Learning for Connecting Language and Animal Pose

Teacher-Generated Spatial-Attention Labels Boost Robustness and Accuracy of Contrastive Models

DegAE: A New Pretraining Paradigm for Low-Level Vision

RILS: Masked Visual Reconstruction in Language Semantic Space

Learning Geometry-Aware Representations by Sketching

SketchXAI: A First Look at Explainability for Human Sketches

MAGVLT: Masked Generative Vision-and-Language Transformer

Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

Semantic-Conditional Diffusion Networks for Image Captioning

REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory

Variational Distribution Learning for Unsupervised Text-to-Image Generation

Scaling Language-Image Pre-Training via Masking

LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

Revisiting Self-Similarity: Structural Embedding for Image Retrieval

Improving Cross-Modal Retrieval With Set of Diverse Embeddings

Masked Autoencoding Does Not Help Natural Language Supervision at Scale

Few-Shot Learning With Visual Distribution Calibration and Cross-Modal Distribution Alignment

Deep Hashing With Minimal-Distance-Separated Hash Centers

ConZIC: Controllable Zero-Shot Image Captioning by Sampling-Based Polishing

Learning To Name Classes for Vision and Language Models

Data-Efficient Large Scale Place Recognition With Graded Similarity Supervision

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment

HOICLIP: Efficient Knowledge Transfer for HOI Detection With Vision-Language Models

OvarNet: Towards Open-Vocabulary Object Attribute Recognition

NeRF-RPN: A General Framework for Object Detection in NeRFs

Mask-Free OVIS: Open-Vocabulary Instance Segmentation Without Manual Mask Annotations

GP-VTON: Towards General Purpose Virtual Try-On via Collaborative Local-Flow Global-Parsing Learning

Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning

Contrastive Grouping With Transformer for Referring Image Segmentation

Semantic Prompt for Few-Shot Image Recognition

GRES: Generalized Referring Expression Segmentation

Network-Free, Unsupervised Semantic Segmentation With Synthetic Images

Few-Shot Semantic Image Synthesis With Class Affinity Transfer

Ultra-High Resolution Segmentation With Ultra-Rich Context: A Novel Benchmark

Content-Aware Token Sharing for Efficient Semantic Segmentation With Vision Transformers

Hierarchical Dense Correlation Distillation for Few-Shot Segmentation

On Calibrating Semantic Segmentation Models: Analyses and an Algorithm

FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation

Out-of-Candidate Rectification for Weakly Supervised Semantic Segmentation

Foundation Model Drives Weakly Incremental Learning for Semantic Segmentation

Long-Tailed Visual Recognition via Self-Heterogeneous Integration With Knowledge Excavation

Instance-Specific and Model-Adaptive Supervision for Semi-Supervised Semantic Segmentation

Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm

IDGI: A Framework To Eliminate Explanation Noise From Integrated Gradients

Weakly Supervised Posture Mining for Fine-Grained Classification

Vision Transformers Are Good Mask Auto-Labelers

Enhanced Training of Query-Based Object Detection via Selective Query Recollection

Box-Level Active Detection

CIGAR: Cross-Modality Graph Reasoning for Domain Adaptive Object Detection

DA-DETR: Domain Adaptive Detection Transformer With Information Fusion

Continual Detection Transformer for Incremental Object Detection

Semi-DETR: Semi-Supervised Object Detection With Detection Transformers

Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection

Harmonious Teacher for Cross-Domain Object Detection

Contrastive Mean Teacher for Domain Adaptive Object Detectors

Out-of-Distributed Semantic Pruning for Robust Semi-Supervised Learning

(ML)$^2$P-Encoder: On Exploration of Channel-Class Correlation for Multi-Label Zero-Shot Learning

MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery

Devil Is in the Queries: Advancing Mask Transformers for Real-World Medical Image Segmentation and Out-of-Distribution Localization

SQUID: Deep Feature In-Painting for Unsupervised Anomaly Detection

OCELOT: Overlapped Cell on Tissue Dataset for Histopathology

DeGPR: Deep Guided Posterior Regularization for Multi-Class Cell Detection and Counting

Best of Both Worlds: Multimodal Contrastive Learning With Tabular and Imaging Data

RankMix: Data Augmentation for Weakly Supervised Learning of Classifying Whole Slide Images With Diverse Sizes and Imbalanced Categories

GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection

Discriminating Known From Unknown Objects via Structure-Enhanced Recurrent Variational AutoEncoder

Sample-Level Multi-View Graph Clustering

On the Effects of Self-Supervision and Contrastive Alignment in Deep Multi-View Clustering

Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement

Open-Set Likelihood Maximization for Few-Shot Learning

HyperMatch: Noise-Tolerant Semi-Supervised Learning via Relaxed Contrastive Constraint

Token Boosting for Robust Self-Supervised Visual Transformer Pre-Training

Difficulty-Based Sampling for Debiased Contrastive Representation Learning

Improving Selective Visual Question Answering by Learning From Your Peers

Superclass Learning With Representation Enhancement

DISC: Learning From Noisy Labels via Dynamic Instance-Specific Selection and Correction

FCC: Feature Clusters Compression for Long-Tailed Visual Recognition

Dynamically Instance-Guided Adaptation: A Backward-Free Approach for Test-Time Domain Adaptive Semantic Segmentation

Semi-Supervised Domain Adaptation With Source Label Adaptation

Adjustment and Alignment for Unbiased Open Set Domain Adaptation

C-SFDA: A Curriculum Learning Aided Self-Training Framework for Efficient Source Free Domain Adaptation

ALOFT: A Lightweight MLP-Like Architecture With Dynamic Low-Frequency Transform for Domain Generalization

Modality-Agnostic Debiasing for Single Domain Generalization

ActMAD: Activation Matching To Align Distributions for Test-Time-Training

TIPI: Test Time Adaptation With Transformation Invariance

Improved Test-Time Adaptation for Domain Generalization

Learning With Fantasy: Semantic-Aware Virtual Contrastive Constraint for Few-Shot Class-Incremental Learning

NIFF: Alleviating Forgetting in Generalized Few-Shot Object Detection via Neural Instance Feature Forging

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering

PIVOT: Prompting for Video Continual Learning

BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning

DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning

PCR: Proxy-Based Contrastive Replay for Online Class-Incremental Continual Learning

Masked Autoencoders Enable Efficient Knowledge Distillers

Data-Free Knowledge Distillation via Feature Exchange and Activation Region Constraint

Multi-Level Logit Distillation

Preserving Linear Separability in Continual Learning by Backward Feature Projection

Critical Learning Periods for Multisensory Integration in Deep Networks

SLACK: Stable Learning of Augmentations With Cold-Start and KL Regularization

Improving Generalization With Domain Convex Game

Exploring Data Geometry for Continual Learning

FlowGrad: Controlling the Output of Generative ODEs With Gradients

Deep Graph Reprogramming

X-Pruner: eXplainable Pruning for Vision Transformers

Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures

Compacting Binary Neural Networks by Sparse Kernel Selection

Deep Deterministic Uncertainty: A New Simple Baseline

Understanding Deep Generative Models With Generalized Empirical Likelihoods

Fair Scratch Tickets: Finding Fair Sparse Networks Without Weight Training

Hard Sample Matters a Lot in Zero-Shot Quantization

PD-Quant: Post-Training Quantization Based on Prediction Difference Metric

Vector Quantization With Self-Attention for Quality-Independent Representation Learning

Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond

Sequential Training of GANs Against GAN-Classifiers Reveals Correlated “Knowledge Gaps” Present Among Independently Trained GAN Instances

Edges to Shapes to Concepts: Adversarial Augmentation for Robust Vision

Towards Universal Fake Image Detectors That Generalize Across Generative Models

Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection

Generating Anomalies for Video Anomaly Detection With Prompt-Based Feature Mapping

Revisiting Reverse Distillation for Anomaly Detection

MetaMix: Towards Corruption-Robust Continual Learning With Temporally Self-Adaptive Data Transformation

ScaleFL: Resource-Adaptive Federated Learning With Heterogeneous Clients

Confidence-Aware Personalized Federated Learning via Variational Expectation Maximization

Make Landscape Flatter in Differentially Private Federated Learning

Rethinking Domain Generalization for Face Anti-Spoofing: Separability and Alignment

StyleAdv: Meta Style Adversarial Training for Cross-Domain Few-Shot Learning

The Dark Side of Dynamic Routing Neural Networks: Towards Efficiency Backdoor Injection

Architectural Backdoors in Neural Networks

You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?

A Practical Upper Bound for the Worst-Case Attribution Deviations

Sibling-Attack: Rethinking Transferable Adversarial Attacks Against Face Recognition

Angelic Patches for Improving Third-Party Object Detector Performance

Introducing Competition To Boost the Transferability of Targeted Adversarial Examples Through Clean Feature Mixup

Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations

Boosting Accuracy and Robustness of Student Models via Adaptive Adversarial Distillation

The Enemy of My Enemy Is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training

Robust Single Image Reflection Removal Against Adversarial Attacks

Physical-World Optical Adversarial Attacks on 3D Face Recognition

AUNet: Learning Relations Between Action Units for Face Forgery Detection

(ends 6:00 PM)