CVPR 2024 Events with Videos
Art Programs
Expo Track Keynotes
Keynotes
Posters
- Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features
- GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors
- A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint
- Score-Guided Diffusion for 3D Human Recovery
- Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers
- Sharingan: A Transformer Architecture for Multi-Person Gaze Following
- SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes
- Real-Time Exposure Correction via Collaborative Transformations and Adaptive Sampling
- GSVA: Generalized Segmentation via Multimodal Large Language Models
- From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration
- GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh
- FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding
- HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video
- Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence
- Automatic Controllable Colorization via Imagination
- Mosaic-SDF for 3D Generative Models
- IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing
- Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation
- Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance
- ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention
- TexOct: Generating Textures of 3D Models with Octree-based Diffusion
- AAMDM: Accelerated Auto-regressive Motion Diffusion Model
- MatFuse: Controllable Material Generation with Diffusion Models
- Video2Game: Real-time Interactive Realistic and Browser-Compatible Environment from a Single Video
- 3D Human Pose Perception from Egocentric Stereo Videos
- Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations
- Real-Time Neural BRDF with Spherically Distributed Primitives
- ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning
- UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures
- SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement
- BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model
- From Feature to Gaze: A Generalizable Replacement of Linear Layer for Gaze Estimation
- CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement
- Differentiable Micro-Mesh Construction
- One-Shot Open Affordance Learning with Foundation Models
- DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion
- HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion
- AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond
- Estimating Extreme 3D Image Rotations using Cascaded Attention
- Self-Supervised Facial Representation Learning with Facial Region Awareness
- Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach
- GraCo: Granularity-Controllable Interactive Segmentation
- Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling
- CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation
- FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring
- Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation
- Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation
- Self-Supervised Dual Contouring
- Artist-Friendly Relightable and Animatable Neural Heads
- VINECS: Video-based Neural Character Skinning
- Segment Every Out-of-Distribution Object
- Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution
- A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation
- CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention
- MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
- Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing
- XFeat: Accelerated Features for Lightweight Image Matching
- From Activation to Initialization: Scaling Insights for Optimizing Neural Fields
- Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction
- I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions
- MAS: Multi-view Ancestral Sampling for 3D Motion Generation Using 2D Diffusion
- Putting the Object Back into Video Object Segmentation
- Masked and Shuffled Blind Spot Denoising for Real-World Images
- Modular Blind Video Quality Assessment
- ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering
- WANDR: Intention-guided Human Motion Generation
- DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling
- Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks
- Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
- Enhancing Video Super-Resolution via Implicit Resampling-based Alignment
- Degrees of Freedom Matter: Inferring Dynamics from Point Trajectories
- Boosting Image Restoration via Priors from Pre-trained Models
- Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches
- AUEditNet: Dual-Branch Facial Action Unit Intensity Manipulation with Implicit Disentanglement
- SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation
- Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching
- Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
- HOIAnimator: Generating Text-prompt Human-object Animations using Novel Perceptive Diffusion Models
- MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild
- Bidirectional Autoregessive Diffusion Model for Dance Generation
- Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes
- Misalignment-Robust Frequency Distribution Loss for Image Transformation
- Unsupervised Salient Instance Detection
- Relightable Gaussian Codec Avatars
- CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment
- Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering
- Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding
- MMM: Generative Masked Motion Model
- TextureDreamer: Image-Guided Texture Synthesis Through Geometry-Aware Diffusion
- FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions
- Quantifying Task Priority for Multi-Task Optimization
- Video-Based Human Pose Regression via Decoupled Space-Time Aggregation
- A Unified Framework for Human-centric Point Cloud Video Understanding
- Unsupervised Gaze Representation Learning from Multi-view Face Images
- Arbitrary Motion Style Transfer with Multi-condition Motion Latent Diffusion Model
- Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes
- PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF
- DiffusionLight: Light Probes for Free by Painting a Chrome Ball
- SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion
- DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling
- Joint2Human: High-Quality 3D Human Generation via Compact Spherical Embedding of 3D Joints
- GALA: Generating Animatable Layered Assets from a Single Scan
- On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
- Programmable Motion Generation for Open-Set Motion Control Tasks
- NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs
- Capturing Closely Interacted Two-Person Motions with Reaction Priors
- Point2CAD: Reverse Engineering CAD Models from 3D Point Clouds
- RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method
- Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
- TexVocab: Texture Vocabulary-conditioned Human Avatars
- Guided Slot Attention for Unsupervised Video Object Segmentation
- SEAS: ShapE-Aligned Supervision for Person Re-Identification
- Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation
- UniHuman: A Unified Model For Editing Human Images in the Wild
- SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
- SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting
- Towards Variable and Coordinated Holistic Co-Speech Motion Generation
- DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation
- GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians
- MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using Differentiable Shading
- AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution
- SVDTree: Semantic Voxel Diffusion for Single Image Tree Reconstruction
- OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers
- Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles
- KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation
- Optimizing Diffusion Noise Can Serve As Universal Motion Priors
- Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation
- OHTA: One-shot Hand Avatar via Data-driven Implicit Priors
- PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation
- SfmCAD: Unsupervised CAD Reconstruction by Learning Sketch-based Feature Modeling Operations
- Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras
- Learning to Control Camera Exposure via Reinforcement Learning
- Activity-Biometrics: Person Identification from Daily Activities
- Objects as Volumes: A Stochastic Geometry View of Opaque Solids
- MultiPhys: Multi-Person Physics-aware 3D Motion Estimation
- BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics
- Makeup Prior Models for 3D Facial Makeup Estimation and Applications
- OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation
- Locally Adaptive Neural 3D Morphable Models
- General Object Foundation Model for Images and Videos at Scale
- RecDiffusion: Rectangling for Image Stitching with Diffusion Models
- HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
- LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging
- TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation
- Relightable and Animatable Neural Avatar from Sparse-View Video
- Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives
- PoNQ: a Neural QEM-based Mesh Representation
- A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark
- Anatomically Constrained Implicit Face Models
- HashPoint: Accelerated Point Searching and Sampling for Neural Rendering
- RAM-Avatar: Real-time Photo-Realistic Avatar from Monocular Videos with Full-body Control
- DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer
- Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians
- Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption
- Neural Super-Resolution for Real-time Rendering with Radiance Demodulation
- Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors
- AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents
- No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation
- Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
- Segment Any Event Streams via Weighted Adaptation of Pivotal Tokens
- Fun with Flags: Robust Principal Directions via Flag Manifolds
- Learned Scanpaths Aid Blind Panoramic Video Quality Assessment
- From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation
- Beyond Image Super-Resolution for Image Recognition with Task-Driven Perceptual Loss
- Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis
- EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation
- Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
- As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors
- QUADify: Extracting Meshes with Pixel-level Details and Materials from Images
- RobustSAM: Segment Anything Robustly on Degraded Images
- Towards a Perceptual Evaluation Framework for Lighting Estimation
- USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation
- Digital Life Project: Autonomous 3D Characters with Social Intelligence
- Semantics-aware Motion Retargeting with Vision-Language Models
- FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio
- Collaborating Foundation Models for Domain Generalized Semantic Segmentation
- 3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow
- Functional Diffusion
- NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation
- HIT: Estimating Internal Human Implicit Tissues from the Body Surface
- HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations
- Cross-view and Cross-pose Completion for 3D Human Understanding
- Image Sculpting: Precise Object Editing with 3D Geometry Control
- Deep Equilibrium Diffusion Restoration with Parallel Sampling
- Unbiased Estimator for Distorted Conics in Camera Calibration
- Rethinking Interactive Image Segmentation with Low Latency High Quality and Diverse Prompts
- Garment Recovery with Shape and Deformation Priors
- PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics
- Authentic Hand Avatar from a Phone Scan via Universal Hand Model
- HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation
- PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
- MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints
- LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition
- ZERO-IG: Zero-Shot Illumination-Guided Joint Denoising and Adaptive Enhancement for Low-Light Images
- Loose Inertial Poser: Motion Capture with IMU-attached Loose-Wear Jacket
- M&M VTO: Multi-Garment Virtual Try-On and Editing
- Rethinking Few-shot 3D Point Cloud Semantic Segmentation
- Gradient Alignment for Cross-Domain Face Anti-Spoofing
- RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses
- Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation
- PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild
- ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring
- Prompt-Driven Referring Image Segmentation with Instance Contrasting
- SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis
- XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies
- Unlocking the Potential of Pre-trained Vision Transformers for Few-Shot Semantic Segmentation through Relationship Descriptors
- Memory-Scalable and Simplified Functional Map Learning
- Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation
- Generating Human Motion in 3D Scenes from Text Descriptions
- CAM Back Again: Large Kernel CNNs from a Weakly Supervised Object Localization Perspective
- MoST: Motion Style Transformer Between Diverse Action Contents
- Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring
- Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image
- Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains
- Molecular Data Programming: Towards Molecule Pseudo-labeling with Systematic Weak Supervision
- ChatPose: Chatting about 3D Human Pose
- URHand: Universal Relightable Hands
- Forecasting of 3D Whole-body Human Poses with Grasping Objects
- CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs
- InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion
- HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models
- Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling
- Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation
- DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans
- BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition
- OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion
- EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
- Scaling Up Dynamic Human-Scene Interaction Modeling
- Infer from What You Have Seen Before: Temporally-dependent Classifier for Semi-supervised Video Segmentation
- LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example
- DIOD: Self-Distillation Meets Object Discovery
- TexTile: A Differentiable Metric for Texture Tileability
- REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning
- Neural Sign Actors: A Diffusion Model for 3D Sign Language Production from Text
- Exploiting Style Latent Flows for Generalizing Deepfake Video Detection
- HOIST-Former: Hand-held Objects Identification Segmentation and Tracking in the Wild
- HumMUSS: Human Motion Understanding using State Space Models
- Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects
- EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams
- OmniMotionGPT: Animal Motion Generation with Limited Data
- Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi
- MeshPose: Unifying DensePose and 3D Body Mesh Reconstruction
- DPHMs: Diffusion Parametric Head Models for Depth-based Tracking
- Stratified Avatar Generation from Sparse Observations
- GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning
- HUGS: Human Gaussian Splats
- Differentiable Point-based Inverse Rendering
- Hierarchical Histogram Threshold Segmentation – Auto-terminating High-detail Oversegmentation
- A Unified and Interpretable Emotion Representation and Expression Generation
- Human Gaussian Splatting: Real-time Rendering of Animatable Avatars
- Open-World Semantic Segmentation Including Class Similarity
- SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
- Self-Calibrating Vicinal Risk Minimisation for Model Calibration
- Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
- UniVS: Unified and Universal Video Segmentation with Prompts as Queries
- Semantic-aware SAM for Point-Prompted Instance Segmentation
- Deciphering ‘What’ and ‘Where’ Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations
- AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation
- One-Class Face Anti-spoofing via Spoof Cue Map-Guided Feature Learning
- Breathing Life Into Sketches Using Text-to-Video Priors
- ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning
- BigGait: Learning Gait Representation You Want by Large Vision Models
- KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation
- Human Motion Prediction Under Unexpected Perturbation
- VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams
- SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation
- Robust Image Denoising through Adversarial Frequency Mixup
- Weakly Supervised Point Cloud Semantic Segmentation via Artificial Oracle
- From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
- FaceLift: Semi-supervised 3D Facial Landmark Localization
- MoMask: Generative Masked Modeling of 3D Human Motions
- MFP: Making Full Use of Probability Maps for Interactive Image Segmentation
- Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
- PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos
- 3D Facial Expressions through Analysis-by-Neural-Synthesis
- Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining
- Test-Time Domain Generalization for Face Anti-Spoofing
- LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment
- Multimodal Sense-Informed Forecasting of 3D Human Motions
- Residual Denoising Diffusion Models
- Finsler-Laplace-Beltrami Operators with Application to Shape Analysis
- HOI-M^3: Capture Multiple Humans and Objects Interaction within Contextual Environment
- SD2Event:Self-supervised Learning of Dynamic Detectors and Contextual Descriptors for Event Cameras
- High-Quality Facial Geometry and Appearance Capture at Home
- Diffuse Attend and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion
- KeyPoint Relative Position Encoding for Face Recognition
- MANUS: Markerless Grasp Capture using Articulated 3D Gaussians
- When StyleGAN Meets Stable Diffusion: a W+ Adapter for Personalized Image Generation
- Open Vocabulary Semantic Scene Sketch Understanding
- PFStorer: Personalized Face Restoration and Super-Resolution
- G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis
- Eclipse: Disambiguating Illumination and Materials using Unintended Shadows
- HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis
- PolarMatte: Fully Computational Ground-Truth-Quality Alpha Matte Extraction for Images and Video using Polarized Screen Matting
- Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation
- Design2Cloth: 3D Cloth Generation from 2D Masks
- Perception-Oriented Video Frame Interpolation via Asymmetric Blending
- Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
- BodyMAP - Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed
- 3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation
- NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors
- Monocular Identity-Conditioned Facial Reflectance Reconstruction
- 4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations
- Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning
- GenesisTex: Adapting Image Denoising Diffusion to Texture Space
- Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting
- Flexible Biometrics Recognition: Bridging the Multimodality Gap through Attention Alignment and Prompt Tuning
- Geometry Transfer for Stylizing Radiance Fields
- Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification
- Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
- PAIR Diffusion: A Comprehensive Multimodal Object-Level Image Editor
- CDFormer: When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution
- Learning Continuous 3D Words for Text-to-Image Generation
- Hierarchical Patch Diffusion Models for High-Resolution Video Generation
- You Only Need Less Attention at Each Stage in Vision Transformers
- Face2Diffusion for Fast and Editable Face Personalization
- CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model
- Customization Assistant for Text-to-Image Generation
- Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models
- Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability
- TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models
- NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging
- Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
- CosmicMan: A Text-to-Image Foundation Model for Humans
- Pose Adapted Shape Learning for Large-Pose Face Reenactment
- VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
- Style Aligned Image Generation via Shared Attention
- Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement
- Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance
- HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting
- Generating Illustrated Instructions
- JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
- SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering
- It's All About Your Sketch: Democratising Sketch Control in Diffusion Models
- Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data
- Cache Me if You Can: Accelerating Diffusion Models through Block Caching
- Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following
- VAREN: Very Accurate and Realistic Equine Network
- MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
- A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network
- GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs
- Intrinsic Image Diffusion for Indoor Single-view Material Estimation
- REACTO: Reconstructing Articulated Objects from a Single Video
- Single Mesh Diffusion Models with Field Latents for Texture Generation
- Layout-Agnostic Scene Text Image Synthesis with Diffusion Models
- Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models
- On Exact Inversion of DPM-Solvers
- DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
- DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior
- InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
- Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
- DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance
- InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
- Wired Perspectives: Multi-View Wire Art Embraces Generative AI
- FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
- Continuous Pose for Monocular Cameras in Neural Implicit Representation
- Accurate Training Data for Occupancy Map Prediction in Automated Driving Using Evidence Theory
- DaReNeRF: Direction-aware Representation for Dynamic Scenes
- Balancing Act: Distribution-Guided Debiasing in Diffusion Models
- IReNe: Instant Recoloring of Neural Radiance Fields
- StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation
- Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models
- DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
- Making Vision Transformers Truly Shift-Equivariant
- Personalized Residuals for Concept-Driven Text-to-Image Generation
- Grid Diffusion Models for Text-to-Video Generation
- Named Entity Driven Zero-Shot Image Manipulation
- Total Selfie: Generating Full-Body Selfies
- Condition-Aware Neural Network for Controlled Image Generation
- Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
- Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
- AdaShift: Learning Discriminative Self-Gated Neural Feature Activation With an Adaptive Shift Factor
- MR-VNet: Media Restoration using Volterra Networks
- Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models
- 3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis
- SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
- LTM: Lightweight Textured Mesh Extraction and Refinement of Large Unbounded Scenes for Efficient Storage and Real-time Rendering
- WonderJourney: Going from Anywhere to Everywhere
- SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
- State Space Models for Event Cameras
- MotionEditor: Editing Video Motion via Content-Aware Diffusion
- TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
- Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
- Gaussian Shell Maps for Efficient 3D Human Generation
- Revisiting Sampson Approximations for Geometric Estimation Problems
- Video-P2P: Video Editing with Cross-attention Control
- NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild
- DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
- GenN2N: Generative NeRF2NeRF Translation
- A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
- Diversity-aware Channel Pruning for StyleGAN Compression
- Don’t Drop Your Samples! Coherence-Aware Training Benefits Conditional Diffusion
- ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D Image
- Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization
- 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
- Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection
- In Search of a Data Transformation That Accelerates Neural Field Training
- Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models
- 2S-UDF: A Novel Two-stage UDF Learning Method for Robust Non-watertight Model Reconstruction from Multi-view Images
- SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model
- ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image
- ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
- Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On
- Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis
- Orthogonal Adaptation for Modular Customization of Diffusion Models
- MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation
- Exploiting Diffusion Prior for Generalizable Dense Prediction
- SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation
- Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching
- Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing
- RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction
- Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation
- Identifying Important Group of Pixels using Interactions
- RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
- Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples
- Learning Degradation-unaware Representation with Prior-based Latent Transformations for Blind Face Restoration
- HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data
- Neural Lineage
- CAMEL: CAusal Motion Enhancement Tailored for Lifting Text-driven Video Editing
- MaskPLAN: Masked Generative Layout Planning from Partial Input
- Learning Structure-from-Motion with Graph Attention Networks
- SuperPrimitive: Scene Reconstruction at a Primitive Level
- SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream
- In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing
- Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution
- Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts
- Prompt Augmentation for Self-supervised Text-guided Image Manipulation
- Boosting Diffusion Models with Moving Average Sampling in Frequency Domain
- SIGNeRF: Scene Integrated Generation for Neural Radiance Fields
- Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
- Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth
- Observation-Guided Diffusion Probabilistic Models
- FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance Head-pose and Facial Expression Features
- VidToMe: Video Token Merging for Zero-Shot Video Editing
- Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation
- FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer
- Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting
- Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis
- AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error
- Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields
- Mitigating Motion Blur in Neural Radiance Fields with Events and Frames
- SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design
- Efficient Detection of Long Consistent Cycles and its Application to Distributed Synchronization
- Relightful Harmonization: Lighting-aware Portrait Background Replacement
- Friendly Sharpness-Aware Minimization
- LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching
- Generative Unlearning for Any Identity
- Robust Self-calibration of Focal Lengths from the Fundamental Matrix
- WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
- Unmixing Before Fusion: A Generalized Paradigm for Multi-Source-based Hyperspectral Image Synthesis
- Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis
- Fast ODE-based Sampling for Diffusion Models in Around 5 Steps
- AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings
- LightIt: Illumination Modeling and Control for Diffusion Models
- DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars
- Unsupervised Template-assisted Point Cloud Shape Correspondence Network
- FedUV: Uniformity and Variance for Heterogeneous Federated Learning
- InceptionNeXt: When Inception Meets ConvNeXt
- InstructVideo: Instructing Video Diffusion Models with Human Feedback
- HEAL-SWIN: A Vision Transformer On The Sphere
- TetraSphere: A Neural Descriptor for O(3)-Invariant Point Cloud Analysis
- Grounded Text-to-Image Synthesis with Attention Refocusing
- StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
- One-Shot Structure-Aware Stylized Image Synthesis
- NC-TTT: A Noise Constrastive Approach for Test-Time Training
- Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion
- CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization
- VideoBooth: Diffusion-based Video Generation with Image Prompts
- Towards 3D Vision with Low-Cost Single-Photon Cameras
- Self-correcting LLM-controlled Diffusion Models
- AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search
- Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models
- Steerers: A Framework for Rotation Equivariant Keypoint Descriptors
- Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
- OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos
- Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
- Relation Rectification in Diffusion Model
- Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation
- Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting
- LAENeRF: Local Appearance Editing for Neural Radiance Fields
- MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior
- DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
- TextCraftor: Your Text Encoder Can be Image Quality Controller
- Generalizable Novel-View Synthesis using a Stereo Camera
- Correcting Diffusion Generation through Resampling
- YolOOD: Utilizing Object Detection Concepts for Multi-Label Out-of-Distribution Detection
- LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis
- ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion
- TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing
- Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection
- Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training
- Leveraging Camera Triplets for Efficient and Accurate Structure-from-Motion
- Seeing the World through Your Eyes
- CCEdit: Creative and Controllable Video Editing via Diffusion Models
- AnyDoor: Zero-shot Object-level Image Customization
- FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
- TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video
- Fitting Flats to Flats
- Data-Free Quantization via Pseudo-label Filtering
- Scaling Laws of Synthetic Images for Model Training ... for Now
- PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns
- Universal Robustness via Median Randomized Smoothing for Real-World Super-Resolution
- GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos
- Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion
- Text-Driven Image Editing via Learnable Regions
- PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
- A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
- DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
- Panacea: Panoramic and Controllable Video Generation for Autonomous Driving
- KPConvX: Modernizing Kernel Point Convolution with Kernel Attention
- SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer
- Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing
- Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning
- Training-Free Pretrained Model Merging
- 3D Multi-frame Fusion for Video Stabilization
- GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image
- Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis
- Building Optimal Neural Architectures using Interpretable Knowledge
- Learned Representation-Guided Diffusion Models for Large-Image Generation
- VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models
- Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
- DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video
- Generating Non-Stationary Textures using Self-Rectification
- VecFusion: Vector Font Generation with Diffusion
- Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing
- Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer
- ControlRoom3D: Room Generation using Semantic Proxy Rooms
- Deformable One-shot Face Stylization via DINO Semantic Guidance
- Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance
- MedBN: Robust Test-Time Adaptation against Malicious Test Samples
- DreamComposer: Controllable 3D Object Generation via Multi-View Conditions
- DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction
- FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
- Towards Accurate and Robust Architectures via Neural Architecture Search
- Denoising Point Clouds in Latent Space via Graph Convolution and Invertible Neural Network
- Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network
- SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation
- Diffusion Handles Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D
- Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation
- Low-Latency Neural Stereo Streaming
- 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting
- AVID: Any-Length Video Inpainting with Diffusion Model
- Exact Fusion via Feature Distribution Matching for Few-shot Image Generation
- DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
- Time- Memory- and Parameter-Efficient Visual Adaptation
- Permutation Equivariance of Transformers and Its Applications
- Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices
- PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
- Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates
- Instruct-Imagen: Image Generation with Multi-modal Instruction
- Clockwork Diffusion: Efficient Generation With Model-Step Distillation
- Emu Edit: Precise Image Editing via Recognition and Generation Tasks
- FreeU: Free Lunch in Diffusion U-Net
- AnyScene: Customized Image Synthesis with Composited Foreground
- PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
- Mean-Shift Feature Transformer
- Video Interpolation with Diffusion Models
- Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences
- LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model
- SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering
- GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models
- Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis
- Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
- StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN
- Towards Memorization-Free Diffusion Models
- HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
- Vlogger: Make Your Dream A Vlog
- Readout Guidance: Learning Control from Diffusion Features
- Doubly Abductive Counterfactual Inference for Text-based Image Editing
- The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing
- Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
- VS: Reconstructing Clothed 3D Human from Single Image via Vertex Shift
- En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data
- Diversified and Personalized Multi-rater Medical Image Segmentation
- FairCLIP: Harnessing Fairness in Vision-Language Learning
- Honeybee: Locality-enhanced Projector for Multimodal LLM
- DiG-IN: Diffusion Guidance for Investigating Networks - Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations
- Brush2Prompt: Contextual Prompt Generator for Object Inpainting
- Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation
- Bi-level Learning of Task-Specific Decoders for Joint Registration and One-Shot Medical Image Segmentation
- SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers
- Beyond Seen Primitive Concepts and Attribute-Object Compositional Learning
- MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections
- Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences
- Towards Learning a Generalist Model for Embodied Navigation
- Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- CORES: Convolutional Response-based Score for Out-of-distribution Detection
- Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition
- Do Vision and Language Encoders Represent the World Similarly?
- Discontinuity-preserving Normal Integration with Auxiliary Edges
- MICap: A Unified Model for Identity-Aware Movie Descriptions
- The STVchrono Dataset: Towards Continuous Change Recognition in Time
- An Edit Friendly DDPM Noise Space: Inversion and Manipulations
- Depth Prompting for Sensor-Agnostic Depth Estimation
- SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
- Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World
- HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions
- Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance
- Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation
- InstaGen: Enhancing Object Detection by Training on Synthetic Dataset
- The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement
- Would Deep Generative Models Amplify Bias in Future Models?
- Efficient Test-Time Adaptation of Vision-Language Models
- Retrieval-Augmented Egocentric Video Captioning
- DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions
- Understanding Video Transformers via Universal Concept Discovery
- LaneCPP: Continuous 3D Lane Detection using Physical Priors
- Bayesian Differentiable Physics for Cloth Digitalization
- WorDepth: Variational Language Prior for Monocular Depth Estimation
- SEED-Bench: Benchmarking Multimodal Large Language Models
- Blind Image Quality Assessment Based on Geometric Order Learning
- 3DToonify: Creating Your High-Fidelity 3D Stylized Avatar Easily from 2D Portrait Images
- Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding
- MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation
- ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
- Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
- Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning
- Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
- IDGuard: Robust General Identity-centric POI Proactive Defense Against Face Editing Abuse
- Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
- AHIVE: Anatomy-aware Hierarchical Vision Encoding for Interactive Radiology Report Retrieval
- Incremental Residual Concept Bottleneck Models
- Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention
- VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models
- SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
- FairRAG: Fair Human Generation via Fair Retrieval Augmentation
- DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
- DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
- Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis
- Situational Awareness Matters in 3D Vision Language Reasoning
- Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
- In-distribution Public Data Synthesis with Diffusion Models for Differentially Private Image Classification
- SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection
- From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior
- Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation
- DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors
- LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation
- EvDiG: Event-guided Direct and Global Components Separation
- Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction
- Dual-View Visual Contextualization for Web Navigation
- EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
- Sparse Views Near Light: A Practical Paradigm for Uncalibrated Point-light Photometric Stereo
- CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation
- SAOR: Single-View Articulated Object Reconstruction
- HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
- Learning Triangular Distribution in Visual World
- Free3D: Consistent Novel View Synthesis without 3D Representation
- WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects Under Occlusion
- NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows
- NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images
- A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
- Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation
- Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining
- Constructing and Exploring Intermediate Domains in Mixed Domain Semi-supervised Medical Image Segmentation
- Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
- VTimeLLM: Empower LLM to Grasp Video Moments
- Language-only Training of Zero-shot Composed Image Retrieval
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
- FLHetBench: Benchmarking Device and State Heterogeneity in Federated Learning
- GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding
- PointInfinity: Resolution-Invariant Point Diffusion Models
- Validating Privacy-Preserving Face Recognition under a Minimum Assumption
- PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation
- Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration
- LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation
- Compositional Chain-of-Thought Prompting for Large Multimodal Models
- Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
- Explaining CLIP's Performance Disparities on Data from Blind/Low Vision Users
- MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models
- What Sketch Explainability Really Means for Downstream Tasks?
- PH-Net: Semi-Supervised Breast Lesion Segmentation via Patch-wise Hardness
- Multi-Modal Hallucination Control by Visual Information Grounding
- See Say and Segment: Teaching LMMs to Overcome False Premises
- MuseChat: A Conversational Music Recommendation System for Videos
- Model Inversion Robustness: Can Transfer Learning Help?
- Instance-aware Contrastive Learning for Occluded Human Mesh Reconstruction
- RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D
- Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models
- Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline
- HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation
- Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods
- MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes
- Long-Tailed Anomaly Detection with Learnable Class Names
- Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training
- EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation
- Incremental Nuclei Segmentation from Histopathological Images via Future-class Awareness and Compatibility-inspired Distillation
- IBD-SLAM: Learning Image-Based Depth Fusion for Generalizable SLAM
- Global and Local Prompts Cooperation via Optimal Transport for Federated Learning
- Learning Large-Factor EM Image Super-Resolution with Generative Priors
- ZeroShape: Regression-based Zero-shot Shape Reconstruction
- Structure-Aware Sparse-View X-ray 3D Reconstruction
- ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting
- Differentiable Display Photometric Stereo
- RegionGPT: Towards Region Understanding Vision Language Model
- Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models
- Previously on ... From Recaps to Story Summarization
- PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
- Question Aware Vision Transformer for Multimodal Reasoning
- ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis
- GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
- Towards Efficient Replay in Federated Incremental Learning
- Hearing Anything Anywhere
- Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation
- SeaBird: Segmentation in Bird’s View with Dice Loss Improves Monocular 3D Detection of Large Objects
- Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework
- Viewpoint-Aware Visual Grounding in 3D Scenes
- MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
- MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
- FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders
- SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image
- SketchINR: A First Look into Sketches as Implicit Neural Representations
- XFibrosis: Explicit Vessel-Fiber Modeling for Fibrosis Staging from Liver Pathology Images
- Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds
- CNC-Net: Self-Supervised Learning for CNC Machining Operations
- SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling
- Discovering and Mitigating Visual Biases through Keyword Explanation
- MonoNPHM: Dynamic Head Reconstruction from Monocular Videos
- Prompting Vision Foundation Models for Pathology Image Analysis
- Enhancing Intrinsic Features for Debiasing via Investigating Class-Discerning Common Attributes in Bias-Contrastive Pair
- HiLo: Detailed and Robust 3D Clothed Human Reconstruction with High-and Low-Frequency Information of Parametric Models
- Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration
- Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning
- Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI
- Instance-level Expert Knowledge and Aggregate Discriminative Attention for Radiology Report Generation
- Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
- Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction
- Koala: Key Frame-Conditioned Long Video-LLM
- Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
- Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation
- Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo
- 3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surfaces
- Referring Image Editing: Object-level Image Editing via Referring Expressions
- Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation
- Learning Occupancy for Monocular 3D Object Detection
- Visual Objectification in Films: Towards a New AI Task for Video Interpretation
- A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning
- ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations
- Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer
- InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields
- Text-Image Alignment for Diffusion-Based Perception
- The Manga Whisperer: Automatically Generating Transcriptions for Comics
- ProMark: Proactive Diffusion Watermarking for Causal Attribution
- Navigate Beyond Shortcuts: Debiased Learning Through the Lens of Neural Collapse
- ViewFusion: Towards Multi-View Consistency via Interpolated Denoising
- Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions
- 6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation
- GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects
- WildlifeMapper: Aerial Image Analysis for Multi-Species Detection and Identification
- A Theory of Joint Light and Heat Transport for Lambertian Scenes
- Rethinking Inductive Biases for Surface Normal Estimation
- FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
- Tyche: Stochastic In-Context Learning for Medical Image Segmentation
- WWW: A Unified Framework for Explaining What Where and Why of Neural Networks by Interpretation of Neuron Concepts
- VMINer: Versatile Multi-view Inverse Rendering with Near- and Far-field Light Sources
- SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology
- Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under Manhattan World Assumption
- G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images
- Label-Efficient Group Robustness via Out-of-Distribution Concept Curation
- CrowdDiff: Multi-hypothesis Crowd Density Estimation using Diffusion Models
- SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples
- Building Vision-Language Models on Solid Foundations with Masked Distillation
- A Stealthy Wrongdoer: Feature-Oriented Reconstruction Attack against Split Learning
- R-Cyclic Diffuser: Reductive and Cyclic Latent Diffusion for 3D Clothed Human Digitalization
- Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
- Towards Language-Driven Video Inpainting via Multimodal Large Language Models
- VOODOO 3D: Volumetric Portrait Disentanglement For One-Shot 3D Head Reenactment
- BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image
- Transcriptomics-guided Slide Representation Learning in Computational Pathology
- MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
- BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
- Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering
- WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights
- Privacy-Preserving Optics for Enhancing Protection in Face De-Identification
- Holistic Autonomous Driving Understanding by Bird’s-Eye-View Injected Multi-Modal Large Models
- Plug-and-Play Diffusion Distillation
- Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology
- On the Faithfulness of Vision Transformer Explanations
- Unleashing Network Potentials for Semantic Scene Completion
- Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning
- Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts
- CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering
- Fair-VPT: Fair Visual Prompt Tuning for Image Classification
- Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer
- CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment
- UniDepth: Universal Monocular Metric Depth Estimation
- Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction
- Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation
- Improved Visual Grounding through Self-Consistent Explanations
- Language Models as Black-Box Optimizers for Vision-Language Models
- Communication-Efficient Federated Learning with Accelerated Client Gradient
- EarthLoc: Astronaut Photography Localization by Indexing Earth from Space
- Revisiting Counterfactual Problems in Referring Expression Comprehension
- AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing
- SignGraph: A Sign Sequence is Worth Graphs of Nodes
- Posterior Distillation Sampling
- Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
- PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
- Tumor Micro-environment Interactions Guided Graph Learning for Survival Analysis of Human Cancers from Whole-slide Pathological Images
- HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields
- Unknown Prompt the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization
- CycleINR: Cycle Implicit Neural Representation for Arbitrary-Scale Volumetric Super-Resolution of Medical Data
- Unsupervised 3D Structure Inference from Category-Specific Image Collections
- V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
- Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering
- An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning
- OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
- MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation
- Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
- Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models
- Tune-An-Ellipse: CLIP Has Potential to Find What You Want
- Cross-Dimension Affinity Distillation for 3D EM Neuron Segmentation
- Weakly Supervised Monocular 3D Detection with a Single-View Image
- Uncertainty Visualization via Low-Dimensional Posterior Projections
- Learning the 3D Fauna of the Web
- CAD: Photorealistic 3D Generation via Adversarial Distillation
- ScanFormer: Referring Expression Comprehension by Iteratively Scanning
- Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images
- Towards Better Vision-Inspired Vision-Language Models
- Neural Underwater Scene Representation
- EventPS: Real-Time Photometric Stereo Using an Event Camera
- Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation
- Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models
- 3D-LFM: Lifting Foundation Model
- Behind the Veil: Enhanced Indoor 3D Scene Reconstruction with Occluded Surfaces Completion
- Pixel-Aligned Language Model
- MonoCD: Monocular 3D Object Detection with Complementary Depths
- SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
- Slice3D: Multi-Slice Occlusion-Revealing Single View 3D Reconstruction
- Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes
- CityDreamer: Compositional Generative Model of Unbounded 3D Cities
- Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
- EscherNet: A Generative Model for Scalable View Synthesis
- E-GPS: Explainable Geometry Problem Solving via Top-Down Solver and Bottom-Up Generator
- SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes
- Learning Object State Changes in Videos: An Open-World Perspective
- RepViT: Revisiting Mobile CNN From ViT Perspective
- Preserving Fairness Generalization in Deepfake Detection
- Learning Group Activity Features Through Person Attribute Prediction
- LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking
- ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction
- LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
- Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences
- Harnessing Large Language Models for Training-free Video Anomaly Detection
- PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos
- Step Differences in Instructional Video
- Prompt-Enhanced Multiple Instance Learning for Weakly Supervised Video Anomaly Detection
- Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection
- Selective Interpretable and Motion Consistent Privacy Attribute Obfuscation for Action Recognition
- SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency
- Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation
- DiffLoc: Diffusion Model for Outdoor LiDAR Localization
- VLP: Vision Language Planning for Autonomous Driving
- MoST: Multi-Modality Scene Tokenization for Motion Prediction
- Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation
- TULIP: Transformer for Upsampling of LiDAR Point Clouds
- UniMODE: Unified Monocular 3D Object Detection
- Characteristics Matching Based Hash Codes Generation for Efficient Fine-grained Image Retrieval
- Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations
- UnO: Unsupervised Occupancy Fields for Perception and Forecasting
- ProTeCt: Prompt Tuning for Taxonomic Open Set Classification
- CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation
- Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous and Instruction-guided Driving
- Action Detection via an Image Diffusion Process
- EASE-DETR: Easing the Competition among Object Queries
- LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection
- Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition
- Video ReCap: Recursive Captioning of Hour-Long Videos
- SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution
- Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline
- Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer
- Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval
- BEVSpread: Spread Voxel Pooling for Bird’s-Eye-View Representation in Vision-based Roadside 3D Object Detection
- GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds
- TransNeXt: Robust Foveal Visual Perception for Vision Transformers
- MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
- Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge
- Visual Point Cloud Forecasting enables Scalable Autonomous Driving
- PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
- PTQ4SAM: Post-Training Quantization for Segment Anything
- Matching Anything by Segmenting Anything
- Feedback-Guided Autonomous Driving
- TIM: A Time Interval Machine for Audio-Visual Action Recognition
- Learning Vision from Models Rivals Learning Vision from Data
- MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
- Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
- Depth-Aware Concealed Crop Detection in Dense Agricultural Scenes
- Improving Distant 3D Object Detection Using 2D Box Supervision
- vid-TLDR: Training Free Token Merging for Light-weight Video Transformer
- VicTR: Video-conditioned Text Representations for Activity Recognition
- GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction
- Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection
- Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach
- Scaled Decoupled Distillation
- CAT: Exploiting Inter-Class Dynamics for Domain Adaptive Object Detection
- Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning
- ICP-Flow: LiDAR Scene Flow Estimation with ICP
- Implicit Motion Function
- Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection
- Joint-Task Regularization for Partially Labeled Multi-Task Learning
- TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation
- From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding
- Hyperbolic Learning with Synthetic Captions for Open-World Detection
- Context-Aware Integration of Language and Visual References for Natural Language Tracking
- Hyperspherical Classification with Dynamic Label-to-Prototype Assignment
- CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow
- GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation Demonstration and Imitation
- OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation
- RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation
- Active Object Detection with Knowledge Aggregation and Distillation from Large Models
- On the Estimation of Image-matching Uncertainty in Visual Place Recognition
- View From Above: Orthogonal-View aware Cross-view Localization
- Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection
- Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch
- CaKDP: Category-aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection
- Language-driven Grasp Detection
- Dual Prototype Attention for Unsupervised Video Object Segmentation
- Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving
- Learning Correlation Structures for Vision Transformers
- Bézier Everywhere All at Once: Learning Drivable Lanes as Bézier Graphs
- YOLO-World: Real-Time Open-Vocabulary Object Detection
- IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
- End-to-End Spatio-Temporal Action Localisation with Video Transformers
- Weak-to-Strong 3D Object Detection with X-Ray Distillation
- Dense Vision Transformer Compression with Few Samples
- Low-power Continuous Remote Behavioral Localization with Event Cameras
- Learning to Navigate Efficiently and Precisely in Real Environments
- UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model
- ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More
- PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition
- LLMs are Good Action Recognizers
- Optimal Transport Aggregation for Visual Place Recognition
- Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers
- SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World
- Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations
- Generalized Predictive Model for Autonomous Driving
- Multi-View Attentive Contextualization for Multi-View 3D Object Detection
- LLMs are Good Sign Language Translators
- Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
- SeMoLi: What Moves Together Belongs Together
- Exploring Orthogonality in Open World Object Detection
- Towards Realistic Scene Generation with LiDAR Diffusion Models
- Learning Transferable Negative Prompts for Out-of-Distribution Detection
- Video Harmonization with Triplet Spatio-Temporal Variation Patterns
- CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation
- Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping
- Frozen Feature Augmentation for Few-Shot Image Classification
- Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training
- You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
- Extreme Point Supervised Instance Segmentation
- Towards High-fidelity Artistic Image Vectorization via Texture-Encapsulated Shape Parameterization
- Holodeck: Language Guided Generation of 3D Embodied AI Environments
- Continual Learning for Motion Prediction Model via Meta-Representation Learning and Optimal Memory Buffer Retention Strategy
- FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models
- Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos
- Rapid Motor Adaptation for Robotic Manipulator Arms
- Commonsense Prototype for Outdoor Unsupervised 3D Object Detection
- GRAM: Global Reasoning for Multi-Page VQA
- MemFlow: Optical Flow Estimation and Prediction with Memory
- Dense Optical Tracking: Connecting the Dots
- DualAD: Disentangling the Dynamic and Static World for End-to-End Driving
- RoHM: Robust Human Motion Reconstruction via Diffusion
- EventDance: Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition
- 3D Feature Tracking via Event Camera
- CAGE: Controllable Articulation GEneration
- PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness
- Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion
- Attribute-Guided Pedestrian Retrieval: Bridging Person Re-ID with Internal Attribute Variability
- Supervised Anomaly Detection for Complex Industrial Images
- Model Adaptation for Time Constrained Embodied Control
- UniPAD: A Universal Pre-training Paradigm for Autonomous Driving
- Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors
- ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association
- Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
- CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images
- Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households
- Riemannian Multinomial Logistics Regression for SPD Neural Networks
- MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction
- LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels
- LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic Simulation
- NeuRAD: Neural Rendering for Autonomous Driving
- Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers
- Task-Conditioned Adaptation of Visual Features in Multi-Task Policy Learning
- Rethinking Boundary Discontinuity Problem for Oriented Object Detection
- RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features
- A Generative Approach for Wikipedia-Scale Visual Entity Recognition
- Point Segment and Count: A Generalized Framework for Object Counting
- Versatile Navigation Under Partial Observability via Value-guided Diffusion Policy
- NetTrack: Tracking Highly Dynamic Objects with a Net
- RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection
- Logit Standardization in Knowledge Distillation
- Towards Generalizable Multi-Object Tracking
- From Coarse to Fine-Grained Open-Set Recognition
- Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis
- PELA: Learning Parameter-Efficient Models with Low-Rank Approximation
- Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions
- Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation
- Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning
- PEM: Prototype-based Efficient MaskFormer for Image Segmentation
- CURSOR: Scalable Mixed-Order Hypergraph Matching with CUR Decomposition
- MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
- Hybrid Proposal Refiner: Revisiting DETR Series from the Faster R-CNN Perspective
- OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning
- Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
- Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception
- An N-Point Linear Solver for Line and Motion Estimation with Event Cameras
- Retrieval-Augmented Open-Vocabulary Object Detection
- A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives
- Instance-Aware Group Quantization for Vision Transformers
- SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields
- CLIP-KD: An Empirical Study of CLIP Model Distillation
- StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential Equation
- HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
- Gradient Reweighting: Towards Imbalanced Class-Incremental Learning
- Region-Based Representations Revisited
- Fourier-basis Functions to Bridge Augmentation Gap: Rethinking Frequency Augmentation in Image Classification
- UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather
- Gaussian Splatting SLAM
- Novel Class Discovery for Ultra-Fine-Grained Visual Categorization
- Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning
- SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
- SnAG: Scalable and Accurate Video Grounding
- Learning for Transductive Threshold Calibration in Open-World Recognition
- SFOD: Spiking Fusion Object Detector
- Test-Time Zero-Shot Temporal Action Localization
- Contrastive Learning for DeepFake Classification and Localization via Multi-Label Ranking
- ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
- VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection
- On Train-Test Class Overlap and Detection for Image Retrieval
- Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
- Looking 3D: Anomaly Detection with 2D-3D Alignment
- Effective Video Mirror Detection with Inconsistent Motion Cues
- Holistic Features are almost Sufficient for Text-to-Video Retrieval
- Adaptive Softassign via Hadamard-Equipped Sinkhorn
- 3D LiDAR Mapping in Dynamic Environments using a 4D Implicit Neural Representation
- PREGO: Online Mistake Detection in PRocedural EGOcentric Videos
- MTLoRA: Low-Rank Adaptation Approach for Efficient Multi-Task Learning
- Seeing the Unseen: Visual Common Sense for Semantic Placement
- Delving into the Trajectory Long-tail Distribution for Muti-object Tracking
- D3still: Decoupled Differential Distillation for Asymmetric Image Retrieval
- LiDAR-based Person Re-identification
- Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes
- Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
- FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models
- DETRs Beat YOLOs on Real-time Object Detection
- A Category Agnostic Model for Visual Rearrangment
- Pixel-level Semantic Correspondence through Layout-aware Representation Learning and Multi-scale Matching Integration
- Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation
- CSTA: CNN-based Spatiotemporal Attention for Video Summarization
- OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation
- Referring Expression Counting
- Driving Everywhere with Large Language Model Policy Adaptation
- Higher-order Relational Reasoning for Pedestrian Trajectory Prediction
- TransLoc4D: Transformer-based 4D Radar Place Recognition
- Detours for Navigating Instructional Videos
- How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?
- CrossKD: Cross-Head Knowledge Distillation for Object Detection
- Efficient Meshflow and Optical Flow Estimation from Event Cameras
- Producing and Leveraging Online Map Uncertainty in Trajectory Prediction
- OmniViD: A Generative Framework for Universal Video Understanding
- Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching
- Streaming Dense Video Captioning
- Resource-Efficient Transformer Pruning for Finetuning of Large Models
- MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection
- Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving
- Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation
- PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation
- Sparse Global Matching for Video Frame Interpolation with Large Motion
- SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model
- Generating Handwritten Mathematical Expressions From Symbol Graphs: An End-to-End Pipeline
- Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers
- Transferable and Principled Efficiency for Open-Vocabulary Segmentation
- Open-Vocabulary Object 6D Pose Estimation
- Depth-aware Test-Time Training for Zero-shot Video Object Segmentation
- Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
- Compositional Video Understanding with Spatiotemporal Structure-based Transformers
- OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition
- What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
- SIRA: Scalable Inter-frame Relation and Association for Radar Perception
- SPIN: Simultaneous Perception Interaction and Navigation
- Single-Model and Any-Modality for Video Object Tracking
- Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
- Object Recognition as Next Token Prediction
- LoCoNet: Long-Short Context Network for Active Speaker Detection
- Hyperbolic Anomaly Detection
- Enhancing the Power of OOD Detection via Sample-Aware Model Selection
- MaxQ: Multi-Axis Query for N:M Sparsity Network
- EgoGen: An Egocentric Synthetic Data Generator
- PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar
- A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion
- Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-stage Action Localization
- Towards Robust 3D Object Detection with LiDAR and 4D Radar Fusion in Various Weather Conditions
- Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID
- Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions
- Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers
- PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks
- GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation
- CaDeT: a Causal Disentanglement Approach for Robust Trajectory Prediction in Autonomous Driving
- Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views
- Asymmetric Masked Distillation for Pre-Training Small Foundation Models
- Atom-Level Optical Chemical Structure Recognition with Limited Supervision
- Neural Visibility Field for Uncertainty-Driven Active Mapping
- eTraM: Event-based Traffic Monitoring Dataset
- MuRF: Multi-Baseline Radiance Fields
- GARField: Group Anything with Radiance Fields
- FAR: Flexible Accurate and Robust 6DoF Relative Camera Pose Estimation
- Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer
- pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
- Evaluating Transferability in Retrieval Tasks: An Approach Using MMD and Kernel Methods
- Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
- Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement
- Towards Generalizing to Unseen Domains with Few Labels
- LiDAR-Net: A Real-scanned 3D Point Cloud Dataset for Indoor Scenes
- ReCoRe: Regularized Contrastive Representation Learning of World Model
- NEAT: Distilling 3D Wireframes from Neural Attraction Fields
- Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange
- Brain Decodes Deep Nets
- OpenStreetView-5M: The Many Roads to Global Visual Geolocation
- Robust Depth Enhancement via Polarization Prompt Fusion Tuning
- ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object
- LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes
- GS-IR: 3D Gaussian Splatting for Inverse Rendering
- LoS: Local Structure-Guided Stereo Matching
- Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency
- L2B: Learning to Bootstrap Robust Models for Combating Label Noise
- ColorPCR: Color Point Cloud Registration with Multi-Stage Geometric-Color Fusion
- A Simple Recipe for Language-guided Domain Generalized Segmentation
- Robust Synthetic-to-Real Transfer for Stereo Matching
- Federated Online Adaptation for Deep Stereo
- RoMa: Robust Dense Feature Matching
- HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting
- Mip-Splatting: Alias-free 3D Gaussian Splatting
- IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images
- Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation
- MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation
- Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
- HouseCat6D - A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios
- CoGS: Controllable Gaussian Splatting
- GLACE: Global Local Accelerated Coordinate Encoding
- TUMTraf V2X Cooperative Perception Dataset
- Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
- Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset
- Instance Tracking in 3D Scenes from Egocentric Videos
- TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding
- Pre-training Vision Models with Mandelbulb Variations
- 360+x: A Panoptic Multi-modal Scene Understanding Dataset
- Text-Enhanced Data-free Approach for Federated Class-Incremental Learning
- Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling
- 360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries
- VBench: Comprehensive Benchmark Suite for Video Generative Models
- GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
- FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization
- RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception
- NARUTO: Neural Active Reconstruction from Uncertain Target Observations
- Learning to Rank Patches for Unbiased Image Redundancy Reduction
- Edge-Aware 3D Instance Segmentation Network with Intelligent Semantic Prior
- NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation
- GenZI: Zero-Shot 3D Human-Scene Interaction Generation
- Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata
- Convolutional Prompting meets Language Models for Continual Learning
- Text-to-3D using Gaussian Splatting
- Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding
- Integrating Efficient Optimal Transport and Functional Maps For Unsupervised Shape Correspondence Learning
- From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation
- GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?
- TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding
- NeISF: Neural Incident Stokes Field for Geometry and Material Estimation
- XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold
- Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras
- What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs
- Deep Generative Model based Rate-Distortion for Image Downscaling Assessment
- LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry
- GauHuman: Articulated Gaussian Splatting from Monocular Human Videos
- MSU-4S - The Michigan State University Four Seasons Dataset
- Disentangled Prompt Representation for Domain Generalization
- A2XP: Towards Private Domain Generalization
- MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures
- Learning from Synthetic Human Group Activities
- Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields
- Source-Free Domain Adaptation with Frozen Multimodal Foundation Model
- Low-Resource Vision Challenges for Foundation Models
- Distributionally Generative Augmentation for Fair Facial Attribute Classification
- EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
- Insights from the Use of Previously Unseen Neural Architecture Search Datasets
- Rich Human Feedback for Text-to-Image Generation
- Grounding and Enhancing Grid-based Models for Neural Fields
- Aligning Logits Generatively for Principled Black-Box Knowledge Distillation
- Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM
- The Devil is in the Fine-Grained Details: Evaluating Open-Vocabulary Object Detectors for Fine-Grained Understanding
- HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation
- SpecNeRF: Gaussian Directional Encoding for Specular Reflections
- DYSON: Dynamic Feature Space Self-Organization for Online Task-Free Class Incremental Learning
- Efficient Solution of Point-Line Absolute Pose
- FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
- Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps
- UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes
- CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoor Object Detection from Multi-view Images
- BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
- CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning
- SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild
- Adapters Strike Back
- SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments
- Real-World Mobile Image Denoising Dataset with Efficient Baselines
- NeRF Director: Revisiting View Selection in Neural Volume Rendering
- Map-Relative Pose Regression for Visual Re-Localization
- Unified Language-driven Zero-shot Domain Adaptation
- Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction
- NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation
- ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models
- Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds
- CausalPC: Improving the Robustness of Point Cloud Classification by Causal Effect Identification
- Open-Set Domain Adaptation for Semantic Segmentation
- GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs
- A Bayesian Approach to OOD Robustness in Image Classification
- GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis
- Gaussian Shadow Casting for Neural Characters
- MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
- Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes
- DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes
- Object Dynamics Modeling with Hierarchical Point Cloud-based Representations
- ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks
- Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform
- Perceptual Assessment and Optimization of HDR Image Rendering
- Label Propagation for Zero-shot Classification with Vision-Language Models
- Real-time Acquisition and Reconstruction of Dynamic Volumes with Neural Structured Illumination
- Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
- Loopy-SLAM: Dense Neural SLAM with Loop Closures
- S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes
- Adaptive Slot Attention: Object Discovery with Dynamic Slot Number
- Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos
- Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning
- Spectrum AUC Difference (SAUCD): Human-aligned 3D Shape Evaluation
- Test-Time Linear Out-of-Distribution Detection
- Improving Generalized Zero-Shot Learning by Exploring the Diverse Semantics from External Class Names
- TULIP: Multi-camera 3D Precision Assessment of Parkinson’s Disease
- DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields
- Absolute Pose from One or Two Scaled and Oriented Features
- Small Steps and Level Sets: Fitting Neural Surface Models with Point Guidance
- EFHQ: Multi-purpose ExtremePose-Face-HQ dataset
- HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting
- Enhancing Visual Continual Learning with Language-Guided Supervision
- Multiway Point Cloud Mosaicking with Diffusion and Global Optimization
- Non-Rigid Structure-from-Motion: Temporally-Smooth Procrustean Alignment and Spatially-Variant Deformation Modeling
- Instantaneous Perception of Moving Objects in 3D
- Universal Novelty Detection Through Adaptive Contrastive Learning
- Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling
- EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
- DUSt3R: Geometric 3D Vision Made Easy
- Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking
- MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection
- 4K4D: Real-Time 4D View Synthesis at 4K Resolution
- Can Biases in ImageNet Models Explain Generalization?
- Three Pillars Improving Vision Foundation Model Distillation for Lidar
- Multi-Level Neural Scene Graphs for Dynamic Urban Environments
- JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups
- Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction
- A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals
- Backpropagation-free Network for 3D Test-time Adaptation
- A Noisy Elephant in the Room: Is Your Out-of-Distribution Detector Robust to Label Noise?
- A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
- StraightPCF: Straight Point Cloud Filtering
- Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes
- ESCAPE: Encoding Super-keypoints for Category-Agnostic Pose Estimation
- DAVE - A Detect-and-Verify Paradigm for Low-Shot Counting
- Fully Geometric Panoramic Localization
- Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach
- D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object Detection
- Improving Graph Contrastive Learning via Adaptive Positive Sampling
- CG-HOI: Contact-Guided 3D Human-Object Interaction Generation
- NeRFiller: Completing Scenes via Generative 3D Inpainting
- CORE-MPI: Consistency Object Removal with Embedding MultiPlane Image
- GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
- Learning to Produce Semi-dense Correspondences for Visual Localization
- Neural Refinement for Absolute Pose Regression with Feature Synthesis
- VGGSfM: Visual Geometry Grounded Deep Structure From Motion
- Differentiable Neural Surface Refinement for Modeling Transparent Objects
- NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis
- PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling
- GART: Gaussian Articulated Template Models
- Improving Depth Completion via Depth Feature Upsampling
- MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors
- Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation
- GEARS: Local Geometry-aware Hand-object Interaction Synthesis
- GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions
- PAPR in Motion: Seamless Point-level 3D Scene Interpolation
- MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video
- OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees
- MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers
- Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation
- MLP Can Be A Good Transformer Learner
- Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning
- MVCPS-NeuS: Multi-view Constrained Photometric Stereo for Neural Surface Reconstruction
- NICE: Neurogenesis Inspired Contextual Encoding for Replay-free Class Incremental Learning
- 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
- Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It
- SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field
- TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations
- Learning Equi-angular Representations for Online Continual Learning
- DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes
- Deep Imbalanced Regression via Hierarchical Classification Adjustment
- Dr.Hair: Reconstructing Scalp-Connected Hair Strands without Pre-Training via Differentiable Rendering of Line Segments
- Positive-Unlabeled Learning by Latent Group-Aware Meta Disambiguation
- Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields
- HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces
- InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning
- LangSplat: 3D Language Gaussian Splatting
- TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model
- Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering
- MatSynth: A Modern PBR Materials Dataset
- Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields
- DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization
- GES : Generalized Exponential Splatting for Efficient Radiance Field Rendering
- Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery
- Dynamic Cues-Assisted Transformer for Robust Point Cloud Registration
- CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers
- Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation
- Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers
- DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF
- SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors
- Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling
- GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians
- FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations
- MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark
- How to Train Neural Field Representations: A Comprehensive Study and Benchmark
- Compact 3D Gaussian Representation for Radiance Field
- GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding
- Systematic Comparison of Semi-supervised and Self-supervised Learning for Medical Image Classification
- Improving Plasticity in Online Continual Learning via Collaborative Learning
- OneFormer3D: One Transformer for Unified Point Cloud Segmentation
- OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
- COLMAP-Free 3D Gaussian Splatting
- Hierarchical Correlation Clustering and Tree Preserving Embedding
- Universal Semi-Supervised Domain Adaptation by Mitigating Common-Class Bias
- DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis
- Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
- 3DInAction: Understanding Human Actions in 3D Point Clouds
- How Far Can We Compress Instant-NGP-Based NeRF?
- CroSel: Cross Selection of Confident Pseudo Labels for Partial-Label Learning
- Traffic Scene Parsing through the TSP6K Dataset
- Neural Modes: Self-supervised Learning of Nonlinear Modal Subspaces
- Dynamic LiDAR Re-simulation using Compositional Neural Fields
- DeMatch: Deep Decomposition of Motion Field for Two-View Correspondence Learning
- A Versatile Framework for Continual Test-Time Domain Adaptation: Balancing Discriminability and Generalizability
- PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion
- WinSyn: : A High Resolution Testbed for Synthetic Data
- 3D Neural Edge Reconstruction
- SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM
- GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection
- Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains
- NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation
- What How and When Should Object Detectors Update in Continually Changing Test Domains?
- TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
- Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
- FISBe: A Real-World Benchmark Dataset for Instance Segmentation of Long-Range Thin Filamentous Structures
- Classes Are Not Equal: An Empirical Study on Image Recognition Fairness
- Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation
- FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions
- AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis
- Unveiling the Unknown: Unleashing the Power of Unknown to Known in Open-Set Source-Free Domain Adaptation
- Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis
- Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening
- Overload: Latency Attacks on Object Detection for Edge Devices
- UFC-Net: Unrolling Fixed-point Continuous Network for Deep Compressive Sensing
- PAD: Patch-Agnostic Defense against Adversarial Patch Attacks
- Data-Efficient Multimodal Fusion on a Single GPU
- Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
- View-Category Interactive Sharing Transformer for Incomplete Multi-View Multi-Label Learning
- PerceptionGPT: Effectively Fusing Visual Perception into LLM
- Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation
- Unbiased Faster R-CNN for Single-source Domain Generalized Object Detection
- Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement
- Scalable 3D Registration via Truncated Entry-wise Absolute Residuals
- Audio-Visual Segmentation via Unlabeled Frame Exploitation
- Distilling Semantic Priors from SAM to Efficient Image Restoration Models
- Amodal Ground Truth and Completion in the Wild
- Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities
- SonicVisionLM: Playing Sound with Vision Language Models
- DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models
- Semantics Distortion and Style Matter: Towards Source-free UDA for Panoramic Segmentation
- AV-RIR: Audio-Visual Room Impulse Response Estimation
- Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution
- ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation
- ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models
- Cyclic Learning for Binaural Audio Generation and Localization
- Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing
- DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning
- Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion
- Data Poisoning based Backdoor Attacks to Contrastive Learning
- Multi-Task Dense Prediction via Mixture of Low-Rank Experts
- DART: Implicit Doppler Tomography for Radar Novel View Synthesis
- Relational Matching for Weakly Semi-Supervised Oriented Object Detection
- Dispersed Structured Light for Hyperspectral 3D Imaging
- Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models
- No More Ambiguity in 360° Room Layout via Bi-Layout Estimation
- Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance
- From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers
- TurboSL: Dense Accurate and Fast 3D by Neural Inverse Structured Light
- Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation
- MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
- SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing
- Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation
- Beyond Average: Individualized Visual Scanpath Prediction
- Language-guided Image Reflection Separation
- EGTR: Extracting Graph from Transformer for Scene Graph Generation
- In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging
- Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now
- AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring
- DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly
- FedMef: Towards Memory-efficient Federated Dynamic Pruning
- Diff-BGM: A Diffusion Model for Video Background Music Generation
- KVQ: Kwai Video Quality Assessment for Short-form Videos
- LAN: Learning to Adapt Noise for Image Denoising
- CoSeR: Bridging Image and Language for Cognitive Super-Resolution
- S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data
- SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection
- 1-Lipschitz Layers Compared: Memory Speed and Certifiable Robustness
- SinSR: Diffusion-Based Image Super-Resolution in a Single Step
- Latency Correction for Event-guided Deblurring and Frame Interpolation
- OneLLM: One Framework to Align All Modalities with Language
- Analyzing and Improving the Training Dynamics of Diffusion Models
- Unmixing Diffusion for Self-Supervised Hyperspectral Image Denoising
- Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos
- Single View Refractive Index Tomography with Neural Fields
- Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models
- Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
- Deep Video Inverse Tone Mapping Based on Temporal Clues
- Neural Spline Fields for Burst Image Fusion and Layer Separation
- Revamping Federated Learning Security from a Defender's Perspective: A Unified Defense with Homomorphic Encrypted Data Space
- DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
- Latent Modulated Function for Computational Optimal Continuous Image Representation
- PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor
- Improving Generalization via Meta-Learning on Hard Samples
- Boosting Flow-based Generative Super-Resolution Models via Learned Prior
- Task-Driven Wavelets using Constrained Empirical Risk Minimization
- SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder
- Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
- LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning
- Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment
- Unsupervised Deep Unrolling Networks for Phase Unwrapping
- Seeing Motion at Nighttime with an Event Camera
- Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement
- Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation
- Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans
- Osprey: Pixel Understanding with Visual Instruction Tuning
- SeD: Semantic-Aware Discriminator for Image Super-Resolution
- APISR: Anime Production Inspired Real-World Anime Super-Resolution
- Learning to Remove Wrinkled Transparent Film with Polarized Prior
- DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning
- AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
- SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation
- MoDE: CLIP Data Experts via Clustering
- G-FARS: Gradient-Field-based Auto-Regressive Sampling for 3D Part Grouping
- Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
- Generative Multi-modal Models are Good Class Incremental Learners
- Open-Vocabulary Semantic Segmentation with Image Embedding Balancing
- WaveMo: Learning Wavefront Modulations to See Through Scattering
- Federated Generalized Category Discovery
- SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting
- Generative Image Dynamics
- Look-Up Table Compression for Efficient Image Restoration
- Language-driven All-in-one Adverse Weather Removal
- Prompt3D: Random Prompt Assisted Weakly-Supervised 3D Object Detection
- Projecting Trackable Thermal Patterns for Dynamic Computer Vision
- Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks Methods and Applications
- Multi-modal Learning for Geospatial Vegetation Forecasting
- Boosting Adversarial Training via Fisher-Rao Norm-based Regularization
- Partial-to-Partial Shape Matching with Geometric Consistency
- Bilateral Adaptation for Human-Object Interaction Detection with Occlusion-Robustness
- Text-guided Explorable Image Super-resolution
- VCoder: Versatile Vision Encoders for Multimodal Large Language Models
- SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching
- Close Imitation of Expert Retouching for Black-and-White Photography
- MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation
- Transferable Structural Sparse Adversarial Attack Via Exact Group Sparsity Training
- FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
- GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation
- Snapshot Lidar: Fourier Embedding of Amplitude and Phase for Single-Image Depth Reconstruction
- Looking Similar Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning
- UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
- SHAP-EDITOR: Instruction-Guided Latent 3D Editing in Seconds
- Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models
- Efficient Model Stealing Defense with Noise Transition Matrix
- CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning
- Equivariant Plug-and-Play Image Reconstruction
- Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow
- Prompt Learning via Meta-Regularization
- Instance-based Max-margin for Practical Few-shot Recognition
- Accept the Modality Gap: An Exploration in the Hyperbolic Space
- A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
- Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
- Navigating Beyond Dropout: An Intriguing Solution towards Generalizable Image Super Resolution
- SlowFormer: Adversarial Attack on Compute and Energy Consumption of Efficient Vision Transformers
- AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
- Regressor-Segmenter Mutual Prompt Learning for Crowd Counting
- Discriminability-Driven Channel Selection for Out-of-Distribution Detection
- OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
- AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation
- SG-PGM: Partial Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks
- MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model
- Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships
- CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation
- Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transfomers
- HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
- SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
- Active Prompt Learning in Vision Language Models
- Discovering Syntactic Interaction Clues for Human-Object Interaction Detection
- DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model
- Understanding and Improving Source-free Domain Adaptation from a Theoretical Perspective
- SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
- Physical Property Understanding from Language-Embedded Feature Fields
- MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation
- Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models
- Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball
- CFAT: Unleashing Triangular Windows for Image Super-resolution
- NAPGuard: Towards Detecting Naturalistic Adversarial Patches
- A Physics-informed Low-rank Deep Neural Network for Blind and Universal Lens Aberration Correction
- Revisiting Adversarial Training Under Long-Tailed Distributions
- Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
- LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation
- Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding
- On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
- Are Conventional SNNs Really Efficient? A Perspective from Network Quantization
- Coherent Temporal Synthesis for Incremental Action Segmentation
- Noisy-Correspondence Learning for Text-to-Image Person Re-identification
- Driving-Video Dehazing with Non-Aligned Regularization for Safety Assistance
- DAP: A Dynamic Adversarial Patch for Evading Person Detectors
- Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness
- Adversarially Robust Few-shot Learning via Parameter Co-distillation of Similarity and Class Concept Learners
- ParamISP: Learned Forward and Inverse ISPs using Camera Parameters
- PanoContext-Former: Panoramic Total Scene Understanding with a Transformer
- T-VSL: Text-Guided Visual Sound Source Localization in Mixtures
- Parameter Efficient Self-Supervised Geospatial Domain Adaptation
- Dual-Scale Transformer for Large-Scale Single-Pixel Imaging
- DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models
- Gradient-based Parameter Selection for Efficient Fine-Tuning
- LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
- Initialization Matters for Adversarial Transfer Learning
- Time-Efficient Light-Field Acquisition Using Coded Aperture and Events
- Dispel Darkness for Better Fusion: A Controllable Visual Enhancer based on Cross-modal Conditional Adversarial Learning
- Efficient Hyperparameter Optimization with Adaptive Fidelity Identification
- Event-based Visible and Infrared Fusion via Multi-task Collaboration
- HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes
- Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments
- Byzantine-robust Decentralized Federated Learning via Dual-domain Clustering and Trust Bootstrapping
- EVS-assisted Joint Deblurring Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling
- Dynamic Prompt Optimizing for Text-to-Image Generation
- Towards Calibrated Multi-label Deep Neural Networks
- Empowering Resampling Operation for Ultra-High-Definition Image Enhancement with Model-Aware Guidance
- From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
- MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models
- Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning
- Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
- Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
- LEAD: Exploring Logit Space Evolution for Model Selection
- Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
- PixelRNN: In-pixel Recurrent Neural Networks for End-to-end–optimized Perception with Neural Sensors
- NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models
- Fine-Grained Bipartite Concept Factorization for Clustering
- Generalized Event Cameras
- Single Domain Generalization for Crowd Counting
- Learning with Structural Labels for Learning with Noisy Labels
- Revisiting Adversarial Training at Scale
- Learning Inclusion Matching for Animation Paint Bucket Colorization
- VILA: On Pre-training for Visual Language Models
- Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI
- Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization
- One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
- SPECAT: SPatial-spEctral Cumulative-Attention Transformer for High-Resolution Hyperspectral Image Reconstruction
- CPP-Net: Embracing Multi-Scale Feature Fusion into Deep Unfolding CP-PPA Network for Compressive Sensing
- Online Task-Free Continual Generative and Discriminative Learning via Dynamic Cluster Memory
- Alchemist: Parametric Control of Material Properties with Diffusion Models
- CurveCloudNet: Processing Point Clouds with 1D Structure
- Transfer CLIP for Generalizable Image Denoising
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
- An Aggregation-Free Federated Learning for Tackling Data Heterogeneity
- MRFS: Mutually Reinforcing Image Fusion and Segmentation
- MuGE: Multiple Granularity Edge Detection
- OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning
- OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning
- Distraction is All You Need: Memory-Efficient Image Immunization against Diffusion-Based Image Editing
- Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence
- Large Language Models are Good Prompt Learners for Low-Shot Image Classification
- Learned Lossless Image Compression based on Bit Plane Slicing
- Long-Tail Class Incremental Learning via Independent Sub-prototype Construction
- NB-GTR: Narrow-Band Guided Turbulence Removal
- SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation
- Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization
- LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network
- ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation
- MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning
- X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization
- Active Domain Adaptation with False Negative Prediction for Object Detection
- OMG-Seg: Is One Model Good Enough For All Segmentation?
- Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
- Towards Backward-Compatible Continual Learning of Image Compression
- Towards Robust Learning to Optimize with Theoretical Guarantees
- Coherence As Texture – Passive Textureless 3D Reconstruction by Self-interference
- Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning
- Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation
- MonoHair: High-Fidelity Hair Modeling from a Monocular Video
- Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing
- Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring
- Unsupervised Blind Image Deblurring Based on Self-Enhancement
- Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples
- Deep-TROJ: An Inference Stage Trojan Insertion Algorithm through Efficient Weight Replacement Attack
- MAFA: Managing False Negatives for Vision-Language Pre-training
- Towards HDR and HFR Video from Rolling-Mixed-Bit Spikings
- Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion.
- Passive Snapshot Coded Aperture Dual-Pixel RGB-D Imaging
- Disentangled Pre-training for Human-Object Interaction Detection
- Simple Semantic-Aided Few-Shot Learning
- CAMixerSR: Only Details Need More "Attention"
Receptions
Remarks
Tutorials
- Deep Stereo Matching in the Twenties
- Disentanglement and Compositionality in Computer Vision
- Machine Unlearning in Computer Vision: Foundations and Applications
- SCENIC: An Open-Source Probabilistic Programming System for Data Generation and Safety in AI-Based Autonomy
- Recent Advances in Vision Foundation Models
- Object-centric Representations in Computer Vision
- Robustness at Inference: Towards Explainability, Uncertainty, and Intervenability
- Efficient Homotopy Continuation for Solving Polynomial Systems in Computer Vision Applications
- Geospatial Computer Vision and Machine Learning for Large-Scale Earth Observation Data
- Edge AI in Action: Practical Approaches to Developing and Deploying Optimized Models
- Edge-Optimized Deep Learning: Harnessing Generative AI and Computer Vision with Open-Source Libraries
- 3D/4D Generation and Modeling with Generative Priors
- Contactless AI Healthcare using Cameras and Wireless Sensors
- Computational Design of Diverse Morphologies and Sensors for Vision and Robotics
- Learning Deep Low-dimensional Models from High-Dimensional Data: From Theory to Practice
- All You Need To Know About Point Cloud Understanding
- All You Need to Know about Self-Driving
- Towards Building AGI in Autonomy and Robotics
- End-to-End Autonomy: A New Era of Self-Driving
- From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond
- Full-Stack, GPU-based Acceleration of Deep Learning
- Diffusion-based Video Generative Models
- Unifying Graph Neural Networks across Spatial and Spectral Domains
Workshops
- Computer Vision for Mixed Reality
- Domain adaptation, Explainability and Fairness in AI for Medical Image Analysis (DEF-AI-MIA)
- Efficient Large Vision Models
- 8th AI City Challenge
- Multimodal Algorithmic Reasoning Workshop
- SyntaGen: Harnessing Generative Models for Synthetic Visual Datasets
- The 5th Face Anti-Spoofing Workshop
- The 7th Workshop and Challenge Bridging the Gap between Computational Photography and Visual Recognition (UG2+)
- 4th Workshop on CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling
- The Fifth Workshop on Fair, Data-efficient, and Trusted Computer Vision
- 2nd Workshop on Multimodal Content Moderation
- MetaFood Workshop (MTF)
- AI for 3D Generation
- 2nd Workshop on Scene Graphs and Graph Representation Learning
- ViLMa – Visual Localization and Mapping
- 1st Workshop on Dataset Distillation for Computer Vision
- VAND 2.0: Visual Anomaly and Novelty Detection
- Workshop on Computer Vision for Fashion, Art, and Design
- AI for Content Creation (AI4CC)
- New Challenges in 3D Human Understanding
- First Joint Egocentric Vision (EgoVis) Workshop
- First Workshop on Efficient and On-Device Generation (EDGE)
- 2nd Workshop on Foundation Models
- The 4th Workshop of Adversarial Machine Learning on Computer Vision: Robustness of Foundation Models
- 1st Workshop on Urban Scene Modeling: Where Vision Meets Photogrammetry and Graphics
- Second Workshop for Learning 3D with Multi-View Supervision
- AI4Space 2024
- 7th Workshop on Autonomous Driving (WAD)
- Foundation Models for Autonomous Systems
- Image Matching: Local Features and Beyond
- 2nd Workshop on Embodied "Humans": Symbiotic Intelligence between Virtual Humans and Humanoid Robots
- Data Curation and Augmentation in Enhancing Medical Imaging Applications
- GenAI Media Generation Challenge for Computer Vision Workshop
- The Seventh International Workshop on Computer Vision for Physiological Measurement (CVPM)
- Workshop on Virtual Try-On
- Workshop on Graphic Design Understanding and Generation (GDUG)
- Fifth Workshop on Neural Architecture Search
- Agriculture-Vision: Challenges & Opportunities for Computer Vision in Agriculture
- VizWiz Grand Challenge: Describing Images and Videos Taken by Blind People
- Computer Vision for Materials Science Workshop
- The Future of Generative Visual Art
- Women in Computer Vision
- LatinX in Computer Vision Research Workshop
- The 5th Omnidirectional Computer Vision Workshop
- Third Workshop of Mobile Intelligent Photography & Imaging
- The 3rd Explainable AI for Computer Vision (XAI4CV) Workshop
- Workshop on Responsible Data
- GAZE 2024: The 6th International Workshop on Gaze Estimation and Prediction in the Wild
- RetailVision - Field Overview and Amazon Deep Dive
- 10th IEEE International Workshop on Computer Vision in Sports (CVsports)
- Equivariant Vision: From Theory to Practice
- 5th Workshop on Continual Learning in Computer Vision (CLVISION)
- 2nd Workshop on Compositional 3D Vision
- Visual Perception via Learning in an Open World
- 7th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues
- Data-Driven Autonomous Driving Simulation (DDASD)
- Synthetic Data for Computer Vision
- Workshop on Human Motion Generation
- 9th Workshop on Computer Vision for Microscopy Image Analysis
- 2nd Workshop on Generative Models for Computer Vision
- ReGenAI: First Workshop on Responsible Generative AI
- The 5th Annual Embodied AI Workshop
- Towards 3D Foundation Models: Progress and Prospects
- 7th MUltimodal Learning and Applications
- Vision and Language for Autonomous Driving and Robotics (VLADR)
- 4th Workshop and Challenge on Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings
- The Sixth Workshop on Deep Learning for Geometric Computing (DLGC 2024)
- The First Workshop on the Evaluation of Generative Foundation Models
- 20th Workshop on Perception Beyond the Visible Spectrum
- Embedded Vision Workshop
- 5th Workshop on Robot Visual Perception in Human Crowded Environments
- 6th Workshop and Competition on Affective Behavior Analysis in-the-wild
- (3rd) Monocular Depth Estimation Challenge
- Learning from Procedural Videos and Language: What is Next?
Report issues here.
Successful Page Load