CVPR 2026 Tutorials

Skip to yearly menu bar Skip to main content

Tutorial

The Principles of Diffusion Models: Real-Time Continuous & Discrete Diffusion

Chieh-Hsin Lai · Subham Sahoo · Dongjun Kim · Yang Song · Yuki Mitsufuji · Stefano Ermon

Wed AM 301/302

In recent years, diffusion models have become a central paradigm in computer vision, powering advances in image synthesis, editing, and video generation. However, existing tutorials are often fragmented, focusing either on specific applications or isolated methodological perspectives without a unifying framework. This tutorial aims to present a principle-driven view of diffusion models by distilling their foundations into a small set of core ideas that unify variational, score-based, and flow-based approaches. It further emphasizes emerging directions in real-time generation through flow-map models, which enable fast and interactive visual applications. In addition, the tutorial extends this framework to discrete and tokenized diffusion models, highlighting their role in bridging continuous vision generation with multimodal and structured representations.

View full details

Tutorial

Tom Builds, Tom Breaks: Hands-On Attacks and Defenses for Vision-Language Systems

Pavan Reddy

Wed AM Mile High 2B

Vision-language models are increasingly deployed in real-world systems where images can directly influence decisions and actions, creating new security risks beyond traditional text-based attacks. This tutorial provides a hands-on introduction to attacks and defenses for vision-language systems, using a practical, end-to-end workflow that mirrors real deployment scenarios. It covers a range of vulnerabilities, including visual jailbreaks, preprocessing-induced attacks, adversarial perturbations, backdoored models, and data poisoning, along with corresponding mitigation strategies. Through interactive examples and reproducible notebooks, the tutorial emphasizes how these threats manifest in practice and how to build robust, auditable systems for multimodal AI.

View full details

Tutorial

Towards Safe Multi-Modal Learning: Evolving Threats and Safety Solutions

Xi Li · Manling Li · Muchao Ye

Wed AM Mile High 3C

Multi-modal learning has enabled powerful systems that combine text, images, audio, and video for perception, reasoning, and decision-making. At the same time, it has introduced safety challenges that differ fundamentally from those in traditional uni-modal learning. This tutorial presents a structured overview of the evolving safety landscape in multi-modal AI, focusing on emerging threat models and corresponding defense strategies. It examines key risks such as compromised modality integration, modality misalignment, and fused cross-modal vulnerabilities, and reviews recent work on adversarial attacks, jailbreaks, hallucinations, and safety solutions for more reliable multi-modal systems.

View full details

Tutorial

Edge AI in Action: Mastering On-Device Inference

Fabricio Batista Narcizo · Elizabete Munzlinger · Sai Narsi Reddy Donthi Reddy · Shan Ahmed Shaffi

Wed AM 702

Edge AI enables real-time, low-latency inference directly on devices, but achieving high performance and efficiency requires specialized optimization and deployment techniques tailored to heterogeneous hardware. This tutorial provides a hands-on guide to on-device inference, focusing on end-to-end workflows for optimizing and deploying deep learning models on leading edge platforms such as Qualcomm Snapdragon and NVIDIA Jetson. It covers key techniques including model compression, quantization, and hardware-aware optimization, along with practical tools such as SNPE and TensorRT. Through comparative analysis and real-world case studies, the tutorial highlights best practices for achieving efficient, low-latency performance in applications ranging from computer vision to multimodal AI.

View full details

Tutorial

Accelerated Diffusion Models: From Theory to Interactive World Models

Julius Berner · Weili Nie · Arash Vahdat

Wed AM 201

Diffusion models have become a cornerstone of modern generative modeling, but their practical deployment in interactive applications is often limited by slow and computationally expensive sampling. This tutorial focuses on recent advances in accelerating diffusion models, providing a comprehensive overview of methods that enable fast and efficient generation. It covers general acceleration techniques, training-based approaches such as distillation into few-step samplers, and practical strategies for scaling to image and video generation. The tutorial further highlights how these advances enable emerging applications such as interactive world models and real-time generative systems, and provides hands-on guidance through the FastGen library.

View full details

Tutorial

Building GenAI based Simulation Environment for End-to-End Autonomous Driving

Henry Liu · Howie Sun · Jun Gao · Shuo Feng · Xintao Yan · Jiawei Wang

Wed PM 201

Generative AI is transforming simulation for autonomous driving, enabling data-driven and closed-loop environments that better capture the complexity of real-world scenarios. This tutorial presents an end-to-end framework for building generative simulation pipelines tailored to modern learning-based driving systems. It covers key components including world modeling and city-scale digital twins, generative synthesis of rare and safety-critical scenarios, and realistic sensor and video simulation using both graphics and neural approaches. The tutorial further discusses system-level evaluation and integration with autonomous driving stacks, providing practical guidance and open-source tools for developing scalable and reliable simulation environments.

View full details

Tutorial

Principled Interpretability in Vision Models: From Mechanistic Understanding to Interpretable Models by Design

Tsui-Wei (Lily) Weng · Tuomas Oikarinen

Wed PM Mile High 3C

As deep learning systems are increasingly deployed in high-stakes applications, understanding their internal behavior is essential for ensuring trust, safety, and reliability. However, the field of interpretability remains fragmented, spanning diverse methods without a unified framework or standardized evaluation. This tutorial aims to provide a comprehensive overview of interpretability in vision models, bridging post-hoc mechanistic analysis with approaches that design inherently interpretable models. It reviews techniques for analyzing neural networks at multiple levels—from individual neurons to circuits—alongside recent advances in evaluating the faithfulness of explanations. In addition, the tutorial covers emerging methods for learning interpretable models by design, such as concept-based approaches, and highlights practical applications in debugging, model editing, and safety auditing.

View full details

Tutorial

From Perception to Simulation: The Emergence of World Models in Multi-modal Reasoning

Yujun Cai · Jianfei Cai · Yiwei Wang · Ming-Hsuan Yang

Wed PM 301/302

World models are emerging as a new paradigm in computer vision and multimodal learning, enabling systems to move beyond perception toward reasoning, simulation, and decision-making. This tutorial explores how world models have evolved from predictive frameworks into engines for multi-modal reasoning, capable of simulating environments, supporting counterfactual thinking, and enabling planning. It examines key approaches for learning world dynamics from visual data, including both discrete tokenization and diffusion-based methods, and highlights their role in modeling physical and causal structure. The tutorial further covers how these models support reasoning through simulation, as well as their applications in embodied agents and robotics, while discussing key challenges such as grounding, scalability, and causal understanding.

View full details

Tutorial

Monte Carlo physical simulation

Rohan Sawhney · Bailey Miller · Ioannis Gkioulekas · Keenan Crane

Wed PM 702

Partial differential equations (PDEs) play a central role in physics-based modeling across vision, graphics, and robotics, but conventional grid-based solvers often struggle with scalability and complex geometry. This tutorial introduces grid-free Monte Carlo methods for solving PDEs, focusing on algorithms such as walk on spheres and walk on stars that eliminate the need for spatial discretization. It presents the theoretical foundations of these methods alongside practical techniques for efficient sampling, variance reduction, and differentiable simulation. The tutorial also highlights applications in vision and robotics, including inverse problems and physics-based learning, and provides hands-on guidance for implementing Monte Carlo PDE solvers in real-world systems.

View full details

Tutorial

3D Human Mesh Modeling and Recovery from RGB and LiDAR

Romain Bregier · Istvan Sarandi · Salma Galaaoui · Fabien Baradel · Nermin Samet · David Picard

Wed PM Mile High 2B

Understanding human pose and shape through parametric body models is a key enabler of applications from AR/VR and sports analysis to human-robot interaction. This tutorial provides an in-depth overview of parametric body models and their role in Human Mesh Recovery. We cover fundamental principles and recent developments, guiding practitioners through major models (e.g., SMPL, Anny, MHR, SOMA) and their trade-offs. We then present state-of-the-art Human Mesh Recovery methods, with a focus on challenging in-the-wild settings across different input modalities, including single- and multi-view RGB, video, depth and LiDAR.

View full details

Tutorial

The Full Stack of Physical AI: Simulation, Foundation Models, and Edge Deployment for Next-Generation Robotics Applications

Raymond Lo · Johnny Nunez · Chitoku Yato · Spencer Huang · Mitesh Patel

Thu AM Mile High 2B

Physical AI systems, including robotics and autonomous platforms, require tightly integrated pipelines spanning data collection, model training, and real-time deployment. This tutorial presents a full-stack perspective on building such systems, covering simulation-based data generation, foundation models for robot control, and deployment on edge hardware. It introduces practical workflows using modern tools for human-in-the-loop data collection, multimodal robot foundation models, and hardware-aware optimization for low-latency inference. The tutorial further highlights challenges in scaling and deploying physical AI systems, providing attendees with actionable guidance and open-source resources for end-to-end robotics development.

View full details

Tutorial

Recent Advances in AI for Medical Imaging: Progress, Challenges, and Future Directions

Jiaqi Wang · Peirong Liu · Can Zhao

Thu AM 201

Artificial intelligence has driven significant advances in medical imaging, improving tasks such as image reconstruction, diagnosis, and clinical decision support across modalities including MRI, CT, X-ray, and pathology. This tutorial provides an up-to-date overview of key paradigms in the field, including physics-informed learning, medical foundation models, and collaborative approaches such as federated and multi-agent systems. It examines major challenges such as generalization, interpretability, data heterogeneity, and privacy constraints, while highlighting emerging solutions and open research directions. The tutorial aims to offer a comprehensive perspective on the development and deployment of reliable, clinically relevant AI systems for medical imaging.

View full details

Tutorial

Computer Vision at Scale: Multi-Camera Tracking, Calibration, and Event Detection for Checkout-Free Retail

Hareesh Kolluru · Motilal Agarwal · Tanmay Bangalore

Thu AM 203

Large-scale multi-camera systems are central to real-world computer vision applications, yet their design is shaped as much by infrastructure constraints as by algorithmic advances. This tutorial presents a unified perspective on multi-camera vision through the lens of checkout-free retail, focusing on three core components: automatic camera calibration, real-time multi-object tracking, and structured event detection. It examines how challenges such as asynchrony, partial observability, hardware failures, and edge deployment constraints influence system design and performance. The tutorial further highlights generalizable principles for building reliable, scalable vision systems, bridging the gap between academic methods and real-world deployment.

View full details

Tutorial

Extending Computer Vision to Hidden Objects: A Tutorial on Millimeter-Wave Imaging and Reconstruction of Occluded Scenes

Mingmin Zhao · Laura Dodds

Thu AM 702

Millimeter-wave (mmWave) sensing is emerging as a new modality for computer vision, enabling perception of objects and scenes that are occluded or invisible to traditional cameras. This tutorial introduces the fundamentals of mmWave imaging, highlighting how its physical properties enable through-occlusion sensing and all-weather perception. It covers both classical signal-processing approaches and recent learning-based methods for 3D reconstruction, segmentation, and scene understanding. The tutorial further provides practical guidance on datasets, tools, and open challenges, offering a comprehensive entry point for researchers interested in extending vision systems beyond visible light.

View full details

Tutorial

Analytic understanding of diffusion models

Artem Lukoianov · Chenyang Yuan · Christopher Scarvelis · Mason Kamb

Thu Full Day Mile High 3C

Diffusion models achieve state-of-the-art performance in generative modeling, yet their theoretical foundations and generalization behavior remain poorly understood. This tutorial focuses on the analytical understanding of diffusion models, addressing the apparent paradox between closed-form optimal denoisers and the empirical success of deep diffusion networks. It introduces recent theoretical advances that explain how mechanisms such as score smoothing, training dynamics, neural network inductive biases, and data structure contribute to generalization. By combining mathematical insights with hands-on experiments, the tutorial provides a principled framework for understanding the inner workings of diffusion models and for interpreting recent developments in the field.

View full details

Tutorial

All You Need To Know About Self-Driving

Raquel Urtasun · Abbas Sadat · Sivabalan Manivasagam · Jingkang Wang · Ioan Andrei Barsan

Thu Full Day 301/302

Autonomous driving has evolved into a complex, full-stack problem that integrates perception, prediction, planning, simulation, and safety within a unified system. This tutorial provides a comprehensive overview of modern self-driving pipelines, covering both traditional modular approaches and emerging end-to-end paradigms. It reviews key components including sensor systems, multi-modal perception, motion forecasting, planning and control, large-scale simulation, and data-centric development. The tutorial further highlights recent advances such as foundation models, generative simulation, and real-world deployment at scale, offering a unified perspective on current challenges and future directions in self-driving systems.

View full details

Tutorial

The Road to Convergence: Evolution of Unified Multimodal Models

Jindong Wang · Hao Chen · Jiakui Hu · Zhaolong Su · Sharon Li

Thu PM 201

Unified multimodal models are emerging as a new paradigm that integrates understanding and generation across modalities within a single foundation model. This tutorial provides a comprehensive overview of these models, addressing the currently fragmented landscape of architectures, representations, and training strategies. It introduces a unified perspective on key design choices, including modeling paradigms, multimodal tokenization, and alignment methods, while reviewing benchmarks and real-world applications. The tutorial further highlights open challenges such as scalable representation learning and unified world modeling, offering a structured roadmap for future research in multimodal AI.

View full details

Tutorial

Foundations and Frontiers of Watermarking: Algorithms, Multimodal Extensions, Benchmarks, and Authenticity Frameworks

Vishal Asnani · Shruti Agarwal · Benedetta Tondi · Pierre Fernandez · Furong Huang

Thu PM Mile High 2B

Watermarking has re-emerged as a critical component of trustworthy AI, driven by the rapid growth of generative models and the need for content attribution and authenticity. This tutorial provides a unified overview of watermarking, spanning classical signal-processing foundations and modern deep-learning–based approaches across images, video, audio, and multimodal data. It examines key challenges such as robustness, capacity, and adversarial resilience, along with recent benchmarking efforts and evaluation frameworks. The tutorial further connects these methods to real-world deployment through applications in content provenance, media forensics, and emerging standards such as C2PA, offering a comprehensive perspective on building reliable and transparent media systems.

View full details

Tutorial

From Perception to Action: Building Efficient and Deployable Robot Intelligence Pipelines with Open-Source Edge AI Toolkits

Samet Akcay · Zhuo Wu · Michael Paulitsch · Ashutosh Kumar · Tao Xiong · Adrian Boguszewski · Sameer Sheorey · Benjamin Ummenhofer

Thu PM 702

Robotic manipulation has become a key application of embodied AI, but many research pipelines remain difficult to reproduce and deploy in real-world systems. This tutorial presents an end-to-end, open-source workflow for building efficient robot intelligence pipelines, covering data collection, visuomotor policy training, simulation, and deployment on edge hardware. It introduces practical techniques such as teleoperated data acquisition, diffusion- and transformer-based policies, and neural object cloning for simulation-ready assets. The tutorial further emphasizes model optimization and real-time deployment, culminating in a live demonstration of a complete perception-to-action pipeline on an affordable robotic platform.

View full details