Skip to yearly menu bar Skip to main content



Tutorials
Tutorial
Mike Zheng Shou · Jay Zhangjie Wu · Deepti Ghadiyaram
Abstract

The introduction of diffusion models has had a profound impact on video creation, democratizing a wide range of applications, sparkling startups, and leading to innovative products. This tutorial offers an in-depth exploration of diffusion-based video generative models, a field that stands at the forefront of creativity. We expect a wide range of attendees. For students, researchers, and practitioners eager to enter and contribute to this domain, we will help them get the necessary knowledge, understand the challenges, and choose a promising research direction. Our tutorial is also open to video creators and enthusiasts, helping them to harness the power of video diffusion models in crafting visually stunning and innovative videos.

Tutorial
Maying Shen · Danny Yin · Jason Clemons · Pavlo Molchanov · Jan Kautz · Jose M. Alvarez
Abstract

This tutorial focuses on describing techniques to allow deep learning practitioners to accelerate the training and inference of large deep networks while also reducing memory requirements across a spectrum of off-the-shelf hardware for important applications such as autonomous driving and large language models. Topics include, but are not limited to:- Deep learning specialized hardware overview. We review the architecture of the most used deep learning acceleration hardware, including the main computational processors and memory modules. We will also cover aspects of algorithmic intensity and an overview of theoretical aspects of computing. - Best practices for acceleration. We provide an overview of best practices for designing efficient neural networks including channel number selection, compute heavy operations, or reduction operations among others.- Existing tools for model acceleration. In this part we will focus on existing tools to accelerate a trained neural network on GPU devices. We will particularly discuss operation folding, TensorRT, ONNX graph optimization, sparsity.- Foundation models. Here we will focus on best practices for training and deploying foundation models efficiently.- Research overview of recent techniques. In the last part, we will focus on recent advanced techniques for post training model optimization including pruning, quantization, model distillation or NAS among others.

Tutorial
Xin Jin · Wenjun Zeng · Tao Yang · Yue Song · Nicu Sebe · Xingyi Yang · Xinchao Wang · Shuicheng Yan
Abstract

This tutorial aims to explore the concepts of disentanglement and compositionality in the field of computer vision. These concepts play a crucial role in enabling machines to understand and interpret visual information with more sophistication and human-like reasoning. Participants will delve into advanced techniques and models that allow for the disentanglement of visual factors in images and the compositionality of these factors to produce more meaningful representations. All in all, Disentanglement and Composition are believed to be one of the possible ways for AI to understand the world fundamentally, and eventually achieve Artificial General Intelligence (AGI). Specifically, we will cover the following topics: (1) Session #1: Disentangled Representation Learning (DRL); (2) Session #2: Compositionality of Computer Vision in the Era of Large Models; (3) Session #3: Disentanglement and Composition for AIGC; (4) Session #4: Disentanglement and Composition for AGI.

Tutorial
Long Chen · Oleg Sinavski · Fergal Cotter · Vassia Simaiaki · Elahe Arani · Gianluca Corrado · Nikhil Mohan · Jamie Shotton
Abstract

We propose a comprehensive half-day tutorial focused on End-to-End Autonomous Driving (E2EAD), reflecting the significant shift in focus towards this approach within both industry and academia. Traditional modular approaches in autonomous driving, while effective in specific contexts, often struggle with scalability, long-tail scenarios, and compounding errors from different modules, thereby paving the way for the end-to-end paradigm. This tutorial aims to dissect the complexities and nuances of end-to-end autonomy, covering theoretical foundations, practical implementations and validations, and future directions of this evolving technology.

Tutorial
Samet Akcay · Paula Ramos Giraldo · Ria Cheruvu · Alexander Kozlov · Zhen Zhao · Zhuo Wu · Raymond Lo · Yury Gorbachev
Abstract

This tutorial addresses the challenge of navigating the increasingly complex deep learning (DL) landscape, characterized by many frameworks with specialized functionalities. It aims to equip researchers and practitioners withthe necessary skills to develop efficient and accessible DL models for diverse applications. This tutorial encompassescritical aspects of the DL pipeline, including robust data management, diverse training methodologies, optimizationstrategies, and efficient deployment techniques. Emphasis is placed on the utility of open-source libraries, such as,OpenVINO toolkit, OpenVINO Training eXtensions (OTX), and Neural Network Compression Frameworks (NNCF),in streamlining the DL development process. Through hands-on experiences with OpenVINO, OTX, and NNCF,participants will gain proficiency in managing data effectively, utilizing various training methods, and implementingoptimizations across the AI lifecycle including computer vision pipelines and Generative AI (Gen AI). Furthermore,the tutorial dives into the concept of fine-tuning generative AI models, specifically Stable Diffusion SD with LoRA,adaptors for edge computing environments. This section highlights the advantages of customized models in reducinglatency and enhancing efficiency. Ultimately, this comprehensive tutorial provides a valuable learning experience,equipping participants with the knowledge and skills necessary to navigate the complexities of modern DL and achievesuccess in their respective fields

Tutorial
Xiaoyang Wu · Hengshuang Zhao · Fuxin Li · Zhijian Liu
Abstract

Point cloud is a data structure that is quite prevalent in 3D vision, which plays an important role in eras like 3D perception, 3D generation, autonomous driving, embodied AI, etc. However, there has not been a comprehensive resource that covers the state-of-the-art approaches and engineering details in point cloud processing. This tutorial aims to provide a comprehensive understanding of point cloud processing and analysis. Participants will delve into various aspects of point cloud data, exploring fundamental layers, network engineering considerations, applications, properties, and the pathway to developing a 3D foundation model. Through a combination of lectures, hands-on demonstrations, and discussions, attendees will gain insights into the latest developments in the field and learn how to make informed choices when working with point cloud data.

Tutorial
Naoki Wake · Zane Durante · Ran Gong · Jae Sung Park · Bidipta Sarkar · Rohan Taori · Yusuke Noda · Yejin Choi · Demetri Terzopoulos · Katsushi Ikeuchi · Hoi Vo · Li Fei-Fei · Jianfeng Gao · Qiuyuan Huang
Abstract

The recent advancement of large foundation models has driven the evolution of multimodal AI systems that excel in understanding visual and contextual information. In particular, systems that perceive multimodal inputs and produce meaningful actions, namely Multimodal Agent AI (MAA) systems, are emerging as a key technology in various applications, including multimodal understanding, gaming, healthcare, robotics and embodied AI, though not limited to these fields. This tutorial will feature talks and panels with leading researchers from these fields, aiming to engage participants by offering a collection of valuable educational content that spans a broad spectrum of disciplines that we expect to be significantly impacted by MAA systems. The tutorial website is available at: https://multimodalagentai.github.io/.

Tutorial
Orhun Aydin · Philipe Ambrozio Dias · Dalton Lunga
Abstract

The 5Vs of big data, volume, value, variety, velocity, and veracity pose immense opportunity and challenges on implementing local and planet-wide solution from Earth observation (EO) data. EO data, residing at the center of various multidisciplinary problems, primarily obtained through satellite imagery, aerial photography, and UAV-based platforms. Understanding Earth Observation data unlocks this immense data source to address planet-scale problems with computer vision and machine learning techniques for geospatial analysis. This workshop introduces current EO data sources, problems, and image-based analysis techniques. The most recent advances in data, models, and open-source analysis ecosystem related to computer vision and deep learning for EO data will be introduced. The tutorial will expose the audience to cutting-edge geospatial foundation models applied to both archived and live satellite data for environmental and climate monitoring.

Tutorial
Edward Kim · Sanjit Seshia · Daniel Fremont · Jinkyu Kim · Kimin Lee · Hazem Torfah · Necmiye Ozay · Parasara Sridhar Duggirala · Marcell Vazquez-Chanlatte
Abstract

Today's autonomous systems rely heavily on the use of machine learning components trained on large amounts of data. Even so, it is expensive to collect relevant data and test these systems in the real world in a manner that captures typical data distributions and also covers edge cases. Therefore, simulators are widely adopted in the robotics and computer vision community to train, test, and debug autonomous and semi-autonomous systems. However, working directly with simulators can be too low-level and problem-specific. To support the design lifecycle of autonomous/semi-autonomous systems, one needs to raise the level of abstraction above individual simulators and provide a formal framework for world modeling. Such a world model can help reason about the safety of a system and facilitate data generation and sim-to-real validation, as well as help to interpret, validate, share, or re-use training and test scenarios across the community. The objective of this tutorial is to introduce SCENIC, an open-source, domain-specific probabilistic programming language for world modeling that addresses the above needs. SCENIC is designed to model and generate interactive (or reactive), multi-agent scenarios in a manner portable to any simulator. In SCENIC, users can precisely model a stochastic environment in which an autonomous/semi-autonomous system …

Tutorial
Mohit Prabhushankar · Ghassan AlRegib
Abstract

Neural network driven applications like ChatGPT suffer from hallucinations where they confidently provide inaccurate information. A fundamental reason for this inaccuracy is the lack of robust measures that are applied on the underlying neural network predictions. In this tutorial, we identify and expound on three human-centric robustness measures, namely explainability, uncertainty, and intervenability, that every decision made by a neural network must be equipped and evaluated with. Explainability and uncertainty research fields are accompanied by a large body of literature that analyze decisions. Intervenability, on the other hand, has gained recent prominence due its inclusion in the GDPR regulations and a surge in prompting-based neural network architectures. In this tutorial, we connect all three fields using inference-based reliability assessment techniques to motivate robust machine learning models.

Tutorial
Raquel Urtasun · Sergio Casas · Abbas Sadat · Sivabalan Manivasagam · Ioan Andrei Bârsan
Abstract

We have presented prior versions of this course at CVPR 2020, 2021, and 2023 as full-day tutorials. While in 2020 and 2021 the sessions were online due to the pandemic, they nevertheless garnered significant attendance and were well-received. Last year, at CVPR 2023 in Vancouver, we hosted the first in-person version of the tutorial which attracted well over 150 on-site attendees, plus another 100+ online via Zoom. This resulted in multiple insightful conversations during our Q&A sessions. We released each session’s recording on Waabi’s YouTube channel: https://www.youtube.com/channel/UCNJWh6U6vfTz77SJCeWZloQMuch like we did for the 2023 edition, we plan to continue revamping the content of the tutorial by incorporating the key advances in self-driving research from the last year, and further improving the quality of our materials. We note that this course will NOT focus on the advances of a single lab or company, but will provide an in-depth perspective of bothadvances in academia and industry, as well as the remaining problems to solve.We believe the presenters complement their theoretical knowledge with practical expertise covering all aspects related to self-driving. They all have extensive experience in both industry and academia including three of the leading companies in self-driving: Uber ATG, Bosch, and Waabi. …

Tutorial
Qing Qu · Zhihui Zhu · Yuqian Zhang · Yi Ma · Sam Buchanan · Beidi Chen · Mojan Javaheripi · Liyue Shen · Zhangyang Wang
Abstract

Over the past decade, the advent of machine learning and large-scale computing has immeasurably changed the ways we process, interpret, and predict with data in imaging and computer vision. The ``traditional'' approach to algorithm design, based around parametric models for specific structures of signals and measurements – say sparse and low-rank models – and the associated optimization toolkit, is now significantly enriched with data-driven learning-based techniques, where large-scale networks are pre-trained and then adapted to a variety of specific tasks. Nevertheless, the successes of both modern data-driven and classic model-based paradigms rely crucially on correctly identifying the low-dimensional structures present in real-world data, to the extent that we see the roles of learning and compression of data processing algorithms – whether explicit or implicit, as with deep networks – as inextricably linked. As such, this tutorial provides a timely tutorial that uniquely bridges low-dimensional models with deep learning in imaging and vision. This tutorial will show how (i) these low-dimensional models and principles provide a valuable lens for formulating problems and understanding the behavior of modern deep models in imaging and computer vision, and (ii) how ideas from low-dimensional models can provide valuable guidance for designing new parameter efficient, robust, …

Tutorial
Yanwei Fu · Francesco Locatello · Tianjun Xiao · Tong He · Ke Fan
Abstract

This tutorial discusses the evolution of object-centric representation in computer vision and deep learning. Initially inspired by decomposing visual scenes into surfaces and objects, recent developments focus on learning causal variables from high-dimensional observations like images or videos. The tutorial covers the objectives of OCL, its development, and connections with machine learning fields, emphasizing object-centric approaches, especially in unsupervised segmentation. Advances in encoder, decoder, and self-supervised learning objectives are explored, with a focus on real-world applications and challenges. The talk also introduces open-source tools and showcases breakthroughs in video-based object-centric learning. The broader scope of object-centric representation is discussed, including applications in tasks like amodal video segmentation, Visual Language Navigation (VLN), Facial Appearance Editing (FAE), robotic grasping, and the PaLM-E model's impact across various domains.

Tutorial
Sijia Liu · Yang Liu · Nathalie Baracaldo · Eleni Triantafillou
Abstract

This tutorial seeks to provide a comprehensive understanding of emerging machine unlearning (MU) techniques. These techniques are meticulously crafted to assess the precise impact of specific data points, classes, or concepts on model performance and to efficiently and effectively eliminate their potentially detrimental influence within a pre-trained model, all in response to users' unlearning requests. To the best of our knowledge, no well-established tutorial currently offers a systematic and in-depth introduction to MU for vision tasks. Through this tutorial, we aim to unveil MU's critical foundational research aspects, spanning the entire algorithm-evaluation-application spectrum. In addition, our tutorial endeavors to capture the broad interest of students, researchers, and practitioners working on trustworthy computer vision (CV), AI privacy and security (SP), and generative models, catering to individuals at all levels of expertise. A notable component of our tutorial is the thoughtfully designed Demo Expo, which will provide participants with a hands-on, detailed tutorial on implementing existing MU techniques for vision tasks, including image classification and image generation. Four accomplished speakers with extensive academic and industry experience will lead the proposed tutorial. They bring diverse research expertise, encompassing CV, machine learning, and SP. Furthermore, the speakers ensure gender diversity, contributing to a well-rounded …

Tutorial
Li Chen · Andreas Geiger · Huijie Wang · Jiajie Xu
Abstract

Conventional designs for autonomous systems usually follow the modular pattern, where each module is responsible for its own task and is separately developed. In recent years, end-to-end methods have become a new paradigm for autonomous system design. Compared to modular pipelines, end-to-end systems benefit from joint feature optimization. We believe such a tutorial is necessary for both the machine learning and computer vision community. This tutorial serves a brand-new perspective to discuss broad areas of end-to-end framework design for autonomous systems on a system-level consideration.

Tutorial
Fabricio Narcizo · Elizabete Munzlinger · Anuj Dutt · Shan Shaffi · Sai Narsi Reddy Donthi Reddy
Abstract

Edge AI is the application of artificial intelligence on edge devices, such as smartphones, cameras, and sensors that can perform AI tasks autonomously without relying on a connection to the cloud or a central server. Edge AI brings benefits such as higher speed, lower latency, greater privacy, and lower power consumption but also poses many challenges and opportunities for model development and deployment, such as size reduction, compression, quantization, and distillation. Moreover, edge AI integrates and communicates between edge devices and the cloud or other devices, creating a hybrid and distributed architecture. This tutorial will present practical approaches to developing and deploying optimized models for edge AI, covering theoretical and technical aspects and case studies. We will focus on computer vision and deep learning models, widely used and relevant for edge AI applications. We will also demonstrate using tools and frameworks, such as TensorFlow, PyTorch, ONNX, OpenVINO, Google Mediapipe, Qualcomm SNPE, and others, to facilitate the edge AI process. Furthermore, we will introduce multi-modal AI examples, such as head pose estimation, body segmentation, hand gesture recognition, and sound localization models, which combine different inputs and outputs, such as images, videos, and sounds, to create more interactive and immersive edge AI …

Tutorial
Hao Fei · Yuan Yao · Ao Zhang · Haotian Liu · Fuxiao Liu · Zhuosheng Zhang · Shuicheng Yan
Abstract

Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on three key areas: MLLM architecture design, instructional learning, and multimodal reasoning of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.

Tutorial
Zhiqian Chen · Lei Zhang · Liang Zhao
Abstract

Over recent years, Graph Neural Networks (GNNs) have garnered significant attention.However, the proliferation of diverse GNN models, underpinned by various theoretical approaches, complicates the process of model selection, as they are not readily comprehensible within a uniform framework. Specifically, early GNNs were implemented using spectral theory, while others were developed based on spatial theory . This divergence between spectral and spatial methodologies renders direct comparisons challenging. Moreover, the multitude of models within each domain further complicates the evaluation of their respective strengths and weaknesses.In this half-day tutorial, we examine the state-of-the-art in GNNs and introduce a comprehensive framework that bridges the spatial and spectral domains, elucidating their complex interrelationships. This emphasis on a comprehensive framework enhances our understanding of GNN operations. The tutorial's objective is to explore the interplay between key paradigms, such as spatial and spectral-based methods, through a synthesis of spectral graph theory and approximation theory. We provide an in-depth analysis of the latest research developments in GNNs in this tutorial, including discussions on emerging issues like over-smoothing. A range of well-established GNN models will be utilized to illustrate the universality of our proposed framework.

Tutorial
Wenjin Wang · Daniel Mcduff · Xuyu Wang
Abstract

Understanding people and extracting health-related metrics is an emerging research topic in computer vision that has grown rapidly recently. Without the need of any physical contact of the human body, cameras have been used to measure vital signs remotely (e.g. heart rate, heart rate variability, respiration rate, blood oxygenation saturation, pulse transit time, body temperature, etc.) from an image sequence of the skin or body, which leads to contactless, continuous and comfortable heath monitoring. The use of cameras also enables the measurement of human behaviors/activities and high-level visual semantic/contextual information leveraging computer vision and machine learning techniques, such as facial expression analysis for pain/discomfort/delirium detection, emotion recognition for depression measurement, body motion for sleep staging or bed exit/fall detection, activity recognition for patient actigraphy, etc. Understanding of the environment around the people is also a unique advantage of cameras compared to the contact bio-sensors (e.g., wearables), which facilitates better understanding of human and scene for health monitoring. In addition to camera based approach, Radio Frequency (RF) based methods for health monitoring have also been proposed, using Radar, WiFi, RFID, and acoustic signals. Radar based methods mainly use Doppler/UWB/FMCW radar for health monitoring. They can obtain high monitoring accuracy for different …

Tutorial
Matteo Poggi
Abstract

Obtaining dense and accurate depth measurements from images is of paramount importance for many 3D computer vision applications. It has been the subject of research for decades, with several techniques being developed relying either on dedicated sensors -- such as LiDARs, ToFs, etc. -- or standard, imaging cameras. Among the many alternatives, estimating depth from stereo images has always represented the preferred solution to obtain the best balance in terms of efficiency, easiness of deployment, cheapness, and accuracy.This process is traditionally known as \textit{stereo matching} and consists of finding the correspondences between pixels in two \textit{rectified} images, i.e., for which the same 3D point is projected on pixels laying on the same, horizontal scanline. Once the match between pixels is found, their \textit{disparity} in terms of horizontal coordinates is used to obtain depth through triangulation.For decades, stereo matching has been approached by developing hand-crafted algorithms, focused on measuring the visual appearance between local patterns in the two images and propagating this information globally. Since 2015, deep learning led to a paradigm shift in this field, driving the community to the design of end-to-end deep networks capable of matching pixels. The results of this revolution brought stereo matching to a whole …

Tutorial
Zhengyuan Yang · Linjie Li · Zhe Gan · Chunyuan Li · Jianwei Yang
Abstract

We present our CVPR tutorial proposal on Recent Advances in Vision Foundation Models, which has recently attracted rapidly growing attention from the computer vision community. Our tutorial will cover the most advanced directions in designing and training vision foundation models, including the state-of-the-art approaches and principles in (i) learning vision foundation models for multimodal understanding and generation, (ii) benchmarking and evaluating vision foundation models, and (iii) agents and other advanced systems based on vision foundation models.

Tutorial
Benjamin Kimia · Timothy Duff · Ricardo Fabbri · Hongyi Fan
Abstract

Minimal problems and their solvers play an important role inRANSAC-based approaches to several estimation problems in vision. Minimal solvers solve systems of equations, depending on data, which obey a “conservation of number principle”: for sufficiently generic data, the number of solutions over the complex numbers is constant. Homotopy continuation (HC) methods exploit not just this conservation principle, but also the smooth dependence of solutions on problem data. The classical solution of polynomial systems using Grobner basis, resultants, elimination templates, etc. has been largely successful in smaller problems, but these methods are not able to tackle larger polynomials systems with a larger number of solutions. While HC methods can solve these problems, they have been notoriously slow. Recent research by the presenters and other researchers has enabled efficient HC solvers with the ability for real-time solutions. The main objective of this tutorial is to make this technology more accessible to the computer vision community. Specifically, after an overview of how such methods can be useful for solving problems in vision (e.g., absolute/relative pose, triangulation), we will describe some of the basic theoretical apparatus underlying HC solvers, including both local and global “probability-1” aspects. On the practical side, we will describe recent …

Tutorial
Amir Zamir
Abstract

Nature exhibits a wide variety of morphological adapta-tions (see Fig. 1), which are hypothesized to be “developed”through billions of years of evolution. In vision, such adaptations are essentially instantiations of the so-called Eco-logical Theory, which posits a strong connection betweenthe specifics of vision and the environment surrounding theagent, its goals, and its body.How can a robotics/vision researcher achieve the similar goal of designing a robot tailored for a given task andenvironment? While, classically, this goal is achieved bydesigning the morphology by humans and intuitively, in thistutorial, we discuss computational approaches to robot de-sign that promise to achieve the goal of adaptive design automatically, effectivelly, and efficiently.The field of computational robot design includescomputer-aided workflows that look to co-design robotsover geometry, topology, actuation, sensing, material, sensorimotor control, proprioception, and task-planning andhigher-level reasoning. Recent effort has been put into developing methods and workflows for computationally designing robots that approach the capability and diversityof Animalia; but while much of robotics is anchored in thesense-think-act cycle, most efforts in computational designhave abstracted away the design of sensing and perception.In this tutorial, we will especially focus on building connections to perception, namely vision and multimodal sensing,atop the core existing computational foundations. The tutorial discusses rigid, soft, …

Tutorial
Hsin-Ying Lee · Peiye Zhuang · Chaoyang Wang
Abstract

In the upcoming metaverse era, where physical and digital realities seamlessly blend, the ability to capture, represent, and analyze the three-dimensional structures of objects and scenes is a fundamental requirement. The evolution of 3D and 4D generation technologies has significantly transformed the landscape of applications like gaming, augmented reality (AR), and virtual reality (VR). These technologies introduce a new dimension of immersion and interactivity, enhancing the user experience in unprecedented ways. 3D modeling provides a crucial bridge between the physical and digital realms, enabling realistic simulations, augmented reality experiences, and immersive gaming. Moreover, adding the fourth dimension, time, allows us to capture dynamic changes over time, making it possible to create lifelike animations, track objects, and understand complex spatiotemporal relationships. As we venture into this new era, the integration of 3D and 4D modeling will be pivotal in reshaping our interaction with the digital world, opening up opportunities in areas such as entertainment, education, and more.Previously, 3D generation primarily relied on exploiting the intricacies of 3D data itself, with algorithms designed to directly manipulate and generate three-dimensional representations from point clouds, voxel grids, meshes, implicit functions, etc. These techniques have evolved in tandem with advancements in 2D generations, employing variational …