Skip to yearly menu bar Skip to main content



Tutorials
Tutorial
Ronghang Zhu · Xiang Yu · Sheng Li

[ West 215 - 216 ]

Tutorial
Guansong Pang · Joey Tianyi Zhou · Radu Tudor Ionescu · Yu Tian · Kihyuk Sohn

[ East 18 ]

The tutorial will present a comprehensive review of recent advances in (deep) anomaly detection on image and video data. Three major AD paradigms will be discussed, including unsupervised/self-supervised approaches (anomaly-free training data), semi-supervised approaches (few-shot training anomaly examples are available), and weakly-supervised approaches (videl-level labels are available for frame-level detection). Additionally, we will also touch on anomaly segementation tasks, focusing on autonomous driving settings. The tutorial will be ended with a panel discussion on AD challenges and opportunities.

Tutorial
Dacheng Tao · Qiming Zhang · Yufei Xu · Jing Zhang

[ Virtual ]

Tutorial
Qirong Ho · Samuel Horvath · Hongyi Wang

[ East 5 ]

This tutorial will teach attendees how to overcome performance, cost, privacy and robustness challenges when using distributed and federated software systems for learning and deploying Computer Vision and ML applications across various hardware settings (networked machines, GPUs, embedded, mobile systems). The audience will learn about theory, implementation and practice of these topics: state-of-the-art approaches and system architectures, forms of distributed parallelism, pitfalls in the measurement of parallel application performance, parallel ML compilers, computation-communication-memory efficiency in federated learning (FL), trustworthy FL, tackling device heterogeneity in FL, and on-device FL systems.

Tutorial
Jian Ren · Sergey Tulyakov · Ju Hu

[ West 212 ]

This tutorial will introduce effective methodologies for re-designing algorithms for efficient content understanding, image generation, and neural rendering. Most importantly, we show how the algorithms can be efficiently deployed on mobile devices, eventually achieving real-time interaction between users and mobile devices.

Tutorial
Rakesh “Teddy” Kumar · Chen Chen · Mubarak Shah · Han‐Pang Chiu · Sijie Zhu

[ East 6 ]

Precise geo-location of a ground image within a large-scale environment is crucial to many applications, including autonomous vehicles, robotics, wide area augmented reality and image search. Localizing the ground image by matching to an aerial/ overhead geo-referenced database has gained noticeable momentum in recent years, due to significant growth in the availability of public aerial/ overhead data with multiple modalities (such as aerial images from Google/ Bing maps, and USGS 2D and 3D data, Aerial LiDAR data, Satellite 3D Data etc.). Matching a ground image to aerial/ overhead data, whose acquisition is simpler and faster, also opens more opportunities to industrial and consumer applications. However, cross-view and cross-modal visual geo-localization comes with additional technical challenges due to dramatic changes in appearance between the ground image and aerial/ overhead database, which capture the same scene differently in time, viewpoints or/and sensor modalities. This tutorial will provide a comprehensive review on the research problem of visual geo-localization, including same-view/cross-time, cross-view, cross-modal settings to both new and experienced researchers. It also provides connection opportunities for the researchers of visual geo-localization and other related fields.

Tutorial
Pin-Yu Chen · Chaowei Xiao

[ East 14 ]

While machine learning (ML) models have achieved great success in many perception applications, concerns have risen about their potential security, robustness, privacy, and transparency issues when applied to real-world applications. Irresponsibly applying a foundation model to mission-critical and human-centric domains can lead to serious misuse, inequity issues, negative economic and environmental impacts, and/or legal and ethical concerns. For example, ML models are often regarded as “black boxes” and can produce unreliable, unpredictable, and unexplainable outcomes, especially under domain shifts or maliciously crafted attacks, challenging the reliability of safety-critical applications; Stable Diffusion may generate NSFW content and privacy violated-content.

This goals of this tutorial are to:

  • Provide a holistic and complementary overview of trustworthiness issues, including security, robustness, privacy, and societal issues to allow a fresh perspective and some reflection on the induced impacts and responsibility as well as introduce the potential solutions.

  • Promote awareness of the misuse and potential risks in existing AI techniques and, more importantly, to motivate rethinking of trustworthiness in research.

  • Present case studies from computer vision-based applications.

This tutorial will provide sufficient background for participants to understand the motivation, research progress, known issues, and ongoing challenges in trustworthy perception systems, in addition to pointers to open-source …

Tutorial
Jiaming Song · Chenlin Meng · Arash Vahdat

[ West 202 - 204 ]

Diffusion models have been widely adopted in various computer vision applications and are becoming a dominating class of generative models. In the year 2022 alone, diffusion models have been applied to many large-scale text-to-image foundation models, such as DALL-E 2, Imagen, Stable Diffusion and eDiff-I. These developments have also driven novel computer vision applications, such as solving inverse problems, semantic image editing, few-shot textual inversion, prompt-to-prompt editing, and lifting 2d models for 3d generation. This popularity is also reflected in the diffusion models tutorial in CVPR 2022, which has accumulated nearly 60,000 views on YouTube over 8 months. The primary goal of the CVPR 2023 tutorial on diffusion models is to make diffusion models more accessible to a wider computer vision audience and introduce recent developments in diffusion models. We will present successful practices on training and sampling from diffusion models and discuss novel applications that are enabled by diffusion models in the computer vision domain. These discussions will also heavily lean on recent research developments that are released in 2022 and 2023. We hope that this year’s tutorial on diffusion models will attract more computer vision practitioners interested in this topic to make further progress in this exciting area.

Tutorial
Hila Chefer · Sayak Paul

[ West 211 ]

The attention mechanism has revolutionized deep learning research across many disciplines starting from NLP and expanding to vision, speech, and more. Different from other mechanisms, the elegant and general attention mechanism is easily adaptable and eliminates modality-specific inductive biases. As attention becomes increasingly popular, it is crucial to develop tools to allow researchers to understand and explain the inner workings of the mechanism to facilitate better and more responsible use of it. This tutorial focuses on understanding and interpreting attention in the vision and the multi-modal setting. We present state-of-the-art research on representation probing, interpretability, and attention-based semantic guidance, alongside hands-on demos to facilitate interactivity. Additionally, we discuss open questions arising from recent works and future research directions.

Tutorial
Kai Chen · Conghui He · Yanhong Zeng · Songyang Zhang · Wenwei Zhang

[ East 12 ]

This tutorial will introduce two open platforms which can significantly accelerate the research in computer vision ——OpenMMLab and OpenDataLab.

OpenMMLab is an open-source algorithm platform for computer vision. It aims to provide a solid benchmark and promote reproducibility for academic research. We have released more than 30 high-quality projects and toolboxes in various research areas such as image classification, object detection, semantic segmentation, action recognition, etc. OpenMMLab has made public more than 300 algorithms and 2,400 checkpoints. Over the past years, OpenMMLab has gained popularity in both academia and industry. It receives over 78,000 stars on GitHub and involves more than 1,700 contributors in the community.

OpenDataLab, which was initially released in March, 2022, is an open data platform for artificial intelligence, especially including a large number of datasets for computer vision.

Tutorial
Xin Li · Lan Xu · Yu Ding

[ Virtual ]

This tutorial focuses on the challenges of reconstructing a 3D model of a human face followed by generating facial expressions. It comprises three parts, covering facial reconstruction from skeletal remains, 4D dynamic facial performance, and audio-driven talking face generation. Firstly, Face modeling is a fundamental technique and has broad applications in animation, vision, games, and VR. Facial geometries are fundamentally governed by their underlying skull and tissue structures. This session covers a forensic task of facial reconstruction from skeletal remains, in which we will discuss how to restore fragmented skulls, model anthropological features, and reconstruct human faces upon skulls. Then, we will detail how to capture 4D facial performance, which is the foundation for face modeling and rendering. We will consider the hardware designs for cameras, sensors, lighting, and the steps to obtain dynamic facial geometry along with physically-based textures (pore-level diffuse albedo, specular intensity, and normal, etc.,). We will discuss the two complementary workhorses: multi-view stereo and photometric stereo, and the combination with neural rendering advances and medical imaging. Finally, talking face generation will be discussed including 3D animation parameters and 2D photo-realistic video, as well as their applications. It aims to create a talking video of a speaker …

Tutorial
James Demmel · Yang You

[ West 208 - 209 ]

Large Transformer models have performed promisingly on a wide spectrum of AI and CV applications. These positive performances have thus stimulated a recent surge of extremely large models. However, training these models generally requires more computation and training time. This has generated interest in both academia and industry in scaling up deep learning (DL) using distributed training on high-performance computing (HPC) resources like TPU and GPU clusters.

However, continuously adding more devices will not scale training as intended, since training at a large scale requires overcoming both algorithmic and systems-related challenges. This limitation prevents DL and CV researchers from exploring more advanced model architectures.

Many existing works investigate and develop optimization techniques that overcome these problems and accelerate large model training at a larger-scale. We categorize these works as improving either model accuracy or model efficiency. One method to maintain or improve model accuracy in a large-scale setting, while still maintaining computing efficiency, is to design algorithms that require less communication and memory demands. It is notable that these are not mutually exclusive goals but can be optimized together to further accelerate training. This tutorial helps enable CV members to quickly master optimizations for large-scale DL training and successfully train …

Tutorial
Wenjin Wang · Xuyu Wang · Jun Luo

[ East 10 ]

Extracting health-related metrics is an emerging computer vision research topic that has grown rapidly recently. Without needing physical contact, cameras have been used to measure vital signs remotely (e.g. heart & respiration rates, blood oxygenation saturation, body temperature, etc.) from images/video of the skin or body. This leads to contactless, continuous and comfortable heath monitoring. Cameras can also leverage computer vision and machine learning techniques to measure human behaviors/activities and high-level visual semantic/contextual information, facilitating better understanding of people and scenes for health monitoring and provides a unique advantage compared to the contact bio-sensors. RF (Radar, WiFi, RFID) and acoustic based methods for health monitoring have also been proposed. The rapid development of computer vision and RF sensing also give rise to new multi-modal learning techniques that expand the sensing capability by combining two modalities, while minimizing the need of human labels. The hybrid approach may further improve the performance of monitoring, such as using the camera images as beacon to gear human activity learning for the RF signals. Contactless monitoring will bring a rich set of compelling healthcare applications that directly improve upon contact-based monitoring solutions and improve people’s care experience and quality of life, such as in care …

Tutorial
Linjie Li · Zhe Gan · Chunyuan Li · Jianwei Yang

[ East 16 ]

Visual understanding at different levels of granularity has been a longstanding problem in the computer vision community. The tasks span from image-level tasks (e.g., image classification, image-text retrieval, image captioning, and visual question answering), region-level localization tasks (e.g., object detection and phrase grounding), to pixel-level grouping tasks (e.g., image instance/semantic/panoptic segmentation). Until recently, most of these tasks have been separately tackled with specialized model designs, preventing the synergy of tasks across different granularities from being exploited.

In light of the versatility of transformers and inspired by large-scale vision-language pre-training, the computer vision community is now witnessing a growing interest in building general-purpose vision systems, also called vision foundation models, that can learn from and be applied to various downstream tasks, ranging from image-level , region-level, to pixel-level vision tasks.

In this tutorial, we will cover the most recent approaches and principles at the frontier of learning and applying vision foundation models, including (1) learning vision foundation models from natural language supervision, with applications to open-vocabulary image classification and retrieval, object detection, segmentation, and multimodal understanding; (2) learning vision foundation models via masked image modeling, with its extensions to multimodal pre-training; and (3) vision foundation model architecture design with transformer and …

Tutorial
Oriane Simeoni · Weidi Xie · Thomas Kipf · Patrick Pérez

[ East 11 ]

Object localization in images is a key problem in a wide range of application domains that are embedded in critical settings such as self-driving vehicles or healthcare. However, most efficient solutions able to perform an object localization task follow the standard object detection and semantic segmentation frameworks, meaning that they require large amounts of annotated data for training. Different heuristics and tools can now assist and enhance human annotators, however manual annotation remains a largely heavy and expensive process. Moreover, perception models based on annotations enter a dependence circle of additional annotations for every new object class to detect or new external conditions to cover, e.g. in/outdoor, different times of the day, weathers. Such models struggle in dealing with our open complex world that is evolving continuously. Recent works have shown exciting prospects of avoiding annotations altogether by (1) leveraging self-supervised features, (2) building self-supervised object-centric objectives and (3) combining different modalities. In this context, we propose a half-day tutorial in which we will provide an in-depth coverage of different angles on performing/building-upon object localization with no human supervision.

Tutorial
Torsten Sattler · Yannis Avrithis · Eric Brachmann · Zuzana Kukelova · Marc Pollefeys · Sudipta Sinha · Giorgos Tolias

[ East 2 ]

The tutorial covers the task of visual localization, i.e., the problem of estimating the position and orientation from which a given image was taken. The tutorial’s scope includes cases with different spatial/geographical extent, small indoor/outdoor scenes, city-level, and world-level, and localization under changing conditions. In the coarse localization regime, the task is typically handled via retrieval approaches, which is covered in the first part of the tutorial. A typical use case is the following: Given a database of geo-tagged images, the goal is to determine the place depicted in a new query image. Traditionally, this problem is solved by transferring the geo-tag of the most similar database image to the query. The major focus of this part is on the visual representation models used for retrieval, where we include both classical feature-based and recent deep learning-based approaches. The 2nd and 3rd part of the tutorial encompass methods for precise localization with features-based and deep learning approaches, respectively. A typical use-case for these algorithms is to estimate the full 6 Degree-of-Freedom (6DOF) pose of a query image, i.e., the position and orientation from which the image was taken, for applications such as robotics, autonomous vehicles (self-driving cars), Augmented / Mixed / …

Tutorial
Manling Li · Xudong Lin · Jie Lei

[ East 8 ]

Does knowledge still have value in current era of large-scale pretraining? In this tutorial, we will comprehensively review existing paradigms for multimedia knowledge discovery and encoding, and focus on their contributions to vision-language pretraining. We categorize the knowledge into internal self-knowledge and external knowledge. Internal knowledge are extracted from text and vision modalities, such as structured entities, relations, events, and event procedures. We will focus on the structural aspects of the knowledge and address two key challenges regarding the acquisition of knowledge and encoding of structure across multiple modalities. External knowledge can be obtained from knowledge bases or language models, and we will exemplify their use to assist in commonsense understanding of vision modalities, with a focus on the temporal and cognitive aspects. The objective of this tutorial is to introduce participants to recent trends and emerging challenges in knowledge-driven vision-language research, as well as learning resources and tools for participants to obtain ready-to-use models, prompting thorough discussions regarding the impact of structured knowledge on text and vision learning.

Tutorial
Yanwei Fu · Da Li · Yu-Xiong Wang · Timothy Hospedales

[ East 5 ]

There is a growing trend of research in few-shot learning (FSL), which involves adapting learned knowledge to learn new concepts with limited few-shot training examples. This tutorial comprises several talks, including an overview of few-shot learning by Dr. Da Li and a discussion of seminal and state-of-the-art meta-learning methods for FSL by Prof. Timothy Hospedales. The tutorial will cover both gradient-based and amortised meta-learners, as well as some theory for meta-learning, and Dr. Yanwei Fu will introduce recent FSL techniques that use statistical methods, such as exploiting the support of unlabeled instances for few-shot visual recognition and causal inference for few-shot learning. Dr. Yu-Xiong Wang will also discuss various applications of FSL in fields beyond computer vision, such as natural language processing, reinforcement learning, and robotics.

Tutorial
Pascal Mettes · Max van Spengler · Yunhui Guo · Stella X. Yu

[ West 116 - 117 ]

Learning in computer vision is all about deep networks and such networks operate on Euclidean manifolds by default. While Euclidean space is an intuitive and practical choice, foundational work on non-visual data has shown that when information is hierarchical in nature, hyperbolic space is superior, as it allows for an embedding without distortion. A core reason is because Euclidean distances scale linearly as a function of their norm, while hyperbolic distances grow exponentially, just like hierarchies grow exponentially with depth. This initial finding has resulted in rapid developments in hyperbolic geometry for deep learning.

Hyperbolic deep learning is booming in computer vision, with new theoretical and empirical advances with every new conference. But what is hyperbolic geometry exactly? What is its potential for computer vision? And how can we perform hyperbolic deep learning in practice? This tutorial will cover all such questions. We will dive into the geometry itself, how to design networks in hyperbolic space, and we show how current literature profits from learning in this space. The aim is to provide technical depth while addressing a broad audience of computer vision researchers and enthusiasts.

Tutorial
Raquel Urtasun · Sergio Casas · Abbas Sadat · Sivabalan Manivasagam · Paul Spriesterbach · Ioan Barsan Barsan

[ West 302 - 305 ]

A full day tutorial covering all aspects of autonomous driving. This tutorial will provide the necessary background for understanding the different tasks and associated challenges, the different sensors and data sources one can use and how to exploit them, as well as how to formulate the relevant algorithmic problems such that efficient learning and inference is possible. We will first introduce the self-driving problem setting and a broad range of existing solutions, both top-down from a high-level perspective, as well as bottom-up from technological and algorithmic points of view. We will then extrapolate from the state of the art and discuss where the challenges and open problems are, and where we need to head towards to provide a scalable, safe and affordable self-driving solution for the future.

Tutorial
Kaiyang Zhou · Ziwei Liu · Phillip Isola · Hyojin Bahng · Ludwig Schmidt · Sarah Pratt · Denny Zhou

[ West 223 - 224 ]

Originating from natural language processing, the new paradigm of prompting has recently swept through the computer vision community, bringing disruptive changes to various computer vision applications, such as image recognition and image generation. In comparison to the traditional fixed-once-learned architecture, like a linear classifier trained to recognize a specific set of categories, prompting offers greater flexibility and more opportunities for novel applications. It allows the model to perform new tasks, such as recognizing new categories, by tuning textual instructions or modifying a small number of parameters in the model's input space while keeping the majority of the pre-trained parameters untouched. This paradigm significantly pushes conversational human-AI interaction to unprecedented levels. Within a short period of time, the effectiveness of prompting has been demonstrated in a wide range of problem domains, including image classification, object detection, image generation and editing, video analytics, and robot control. In this tutorial, our aim is to provide a comprehensive background on prompting by building connections between research in computer vision and natural language processing. We will also review the latest advances in using prompting to tackle computer vision problems.

Tutorial
Giovanni Pintore · Marco Agus · Enrico Gobbetti

[ East 15 ]

Creating high-level structured 3D models of real-world indoor scenes from captured data and exploiting them are fundamental tasks with important applications in many fields. In this context, 360 capture and processing is very appealing, since panoramic imaging provides the quickest and most complete per-image coverage and is supported by a wide variety of professional and consumer capture devices. Research on inferring 3D indoor models from 360 images has been thriving in recent years, and has led to a variety of very effective solutions. Given the complexity and variability of interior environments, and the need to cope with noisy and incomplete captured data, many open research problems still remain. In this tutorial, we provide an up-to-date integrative view of the field. After introducing a characterization of input sources, we define the structure of output models, the priors exploited to bridge the gap between imperfect input and desired output, and the main characteristics of geometry reasoning and data-driven approaches. We then identify and discuss the main subproblems in structured reconstruction, and review and analyze state-of-the-art solutions for floor plan segmentation, bounding surfaces reconstruction, object detection and reconstruction, integrated model computation, and visual representation generation. We finally point out relevant research issues and …

Tutorial
Grigorios Chrysos · Fanghui Liu · Volkan Cevher

[ West 211 ]

What is the interplay of width/depth and how does the initialization affects the robustness to adversarial attacks? What is a principled heuristic for selecting good architectures in Neural Architecture Search (NAS)? What is the role of Fourier features in implicit neural representations (INRs)? In this tutorial, we aim to build a bridge between the empirical performance of neural networks and deep learning theory. In particular, we want to make the recent deep learning (DL) theory developments accessible to vision researchers, and motivate vision researchers to design new architectures and algorithms for practical tasks. In the first part of the tutorial, we will discuss popular notions in DL theory, such as lazy training and Neural Tangent Kernel (NTK), or bilevel optimization for adversarial attacks and NAS. Then, we will exhibit how such tools can be critical in understanding the inductive bias of networks.

Tutorial
Jinwei Ye · Seung-Hwan Baek · Achuta Kadambi · Huaijin Chen

[ East 19 - 20 ]

Polarization is a fundamental property of light and describes the direction in which the electric field of light oscillates. Polarization, as an intrinsic property of light, provides an extra dimension of information for probing the physical world. Although polarization is often overlooked, it allows for efficient geometry and material analysis beyond the conventional color images. With the snapshot quad-Bayer polarization camera being commercialized, there have been growing interests in using polarization cues to solve a wide range of computer vision problems. Recent advances have demonstrated advantages of using polarization imaging for geometry and material understanding.

In this tutorial, we will cover comprehensive topics in polarization imaging, from the fundamental physical principles to its applications in various computer vision problems. We will specifically focus on recent advances on using polarization imaging for solving the problems of reflectance modeling, 3D reconstruction, and transparent object segmentation. Finally, we will showcase applications of polarization imaging in industry settings.

Tutorial
Yinqiang Zheng · Yunhao Zou · Haiyang Jiang · Ying Fu

[ West 114 - 115 ]

This half-day tutorial will cover the latest advances in the broad theme of Optics for Better AI, with a specific focus on how to capture and synthesize realistic data for training low-light enhancement deep models. In this tutorial, we will first present the overall pipeline and effects of using realistic data, including (i) Low-light Image Enhancement using Synthesized Data; (ii) Low-light Video Enhancement using Captured Data. Then, we show detailed instructions on noise calibration and construction of optical imaging systems, including (iii) How to Calibrate the Noise Model of a Specific Camera; (iv) How to Construct a Co-axial Imaging System.

Tutorial
Vishnu Naresh Boddeti · Zhichao Lu · Qingfu Zhang · and Kalyanmoy Deb

[ West 113 ]

Real-world applications of deep learning often have to contend with objectives beyond predictive performance, i.e., more than one equally important and competing objective or criterion. Examples include cost functions pertaining to invariance (e.g., to photometric or geometric variations), semantic independence (e.g., to age or race for face recognition systems), privacy (e.g., mitigating leakage of sensitive information), algorithmic fairness (e.g., demographic parity), generalization across multiple domains, computational complexity (FLOPs, compactness), to name a few. In such applications, achieving a single solution that simultaneously optimizes all objectives is no longer feasible; instead, finding a set of solutions that are representative in describing the trade-off among objectives becomes the goal. Multiple approaches have been developed for such problems, including simple scalarization and population-based methods. This tutorial aims to provide a comprehensive introduction to fundamentals, recent advances, and applications of multi-objective optimization (MOO), followed by hands-on coding examples. Some emerging applications of MOO include (1) hardware-aware neural architecture search; (2) multi-task learning as multi-objective optimization; (3) representation learning for privacy and fairness. We will also summarize potential research directions intersecting MOO and ML/CV research.

Tutorial
Sijia Liu · Xiaoming Liu · Xue Lin

[ East 7 ]

This tutorial will deliver a well-rounded understanding of the emerging field of reverse engineering of deception (RED) techniques, a cutting-edge topic in adversarial machine learning (ML) for reliable computer vision (CV). Past studies have extensively explored the generation, detection, and defense of machine-centric deception (e.g., adversarial attacks that deceive ML models) and human-centric deception (e.g., GAN-created images that mislead human decision-making) in CV. However, RED introduces a new adversarial learning paradigm that automatically uncovers and catalogs attack "fingerprints" found in both machine and human-centric attacks. The RED problem addressed in the tutorial is: Can we reverse-engineer the adversary's knowledge and attack toolchains beyond conventional adversarial detection/defense techniques? To this end, this tutorial will cover the following key aspects: (1) Review RED's definition and formulation, addressing basics and preliminaries. (2) Discuss the challenges and significance of RED, highlighting its connections and differences with conventional adversarial detection/defense techniques in ML. (3) Explore RED for machine-centric adversaries, reviewing recent RED developments on top of a variety of adversarial attacks. (4) Examine RED for human-centric adversaries, reviewing RED methods for the detection and model parsing of GAN-generated fake images. (5) Demonstrate and showcase RED applications in CV.

Tutorial
Yuchao Dai · Yinqiang Zheng · Bin Fan · Zhihang Zhong · Zhixiang Wan

[ East 17 ]

This half-day tutorial will cover the latest advances in this area from three aspects, i.e., motion modeling and optimization-based solutions, deep learning-based solutions, and joint hardware and deep learning-based solutions. Specifically, we will first systematically present geometric motion models (like discrete, continuous, and special motions) and optimization-based approaches. Then, we will introduce deep learning-based RS image processing methods, such as RS image correction and RS temporal super-resolution, with new results and benchmarks that have recently appeared. Finally, we will elaborate on the combination of hardware features of RS cameras (e.g., dual RS cameras and global reset feature) and deep learning to boost the correction of RS geometric distortions.

Tutorial
Edward Miller · Pierre Moulon · Prince Gupta · Rawal Khirodkar · Richard Newcombe · Sach Lakhavani · Zhaoyang Lv

[ East 12 ]

Project Aria is a research device from Meta, which is worn like a regular pair of glasses, and enables researchers to study the future of always-on egocentric perception. In this tutorial, we will introduce two exciting new datasets from Project Aria: Aria Digital Twin, a real-world dataset with hyper-accurate digital counterpart; and Aria Synthetic Environments, a procedurally-generated synthetic Aria dataset for large-scale ML research. Each dataset will be presented with corresponding challenges, which we believe will be powerful catalysts for research. In addition to introducing new datasets and research challenges, we will also provide a hands-on demonstration of newly open-sourced tools for working with Project Aria, and demonstrate how the Project Aria ecosystem can be used to accelerate open research into egocentric perception tasks such as visual and non-visual localization and mapping, static and dynamic object detection and spatialization, human pose and eye-gaze estimation, and building geometry estimation.

Tutorial
Yusuke Matsui · Martin Aumuller · Han Xiao

[ West 113 ]

Neural search, a technique for efficiently searching for similar items in deep embedding space, is the most fundamental technique for handling large multimodal collections. With the advent of powerful technologies such as foundation models and prompt engineering, efficient neural search is becoming increasingly important. For example, multimodal encoders such as CLIP allow us to convert various problems into simple embedding-and-search. Another example is the way to feed information into LLMs; currently, vector search engines are a promising direction. Despite the above attention, it is not obvious how to design a search algorithm for given data. In this tutorial, we will focus on "million-scale search", "billion-scale search", and "query language" to show how to tackle real-world search problems.

Tutorial
Nathan Kundtz · Matt Robinson · Dan Hedges

[ East 18 ]

With the rise of edge computing, increase in remote sensing information, and ubiquitous adoption of computer vision systems throughout retail and manufacturing markets, organizations are increasingly relying on the accuracy and reliably of training Artificial Intelligence and Machine Learning systems to analyze and extract information from data captured using physical sensors and sensor platforms. Real data sets often fail to capture rare events or assets, are inaccurately labeled, and the collection of real sensor data can have cost, privacy, security, and safety issues.

Synthetic data offers the opportunity to design and label datasets for specific algorithmic training needs. Synthetic imagery designed to emulate ground-based video systems or remotely sensed satellite imagery, for example, can be generated to show real world locations populated with objects that are hard to find or that don’t yet exist. Accurately labeled, simulated datasets can be created to fit a wide range of potential real-world scenarios in which AI/ML systems will be deployed, thereby enabling teams to train and test these systems before being deployed in production environments.

This tutorial will include an introduction to creating, using, and iterating on synthetic data using the open Rendered.ai synthetic data platform. We will also feature a demonstration using …

Tutorial
Ioannis Gkioulekas · Adithya Pediredla

[ East 8 ]

Tutorial
Maying Shen · Hongxu Yin · Jason Clemons · Pavlo Molchanov · Jose M. Alvarez · Jan Kautz

[ East 11 ]

This tutorial focuses on describing techniques to allow deep learning practitioners to accelerate the training and inference of large deep networks while also reducing memory requirements across a spectrum of off-the-shelf hardware for important applications such as autonomous driving and large language models. Topics include, but are not limited to: 1) Deep learning specialized hardware overview. We review the architecture of the most used deep learning acceleration hardware, including the main computational processors and memory modules. 2) How deep learning is performed on this hardware. We cover aspects of algorithmic intensity and an overview of theoretical aspects of computing. Attendees will learn how to estimate processing time and latency by looking only at hardware specs and the network architecture. 3) Best practices for acceleration. We provide an overview of best practices for designing efficient neural networks including channel number selection, compute heavy operations, or reduction operations among others. 4) Existing tools for model acceleration. In this part we will focus on existing tools to accelerate a trained neural network on GPU devices. We will particularly discuss operation folding, TensorRT, ONNX graph optimization, sparsity. 5) Research overview of recent techniques. In the last part, we will focus on recent advanced techniques …