Workshop: Synthetic Data for Autonomous Systems (SDAS) Sun 18 Jun 07:50 a.m.
The 2nd International Workshop on Transformers for Vision Sun 18 Jun 07:50 a.m.
The Second Workshop on Structural and Compositional Learning on 3D Data Sun 18 Jun 07:50 a.m.
CVPR 2023 - 10th Workshop on Medical Computer Vision (MCV) Sun 18 Jun 08:00 a.m.
The CVPR MCV workshop provides a unique forum for researchers and developers in academia, industry and healthcare to present, discuss and learn about cutting-edge advances in machine learning and computer vision for medical image analysis and computer assisted interventions. The workshop offers a venue for potential new collaborative efforts, encouraging more dataset and information exchanges for important clinical applications.
The ultimate goal of the MCV workshop is to bring together stakeholders interested in leveraging medical imaging data, machine learning and computer vision algorithms to build the next generation of tools and products to advance image-based healthcare. It is time to deliver!
The program features invited talks from leading researchers from academia and industry and clinicians. There will be no paper submissions at this year's workshop.
The Second Workshop on 3D Vision and Robotics Sun 18 Jun 08:00 a.m.
Workshop: OmniLabel: Infinite label spaces for semantic understanding via natural language Sun 18 Jun 08:00 a.m.
The goal of this workshop is to foster research on the next generation of visual perception systems that reason over label spaces that go beyond a list of simple category names. Modern applications of computer vision require systems that understand a full spectrum of labels, from plain category names (“person” or “cat” ), over modifying descriptions using attributes, actions, functions or relations (“women with yellow handbag” , “parked cars” , or “edible item” ), to specific referring descriptions (“the man in the white hat walking next to the fire hydrant” ). Natural language is a promising direction not only to enable such complex label spaces, but also to train such models from multiple datasets with different, and potentially conflicting, label spaces. Besides an excellent list of invited speakers from both academia and industry, the workshop will present the results of the OmniLabel challenge, which we held with our newly collected benchmark dataset that subsumes generic object detection, open-vocabulary detection, and referring expression comprehension into one unified and challenging task.
1st Workshop on Multimodal Content Moderation Sun 18 Jun 08:00 a.m.
Content moderation (CM) is a rapidly growing need in today’s industry, with a high societal impact, where automated CM systems can discover discrimination, violent acts, hate/toxicity, and much more, on a variety of signals (visual, text/OCR, speech, audio, language, generated content, etc.). Leaving or providing unsafe content on social platforms and devices can cause a variety of harmful consequences, including brand damage to institutions and public figures, erosion of trust in science and government, marginalization of minorities, geo-political conflicts, suicidal thoughts and more. Besides user-generated content, content generated by powerful AI models such as DALL-E and GPT present additional challenges to CM systems.
With the prevalence of multimedia social networking and online gaming, the problem of sensitive content detection and moderation is by nature multimodal. The Hateful memes dataset [1] highlights the multimodal nature of content moderation, for example, an image of a skunk and a sentence “you smell good” are benign/neutral separately, but can be hateful when interpreted together. Another aspect is the complementary nature of multimodal analysis where there may be ambiguity in interpreting individual modalities separately. Moreover, content moderation is contextual and culturally multifaceted, for example, different cultures have different conventions about gestures. This requires CM approach to be not only multimodal, but also context aware and culturally sensitive.
Despite the urgency and complexity of the content moderation problem, it has not been an area of focus in the research community. By having a workshop at CVPR, we hope to bring attention to this important research and application area, build and grow the community of interested researchers, and generate new discussion and momentum for positive social impact. Through invited talks, panels, and paper submissions, this workshop will build a forum to discuss ongoing efforts in industry and academia, share best practices, and engage the community in working towards socially responsible solutions for these problems.
With organizers across industry and academia, speakers who are experts across relevant disciplines investigating technical and policy challenges, we are confident that the Workshop on Multimodal Content Moderation (MMCM) will complement the main conference by strengthening and nurturing the community for interdisciplinary cross-organization knowledge sharing to push the envelope of what is possible, and improve the quality and safety of multimodal sensitive content detection and moderation solutions that will benefit the society at large.
8th New Trends in Image Restoration and Enhancement Workshop and Challenges Sun 18 Jun 08:00 a.m.
The 3rd Workshop of Adversarial Machine Learning on Computer Vision: Art of Robustness Sun 18 Jun 08:00 a.m.
Third Workshop on Ethical Considerations in Creative Applications of Computer Vision - EC3V Sun 18 Jun 08:00 a.m.
Computer vision technologies like generative image models are rapidly being integrated into creative domains to, for example, aid in artistic content retrieval and curation, generate synthetic media, or enable new forms of artistic methods and creations. However, creative AI technologies bring with them a host of ethical concerns, ranging from representational harms associated culturally sensitive matter to impact on artistic practices and copyright and ownership concerns. In particular, it is unclear what kinds of performance failures and biases these models bring when deployed in cross-cultural and non-western settings.
We encourage retrospective discussions, position papers examining the cross-cultural and social impacts of creative applications of computer vision, ethical considerations in this domain including but not limited to artwork attributions, inequity in cultural performance, cultural appropriation, environmental impacts of generative arts, biases embedded in generative arts, dynamics of art marketplaces/platforms, and policy perspectives on creative AI.
Our aim is to create a platform for interdisciplinary discussions on these issues among computer vision researchers, socio-technical researchers, policy makers, social scientists, artists, and other cultural stakeholders. This year our Generative Art Demo will invite artists to use computer vision technologies to create art pieces that center questions and topics of cultural significance and create space for collective reflections on the role of AI art especially within non-western communities.
Second Workshop of Mobile Intelligent Photography and Imaging Sun 18 Jun 08:00 a.m.
Workshop: New Frontiers for Zero-Shot Image Captioning Evaluation Sun 18 Jun 08:00 a.m.
The purpose of this workshop is to challenge the computer vision community to develop robust image captioning models that advance the state-of-the-art both in terms of accuracy and fairness (i.e. mitigating societal biases). Both of these issues must be addressed fully before image captioning technology can be reliably deployed in a large-scale setting.
The workshop will focus on testing the true limits of image captioning models under the zero-shot image captioning setting. It aims to challenge the models by providing a large-scale evaluation dataset that includes a larger variety of visual concepts from many domains (including new concepts such as COVID-19) as well as various image types (photographs, illustrations, graphics). To accomplish this task, the models need to broadly understand language-vision relations and also learn how to combine language components for a new concept of image. Before the workshop, a challenge on zero-shot image captioning will be processed, and the results will be shared in the workshop. By providing results only on the limited evaluation dataset, the submitted models will be challenged to understand new concepts and unseen environments.
Throughout the workshop and challenge, we will cover a broad range of topics on understanding language and image together, so that the machines can communicate with humans on what they see in natural language. Therefore, we plan to invite researchers to provide talks on various topics under the range of combination of language and vision.
The Fourth Workshop on Fair, Data-efficient, and Trusted Computer Vision Sun 18 Jun 08:00 a.m.
Workshop on Fair, Data Efficient and Trusted Computer Vision will address four critical issues in enhancing user trust in AI and computer vision systems namely: (i) Fairness, (ii) Data Efficient learning and critical aspects of trust including (ii) explainability, (iii) mitigating adversarial attacks robustly and (iv) improve privacy and security in model building with right level of credit assignment to the data sources along with transparency in lineage.
Workshop: DL-UIA: Deep Learning in Ultrasound Image Analysis Sun 18 Jun 08:20 a.m.
GAZE 2023: The 5th International Workshop on Gaze Estimation and Prediction in the Wild Sun 18 Jun 08:25 a.m.
The 5th International Workshop on Gaze Estimation and Prediction in the Wild (GAZE 2023) at CVPR 2023 aims to encourage and highlight novel strategies for eye gaze estimation and prediction with a focus on robustness and accuracy in extended parameter spaces, both spatially and temporally. This is expected to be achieved by applying novel neural network architectures, incorporating anatomical insights and constraints, introducing new and challenging datasets, and exploiting multi-modal training. Specifically, the workshop topics include (but are not limited to):
- Reformulating eye detection, gaze estimation, and gaze prediction pipelines with deep networks.
- Applying geometric and anatomical constraints into the training of (sparse or dense) deep networks.
- Leveraging additional cues such as contexts from face region and head pose information.
- Developing adversarial methods to deal with conditions where current methods fail (illumination, appearance, etc.).
- Exploring attention mechanisms to predict the point of regard.
- Designing new accurate measures to account for rapid eye gaze movement.
- Novel methods for temporal gaze estimation and prediction including Bayesian methods.
- Integrating differentiable components into 3D gaze estimation frameworks.
- Robust estimation from different data modalities such as RGB, depth, head pose, and eye region landmarks.
- Generic gaze estimation method for handling extreme head poses and gaze directions.
- Temporal information usage for eye tracking to provide consistent gaze estimation on the screen.
- Personalization of gaze estimators with few-shot learning.
- Semi-/weak-/un-/self- supervised leraning methods, domain adaptation methods, and other novel methods towards improved representation learning from eye/face region images or gaze target region images.
Workshop: 2nd Monocular Depth Estimation Challenge Sun 18 Jun 08:30 a.m.
Monocular depth estimation (MDE) is an important low-level vision task, with application in fields such as augmented reality, robotics and autonomous vehicles. Recently, there has been an increased interest in self-supervised systems capable of predicting the 3D scene structure without requiring ground-truth LiDAR training data. Automotive data has accelerated the development of these systems, thanks to the vast quantities of data, the ubiquity of stereo camera rigs and the mostly-static world. However, the evaluation process has also remained focused on only the automotive domain and has been largely unchanged since its inception, relying on simple metrics and sparse LiDAR data.
This workshop seeks to answer the following questions:
1. How well do networks generalize beyond their training distribution relative to humans?
2. What metrics provide the most insight into the model’s performance? What is the relative weight of simple cues, e.g. height in the image, in networks and humans?
3. How do the predictions made by the models differ from how humans perceive depth? Are the failure modes the same?
The workshop will therefore consist of two parts: invited keynote talks discussing current developments in MDE and a challenge organized around a novel benchmarking procedure using the SYNS dataset.
19th CVPR Workshop on Perception Beyond the Visible Spectrum (PBVS 2023) Sun 18 Jun 08:30 a.m.
Tutorial: Ronghang Zhu · Xiang Yu · Sheng Li
Recent Advances in Visual Domain Adaptation and Generalization
Bio s:Workshop: VAND: Visual Anomaly and Novelty Detection Sun 18 Jun 08:30 a.m.
4th International Workshop on Large Scale Holistic Video Understanding Sun 18 Jun 08:30 a.m.
Tutorial: Rakesh “Teddy” Kumar · Chen Chen · Mubarak Shah · Han‐Pang Chiu · Sijie Zhu
A Comprehensive Tour and Recent Advancements toward Real-world Visual Geo-Localization
Precise geo-location of a ground image within a large-scale environment is crucial to many applications, including autonomous vehicles, robotics, wide area augmented reality and image search. Localizing the ground image by matching to an aerial/ overhead geo-referenced database has gained noticeable momentum in recent years, due to significant growth in the availability of public aerial/ overhead data with multiple modalities (such as aerial images from Google/ Bing maps, and USGS 2D and 3D data, Aerial LiDAR data, Satellite 3D Data etc.). Matching a ground image to aerial/ overhead data, whose acquisition is simpler and faster, also opens more opportunities to industrial and consumer applications. However, cross-view and cross-modal visual geo-localization comes with additional technical challenges due to dramatic changes in appearance between the ground image and aerial/ overhead database, which capture the same scene differently in time, viewpoints or/and sensor modalities. This tutorial will provide a comprehensive review on the research problem of visual geo-localization, including same-view/cross-time, cross-view, cross-modal settings to both new and experienced researchers. It also provides connection opportunities for the researchers of visual geo-localization and other related fields.
Bio s:Workshop: Generative Models for Computer Vision Sun 18 Jun 08:30 a.m.
Tutorial: Guansong Pang · Joey Tianyi Zhou · Radu Tudor Ionescu · Yu Tian · Kihyuk Sohn
Recent advances in anomaly detection
The tutorial will present a comprehensive review of recent advances in (deep) anomaly detection on image and video data. Three major AD paradigms will be discussed, including unsupervised/self-supervised approaches (anomaly-free training data), semi-supervised approaches (few-shot training anomaly examples are available), and weakly-supervised approaches (videl-level labels are available for frame-level detection). Additionally, we will also touch on anomaly segementation tasks, focusing on autonomous driving settings. The tutorial will be ended with a panel discussion on AD challenges and opportunities.
Bio s:Tutorial: Qirong Ho · Samuel Horvath · Hongyi Wang
ML Systems for Large Models and Federated Learning
This tutorial will teach attendees how to overcome performance, cost, privacy and robustness challenges when using distributed and federated software systems for learning and deploying Computer Vision and ML applications across various hardware settings (networked machines, GPUs, embedded, mobile systems). The audience will learn about theory, implementation and practice of these topics: state-of-the-art approaches and system architectures, forms of distributed parallelism, pitfalls in the measurement of parallel application performance, parallel ML compilers, computation-communication-memory efficiency in federated learning (FL), trustworthy FL, tackling device heterogeneity in FL, and on-device FL systems.
Bio s:Workshop: XRNeRF: Advances in NeRF for the Metaverse Sun 18 Jun 08:30 a.m.
4th Workshop on Continual Learning in Computer Vision (CLVision) Sun 18 Jun 08:30 a.m.
Incorporating new knowledge in existing models to adapt to novel problems is a fundamental challenge of computer vision. Humans and animals continuously assimilate new experiences to survive in new environments and to improve in situations already encountered in the past. Moreover, while current computer vision models have to be trained with independent and identically distributed random variables, biological systems incrementally learn from non-stationary data distributions. This ability to learn from continuous streams of data, without interfering with previously acquired knowledge and exhibiting positive transfer is called Continual Learning. The CVPR Workshop on “Continual Learning in Computer Vision” (CLVision) aims to gather researchers and engineers from academia and industry to discuss the latest advances in Continual Learning. In this workshop, there are regular paper presentations, invited speakers, and a technical benchmark challenges to present the current state of the art, as well as the limitations and future directions for Continual Learning, arguably one of the most challenging milestones of AI.
Tutorial: Jian Ren · Sergey Tulyakov · Ju Hu
Efficient Neural Networks: From Algorithm Design to Practical Mobile Deployment
This tutorial will introduce effective methodologies for re-designing algorithms for efficient content understanding, image generation, and neural rendering. Most importantly, we show how the algorithms can be efficiently deployed on mobile devices, eventually achieving real-time interaction between users and mobile devices.
Bio s:Tutorial: Pin-Yu Chen · Chaowei Xiao
Trustworthy AI in the Era of Foundation Models
While machine learning (ML) models have achieved great success in many perception applications, concerns have risen about their potential security, robustness, privacy, and transparency issues when applied to real-world applications. Irresponsibly applying a foundation model to mission-critical and human-centric domains can lead to serious misuse, inequity issues, negative economic and environmental impacts, and/or legal and ethical concerns. For example, ML models are often regarded as “black boxes” and can produce unreliable, unpredictable, and unexplainable outcomes, especially under domain shifts or maliciously crafted attacks, challenging the reliability of safety-critical applications; Stable Diffusion may generate NSFW content and privacy violated-content.
This goals of this tutorial are to:
Provide a holistic and complementary overview of trustworthiness issues, including security, robustness, privacy, and societal issues to allow a fresh perspective and some reflection on the induced impacts and responsibility as well as introduce the potential solutions.
Promote awareness of the misuse and potential risks in existing AI techniques and, more importantly, to motivate rethinking of trustworthiness in research.
Present case studies from computer vision-based applications.
This tutorial will provide sufficient background for participants to understand the motivation, research progress, known issues, and ongoing challenges in trustworthy perception systems, in addition to pointers to open-source libraries and surveys.
Bio s:LatinX in Computer Vision Research Workshop Sun 18 Jun 08:30 a.m.
Tutorial: Dacheng Tao · Qiming Zhang · Yufei Xu · Jing Zhang
Vision Transformer: More is different
Bio s:Catch UAVs that Want to Watch You: Detection and Tracking of Unmanned Aerial Vehicle (UAV) in the Wild and the 3rd Anti-UAV Workshop & Challenge Sun 18 Jun 08:30 a.m.
7th Workshop on Media Forensics Sun 18 Jun 08:45 a.m.
Topological, Algebraic, and Geometric Pattern Recognition with Applications Workshop Proposal Sun 18 Jun 08:45 a.m.
FGVC10: 10th Workshop on Fine-grained Visual Categorization Sun 18 Jun 08:45 a.m.
Fine-grained categorization, the precise differentiation between similar plant or animal species, disease of the retina, architectural styles, etc., is an extremely challenging problem, pushing the limits of both human and machine ability. In these domains expert knowledge is typically required, and the question that must be addressed is how can we develop systems that can efficiently discriminate between large numbers of highly similar visual concepts. The 10th Workshop on Fine-Grained Visual Categorization (FGVC10) explores topics related to supervised learning, self- supervised learning, semi-supervised learning, matching, localization, domain adaptation, transfer learning, few-shot learning, machine teaching, multimodal learning (e.g., audio and video), 3D- vision, crowd-sourcing, image captioning and generation, out-of- distribution detection, open-set recognition, human-in-the-loop learning, etc., all through the lens of fine-grained understanding. Topics relevant for FGVC10 are neither restricted to vision nor categorization. FGVC10 consists of invited talks from world- renowned computer vision experts and domain experts (e.g., art), poster sessions, challenges, and peer-reviewed extended abstracts. To mark FGVC’s 10th anniversary, we have confirmed five panellists for a discussion of the history and future of FGVC. We aim to stimulate debate and to expose the wider computer vision community to new challenging problems which have the potential for large societal impact but do not traditionally receive a significant amount of exposure at other CVPR workshops.
3rd International Workshop and Challenge on Long-form Video Understanding and Generation Sun 18 Jun 08:55 a.m.
12th IEEE International Workshop on Computational Cameras and Displays (CCD) Sun 18 Jun 09:00 a.m.
Workshop: EarthVision: Large Scale Computer Vision for Remote Sensing Imagery Sun 18 Jun 09:00 a.m.
Workshop on End-to-end Autonomous Driving Sun 18 Jun 09:00 a.m.
End-to-end autonomous driving, as a relatively new paradigm (compared to the modular design) yet with great potential, has already attracted attention from both academia and industry. This workshop serves a brand-new perspective to discuss broad areas of end-to-end framework design for autonomous driving on a system-level consideration. Central to the program is a series of invited talks and four new challenges in the self-driving domain. Each challenge combines new perspectives of multiple components in perception and planning compared to conventional pipelines.
The 4th CVPR Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics Sun 18 Jun 09:00 a.m.
Workshop: Computer Vision for Mixed Reality Sun 18 Jun 09:00 a.m.
2nd Workshop on Tracking and Its Many Guises: Tracking Any Object in Open-World Sun 18 Jun 09:00 a.m.
Fourth Workshop on Neural Architecture Search, Third lightweight NAS challenge Sun 18 Jun 09:00 a.m.
Tutorial: Hila Chefer · Sayak Paul
All Things ViTs: Understanding and Interpreting Attention in Vision
The attention mechanism has revolutionized deep learning research across many disciplines starting from NLP and expanding to vision, speech, and more. Different from other mechanisms, the elegant and general attention mechanism is easily adaptable and eliminates modality-specific inductive biases. As attention becomes increasingly popular, it is crucial to develop tools to allow researchers to understand and explain the inner workings of the mechanism to facilitate better and more responsible use of it. This tutorial focuses on understanding and interpreting attention in the vision and the multi-modal setting. We present state-of-the-art research on representation probing, interpretability, and attention-based semantic guidance, alongside hands-on demos to facilitate interactivity. Additionally, we discuss open questions arising from recent works and future research directions.
Bio s:Tutorial: Jiaming Song · Chenlin Meng · Arash Vahdat
Denoising Diffusion Models: A Generative Learning Big Bang
Diffusion models have been widely adopted in various computer vision applications and are becoming a dominating class of generative models. In the year 2022 alone, diffusion models have been applied to many large-scale text-to-image foundation models, such as DALL-E 2, Imagen, Stable Diffusion and eDiff-I. These developments have also driven novel computer vision applications, such as solving inverse problems, semantic image editing, few-shot textual inversion, prompt-to-prompt editing, and lifting 2d models for 3d generation. This popularity is also reflected in the diffusion models tutorial in CVPR 2022, which has accumulated nearly 60,000 views on YouTube over 8 months. The primary goal of the CVPR 2023 tutorial on diffusion models is to make diffusion models more accessible to a wider computer vision audience and introduce recent developments in diffusion models. We will present successful practices on training and sampling from diffusion models and discuss novel applications that are enabled by diffusion models in the computer vision domain. These discussions will also heavily lean on recent research developments that are released in 2022 and 2023. We hope that this year’s tutorial on diffusion models will attract more computer vision practitioners interested in this topic to make further progress in this exciting area.
Bio s:Workshop: Visual Perception via Learning in an Open World Sun 18 Jun 09:00 a.m.
CVPR 2023 Biometrics Workshop Sun 18 Jun 09:00 a.m.
Tutorial: Kai Chen · Conghui He · Yanhong Zeng · Songyang Zhang · Wenwei Zhang
Boosting Computer Vision Research with OpenMMLab and OpenDataLab
This tutorial will introduce two open platforms which can significantly accelerate the research in computer vision ——OpenMMLab and OpenDataLab.
OpenMMLab is an open-source algorithm platform for computer vision. It aims to provide a solid benchmark and promote reproducibility for academic research. We have released more than 30 high-quality projects and toolboxes in various research areas such as image classification, object detection, semantic segmentation, action recognition, etc. OpenMMLab has made public more than 300 algorithms and 2,400 checkpoints. Over the past years, OpenMMLab has gained popularity in both academia and industry. It receives over 78,000 stars on GitHub and involves more than 1,700 contributors in the community.
OpenDataLab, which was initially released in March, 2022, is an open data platform for artificial intelligence, especially including a large number of datasets for computer vision.
Bio s:3rd Workshop and Challenge on Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings Sun 18 Jun 09:00 a.m.
Workshop on Autonomous Driving (WAD) Sun 18 Jun 09:15 a.m.
The CVPR 2023 Workshop on Autonomous Driving (WAD) aims to gather researchers and engineers from academia and industry to discuss the latest advances in perception for autonomous driving. In this full-day workshop, we will host speakers as well as technical benchmark challenges to present the current state of the art, limitations and future directions in the field - arguably one of the most promising applications of computer vision and artificial intelligence. The previous chapters of this workshop attracted hundreds of researchers to attend. This year, multiple industry sponsors are also joining our organizing efforts to push it to a new level.
6th Multi-modal Learning and Applications Workshop (MULA) Sun 18 Jun 09:15 a.m.
The exploitation of the power of big data in the last few years led to a big step forward in many applications of Computer Vision. However, most of the tasks tackled so far are involving visual modality only, mainly due to the unbalanced number of labelled samples available among modalities (e.g., there are many huge labelled datasets for images while not as many for audio or IMU based classification), resulting in a huge gap in performance when algorithms are trained separately.
Recently, a few works have started to exploit the synchronization of multimodal streams (e.g., audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio) to transfer semantic information from one modality to another reaching surprising results. Interesting applications are also proposed in a self-supervised fashion, where multiple modalities are learning correspondences without need of manual labelling, resulting in a more powerful set of features compared to those learned processing the two modalities separately. Other works have also shown that particular training paradigms allow neural networks to perform well when one of the modalities is missing due to sensor failure or unfavorable environmental conditions. These topics are gaining lots of interest in computer vision community in the recent years.
The information fusion from multiple sensors is a topic of major interest also in industry, the exponential growth of companies working on automotive, drone vision, surveillance or robotics are just a few examples. Many companies are trying to automate processes, by using a large variety of control signals from different sources. The aim of this workshop is to generate momentum around this topic of growing interest, and to encourage interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities, that will serve as a forum for research groups from academia and industry.
We expect contributions involving, but not limited to, image, video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged. Multimodal data analysis is a very important bridge among vision, multimedia, remote sensing, and robotics, therefore we expect a positive response from these communities.
Workshop: New Frontiers in Visual Language Reasoning: Compositionality, Prompts and Causality Sun 18 Jun 09:15 a.m.
Recent years have seen the stunning powers of Visual Language Pre-training (VLP) models. Although VLPs have revolutionalized some fundamental principles of visual language reasoning (VLR), the other remaining problems prevent them from “thinking” like a human being: how to reason the world from breaking into parts (compositionality), how to achieve the generalization towards novel concepts provided a glimpse of demonstrations in context (prompts), and how to debias visual language reasoning by imagining what would have happened in the counterfactual scenarios (causality).
The workshop provides the opportunity to gather researchers from different fields to review the technology trends of the three lines, to better endow VLPs with these reasoning abilities. Our workshop also consists of two multi-modal reasoning challenges under the backgrounds of cross-modal math-word calculation and proving problems. The challenges are practical and highly involved with our issues, therefore, shedding more insights into the new frontiers of visual language reasoning.
CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling (Workshop) Sun 18 Jun 09:20 a.m.
Many biological organisms are evolved to exhibit diverse quintessential behaviors via physical and social interactions with surroundings, and understanding these behaviors is a fundamental goal of multiple disciplines including neuroscience, biology, medicine, behavior science, and sociology. For example, ethogramming characterizes the behavioral states and their transitions, which further provides a scientific basis to understand innate human behaviors, e.g., decision-making, attention, and group behaviors. These analyses require objective, repeatable, and scalable measurements of animal behaviors that are not possible with existing methodologies that leverage manual encoding from animal experts and specialists. Recently, computer vision has been making a groundbreaking impact by providing a new tool that enables computational measurements of the behaviors.
The workshop offers invited talks, orals, and poster sessions by the leading scientists in the field, coming computer vision, neuro science, and biology. Our webpage list the full schedule, accepted papers, and posters.
Tutorial: Xin Li · Lan Xu · Yu Ding
Skull Restoration, Facial Reconstruction and Expression
This tutorial focuses on the challenges of reconstructing a 3D model of a human face followed by generating facial expressions. It comprises three parts, covering facial reconstruction from skeletal remains, 4D dynamic facial performance, and audio-driven talking face generation. Firstly, Face modeling is a fundamental technique and has broad applications in animation, vision, games, and VR. Facial geometries are fundamentally governed by their underlying skull and tissue structures. This session covers a forensic task of facial reconstruction from skeletal remains, in which we will discuss how to restore fragmented skulls, model anthropological features, and reconstruct human faces upon skulls. Then, we will detail how to capture 4D facial performance, which is the foundation for face modeling and rendering. We will consider the hardware designs for cameras, sensors, lighting, and the steps to obtain dynamic facial geometry along with physically-based textures (pore-level diffuse albedo, specular intensity, and normal, etc.,). We will discuss the two complementary workhorses: multi-view stereo and photometric stereo, and the combination with neural rendering advances and medical imaging. Finally, talking face generation will be discussed including 3D animation parameters and 2D photo-realistic video, as well as their applications. It aims to create a talking video of a speaker with authentic facial expressions from an input of simultaneous speech. The face identity may be from a predefined 3D virtual character, a single image, or a few minutes of a specific speaker.
Bio s:Workshop: End-to-End Autonomous Driving: Perception, Prediction, Planning and Simulation Sun 18 Jun 12:30 p.m.
1st Workshop on Compositional 3D Vision & 3DCoMPaT Challenge Sun 18 Jun 12:45 p.m.
6th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues Sun 18 Jun 12:45 p.m.
Workshop: AVA: Accessibility, Vision, and Autonomy Meet Sun 18 Jun 01:00 p.m.
The goal of this workshop is to gather researchers, students, and advocates who work at the intersection of accessibility, computer vision, and autonomous and intelligent systems. In particular, we plan to use the workshop to identify challenges and pursue solutions for the current lack of shared and principled development tools for vision-based accessibility systems. For instance, there is a general lack of vision-based benchmarks and methods relevant to accessibility (e.g., people using mobility aids are currently mostly absent from largescale datasets in pedestrian detection). Towards building a community of accessibility-oriented research in computer vision conferences, we also introduce a large-scale fine-grained computer
vision challenge. The challenge involves visual recognition tasks relevant to individuals with disabilities. We aim to use the challenge to uncover research opportunities and spark the interest of computer vision and AI researchers working on more robust and broadly usable visual reasoning models in the future. An interdisciplinary panel of speakers will further provide an
opportunity for fostering a mutual discussion between accessibility, computer vision, and robotics researchers and practitioners.
Workshop: High-fidelity Neural Actors Sun 18 Jun 01:00 p.m.
Workshop: 4D Hand Object Interaction: Geometric Understanding and Applications in Dexterous Manipulation Sun 18 Jun 01:00 p.m.
The Fifth Workshop on Precognition: Seeing through the Future Sun 18 Jun 01:00 p.m.
Vision-based detection and recognition studies have been recently achieving highly accurate performance and were able to bridge the gap between research and real-world applications. Beyond these well-explored detection and recognition capabilities of modern algorithms, vision-based forecasting will likely be one of the next big research topics in the field of computer vision. Vision-based prediction is one of the critical capabilities of humans, and the potential success of automatic vision-based forecasting will empower and unlock human-like capabilities in machines and robots.
One important application is in autonomous driving technologies, where a vision-based understanding of a traffic scene and prediction of the movement of traffic actors is a critical piece of the autonomous puzzle. Various sensors such as cameras and lidar are used as the "eyes" of a vehicle, and advanced vision-based algorithms are required to allow safe and effective driving. Another area where vision-based prediction is used is the medical domain, allowing deep understanding and prediction of future medical conditions of patients. However, despite its potential and relevance for real-world applications, visual forecasting or precognition has not been the focus of new theoretical studies and practical applications as much as detection and recognition problems.
Through the organization of this workshop, we aim to facilitate further discussion and interest within the research community regarding this nascent topic. This workshop will discuss recent approaches and research trends not only in anticipating human behavior from videos but also precognition in multiple other visual applications, such as medical imaging, healthcare, human face aging prediction, early even prediction, autonomous driving forecasting, etc.
QCVML: Quantum Computer Vision and Machine Learning Workshop Sun 18 Jun 01:00 p.m.
1st workshop on Capturing, Interpreting & Visualizing Indoor Living Spaces Sun 18 Jun 01:00 p.m.
With the recent advances in AR/VR recently a wider range of applications such as virtual touring, Building Information Modeling (BIM), e.g. floorplan generation and 3D holistic understanding have been emerging. Such applications have attracted a lot of interest from both academia and industry and motivated a lot of investments in the form of dataset collection, research, publications and products. A few recent examples of such datasets are Zillow Indoor Dataset (ZInD), Apple’s ARKit Scenes dataset and Facebook’s Habitat-Matterport dataset. The size and unique type of annotations provided by each of these datasets provide a huge opportunity for CV/ML researchers to focus on different aspects of scene and environment understanding beyond what was possible before.
Motivated by the recent release of datasets such as Zillow Indoor Dataset (ZInD), Apple's ARKit Scenes dataset and Facebook's Habitat-Matterport dataset, in this workshop we would like to bring industry and academia together and encourage both to focus on specific under explored aspects of environment understanding. We encourage researchers to go beyond "scene understanding" and explore "environment understanding" with a focus on understanding structure through tasks such as 2D/3D room layout estimation, understanding relation of "rooms" for floorplan generation, localization of media within rooms and floorplans, localization of objects within rooms and floorplans. Image, geometric, and semantic information can also be used to reimagine the appearance of home interiors in a photorealistic manner.
Workshop: Computer Vision for Fashion, Art, and Design Sun 18 Jun 01:00 p.m.
Creative domains render a big part of modern society, having a strong influence on the economy and cultural life. Much effort within creative domains, such as fashion, art and design, center around the creation, consumption, manipulation and analytics of visual content. In recent years, there has been an explosion of research in applying machine learning and computer vision algorithms to various aspects of the creative domains. For four years in a row, CVFAD workshop series have been capturing important trends and new ideas in this area. At CVPR 2023, we will continue to bring together artists, designers, and computer vision researchers and engineers. We will keep growing the workshop itself to be a space for conversations and idea exchanges at the intersection of computer vision and creative applications.
The 3rd Workshop on Light Fields for Computer Vision LFNAT: New Applications and Trends in Light Fields Sun 18 Jun 01:30 p.m.
4D Light fields can capture both intensity and directions of light rays, and record 3D geometry in a convenient and efficient manner. In the past few years, various areas of research are trying to use light fields to obtain superior performance internal structure information. Light fields have been widely used with remarkable results in some applications like depth estimation, super-resolution and so on. While the attempts in other applications like object detection and semantic segmentation are still in preliminary stage due to the lack of corresponding datasets, and incompatibility between redundant context information and limited memory. Meanwhile, more and more novel and powerful technologies like Neural Radiance Fields and Multiplane Image have been introduced into computer vision, there will be plenty of opportunities and challenges to incorporate them with light fields. To this end, this workshop focuses on two brand new topics. The first is to introduce the light field into more application areas, break through the bottleneck between rich structural information and limited memory, and achieve stable performance. The second is to explore how to introduce emerging technologies from other research fields into light fields to create new technological effects and drive competition. Besides, this workshop also hosts competitions about light field semantic segmentation and depth estimation to invite more researchers to the field.
Tutorial: James Demmel · Yang You
Large-scale Deep Learning Optimization Techniques
Large Transformer models have performed promisingly on a wide spectrum of AI and CV applications. These positive performances have thus stimulated a recent surge of extremely large models. However, training these models generally requires more computation and training time. This has generated interest in both academia and industry in scaling up deep learning (DL) using distributed training on high-performance computing (HPC) resources like TPU and GPU clusters.
However, continuously adding more devices will not scale training as intended, since training at a large scale requires overcoming both algorithmic and systems-related challenges. This limitation prevents DL and CV researchers from exploring more advanced model architectures.
Many existing works investigate and develop optimization techniques that overcome these problems and accelerate large model training at a larger-scale. We categorize these works as improving either model accuracy or model efficiency. One method to maintain or improve model accuracy in a large-scale setting, while still maintaining computing efficiency, is to design algorithms that require less communication and memory demands. It is notable that these are not mutually exclusive goals but can be optimized together to further accelerate training. This tutorial helps enable CV members to quickly master optimizations for large-scale DL training and successfully train large models at large-scale with different optimization techniques in a distributed environment.
Bio s:Workshop: Pixel-level Video Understanding in the Wild Challenge Sun 18 Jun 01:30 p.m.
Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Since the real-world is actually video-based rather than a static state, learning to perform video semantic/panoptic segmentation is more reasonable and practical for realistic applications. To advance the semantic/panoptic segmentation task from images to videos, we present two large-scale datasets (VSPW and VIPSeg) and a competition in this workshop, aiming at performing the challenging yet practical Pixel-level Video Understanding in the Wild (PVUW).
The Fourth Workshop on Face and Gesture Analysis for Health Informatics (FGAHI) Sun 18 Jun 01:30 p.m.
Tutorial: Wenjin Wang · Xuyu Wang · Jun Luo
Contactless Healthcare using Cameras and Wireless Sensors
Extracting health-related metrics is an emerging computer vision research topic that has grown rapidly recently. Without needing physical contact, cameras have been used to measure vital signs remotely (e.g. heart & respiration rates, blood oxygenation saturation, body temperature, etc.) from images/video of the skin or body. This leads to contactless, continuous and comfortable heath monitoring. Cameras can also leverage computer vision and machine learning techniques to measure human behaviors/activities and high-level visual semantic/contextual information, facilitating better understanding of people and scenes for health monitoring and provides a unique advantage compared to the contact bio-sensors. RF (Radar, WiFi, RFID) and acoustic based methods for health monitoring have also been proposed. The rapid development of computer vision and RF sensing also give rise to new multi-modal learning techniques that expand the sensing capability by combining two modalities, while minimizing the need of human labels. The hybrid approach may further improve the performance of monitoring, such as using the camera images as beacon to gear human activity learning for the RF signals. Contactless monitoring will bring a rich set of compelling healthcare applications that directly improve upon contact-based monitoring solutions and improve people’s care experience and quality of life, such as in care units of the hospital, sleep/senior centers, assisted-living homes, telemedicine and e-health, fitness and sports, driver monitoring in automotive, etc.
Bio s: