State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concepts.
Recent works show that learning from large-scale image-text data is a promising approach to building transferable visual models that can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. For example, CLIP, ALIGN and Florence for image classification, ViLD, RegionCLIP, GLIP and OWL-ViT for object detection, GroupViT, OpenSeg, MaskCLIP, X-Decoder, Segment Anything (SAM) and SEEM for segmentation, Multimodal GPT-4, LLaVA and MiniGPT4 for langauge-and-image instruction-following chat assistants. These vision models with language or interactive interface are naturally open-vocabulary recogntion models, showing superior zero-shot and few-shot adaption performance on various real-world scenarios.
We host this "Computer Vision in the Wild (CVinW)" workshop, aiming to gather academic and industry communities to work on CV and MM problems in real-world scenarios, focusing on the challenge of open-set/domain visual recognition at different granularities and efficient task-level transfer. To measure the progress of CVinW, we develop new benchmarks for image classification, object detection and segmentation to measure the task-level transfer ablity of various models/methods over diverse real-world datasets, in terms of both prediction accuracy and adaption efficiency. This workshop is a continuation of our ECCV 2022 CVinW Workshop. For those who are new to this topic, please check out the CVinW Reading List.
In this year, our workshop will host two new challenges:
- Segmentation in the Wild (SGinW): Open-set instance/semantic/panoptic segmentation on dozens of semgnetaion datasets in the realistic scenarios.
- Roboflow 100 for Object Detection in the Wild: An augmented version of our ODinW by increasing the datasets to hundreds to cover more diverse application domains.