New Frontiers for Zero-Shot Image Captioning Evaluation

Taehoon Kim · Kyoung Mu Lee · Seung Hwan Kim · Alessandra Sala · Bohyung Han · Taehoon Kim · Mark Marsden · Sihaeng Lee · Pyunghwan Ahn · Sangyun Kim

West 116

Keywords:  Vision+language  

[ Abstract ] Workshop Website
Sun 18 Jun, 8 a.m. PDT

The purpose of this workshop is to challenge the computer vision community to develop robust image captioning models that advance the state-of-the-art both in terms of accuracy and fairness (i.e. mitigating societal biases). Both of these issues must be addressed fully before image captioning technology can be reliably deployed in a large-scale setting.

The workshop will focus on testing the true limits of image captioning models under the zero-shot image captioning setting. It aims to challenge the models by providing a large-scale evaluation dataset that includes a larger variety of visual concepts from many domains (including new concepts such as COVID-19) as well as various image types (photographs, illustrations, graphics). To accomplish this task, the models need to broadly understand language-vision relations and also learn how to combine language components for a new concept of image. Before the workshop, a challenge on zero-shot image captioning will be processed, and the results will be shared in the workshop. By providing results only on the limited evaluation dataset, the submitted models will be challenged to understand new concepts and unseen environments.

Throughout the workshop and challenge, we will cover a broad range of topics on understanding language and image together, so that the machines can communicate with humans on what they see in natural language. Therefore, we plan to invite researchers to provide talks on various topics under the range of combination of language and vision.

Chat is not available.