Skip to yearly menu bar Skip to main content

Insights into Top Paper Nominee, "What Can Human Sketches Do for Object Detection?”

A Q&A with the Authors

Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Aneeshan Sain, Subhadeep Koley, Tao Xiang, Yi-Zhe Song

Paper Presentation: Thursday, 22 June, 3:30 p.m. PDT, East Exhibit Halls A-B

In CVPR 2023 paper, “What Can Human Sketches Do for Object Detection?,” the authors propose a framework to detect objects based on what you sketch. The following Q&A explores how this actually works and what's next. 

CVPR: Will you please share a little more about your work and results? How is it different than the standard approaches to date?

Sketches are highly expressive in that they inherently capture subjective and fine-grained visual cues. However, the exploration of such innate properties of human sketch has been limited to that of image retrieval.

Hence, we want to reimagine the importance of sketch for the fundamental vision task of Object Detection. The resulting sketch-enabled framework opens up new avenues for object detection that was not possible in prior literature.

Our proposed framework can detect objects based on what you sketch: (i) Detect “that” zebra (i.e., one that is eating the grass) in a herd of zebras. We call this instance-aware object detection. (ii) Detect only the “head” of “that” zebra. We call this part-aware object detection.

To put it in context, imagine trying to achieve instance-aware or part-aware object detection using alternative modalities such as text — describing a specific “zebra” in a herd is cumbersome (e.g., you want to describe that giraffe with legs in 90 degrees standing in the middle and drinking water).

In addition to opening new avenues in object detection (e.g., instance-aware and part-aware), we show that humans sketches can also reimagine the way we train our object detectors. Object detectors were traditionally trained in a supervised setup (with bounding box annotation and class labels). This high annotation cost was later reduced with weakly supervised setups (only class labels, no bounding boxes).

Despite having cheaper annotation cost, weakly supervised setups require us to collect scene-level images, annotate class labels, and train a data-hungry neural network. Hence, we propose Extremely Weakly Supervised Object Detection (EWSOD) that removes the need of collecting scene-level images (with good diversity of objects and background) and works using only a few object-level sketch/photo pairs. To achieve EWSOD we propose “Tiling.” Note, Tiling is basically a reinterpretation of CutMix (used for data augmentation). However, a key difference is its motivation — Tiling is a data synthesis tool for EWSOD, whereas CutMix is a data augmentation tool for robustness.

Finally, the most important of all is — we achieve all the aforementioned objectives without reinventing the wheel. Instead of devising a sketch-enabled object detection model from the ground up, we show that an intuitive synergy between foundation models (e.g., CLIP) and off-the-shelf sketch-based image retrieval (SBIR) models can already, rather elegantly, solve the problem — CLIP to provide model generalisation, and SBIR to bridge the (sketch —> photo) gap.

CVPR: How did your model outperform other options? What was the key factor in these results?

We perform quantitative evaluation of our model for category-level and instance-aware object detection.

Despite being trained on the challenging EWSOD setup, our proposed method on category-level object detection outperforms the best supervised setups (SOD) by 14.7 (in VOC 2007 dataset) and 10.9 (in MS-COCO dataset) and the best weakly supervised setups (WSOD) by 16.4 (in VOC 2007) and 11.0 (in MS-COCO).

A similar pattern is observed for fine-grained (or instance-aware) object detection. Our proposed method outperforms the best supervised setups (SOD) by 4.6 (in AP @ 0.3), 5.0 (in AP @ 0.5), and 4.6 (in AP @ 0.7). Similarly, we outperform weakly supervised setups (WSOD) by 4.7 (in AP @ 0.3), 5.2 (in AP @ 0.5), and 5.8 (in AP @ 0.7).

The key factor in these results is the extreme generalisation potential of training object detection using a strong CLIP-based SBIR.

Particularly, our proposed CLIP-based SBIR method (adapted via prompt learning) outperforms the best category-level SBIR competitor on mAP by 3.2 for 100% train data, 3.7 for 70% train data, and 4.8 for 50% train data. For cross-category fine-grained sketch-based image retrieval outperforms the best competitors on Acc.@1 by 5.1 (100% train data), 4.7 (70% train data), and 5.5 (50% train data).

CVPR: So, what’s next? What do you see as the future of your research?

I would love future research to further explore the potential of sketches beyond image retrieval. Sketches are human-centered and have subjectivity and rich semantics built in which should help progress computer vision as a discipline.

Integrating this expressivity with computer vision can result in a class of future algorithms that are more human-centered, has no learning curve, and is intuitive to everyone — be it an AI researcher, a person familiar to a computer, a person who never saw a computer, an infant who is yet to learn how to speak or write, or beyond humans (e.g., even chimpanzees can recognise a sketch).

At SketchX, our ultimate vision is to understand how seeing can be explained by drawing. In other words, how better understanding of human sketch data can be translated to insights on how human visual systems operate, and in turn how such insights can benefit computer vision and cognitive science at large.

Towards this goal, we work on multiple interesting problems like how sketches can be used as a weak label to detect salient objects present in an image, or how sketch forms an ideal basis providing a natural interface to study explainability and many more. For more interesting work (such as human-drawn sketch to photo generation, adaptation of CLIP for sketch-based image retrieval, information disentanglement and bottleneck for sketch-photo-text synergy) check our group SketchX at

CVPR: What more would you like to add?

I would thank the CVPR committee for recognising the potential of our work — not because what we did in this particular paper — but because of the direction of research we want to gain traction. Additionally, Ma and Baba — your kid got some attention he was crying for since age 7.

Annually, CVPR recognizes top research in the field through its prestigious “Best Paper Awards.” This year, from more than 9,000 paper submissions, the CVPR 2023 Paper Awards Committee selected 12 candidates for the coveted honor of Best Paper. Join us for the Award Session on Wednesday, 21 June at 8:30 a.m. to find out which nominees take home the distinction of “Best Paper” at CVPR 2023.