Insights into Top Paper Nominee, “Visual Programming: Compositional visual reasoning without training”

A Q&A with the Authors

Tanmay Gupta, Aniruddha Kembhavi

Paper Presentation: Thursday, 22 June, 3:20 p.m. PDT, East Exhibit Halls A-B

For the authors of CVPR 2023 paper, “Visual Programming: Compositional visual reasoning without training,” their work is opening new doors of potential, particularly as Visual Programming may become the preferred way of building do-all AI systems. The following Q&A interview dives deeper with the authors to learn how their work is supporting industry evolution.

CVPR: Will you please share a little more about your work and results? How is it different than the standard approaches to date?
Visual Programming is a new way of approaching complex multi-step visual tasks enabled by code generation capabilities of large language models. The traditional approach for solving such tasks in the deep learning era requires building massive multitask models and is not only computationally expensive and technically challenging, but also rigid and difficult to extend to new tasks.

Visual programming simplifies the process of building flexible vision systems by using LLMs to compose programmatic solutions for a task described in natural language using existing neural models, image processing libraries, and other programs as building blocks. VisProg is our specific implementation of this framework which uses the in-context learning ability of GPT3 to generate Python programs. Each step in the program invokes a VisProg module for visual understanding, image manipulation, knowledge retrieval, or arithmetic & logical operations.

Tasks that typically require tens of thousands or even millions of training samples are easily solved by VisProg with less than 20 examples of in-context instruction-program pairs.

CVPR: How did your model outperform other options? What was the key factor in these results?
Visual Programming is designed for multi-step and compositional visual tasks that can leverage existing pretrained models and programs. Some of the tasks that our system VisProg handles with ease, like tagging the characters on your favorite TV show or seamlessly replacing multiple objects or regions in an image, continue to be difficult, if not impossible, for existing end-to-end models. Furthermore, even for tasks like Visual Question Answering (VQA) and Natural Language Visual Reasoning (NLVR), where end-to-end models perform well, VisProg has the advantage of producing a highly interpretable execution trace that serves as a visual rationale. This allows users to inspect the program logic and intermediate results and intervene if necessary.

CVPR: So, what’s next? What do you see as the future of your research?
With the global AI community making their state-of-the-art models readily available on platforms like Hugging Face, approaches like Visual Programming are likely to become the preferred way of building do-all AI systems. Meanwhile, there are many new and exciting directions to explore, such as iterative program generation and execution, automatic error correction while generating programs, learning from human feedback, and investigating more effective and scalable ways of generating programs for a large number of tasks.

CVPR: What more would you like to add?
Visual Programming is a natural evolution of neuro-symbolic methods in computer vision for the LLM era. Since its Arxiv release in Nov 2022, we have already seen tremendous excitement for this new paradigm from the community, and we are eager to see more magical AI capabilities unlocked by this approach in the coming months.

Annually, CVPR recognizes top research in the field through its prestigious “Best Paper Awards.” This year, from more than 9,000 paper submissions, the CVPR 2023 Paper Awards Committee selected 12 candidates for the coveted honor of Best Paper. Join us for the Award Session on Wednesday, 21 June at 8:30 a.m. to find out which nominees take home the distinction of “Best Paper” at CVPR 2023.