Skip to yearly menu bar Skip to main content


Poster

When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach

TAO MA · Bing Bai · Haozhe Lin · Heyuan Wang · Yu Wang · Lin Luo · Lu Fang

Arch 4A-E Poster #247
[ ] [ Paper PDF ]
Fri 21 Jun 10:30 a.m. PDT — noon PDT

Abstract:

Visual grounding refers to the process of associating natural language expressions with corresponding regions within an image. Existing benchmarks for visual grounding primarily operate within small-scale scenes with a few objects. Nevertheless, recent advances in imaging technology have enabled the acquisition of gigapixel-level images, providing high-resolution details in large-scale scenes containing numerous objects. To bridge this gap between imaging and computer vision benchmarks and make grounding more practically valuable, we introduce a novel dataset, named GigaGrounding, designed to challenge visual grounding models in gigapixel-level large-scale scenes. We extensively analyze and compare the dataset with existing benchmarks, demonstrating that GigaGrounding presents unique challenges such as large-scale scene understanding, gigapixel-level resolution, significant variations in object scales, and the "multi-hop expressions". Furthermore, we introduced a simple yet effiective grounding approach, which employs a "glance-to-zoom-in" paradigm and exhibits enhanced capabilities for addressing the GigaGrounding task. The dataset and our code will be made publicly available upon paper acceptance.

Chat is not available.