Skip to yearly menu bar Skip to main content


Paper
in
Workshop: 8th Multimodal Learning and Applications Workshop

TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents

Kunal Singh · Shreyas Singh · Mukund Khanna


Abstract:

Recent advancements in Large Vision Language Models (LVLMs) have led to the emergence of LVLM-based Graphical User Interface (GUI) agents developed under various paradigms. Training-based approaches, such as CogAgent and SeeClick, suffer from poor cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, utilize Set-of-Marks (SoM) for action grounding; however, obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Additionally, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI elements) or GUI referring (describing GUI elements given a location), TRISHUL seamlessly integrates both. At its core, TRISHUL employs Hierarchical Screen Parsing (HSP) and the Spatially Enhanced Element Description (SEED) module, which work synergistically to provide multi-granular, spatially, and semantically enriched representations of GUI elements. Our results demonstrate TRISHUL’s superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. Additionally, for GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.

Chat is not available.