Poster

PAVE: Patching and Adapting Video Large Language Models

Zhuoming Liu ⋅ Yiquan Li ⋅ Khoi D Nguyen ⋅ Yiwu Zhong ⋅ Yin Li

2025 Poster

Project Page Paper PDF

Abstract

We present PAVE, a framework for adapting pre-trained video large language models to downstream tasks featuring temporal supplementary signals, such as audio, camera pose, or high frame rate videos. PAVE adapts these models through ``patching'', introducing a small number of additional parameters and operations without modifying the base model architecture or pre-trained weights. We demonstrate that PAVE effectively adapts video LLMs for tasks including audio-visual understanding and 3D reasoning, surpassing state-of-the-art task-specific models, while using less than 1% additional parameters and FLOPs. Furthermore, when applied to high-frame-rate videos, PAVE enhances video understanding, improving the performance of strong base models. Our analysis also highlights that this framework generalizes well across different video LLMs.

Chat is not available.