Poster
PAVE: Patching and Adapting Video Large Language Models
Zhuoming Liu · Yiquan Li · Khoi D Nguyen · Yiwu Zhong · Yin Li
We present PAVE, a framework for adapting pre-trained video large language models to downstream tasks featuring temporal supplementary signals, such as audio, camera pose, or high frame rate videos. PAVE adapts these models through ``patching'', introducing a small number of additional parameters and operations without modifying the base model architecture or pre-trained weights. We demonstrate that PAVE effectively adapts video LLMs for tasks including audio-visual understanding and 3D reasoning, surpassing state-of-the-art task-specific models, while using less than 1% additional parameters and FLOPs. Furthermore, when applied to high-frame-rate videos, PAVE enhances video understanding, improving the performance of strong base models. Our analysis also highlights that this framework generalizes well across different video LLMs.