Zhuoming Liu1, Yiquan Li1, Khoi Duc Nguyen1, Yiwu Zhong2, Yin Li1
1University of Wisconsin-Madison 2The Chinese University of Hong Kong
This code repo holds the implementation of PAVE, a framework that adapts pre-trained video large language model (Video-LLMs) to downstream tasks with side-channel signals, such as audio, depth information, exo-centric video and high frame rate videos. Our paper is accepted to CVPR2025 and an arXiv version of our paper is available here.
PAVE adapts video-LLMs through patching --- adding a small "patch" of additional parameters and operations to the video-LLM, without altering its existing architecture or vast pre-trained weights. Specifically, PAVE leverages cross-attention that operates between tokens derived from key video frames (as queries) and tokens from side-channel signals (as keys and values). This operation seeks to align the visual signal and side-channel signals along the time axis, fuse the signals from both sources, and then update the input visual tokens to the LLM. In doing so, PAVE allows for the input of supplementary signals, and introduces a small number of parameters and operations with a negligible computing cost, while enabling effective adaptation to various downstream tasks.
Without bells and whistles, PAVE achieves state-of-the-art performance in audio-visual QA, 3DQA, multi-view video understanding by adapting video-LLM to new tasks with the cost of adding less than 1% of FLOPs and parameters.
-
Audio-visual QA: With audio information works as side-channel information, PAVE outperforms the SOTA audio-visual model by 44 points, 2% and 7% on AVSD, AVQA and visual split of Music-AVQA, respectively.
-
3D QA: With camera poses and scene depth are treated as side-channel signals, PAVE surpasses the previous best 3D MLLM by 2-4% on SQA3D and ScanQA.
-
Multi-view Video Understanding: With the exo-centric video as side-channel signals, PAVE outperforms the baseline by a clear margin in Ego-Exo4D demonstrator proficiency estimation benchmark.
-
Enhanced Video Understanding: With the densely sampled video frames as side-channel signals, PAVE improves the performance of LLaVA-OneVision with 1-5% boosts on VideoMME, MLVU and key sub-tasks of MVBench.
We install this environment on Linux machine:
- Clone the repository and navigate to PAVE folder
git clone https://github.com/dragonlzm/PAVE.git
cd PAVE
- Install Packages
conda create -n pave python=3.10 -y
conda activate pave
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install flash-attn==2.7.3 --no-build-isolation --no-cache-dir
pip install peft==0.10.0
pip install rotary-embedding-torch
# You may need to install following libs for evaluation
pip install mmengine, pycocotools, pycocoevalcap, pytablewriter, hf_transfer, tenacity, sqlitedict, evaluate, sacrebleu, loguru
We includes all the PAVE weights for different tasks in this section. You can find all the weights from here
Dataset | Base Model | Schedule | Checkpoint | AVSD (CIDEr) | AVQA (Acc.) | Music-AVQA (Audio Acc.) | Music-AVQA (Visual Acc.) | Music-AVQA (Audio-Visual Acc.) | Music-AVQA (Overall Acc.) |
---|---|---|---|---|---|---|---|---|---|
AVSD | LLaVA-OneVision-0.5B | 1e | pave_avsd_imagebind | 134.5 | - | - | - | - | - |
AVSD | LLaVA-OneVision-7B | 1e | pave_avsd_7B_imagebind | 152.9 | - | - | - | - | - |
AVSD | LLaVA-OneVision-7B | 2e with densely sampled frames | pave_avsd_7B_imagebind_dense | 160.0 | - | - | - | - | - |
AVQA | LLaVA-OneVision-0.5B | 2e | pave_avqa_imagebind | - | 90.4 | - | - | - | - |
AVQA | LLaVA-OneVision-7B | 2e | pave_avqa_7B_imagebind | - | 93.8 | - | - | - | - |
Music-AVQA | LLaVA-OneVision-0.5B | 2e | pave_music_avqa_imagebind | - | - | 77.3 | 89.8 | 74.1 | 78.8 |
Music-AVQA | LLaVA-OneVision-7B | 2e | pave_music_avqa_7B_imagebind | - | - | 79.7 | 93.0 | 78.0 | 82.3 |
Dataset | Base Model | Schedule | Checkpoint | ScanQA (C) | ScanQA (B-4) | ScanQA (M) | ScanQA (R) | ScanQA (EM@1) | SQA3D (EM@1) |
---|---|---|---|---|---|---|---|---|---|
ScanQA | LLaVA-OneVision-0.5B | 1e | pave_scanqa | 84.2 | 13.1 | 17.0 | 42.1 | 23.1 (40.0) | - |
ScanQA | LLaVA-OneVision-7B | 1e | pave_scanqa_7B | 103.4 | 16.0 | 19.9 | 49.0 | 29.1 (48.5) | - |
SQA3D | LLaVA-OneVision-0.5B | 2e | pave_sqa3d | - | - | - | - | - | 51.1 (52.8) |
SQA3D | LLaVA-OneVision-7B | 2e | pave_sqa3d_7B | - | - | - | - | - | 59.0 (61.4) |
Base Model | Schedule | Checkpoint | Ego-Exo4D demonstrator proficiency (Acc.) |
---|---|---|---|
LLaVA-OneVision-0.5B | 2e | pave_v5_1_2_egoexo_lora | 32.4 |
LLaVA-OneVision-7B | 2e | pave_v5_1_3_egoexo_lora_7B | 44.2 |
Base Model | Schedule | Checkpoint | VideoMME (Short) | VideoMME (Median) | VideoMME (Avg) | MVBench | MLVU |
---|---|---|---|---|---|---|---|
LLaVA-OneVision-0.5B | 1e | pave_v5_1_2_lora | 57.8 | 42.7 | 37.4 | 46.0 | 46.6 |
LLaVA-OneVision-7B | 1e | pave_v5_1_3_lora_7B | 71.1 | 59.4 | 59.9 | 58.0 | 67.0 |
We include the instructions for preparing different datasets used in the PAVE training and evaluation in this section. You can find all the annotations and pre-extracted features from here. You can find the videos needed for train or eval from here
Please refer to AVSD_Prepare for more information.
Please refer to AVQA_Prepare for more information.
Please refer to Music-AVQA_Prepare for more information.
Please refer to ScanQA_Prepare for more information.
Please refer to SQA3D_Prepare for more information.
Please refer to Ego-Exo4d-dp_Prepare for more information.
Please refer to LLaVA-Video_Prepare for more information.
We provide demo of PAVE here, you can also find the sample data here.
The gradio is coming soon.
We provide the sample training scripts in this section.
Please refer to AVSD_Train for more information.
Please refer to AVQA_Train for more information.
Please refer to Music-AVQA_Train for more information.
Please refer to ScanQA_Train for more information.
Please refer to SQA3D_Train for more information.
Please refer to Ego-Exo4d-dp_Train for more information.
Please refer to Enhanced_video_Train for more information.
Please refer to Multiple_side_channels_Train for more information.
We provide the sample evaluation scripts in this section.
Please refer to AVSD_Eval for more information.
Please refer to AVQA_Eval for more information.
Please refer to Music-AVQA_Eval for more information.
Please refer to ScanQA_Eval for more information.
Please refer to SQA3D_Eval for more information.
Please refer to Ego-Exo4d-dp_Eval for more information.
Please refer to Enhanced_video_Eval for more information.
Please refer to Multiple_side_channels_Eval for more information.
Zhuoming Liu ([email protected])
If think our work is helpful or interesting, please consider citing our paper. Thanks!
@misc{liu2025pavepatchingadaptingvideo,
title={PAVE: Patching and Adapting Video Large Language Models},
author={Zhuoming Liu and Yiquan Li and Khoi Duc Nguyen and Yiwu Zhong and Yin Li},
year={2025},
eprint={2503.19794},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.19794},
}
- LLaVA-OneVision: the base-model that PAVE built upon, and our base model LLaVA-OneVision-7B and LLaVA-OneVision-0.5B that has the amazing video understanding capabilities!
For academic use only. For commercial use, please contact the authors.