🛠️ PAVE: Patching and Adapting Video Large Language Models

Zhuoming Liu¹, Yiquan Li¹, Khoi Duc Nguyen¹, Yiwu Zhong², Yin Li¹

¹University of Wisconsin-Madison ²The Chinese University of Hong Kong

Introduction

This code repo holds the implementation of PAVE, a framework that adapts pre-trained video large language model (Video-LLMs) to downstream tasks with side-channel signals, such as audio, depth information, exo-centric video and high frame rate videos. Our paper is accepted to CVPR2025 and an arXiv version of our paper is available here.

PAVE adapts video-LLMs through patching --- adding a small "patch" of additional parameters and operations to the video-LLM, without altering its existing architecture or vast pre-trained weights. Specifically, PAVE leverages cross-attention that operates between tokens derived from key video frames (as queries) and tokens from side-channel signals (as keys and values). This operation seeks to align the visual signal and side-channel signals along the time axis, fuse the signals from both sources, and then update the input visual tokens to the LLM. In doing so, PAVE allows for the input of supplementary signals, and introduces a small number of parameters and operations with a negligible computing cost, while enabling effective adaptation to various downstream tasks.

Without bells and whistles, PAVE achieves state-of-the-art performance in audio-visual QA, 3DQA, multi-view video understanding by adapting video-LLM to new tasks with the cost of adding less than 1% of FLOPs and parameters.

Audio-visual QA: With audio information works as side-channel information, PAVE outperforms the SOTA audio-visual model by 44 points, 2% and 7% on AVSD, AVQA and visual split of Music-AVQA, respectively.
3D QA: With camera poses and scene depth are treated as side-channel signals, PAVE surpasses the previous best 3D MLLM by 2-4% on SQA3D and ScanQA.
Multi-view Video Understanding: With the exo-centric video as side-channel signals, PAVE outperforms the baseline by a clear margin in Ego-Exo4D demonstrator proficiency estimation benchmark.
Enhanced Video Understanding: With the densely sampled video frames as side-channel signals, PAVE improves the performance of LLaVA-OneVision with 1-5% boosts on VideoMME, MLVU and key sub-tasks of MVBench.

Install

We install this environment on Linux machine:

Clone the repository and navigate to PAVE folder

git clone https://github.com/dragonlzm/PAVE.git
cd PAVE

Install Packages

conda create -n pave python=3.10 -y
conda activate pave
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

pip install flash-attn==2.7.3 --no-build-isolation --no-cache-dir
pip install peft==0.10.0
pip install rotary-embedding-torch

# You may need to install following libs for evaluation
pip install mmengine, pycocotools, pycocoevalcap, pytablewriter, hf_transfer, tenacity, sqlitedict, evaluate, sacrebleu, loguru

PAVE Weights

We includes all the PAVE weights for different tasks in this section. You can find all the weights from here

1. Audio-Visual

Dataset	Base Model	Schedule	Checkpoint	AVSD (CIDEr)	AVQA (Acc.)	Music-AVQA (Audio Acc.)	Music-AVQA (Visual Acc.)	Music-AVQA (Audio-Visual Acc.)	Music-AVQA (Overall Acc.)
AVSD	LLaVA-OneVision-0.5B	1e	pave_avsd_imagebind	134.5	-	-	-	-	-
AVSD	LLaVA-OneVision-7B	1e	pave_avsd_7B_imagebind	152.9	-	-	-	-	-
AVSD	LLaVA-OneVision-7B	2e with densely sampled frames	pave_avsd_7B_imagebind_dense	160.0	-	-	-	-	-
AVQA	LLaVA-OneVision-0.5B	2e	pave_avqa_imagebind	-	90.4	-	-	-	-
AVQA	LLaVA-OneVision-7B	2e	pave_avqa_7B_imagebind	-	93.8	-	-	-	-
Music-AVQA	LLaVA-OneVision-0.5B	2e	pave_music_avqa_imagebind	-	-	77.3	89.8	74.1	78.8
Music-AVQA	LLaVA-OneVision-7B	2e	pave_music_avqa_7B_imagebind	-	-	79.7	93.0	78.0	82.3

2. 3D-QA

Dataset	Base Model	Schedule	Checkpoint	ScanQA (C)	ScanQA (B-4)	ScanQA (M)	ScanQA (R)	ScanQA (EM@1)	SQA3D (EM@1)
ScanQA	LLaVA-OneVision-0.5B	1e	pave_scanqa	84.2	13.1	17.0	42.1	23.1 (40.0)	-
ScanQA	LLaVA-OneVision-7B	1e	pave_scanqa_7B	103.4	16.0	19.9	49.0	29.1 (48.5)	-
SQA3D	LLaVA-OneVision-0.5B	2e	pave_sqa3d	-	-	-	-	-	51.1 (52.8)
SQA3D	LLaVA-OneVision-7B	2e	pave_sqa3d_7B	-	-	-	-	-	59.0 (61.4)

3. Multi-Video Video Understanding

Base Model	Schedule	Checkpoint	Ego-Exo4D demonstrator proficiency (Acc.)
LLaVA-OneVision-0.5B	2e	pave_v5_1_2_egoexo_lora	32.4
LLaVA-OneVision-7B	2e	pave_v5_1_3_egoexo_lora_7B	44.2

4. Enhancing Video QA

Base Model	Schedule	Checkpoint	VideoMME (Short)	VideoMME (Median)	VideoMME (Avg)	MVBench	MLVU
LLaVA-OneVision-0.5B	1e	pave_v5_1_2_lora	57.8	42.7	37.4	46.0	46.6
LLaVA-OneVision-7B	1e	pave_v5_1_3_lora_7B	71.1	59.4	59.9	58.0	67.0

Datasets

We include the instructions for preparing different datasets used in the PAVE training and evaluation in this section. You can find all the annotations and pre-extracted features from here. You can find the videos needed for train or eval from here

1. AVSD

Please refer to AVSD_Prepare for more information.

2. AVQA

Please refer to AVQA_Prepare for more information.

3. Music-AVQA

Please refer to Music-AVQA_Prepare for more information.

4. ScanQA

Please refer to ScanQA_Prepare for more information.

5. SQA3D

Please refer to SQA3D_Prepare for more information.

6. Ego-Exo4D demonstrator proficiency

Please refer to Ego-Exo4d-dp_Prepare for more information.

7. LLaVA-Video

Please refer to LLaVA-Video_Prepare for more information.

Demo

We provide demo of PAVE here, you can also find the sample data here.

The gradio is coming soon.

Train

We provide the sample training scripts in this section.

1. AVSD

Please refer to AVSD_Train for more information.

2. AVQA

Please refer to AVQA_Train for more information.

3. Music-AVQA

Please refer to Music-AVQA_Train for more information.

4. ScanQA

Please refer to ScanQA_Train for more information.

5. SQA3D

Please refer to SQA3D_Train for more information.

6. Ego-Exo4D demonstrator proficiency

Please refer to Ego-Exo4d-dp_Train for more information.

7. Enhanced Video

Please refer to Enhanced_video_Train for more information.

8. Multiple Side-Channels

Please refer to Multiple_side_channels_Train for more information.

Evaluation

We provide the sample evaluation scripts in this section.

1. AVSD

Please refer to AVSD_Eval for more information.

2. AVQA

Please refer to AVQA_Eval for more information.

3. Music-AVQA

Please refer to Music-AVQA_Eval for more information.

4. ScanQA

Please refer to ScanQA_Eval for more information.

5. SQA3D

Please refer to SQA3D_Eval for more information.

6. Music-AVQA

Please refer to Ego-Exo4d-dp_Eval for more information.

7. Enhanced Video

Please refer to Enhanced_video_Eval for more information.

8. Multiple Side-Channels

Please refer to Multiple_side_channels_Eval for more information.

Contact

Zhuoming Liu ([email protected])

References

If think our work is helpful or interesting, please consider citing our paper. Thanks!

@misc{liu2025pavepatchingadaptingvideo,
      title={PAVE: Patching and Adapting Video Large Language Models}, 
      author={Zhuoming Liu and Yiquan Li and Khoi Duc Nguyen and Yiwu Zhong and Yin Li},
      year={2025},
      eprint={2503.19794},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.19794}, 
}

Acknowledgement

LLaVA-OneVision: the base-model that PAVE built upon, and our base model LLaVA-OneVision-7B and LLaVA-OneVision-0.5B that has the amazing video understanding capabilities!

License

For academic use only. For commercial use, please contact the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
doc		doc
images		images
libs		libs
lmms_eval		lmms_eval
scripts		scripts
tools		tools
README.md		README.md
demo_pave.py		demo_pave.py
eval_pave_avqa.py		eval_pave_avqa.py
eval_pave_avsd.py		eval_pave_avsd.py
eval_pave_avsd_multi_sides.py		eval_pave_avsd_multi_sides.py
eval_pave_egoexo.py		eval_pave_egoexo.py
eval_pave_music_avqa.py		eval_pave_music_avqa.py
eval_pave_scanqa.py		eval_pave_scanqa.py
eval_pave_sqa3d.py		eval_pave_sqa3d.py
lmms_eval_start.py		lmms_eval_start.py
pyproject.toml		pyproject.toml
train_pave_w_feat.py		train_pave_w_feat.py
train_vidit_w_multiple_feat.py		train_vidit_w_multiple_feat.py
utils.py		utils.py

dragonlzm/PAVE

Folders and files

Latest commit

History

Repository files navigation

🛠️ PAVE: Patching and Adapting Video Large Language Models

Introduction

Contents

Install

PAVE Weights

1. Audio-Visual

2. 3D-QA

3. Multi-Video Video Understanding

4. Enhancing Video QA

Datasets

1. AVSD

2. AVQA

3. Music-AVQA

4. ScanQA

5. SQA3D

6. Ego-Exo4D demonstrator proficiency

7. LLaVA-Video

Demo

Train

1. AVSD

2. AVQA

3. Music-AVQA

4. ScanQA

5. SQA3D

6. Ego-Exo4D demonstrator proficiency

7. Enhanced Video

8. Multiple Side-Channels

Evaluation

1. AVSD

2. AVQA

3. Music-AVQA

4. ScanQA

5. SQA3D

6. Music-AVQA

7. Enhanced Video

8. Multiple Side-Channels

Contact

References

Acknowledgement

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages