Skip to content
/ PAVE Public

This repo holds the implementation of PAVE: Patching and Adapting Video Large Language Models (CVPR2025)

Notifications You must be signed in to change notification settings

dragonlzm/PAVE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛠️ PAVE: Patching and Adapting Video Large Language Models

Zhuoming Liu1, Yiquan Li1, Khoi Duc Nguyen1, Yiwu Zhong2, Yin Li1

1University of Wisconsin-Madison 2The Chinese University of Hong Kong

Introduction

This code repo holds the implementation of PAVE, a framework that adapts pre-trained video large language model (Video-LLMs) to downstream tasks with side-channel signals, such as audio, depth information, exo-centric video and high frame rate videos. Our paper is accepted to CVPR2025 and an arXiv version of our paper is available here.

PAVE adapts video-LLMs through patching --- adding a small "patch" of additional parameters and operations to the video-LLM, without altering its existing architecture or vast pre-trained weights. Specifically, PAVE leverages cross-attention that operates between tokens derived from key video frames (as queries) and tokens from side-channel signals (as keys and values). This operation seeks to align the visual signal and side-channel signals along the time axis, fuse the signals from both sources, and then update the input visual tokens to the LLM. In doing so, PAVE allows for the input of supplementary signals, and introduces a small number of parameters and operations with a negligible computing cost, while enabling effective adaptation to various downstream tasks.

Without bells and whistles, PAVE achieves state-of-the-art performance in audio-visual QA, 3DQA, multi-view video understanding by adapting video-LLM to new tasks with the cost of adding less than 1% of FLOPs and parameters.

  • Audio-visual QA: With audio information works as side-channel information, PAVE outperforms the SOTA audio-visual model by 44 points, 2% and 7% on AVSD, AVQA and visual split of Music-AVQA, respectively.

  • 3D QA: With camera poses and scene depth are treated as side-channel signals, PAVE surpasses the previous best 3D MLLM by 2-4% on SQA3D and ScanQA.

  • Multi-view Video Understanding: With the exo-centric video as side-channel signals, PAVE outperforms the baseline by a clear margin in Ego-Exo4D demonstrator proficiency estimation benchmark.

  • Enhanced Video Understanding: With the densely sampled video frames as side-channel signals, PAVE improves the performance of LLaVA-OneVision with 1-5% boosts on VideoMME, MLVU and key sub-tasks of MVBench.

Contents

Install

We install this environment on Linux machine:

  1. Clone the repository and navigate to PAVE folder
git clone https://github.com/dragonlzm/PAVE.git
cd PAVE
  1. Install Packages
conda create -n pave python=3.10 -y
conda activate pave
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

pip install flash-attn==2.7.3 --no-build-isolation --no-cache-dir
pip install peft==0.10.0
pip install rotary-embedding-torch

# You may need to install following libs for evaluation
pip install mmengine, pycocotools, pycocoevalcap, pytablewriter, hf_transfer, tenacity, sqlitedict, evaluate, sacrebleu, loguru

PAVE Weights

We includes all the PAVE weights for different tasks in this section. You can find all the weights from here

1. Audio-Visual

Dataset Base Model Schedule Checkpoint AVSD (CIDEr) AVQA (Acc.) Music-AVQA (Audio Acc.) Music-AVQA (Visual Acc.) Music-AVQA (Audio-Visual Acc.) Music-AVQA (Overall Acc.)
AVSD LLaVA-OneVision-0.5B 1e pave_avsd_imagebind 134.5 - - - - -
AVSD LLaVA-OneVision-7B 1e pave_avsd_7B_imagebind 152.9 - - - - -
AVSD LLaVA-OneVision-7B 2e with densely sampled frames pave_avsd_7B_imagebind_dense 160.0 - - - - -
AVQA LLaVA-OneVision-0.5B 2e pave_avqa_imagebind - 90.4 - - - -
AVQA LLaVA-OneVision-7B 2e pave_avqa_7B_imagebind - 93.8 - - - -
Music-AVQA LLaVA-OneVision-0.5B 2e pave_music_avqa_imagebind - - 77.3 89.8 74.1 78.8
Music-AVQA LLaVA-OneVision-7B 2e pave_music_avqa_7B_imagebind - - 79.7 93.0 78.0 82.3

2. 3D-QA

Dataset Base Model Schedule Checkpoint ScanQA (C) ScanQA (B-4) ScanQA (M) ScanQA (R) ScanQA (EM@1) SQA3D (EM@1)
ScanQA LLaVA-OneVision-0.5B 1e pave_scanqa 84.2 13.1 17.0 42.1 23.1 (40.0) -
ScanQA LLaVA-OneVision-7B 1e pave_scanqa_7B 103.4 16.0 19.9 49.0 29.1 (48.5) -
SQA3D LLaVA-OneVision-0.5B 2e pave_sqa3d - - - - - 51.1 (52.8)
SQA3D LLaVA-OneVision-7B 2e pave_sqa3d_7B - - - - - 59.0 (61.4)

3. Multi-Video Video Understanding

Base Model Schedule Checkpoint Ego-Exo4D demonstrator proficiency (Acc.)
LLaVA-OneVision-0.5B 2e pave_v5_1_2_egoexo_lora 32.4
LLaVA-OneVision-7B 2e pave_v5_1_3_egoexo_lora_7B 44.2

4. Enhancing Video QA

Base Model Schedule Checkpoint VideoMME (Short) VideoMME (Median) VideoMME (Avg) MVBench MLVU
LLaVA-OneVision-0.5B 1e pave_v5_1_2_lora 57.8 42.7 37.4 46.0 46.6
LLaVA-OneVision-7B 1e pave_v5_1_3_lora_7B 71.1 59.4 59.9 58.0 67.0

Datasets

We include the instructions for preparing different datasets used in the PAVE training and evaluation in this section. You can find all the annotations and pre-extracted features from here. You can find the videos needed for train or eval from here

1. AVSD

Please refer to AVSD_Prepare for more information.

2. AVQA

Please refer to AVQA_Prepare for more information.

3. Music-AVQA

Please refer to Music-AVQA_Prepare for more information.

4. ScanQA

Please refer to ScanQA_Prepare for more information.

5. SQA3D

Please refer to SQA3D_Prepare for more information.

6. Ego-Exo4D demonstrator proficiency

Please refer to Ego-Exo4d-dp_Prepare for more information.

7. LLaVA-Video

Please refer to LLaVA-Video_Prepare for more information.

Demo

We provide demo of PAVE here, you can also find the sample data here.

The gradio is coming soon.

Train

We provide the sample training scripts in this section.

1. AVSD

Please refer to AVSD_Train for more information.

2. AVQA

Please refer to AVQA_Train for more information.

3. Music-AVQA

Please refer to Music-AVQA_Train for more information.

4. ScanQA

Please refer to ScanQA_Train for more information.

5. SQA3D

Please refer to SQA3D_Train for more information.

6. Ego-Exo4D demonstrator proficiency

Please refer to Ego-Exo4d-dp_Train for more information.

7. Enhanced Video

Please refer to Enhanced_video_Train for more information.

8. Multiple Side-Channels

Please refer to Multiple_side_channels_Train for more information.

Evaluation

We provide the sample evaluation scripts in this section.

1. AVSD

Please refer to AVSD_Eval for more information.

2. AVQA

Please refer to AVQA_Eval for more information.

3. Music-AVQA

Please refer to Music-AVQA_Eval for more information.

4. ScanQA

Please refer to ScanQA_Eval for more information.

5. SQA3D

Please refer to SQA3D_Eval for more information.

6. Music-AVQA

Please refer to Ego-Exo4d-dp_Eval for more information.

7. Enhanced Video

Please refer to Enhanced_video_Eval for more information.

8. Multiple Side-Channels

Please refer to Multiple_side_channels_Eval for more information.

Contact

Zhuoming Liu ([email protected])

References

If think our work is helpful or interesting, please consider citing our paper. Thanks!

@misc{liu2025pavepatchingadaptingvideo,
      title={PAVE: Patching and Adapting Video Large Language Models}, 
      author={Zhuoming Liu and Yiquan Li and Khoi Duc Nguyen and Yiwu Zhong and Yin Li},
      year={2025},
      eprint={2503.19794},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.19794}, 
}

Acknowledgement

  • LLaVA-OneVision: the base-model that PAVE built upon, and our base model LLaVA-OneVision-7B and LLaVA-OneVision-0.5B that has the amazing video understanding capabilities!

License

For academic use only. For commercial use, please contact the authors.

About

This repo holds the implementation of PAVE: Patching and Adapting Video Large Language Models (CVPR2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published