FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models

FrameFusion reduces the number of tokens in Large Vision-Language Models (LVLMs) by combining similarity-based merging with importance-based pruning. It achieves a 70% vision token reduction, 3.4–4.4× LLM speedups, and 1.6–1.9× end-to-end speedups with minimal performance impact.

Feel free to star the repo or cite the paper if you find it interesting.

@misc{fu2024framefusion,
      title={FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models}, 
      author={Tianyu Fu and Tengxuan Liu and Qinghao Han and Guohao Dai and Shengen Yan and Huazhong Yang and Xuefei Ning and Yu Wang},
      year={2024},
      eprint={2501.01986},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.01986}, 
}

News

[2025/04] Support NVILA model family

Environment Setup

General

Create a new environment:

conda create -n framefusion python=3.10
conda activate framefusion

Install the dependencies:

pip install -r requirements.txt

Install FrameFusion:

pip install -e .

Working with Other Models

Important: NVILA and Llava-Video have conflicting architectures. FrameFusion supports both, but please install only one to avoid conflicts.

Option 1: Llava-Video

To install Llava-Video LVLM dependencies:

Clone the LLaVA-NeXT repository:

git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
cd LLaVA-NeXT

Install via:
```
pip install -e .
```

Option 2: NVILA

To install NVILA dependencies:

Clone the VILA repository:

git clone https://github.com/NVlabs/VILA.git
cd VILA

Run environment setup script to install dependencies in current conda environment:
```
./environment_setup.sh
```
Install via:
```
pip install -e .
```

How to

Run an example

We provide an example with LLaVA-Video-7B model to inference on a video with or without FrameFusion in script/playground/example_llava.py.

python script/playground/example_llava.py

Apply FrameFusion

You can apply FrameFusion in your own code to any huggingface model that supports the interface with few lines of code. Here is an example:

from llava.model.builder import load_pretrained_model
from framefusion.interface import apply_framefusion

# set attn_implementation to be sdpa
tokenizer, model, image_processor, max_length = load_pretrained_model("lmms-lab/LLaVA-Video-7B-Qwen2", None, "llava_qwen", torch_dtype="bfloat16", attn_implementation='sdpa', device_map="auto")

# apply FrameFusion
apply_framefusion(model, cost=0.3, similarity_lower_bound=0.6, ratio_lower_bound=0.1)

# use the model as usual

Adept to new models

Understand Code Structure

framefusion/: The main package for FrameFusion.
- models/: The adapter for different models.
- main.py: The main implementation of FrameFusion.
- interface.py: The interface for applying FrameFusion.
scripts/: Scripts for running experiments.
- evaluate/: Scripts for evaluating the performance models.
- playground/: Scripts for running misc experiments.
example/: Example input videos

Modify the code

Add a new model adapter in framefusion/models/, it applies framefusion after the attention module.

Three model functions are required: llm_forward, decoder_forward, and attention_forward. The forward functions are easily modified from the corresponding modeling_<MODEL>.py functions in huggingface transformers. All modifications are marked with ### comments. For LLM, see framefusion/models/qwen2/modeling_qwen2.py as an example.
Register the model in framefusion/interface.py, it applies framefusion to the correct model class.
Add a new example in script/playground/, it shows how to apply framefusion to the model.

Happy to help

If you have any questions on applying FrameFusion to a new model, please feel free to open an issue. We are happy to help you and expand the adapter for more models.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
example/video		example/video
framefusion		framefusion
script/playground		script/playground
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models

News

Environment Setup

General

Working with Other Models

Option 1: Llava-Video

Option 2: NVILA

How to

Run an example

Apply FrameFusion

Adept to new models

Understand Code Structure

Modify the code

Happy to help

Supported Model List

MimiCPM-V

Llava-Video

NVILA

About

Releases

Contributors 3

Languages

License

thu-nics/FrameFusion

Folders and files

Latest commit

History

Repository files navigation

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models

News

Environment Setup

General

Working with Other Models

Option 1: Llava-Video

Option 2: NVILA

How to

Run an example

Apply FrameFusion

Adept to new models

Understand Code Structure

Modify the code

Happy to help

Supported Model List

MimiCPM-V

Llava-Video

NVILA

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 3

Languages