GitHub - SliMM-X/CoMP-MM: Official repository of "CoMP: Continual Multimodal Pre-training for Vision Foundation Models"

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Yitong Chen^1,2, Lingchen Meng^1, Wujian Peng^1,2, Zuxuan Wu^1,2†, Yu-Gang Jiang¹

¹ Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
² Shanghai Innovation Institute

^* Equal contributions; ^† Corresponding author.

[`Paper`] [`Website`] [`Model`]

Installation

Clone this repository and navigate to CoMP-SliMM folder

git clone https://github.com/SliMM-X/CoMP-MM.git
cd CoMP-MM

Install Package

conda create -n comp-slimm python=3.10 -y
conda activate comp-slimm
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

# additional packages for training cases
pip install -e ".[train]"

# install flash-attn directly
pip install flash-attn --no-build-isolation

# or build it from source
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.3.6
python setup.py install

Quick Start With HuggingFace

Example Code of CoMP-VFMs:

import torch
from slimm.model.processor import SliMMQwen2VLProcessor
from slimm.model.vision_encoder import CoMPSiglipVisionModel, CoMPDinov2Model
from PIL import Image
import requests
from io import BytesIO

model_path = "SliMM-X/CoMP-SigLIP-So400M"
# model_path = "SliMM-X/CoMP-DINOv2-Large"

model = CoMPSiglipVisionModel.from_pretrained(
    model_path, torch_dtype="auto", device_map="cuda", w_merger=False
).to(torch.bfloat16)

# model = CoMPDinov2Model.from_pretrained(
#     model_path, torch_dtype="auto", device_map="cuda", w_merger=False
# ).to(torch.bfloat16)

processor = SliMMQwen2VLProcessor.from_pretrained(model_path)

urldata = requests.get("https://slimm-x.github.io/comp/figs/teaser.png")
temp_img = BytesIO(urldata.content)
image_input = Image.open(temp_img)

inputs = processor(
    images=image_input,
    return_tensors="pt",
)

inputs = inputs.to("cuda")
output_feat = model(inputs.pixel_values.to(torch.bfloat16), inputs.image_grid_thw)
print(output_feat)

Example Code of CoMP-MM:

# this is very similar to qwen2-vl
from slimm.model.processor import SliMMQwen2VLProcessor
from slimm.model.slimm import SliMMForConditionalGeneration
from slimm.model.utils_vl import process_vision_info

model_path = "SliMM-X/CoMP-MM-1B"

model = SliMMForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="cuda"
)
processor = SliMMQwen2VLProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://slimm-x.github.io/comp/figs/teaser.png",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Train

We provide two scripts for reproduction about (1) CoMP-MM-1B w/ SigLIP:

bash scripts/comp/comp_1b_siglip.sh

and (2) CoMP-MM-1B w/ DINOv2:

bash scripts/comp/comp_1b_dinov2.sh

For data preparation, please refer to scripts/comp/README.md.

Evaluation

We provide an evaluation script for multimodal understanding based on lmms-eval locally. First, you need to install lmms-eval:

cd lmms-eval
pip install -e .
cd ..

And then, run:

bash scripts/comp/eval.sh

🔗 Citation

If you find our work helpful, please consider citing our paper 📎 and starring our repo 🌟 :

@article{comp2025,
      title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models}, 
      author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
      year={2025},
      journal={arXiv preprint arXiv:2503.18931}, 
}

Acknowledgement

Our work is built upon SliMM, Qwen2-VL, LLaVA and LLaVA-NeXT.

Feel free to contribute and reach out if you have any questions!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
checkpoints		checkpoints
lmms-eval		lmms-eval
scripts		scripts
slimm		slimm
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Yitong Chen^1,2, Lingchen Meng^1, Wujian Peng^1,2, Zuxuan Wu^1,2†, Yu-Gang Jiang¹

¹ Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
² Shanghai Innovation Institute

^* Equal contributions; ^† Corresponding author.

[`Paper`] [`Website`] [`Model`]

Installation

Quick Start With HuggingFace

Train

Evaluation

🔗 Citation

Acknowledgement

About

Releases

Packages

Languages

License

SliMM-X/CoMP-MM

Folders and files

Latest commit

History

Repository files navigation

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Yitong Chen1,2*, Lingchen Meng1*, Wujian Peng1,2, Zuxuan Wu1,2†, Yu-Gang Jiang1 1 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2 Shanghai Innovation Institute * Equal contributions; † Corresponding author. [Paper] [Website] [Model]

Installation

Quick Start With HuggingFace

Train

Evaluation

🔗 Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Yitong Chen^1,2, Lingchen Meng^1, Wujian Peng^1,2, Zuxuan Wu^1,2†, Yu-Gang Jiang¹

¹ Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
² Shanghai Innovation Institute

^* Equal contributions; ^† Corresponding author.

[`Paper`] [`Website`] [`Model`]

Packages