Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder (ACL 2025)

This repo includes the code for "Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder" (ACL 2025).

Evaluation on What'sUp and MMVP/MMVP-VLM

First, download the What'sUp dataset and MMVP/MMVP-VLM accordingly. Our evaluation code is also based on their code.

For CLIP-ViT-L/14-336px: Check Patch-Aligned-Contrastive-Learning/eval_clip.py and the scripts in Patch-Aligned-Contrastive-Learning/eval.sh.

For LLaVA-1.5-7B, Phi-3-V-3.8B, LLaMA-3-V-8B: Check Patch-Aligned-Contrastive-Learning/eval_vqa_score.py and the scripts in Patch-Aligned-Contrastive-Learning/eval.sh.

For original LLM2CLIP: Use open_clip/src/eval_llm2clip.sh.

The additional questions for converted MMVP/MMVP-VLM are attached in additional_questions. Please place them in your folder containing MMVP and MMVP-VLM data, respectively.

Evaluation on Other Benchmarks

The code is based on t2v_metrics. We add new datasets and new models for our experiments. Please follow their instruction on environment installation, and then use t2v_metrics/eval.sh.

Ablation on Training Data

We use the OpenCLIP to finetune the pretrained CLIP, SigLIP, and EVA-CLIP model on converted LLaVA-1.5's data. Please download the data following LLaVA-1.5's instruction, install the environment required by OpenCLIP, and then check our open_clip folder for (1) dataset setting and (2) NegCLIP-style loss for training with left/right negatives. You can use train-clip.sh for finetuning CLIP as an example. Remove the --lock-image option if you want to try finetuning with the unfrozen vision encoder.

Ablation on Token Usage on LLaVA-1.5

For the ablation on LLaVA-1.5, we only added one thing based on the official code: the 'cls' option, in their llava/model/multimodal_encoder/clip_encoder.py, corresponding scripts, and parameters:

def feature_select(self, image_forward_outs):
    image_features = image_forward_outs.hidden_states[self.select_layer]
    if self.select_feature == 'patch':
        image_features = image_features[:, 1:]
    elif self.select_feature == 'cls_patch':
        image_features = image_features
    elif self.select_feature == 'cls':     # Here
        image_features = image_features[:, 0:1]
    else:
        raise ValueError(f'Unexpected select feature: {self.select_feature}')
    return image_features

Then we train the model with the 'cls' option for both stages (Pre-training for Feature Alignment + Fine-tuning End-to-End). The checkpoints can be downloaded at https://huggingface.co/lst627/llava-v1.5-7b-lora-merged and https://huggingface.co/lst627/llava-v1.5-7b-lora-cls-merged. Note that LLaVA-1.5 uses the penultimate layer in CLIP vision encoder, not the last layer.

Ablation on Token Usage and Language Model on CLIP

For PACL (with patch tokens for image) and SPARC (with patch tokens for image and multiple text tokens), our code is based on an implementation of PACL. Check Patch-Aligned-Contrastive-Learning for all the relevant code. For replacing the text encoder with a stronger LLM-based text encoder, we first calculate all the text embeddings to accelerate training.

The checkpoints can be downloaded here and the LLM embeddings are here.

Ablation on Alignment Architecture and Prompt

Our code is based on a previous commit of VLM2Vec. Please refer to their repository for setting up the environment.

Our LLaVA-1.5-7B-VLM2Vec-LoRA checkpoint can be downloaded here and can be evaluated using VLM2Vec/eval.sh. If you would like to reproduce the training process, please refer to VLM2Vec/scripts/llava_1.5/run_train.sh.

Citation

If you find our code, data, or the paper useful, please cite the paper:

@inproceedings{li2025exploring,
  title={Exploring how generative mllms perceive more than clip with the same vision encoder},
  author={Li, Siting and Koh, Pang Wei and Du, Simon Shaolei},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={10101--10119},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder (ACL 2025)

Evaluation on What'sUp and MMVP/MMVP-VLM

Evaluation on Other Benchmarks

Ablation on Training Data

Ablation on Token Usage on LLaVA-1.5

Ablation on Token Usage and Language Model on CLIP

Ablation on Alignment Architecture and Prompt

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Patch-Aligned-Contrastive-Learning		Patch-Aligned-Contrastive-Learning
VLM2Vec		VLM2Vec
additional_questions		additional_questions
figs		figs
open_clip		open_clip
t2v_metrics		t2v_metrics
README.md		README.md

lst627/CLIP-Embeds

Folders and files

Latest commit

History

Repository files navigation

Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder (ACL 2025)

Evaluation on What'sUp and MMVP/MMVP-VLM

Evaluation on Other Benchmarks

Ablation on Training Data

Ablation on Token Usage on LLaVA-1.5

Ablation on Token Usage and Language Model on CLIP

Ablation on Alignment Architecture and Prompt

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages