This repo includes the code for "Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder" (ACL 2025).
First, download the What'sUp dataset and MMVP/MMVP-VLM accordingly. Our evaluation code is also based on their code.
For CLIP-ViT-L/14-336px: Check Patch-Aligned-Contrastive-Learning/eval_clip.py and the scripts in Patch-Aligned-Contrastive-Learning/eval.sh.
For LLaVA-1.5-7B, Phi-3-V-3.8B, LLaMA-3-V-8B: Check Patch-Aligned-Contrastive-Learning/eval_vqa_score.py and the scripts in Patch-Aligned-Contrastive-Learning/eval.sh.
For original LLM2CLIP: Use open_clip/src/eval_llm2clip.sh.
The additional questions for converted MMVP/MMVP-VLM are attached in additional_questions. Please place them in your folder containing MMVP and MMVP-VLM data, respectively.
The code is based on t2v_metrics. We add new datasets and new models for our experiments. Please follow their instruction on environment installation, and then use t2v_metrics/eval.sh.
We use the OpenCLIP to finetune the pretrained CLIP, SigLIP, and EVA-CLIP model on converted LLaVA-1.5's data. Please download the data following LLaVA-1.5's instruction, install the environment required by OpenCLIP, and then check our open_clip folder for (1) dataset setting and (2) NegCLIP-style loss for training with left/right negatives. You can use train-clip.sh for finetuning CLIP as an example. Remove the --lock-image option if you want to try finetuning with the unfrozen vision encoder.
For the ablation on LLaVA-1.5, we only added one thing based on the official code: the 'cls' option, in their llava/model/multimodal_encoder/clip_encoder.py, corresponding scripts, and parameters:
def feature_select(self, image_forward_outs):
image_features = image_forward_outs.hidden_states[self.select_layer]
if self.select_feature == 'patch':
image_features = image_features[:, 1:]
elif self.select_feature == 'cls_patch':
image_features = image_features
elif self.select_feature == 'cls': # Here
image_features = image_features[:, 0:1]
else:
raise ValueError(f'Unexpected select feature: {self.select_feature}')
return image_features
Then we train the model with the 'cls' option for both stages (Pre-training for Feature Alignment + Fine-tuning End-to-End). The checkpoints can be downloaded at https://huggingface.co/lst627/llava-v1.5-7b-lora-merged and https://huggingface.co/lst627/llava-v1.5-7b-lora-cls-merged. Note that LLaVA-1.5 uses the penultimate layer in CLIP vision encoder, not the last layer.
For PACL (with patch tokens for image) and SPARC (with patch tokens for image and multiple text tokens), our code is based on an implementation of PACL. Check Patch-Aligned-Contrastive-Learning for all the relevant code. For replacing the text encoder with a stronger LLM-based text encoder, we first calculate all the text embeddings to accelerate training.
The checkpoints can be downloaded here and the LLM embeddings are here.
Our code is based on a previous commit of VLM2Vec. Please refer to their repository for setting up the environment.
Our LLaVA-1.5-7B-VLM2Vec-LoRA checkpoint can be downloaded here and can be evaluated using VLM2Vec/eval.sh. If you would like to reproduce the training process, please refer to VLM2Vec/scripts/llava_1.5/run_train.sh.
If you find our code, data, or the paper useful, please cite the paper:
@inproceedings{li2025exploring,
title={Exploring how generative mllms perceive more than clip with the same vision encoder},
author={Li, Siting and Koh, Pang Wei and Du, Simon Shaolei},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={10101--10119},
year={2025}
}

