Source code for CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning
Install the package following LLaVA-NeXT:
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip
pip install -e ".[train]"
During training, we only use subsets of LLaVA one vision training data specified in data/dataset.yaml
.
LLaVA one vision training data can be downloaded from here.
After downloading these datasets, convert each dataset into the format of LLaVA-OneVision, i.e., a json file and an image folder. Place them under data
and conduct the training with the converted datasets.
Example converting script: scripts/prepare_data.py
To train CAFe, we provide a sample training script scripts/train.sh
.
The training datasets are specified through --data_path and image_folder following the format of LLaVA-NeXT. --compute_language_loss
and --compute_contrastive_loss
control whether to compute language modeling and contrastive loss. --con_weight
decides the weight of the contrastive loss.
python eval/eval_retrieval.py \
--ckpt=/CKPT/PATH/ \
Multimodal retrieval on MMEB
Clone VLM2Vec and follow the instructions there to download the MMEB dataset.
To train CAFe on MMEB, prepare the dataset following the same format as described earlier and enable --multimodal_input
when training.
To evaluate, run the following script (modified from VLM2Vec) and replace the subset_name with the desired dataset names:
python eval/eval_mmeb.py \
--encode_output_path output/vlm_outputs/ \
--dataset_name TIGER-Lab/MMEB-eval \
--subset_name ImageNet-1K N24News HatefulMemes \
--dataset_split test --per_device_eval_batch_size 4 \
--image_dir data/eval_images/
Please follow the instructions in lmm-eval for the setup and evaluations. CAFe can be evaluated as the MLLM LLaVa-OV. For example, to evaluate on MMStar:
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model=llava_onevision \
--model_args=pretrained=${CKPT_PATH},conv_template=qwen_1_5,model_name=llava_qwen \
--tasks="mmstar" \
--batch_size=1 \
--log_samples \
--log_samples_suffix=$TASK_SUFFIX \
--output_path="./logs/" \
--wandb_args=project=lmms-eval
Please refer to POPE and THRONE.
If you find it useful for your research and applications, please cite related papers using this BibTeX:
@article{yu2025cafe,
title={CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning},
author={Yu, Hao and Zhao, Zhuokai and Yan, Shen and Korycki, Lukasz and Wang, Jianyu and He, Baosheng and Liu, Jiayi and Zhang, Lizhu and Fan, Xiangjun and Yu, Hanchao},
journal={arXiv preprint arXiv:2503.19900},
year={2025}
}
- The repository is heavily build upon LLaVA-NeXT.