CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

Source code for CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

Installation

Install the package following LLaVA-NeXT:

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip
pip install -e ".[train]"

Dataset

During training, we only use subsets of LLaVA one vision training data specified in data/dataset.yaml.

LLaVA one vision training data can be downloaded from here.

After downloading these datasets, convert each dataset into the format of LLaVA-OneVision, i.e., a json file and an image folder. Place them under data and conduct the training with the converted datasets.

Example converting script: scripts/prepare_data.py

Train

To train CAFe, we provide a sample training script scripts/train.sh.

The training datasets are specified through --data_path and image_folder following the format of LLaVA-NeXT. --compute_language_loss and --compute_contrastive_loss control whether to compute language modeling and contrastive loss. --con_weight decides the weight of the contrastive loss.

Evaluation

Zero-shot image-text retrieval on MSCOCO and Flickr

python eval/eval_retrieval.py \
    --ckpt=/CKPT/PATH/ \

Multimodal retrieval on MMEB

Clone VLM2Vec and follow the instructions there to download the MMEB dataset.

To train CAFe on MMEB, prepare the dataset following the same format as described earlier and enable --multimodal_input when training.

To evaluate, run the following script (modified from VLM2Vec) and replace the subset_name with the desired dataset names:

python eval/eval_mmeb.py \
  --encode_output_path output/vlm_outputs/ \
  --dataset_name TIGER-Lab/MMEB-eval \
  --subset_name ImageNet-1K N24News HatefulMemes \
  --dataset_split test --per_device_eval_batch_size 4 \
  --image_dir data/eval_images/

Multimodal understanding

Please follow the instructions in lmm-eval for the setup and evaluations. CAFe can be evaluated as the MLLM LLaVa-OV. For example, to evaluate on MMStar:

accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
    --model=llava_onevision \
    --model_args=pretrained=${CKPT_PATH},conv_template=qwen_1_5,model_name=llava_qwen \
    --tasks="mmstar" \
    --batch_size=1 \
    --log_samples \
    --log_samples_suffix=$TASK_SUFFIX \
    --output_path="./logs/" \
    --wandb_args=project=lmms-eval

Hallucination

Please refer to POPE and THRONE.

Citation

If you find it useful for your research and applications, please cite related papers using this BibTeX:

@article{yu2025cafe,
  title={CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning},
  author={Yu, Hao and Zhao, Zhuokai and Yan, Shen and Korycki, Lukasz and Wang, Jianyu and He, Baosheng and Liu, Jiayi and Zhang, Lizhu and Fan, Xiangjun and Yu, Hanchao},
  journal={arXiv preprint arXiv:2503.19900},
  year={2025}
}

Acknowledgement

The repository is heavily build upon LLaVA-NeXT.

Name	Name	Last commit message	Last commit date
Latest commit haoyu-bu Update README.md Mar 26, 2025 b708454 · Mar 26, 2025 History 3 Commits
eval	eval	first commit	Mar 22, 2025
figs	figs	first commit	Mar 22, 2025
llava	llava	first commit	Mar 22, 2025
scripts	scripts	first commit	Mar 22, 2025
trl	trl	first commit	Mar 22, 2025
.dockerignore	.dockerignore	first commit	Mar 22, 2025
.editorconfig	.editorconfig	first commit	Mar 22, 2025
.gitattributes	.gitattributes	first commit	Mar 22, 2025
.gitignore	.gitignore	first commit	Mar 22, 2025
LICENSE	LICENSE	first commit	Mar 22, 2025
README.md	README.md	Update README.md	Mar 26, 2025
pyproject.toml	pyproject.toml	first commit	Mar 22, 2025
requirements.txt	requirements.txt	first commit	Mar 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

Installation

Dataset

Train

Evaluation

Zero-shot image-text retrieval on MSCOCO and Flickr

Multimodal retrieval on MMEB

Multimodal understanding

Hallucination

Citation

Acknowledgement

About

Releases

Packages

Languages

License

whuhxb/CAFe

Folders and files

Latest commit

History

Repository files navigation

CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

Installation

Dataset

Train

Evaluation

Zero-shot image-text retrieval on MSCOCO and Flickr

Multimodal retrieval on MMEB

Multimodal understanding

Hallucination

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages