$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Jiaqi Han*, Mingjian Jiang*, Yuxuan Song, Jure Leskovec, Stefano Ermon, Minkai Xu*^
*Equal contribution. ^Corresponding author.
Stanford University

Introduction

We introduce $f$-PO, a novel approach that generalizes preference optimization via minimizing the $f$-divergence through solving certain distribution matching problem. The method subsumes existing offline preference optimization methods like DPO while inspiring new formulations by exploring other members in the $f$-divergence family. Results demonstrate the efficacy of $f$-PO on a wide suite of benchmarks, including, e.g., AlpacaEval, MT-Bench, ArenaHard, and more.

Checkpoints

We provide the checkpoints of the models trained in the paper. The checkpoints are hosted on Hugging Face Hub. You can download the checkpoints using the following links:

Environment

Please refer to the environment file for the detailed dependencies.

Experiments

Tuning Pythia-2.8B on HH and TLDR

Please refer to HH and TLDR for the procedure of data processing, generating preference dataset with SFT model, and labeling with the reward model.

Afterward, refer to run/hh_pref.sh, run/hh_rw.sh, run/tldr_pref.sh, and run/tldr_rw.sh for an integrated pipeline of training, inference, and GPT-4 evaluation on HH and TLDR in the preference and reward model settings.

Remember to change the variables INIT_MODEL_PATH, DATA_PATH in the scripts (e.g., run/hh_pref.sh) and YOUR_PATH and YOUR_API_KEY in src/api.py for the code to work properly.

Tuning Llama3 and Mistral-7B

Use the following command to launch the training:

ACCELERATE_LOG_LEVEL=info \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
accelerate launch \
--main_process_port 21893 \
--config_file accelerate_configs/deepspeed_zero3.yaml \
scripts/run_simpo.py \
training_configs/llama-3-8b-base-alphapo.yaml

Change the config file in the last line to others in the folder training_configs for the entire experiments using Llama3 and Mistral-7B in both Base and Instruct settings.

For the evaluation on AlpacaEval, MT-Bench, and ArenaHard, refer to here for the detailed guidance and configurations.

Citation

Please consider citing our work if you find it useful:

@article{han2024f,
  title={$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization},
  author={Han, Jiaqi and Jiang, Mingjian and Song, Yuxuan and Leskovec, Jure and Ermon, Stefano and Xu, Minkai},
  journal={arXiv preprint arXiv:2410.21662},
  year={2024}
}

Acknowledgment

This repo is built upon EXO and SimPO. We thank the authors for their great work and open-sourcing the code.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
exp		exp
run		run
scripts		scripts
src		src
training_configs		training_configs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Introduction

Checkpoints

Environment

Experiments

Tuning Pythia-2.8B on HH and TLDR

Tuning Llama3 and Mistral-7B

Citation

Acknowledgment

About

Releases

Packages

Contributors 2

Languages

License

MinkaiXu/fPO

Folders and files

Latest commit

History

Repository files navigation

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Introduction

Checkpoints

Environment

Experiments

Tuning Pythia-2.8B on HH and TLDR

Tuning Llama3 and Mistral-7B

Citation

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages