Skip to content
/ BuPO Public

[arxiv: 2512.19673] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

License

Notifications You must be signed in to change notification settings

Trae1ounG/BuPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bupo-logo
Bottom-up Policy Optimization:
Your Language Model Policy Secretly Contains Internal Policies

arXiv License: MIT Python

📰 News

  • 2026-1-4: 🔥 We release the visualization code of internal policy entropy flow.
  • 2025-12-24: 🎉 Ranked #3 of the day on Huggingface Daily Papers.
  • 2025-12-23: 🔥 We release the BuPO code and the paper.

👁 Overview

Bottom-up Policy Optimization provides a novel framework to decompose LLM policies into internal layer and modular policies, reveals distinct reasoning patterns across different model architectures, and introduces a bottom-up optimization algorithm that leverages these insights to enhance complex reasoning.

BuPO

🤯 Key Findings:

  • Internal Policies: Decomposes the unified LLM policy into samplable distributions from individual layers and modules (self-attention & FFN).
  • Progressive Reasoning Pattern: Discovered a human-like "Exploration-Integration-Convergence" (EIC) pattern in Qwen models, contrasting with the abrupt convergence in Llama models.
  • Bottom-up Policy Optimization (BuPO): A novel two-phase RL algorithm that first optimizes an internal, lower-layer policy to reconstruct foundational reasoning, then fine-tunes the full model.
  • Enhanced Reasoning Performance: BuPO significantly outperforms standard RL on complex reasoning benchmarks.

🚀 Quick Start

Installation

conda create -y -n bupo python=3.10.17 && conda activate bupo
pip install -r requirements
python -m pip install flash-attn --no-build-isolation
pip install -e .

Training

BuPO: specify k (internal layer policy index), iterative_steps (steps of internal policy optimization) in run_code/BuPO_qwen3.sh and run_code/BuPO_llama.sh to train the model with BuPO.

cd BuPO
conda activate bupo
bash run_code/BuPO_qwen3.sh

GRPO:

bash run_code/GRPO_qwen3.sh

Internal Policy Entropy Plot 🥷🏼

  • First, run run_eval.sh in scripts folder to obtain the dataset used for visualization.
  • Then, run plot_internal_entropy.py in visualization folder to obtain the plot of internal policy entropy flow.

Implementation Details 🤔

Our mainly design lays in:

  • verl/models/custom_model: we modify the source file of model forward pass in transformers to get internal hidden states and internal policy effectively/
  • verl/workers/actor/dp_actor.py/_forward_micro_batch_layer_k(): here to switch to compute the importance ratio of internal layer policy and update it.

🙇‍♂️ Acknowledgement

We thank the verl for their valuable contributions to the open-source community.

📬 Contact

For questions, discussion, or collaboration opportunities, feel free to contact us!

✍️ Citation

If you find our work helpful, please cite as:

@article{tan2025bupo,
      title={Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies}, 
      author={Yuqiao Tan and Minzheng Wang and Shizhu He and Huanxuan Liao and Chengfeng Zhao and Qiunan Lu and Tian Liang and Jun Zhao and Kang Liu},
      year={2025},
      journal={arXiv preprint arXiv:2512.19673},
      url={https://arxiv.org/abs/2512.19673}
}

Star History

Star History Chart

About

[arxiv: 2512.19673] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages