- 2026-1-4: 🔥 We release the visualization code of internal policy entropy flow.
- 2025-12-24: 🎉 Ranked #3 of the day on Huggingface Daily Papers.
- 2025-12-23: 🔥 We release the BuPO code and the paper.
Bottom-up Policy Optimization provides a novel framework to decompose LLM policies into internal layer and modular policies, reveals distinct reasoning patterns across different model architectures, and introduces a bottom-up optimization algorithm that leverages these insights to enhance complex reasoning.
- Internal Policies: Decomposes the unified LLM policy into samplable distributions from individual layers and modules (self-attention & FFN).
- Progressive Reasoning Pattern: Discovered a human-like "Exploration-Integration-Convergence" (EIC) pattern in Qwen models, contrasting with the abrupt convergence in Llama models.
- Bottom-up Policy Optimization (BuPO): A novel two-phase RL algorithm that first optimizes an internal, lower-layer policy to reconstruct foundational reasoning, then fine-tunes the full model.
- Enhanced Reasoning Performance: BuPO significantly outperforms standard RL on complex reasoning benchmarks.
conda create -y -n bupo python=3.10.17 && conda activate bupo
pip install -r requirements
python -m pip install flash-attn --no-build-isolation
pip install -e .
BuPO: specify k (internal layer policy index), iterative_steps (steps of internal policy optimization) in run_code/BuPO_qwen3.sh and run_code/BuPO_llama.sh to train the model with BuPO.
cd BuPO
conda activate bupo
bash run_code/BuPO_qwen3.sh
GRPO:
bash run_code/GRPO_qwen3.sh
- First, run
run_eval.shinscriptsfolder to obtain the dataset used for visualization. - Then, run
plot_internal_entropy.pyinvisualizationfolder to obtain the plot of internal policy entropy flow.
Our mainly design lays in:
verl/models/custom_model: we modify the source file of model forward pass intransformersto get internal hidden states and internal policy effectively/verl/workers/actor/dp_actor.py/_forward_micro_batch_layer_k(): here to switch to compute the importance ratio of internal layer policy and update it.
We thank the verl for their valuable contributions to the open-source community.
For questions, discussion, or collaboration opportunities, feel free to contact us!
- Yuqiao Tan: tanyuqiao2025@ia.ac.cn
- Minzheng Wang: wangminzheng2023@ia.ac.cn
If you find our work helpful, please cite as:
@article{tan2025bupo,
title={Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies},
author={Yuqiao Tan and Minzheng Wang and Shizhu He and Huanxuan Liao and Chengfeng Zhao and Qiunan Lu and Tian Liang and Jun Zhao and Kang Liu},
year={2025},
journal={arXiv preprint arXiv:2512.19673},
url={https://arxiv.org/abs/2512.19673}
}

