Rulin Shao*, Shuyue Stella Li*, Rui Xin*, Scott Geng*, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, Luke Zettlemoyer
# Our codebase is based on TTRL (https://github.com/PRIME-RL/TTRL).
git clone [email protected]:ruixin31/Rethink_RLVR
cd code
conda create -n rethink-rlvr python=3.10
conda activate rethink-rlvr
pip install -r requirements.txt
pip install flash_attn==2.7.0.post2
pip install -e .bash scripts/rlvr_deepscaler_grpo_qwen_ground_truth.shWe include filtered and majority-labeled data in the paper. You may find a complete list in the code/data directory. For example, the ground truth data is termed DeepScaleR, and Llama 3.2 3B instruct labeled data, filtered to keep only the incorrect labels, is in the DeepScaleR_mv_labeled_llama3.2_3b_instruct_incorrect folder. You may change the data source by changing the variable TASK in code/scripts/rlvr_deepscaler_grpo_qwen_ground_truth.sh.
We include a list of rewards used in the paper below. Furthermore, note that for models without a chat template, be sure to add _r1_only as the suffix. You may change the reward function by changing the variable REWARD in code/scripts/rlvr_deepscaler_grpo_qwen_ground_truth.sh.
math: Mathematical equivalence reward, which is the defaultbox_only_format: Box-only formatting rewardcontain_python_wo_backticks: Mentioning of Python rewardrandom0.5: Random reward with 50% returning 1
ArXiv coming soon! In the meantime, here's the link to our paper.
@misc{shao2025spurious,
title={Spurious Rewards: Rethinking Training Signals in RLVR},
author={Rulin Shao and Shuyue Stella Li and Rui Xin and Scott Geng and Yiping Wang and Sewoong Oh and Simon Shaolei Du and Nathan Lambert and Sewon Min and Ranjay Krishna and Yulia Tsvetkov and Hannaneh Hajishirzi and Pang Wei Koh and Luke Zettlemoyer},
year={2025},
howpublished={\url{https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking-Training-Signals-in-RLVR-1f4df34dac1880948858f95aeb88872f}},
note={Notion Blog}
}This repository is built based on TTRL, which is built on top of OpenRLHF. We added asynchronous evaluation among other custom features to the codebase.
