💭 Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao*, Shuyue Stella Li*, Rui Xin*, Scott Geng*, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, Luke Zettlemoyer

Setup

# Our codebase is based on TTRL (https://github.com/PRIME-RL/TTRL).
git clone [email protected]:ruixin31/Rethink_RLVR
cd code

conda create -n rethink-rlvr python=3.10 
conda activate rethink-rlvr

pip install -r requirements.txt
pip install flash_attn==2.7.0.post2
pip install -e .

Training

bash scripts/rlvr_deepscaler_grpo_qwen_ground_truth.sh

Configurations

Data

We include filtered and majority-labeled data in the paper. You may find a complete list in the code/data directory. For example, the ground truth data is termed DeepScaleR, and Llama 3.2 3B instruct labeled data, filtered to keep only the incorrect labels, is in the DeepScaleR_mv_labeled_llama3.2_3b_instruct_incorrect folder. You may change the data source by changing the variable TASK in code/scripts/rlvr_deepscaler_grpo_qwen_ground_truth.sh.

Rewards

We include a list of rewards used in the paper below. Furthermore, note that for models without a chat template, be sure to add _r1_only as the suffix. You may change the reward function by changing the variable REWARD in code/scripts/rlvr_deepscaler_grpo_qwen_ground_truth.sh.

math: Mathematical equivalence reward, which is the default
box_only_format: Box-only formatting reward
contain_python_wo_backticks: Mentioning of Python reward
random0.5: Random reward with 50% returning 1

Paper

ArXiv coming soon! In the meantime, here's the link to our paper.

Citation

@misc{shao2025spurious,
  title={Spurious Rewards: Rethinking Training Signals in RLVR},
  author={Rulin Shao and Shuyue Stella Li and Rui Xin and Scott Geng and Yiping Wang and Sewoong Oh and Simon Shaolei Du and Nathan Lambert and Sewon Min and Ranjay Krishna and Yulia Tsvetkov and Hannaneh Hajishirzi and Pang Wei Koh and Luke Zettlemoyer},
  year={2025},
  howpublished={\url{https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking-Training-Signals-in-RLVR-1f4df34dac1880948858f95aeb88872f}},
  note={Notion Blog}
}

Acknowledgments

This repository is built based on TTRL, which is built on top of OpenRLHF. We added asynchronous evaluation among other custom features to the codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
code		code
figs		figs
paper		paper
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

💭 Spurious Rewards: Rethinking Training Signals in RLVR

Setup

Training

Configurations

Data

Rewards

Paper

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

victkk/Rethink_RLVR

Folders and files

Latest commit

History

Repository files navigation

💭 Spurious Rewards: Rethinking Training Signals in RLVR

Setup

Training

Configurations

Data

Rewards

Paper

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages