💭 Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao*, Shuyue Stella Li*, Rui Xin*, Scott Geng*, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, Luke Zettlemoyer

Setup

# Our codebase is based on TTRL (https://github.com/PRIME-RL/TTRL).
git clone [email protected]:ruixin31/Spurious_Rewards
cd code

conda create -n spurious-rewards python=3.10 
conda activate spurious-rewards

pip install -r requirements.txt
pip install flash_attn==2.7.0.post2
pip install -e .

Training

bash scripts/rlvr_deepscaler_grpo_qwen_ground_truth.sh

Configurations

Data

We include filtered and majority-labeled data in the paper. You may find a complete list in the code/data directory. For example, the ground truth data is termed DeepScaleR, and Llama 3.2 3B instruct labeled data, filtered to keep only the incorrect labels, is in the DeepScaleR_mv_labeled_llama3.2_3b_instruct_incorrect folder. You may change the data source by changing the variable TASK in code/scripts/rlvr_deepscaler_grpo_qwen_ground_truth.sh.

Rewards

We include a list of rewards used in the paper below. Furthermore, note that for models without a chat template, be sure to add _r1_only as the suffix. You may change the reward function by changing the variable REWARD in code/scripts/rlvr_deepscaler_grpo_qwen_ground_truth.sh.

math: Mathematical equivalence reward, which is the default
box_only_format: Box-only formatting reward
contain_python_wo_backticks: Mentioning of Python reward
random0.5: Random reward with 50% returning 1

Evaluations

To reproduce our evaluation results, use the following commands:

cd code

# For MATH-500 evaluation (requires NVIDIA A100 80GB PCIe for exact reproduction)
python scripts/eval_checkpoint.py --model_path Qwen/Qwen2.5-Math-7B --datasets MATH-500,AIME-2024,AIME-2025,AMC

# For MATH-500 evaluation matching our reported scores in wandb using checkpoints (requires NVIDIA H200 for exact reproduction)
python scripts/eval_checkpoint.py --model_path {} --datasets MATH-500,AIME-2024,AIME-2025,AMC --shards 2

Note: To exactly reproduce temperature = 0 results, both the GPU type and --shards parameter must match the original evaluation setup. This is because the batch size passed into VLLM can cause generation fluctuations.

Paper

Here's the link to our paper.

Citation

@misc{shao2025spuriousrewardsrethinkingtraining,
      title={Spurious Rewards: Rethinking Training Signals in RLVR}, 
      author={Rulin Shao and Shuyue Stella Li and Rui Xin and Scott Geng and Yiping Wang and Sewoong Oh and Simon Shaolei Du and Nathan Lambert and Sewon Min and Ranjay Krishna and Yulia Tsvetkov and Hannaneh Hajishirzi and Pang Wei Koh and Luke Zettlemoyer},
      year={2025},
      eprint={2506.10947},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.10947}, 
}

Acknowledgments

This repository is built based on TTRL, which is built on top of OpenRLHF. We added asynchronous evaluation among other custom features to the codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
code		code
figs		figs
paper		paper
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

💭 Spurious Rewards: Rethinking Training Signals in RLVR

Setup

Training

Configurations

Data

Rewards

Evaluations

Paper

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

ruixin31/Spurious_Rewards

Folders and files

Latest commit

History

Repository files navigation

💭 Spurious Rewards: Rethinking Training Signals in RLVR

Setup

Training

Configurations

Data

Rewards

Evaluations

Paper

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages