feat: configurable truncated_reward (zero|keep|soft) to align AR length handling with verl#19
Open
leviking98z-rgb wants to merge 2 commits into
Open
feat: configurable truncated_reward (zero|keep|soft) to align AR length handling with verl#19leviking98z-rgb wants to merge 2 commits into
leviking98z-rgb wants to merge 2 commits into
Conversation
…ength handling RewardService previously hard-zeroed the reward of any AR generation that hit max_new_tokens (anti-ramble). This diverges from verl, whose dapo reward manager keeps the math_verify score on truncated traces (overlong_buffer.enable=False) or applies a graded overlong penalty (enable=True) — never a hard zero. As responses grow and truncation approaches 100%%, the hard-zero collapses unirl reward_mean while verl stays flat. Add a truncated_reward mode on RewardService: zero (default, unchanged) | keep (= verl overlong-disabled) | soft (= verl DAPO graded overlong penalty over overlong_buffer_len tokens). Recipes opt in via reward.truncated_reward; default preserves prior behavior.
…l parity The qwen3 DAPO-math GRPO/DRPO recipes reproduce verl baselines whose dapo reward manager runs with overlong_buffer disabled (scores the partial text of truncated traces, no zeroing). Set reward.truncated_reward=keep so the reproduction matches verl: otherwise the default hard-zero collapses reward_mean as truncation rises to ~100% late in training, while verl stays flat.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RewardServicehard-zeros the reward of any AR generation that hitsmax_new_tokens(sglangfinish=="length"), as an anti-ramble guard. This diverges from verl: itsdaporeward manager withoverlong_buffer.enable=Falsekeeps the math_verify score on the partial text (no zeroing), and withenable=Trueapplies a graded overlong penalty — never a hard zero. As responses lengthen during training and truncation approaches ~100%, the hard-zero collapsesreward_mean, while verl stays flat. Observed on Qwen3-4B DAPO-math DRPO: unirlreward_mean0.53→0.36 (as truncation rises) vs verl flat ~0.5 even at 100% truncation.This PR adds a
truncated_rewardmode toRewardService:zero(default, unchanged) — force reward 0 on truncated traces (anti-ramble).keep— keep the raw score on the partial text (= verldapo, overlong disabled).soft— verl DAPO graded overlong penalty over the lastoverlong_buffer_lentokens beforemax_new_tokens(mirrorsverl.workers.reward_manager.dapo).It opts the qwen3 GRPO/DRPO DAPO-math recipes into
truncated_reward: keepfor verl parity. Default behavior is unchanged for every other recipe.Test Plan
ruff checkandruff format --checkpass onunirl/reward/service.py.scripts/check_recipe_targets.pypasses (967 target paths resolve).truncated_rewardis a validRewardService.__init__kwarg.truncated_reward: keepto confirm the late-stagereward_meanno longer declines (tracks verl).EXP
BEFORE.

AFTER.
