feat: configurable truncated_reward (zero|keep|soft) to align AR length handling with verl by leviking98z-rgb · Pull Request #19 · Tencent-Hunyuan/UniRL

leviking98z-rgb · 2026-06-10T04:19:41Z

Summary

RewardService hard-zeros the reward of any AR generation that hits max_new_tokens (sglang finish=="length"), as an anti-ramble guard. This diverges from verl: its dapo reward manager with overlong_buffer.enable=False keeps the math_verify score on the partial text (no zeroing), and with enable=True applies a graded overlong penalty — never a hard zero. As responses lengthen during training and truncation approaches ~100%, the hard-zero collapses reward_mean, while verl stays flat. Observed on Qwen3-4B DAPO-math DRPO: unirl reward_mean 0.53→0.36 (as truncation rises) vs verl flat ~0.5 even at 100% truncation.

This PR adds a truncated_reward mode to RewardService:

zero (default, unchanged) — force reward 0 on truncated traces (anti-ramble).
keep — keep the raw score on the partial text (= verl dapo, overlong disabled).
soft — verl DAPO graded overlong penalty over the last overlong_buffer_len tokens before max_new_tokens (mirrors verl.workers.reward_manager.dapo).

It opts the qwen3 GRPO/DRPO DAPO-math recipes into truncated_reward: keep for verl parity. Default behavior is unchanged for every other recipe.

Test Plan

ruff check and ruff format --check pass on unirl/reward/service.py.
scripts/check_recipe_targets.py passes (967 target paths resolve).
Both edited recipes parse as YAML and compose; truncated_reward is a valid RewardService.__init__ kwarg.
Behavioral validation in progress: a Qwen3-4B DAPO-math DRPO run with truncated_reward: keep to confirm the late-stage reward_mean no longer declines (tracks verl).

EXP

BEFORE.

AFTER.

…ength handling RewardService previously hard-zeroed the reward of any AR generation that hit max_new_tokens (anti-ramble). This diverges from verl, whose dapo reward manager keeps the math_verify score on truncated traces (overlong_buffer.enable=False) or applies a graded overlong penalty (enable=True) — never a hard zero. As responses grow and truncation approaches 100%%, the hard-zero collapses unirl reward_mean while verl stays flat. Add a truncated_reward mode on RewardService: zero (default, unchanged) | keep (= verl overlong-disabled) | soft (= verl DAPO graded overlong penalty over overlong_buffer_len tokens). Recipes opt in via reward.truncated_reward; default preserves prior behavior.

…l parity The qwen3 DAPO-math GRPO/DRPO recipes reproduce verl baselines whose dapo reward manager runs with overlong_buffer disabled (scores the partial text of truncated traces, no zeroing). Set reward.truncated_reward=keep so the reproduction matches verl: otherwise the default hard-zero collapses reward_mean as truncation rises to ~100% late in training, while verl stays flat.

leviking98z-rgb added 2 commits June 10, 2026 12:09

leviking98z-rgb requested a review from haonan3 June 10, 2026 04:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: configurable truncated_reward (zero|keep|soft) to align AR length handling with verl#19

feat: configurable truncated_reward (zero|keep|soft) to align AR length handling with verl#19
leviking98z-rgb wants to merge 2 commits into
mainfrom
feat/truncated-reward-modes

leviking98z-rgb commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leviking98z-rgb commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

EXP

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leviking98z-rgb commented Jun 10, 2026 •

edited

Loading