Skip to content

feat: configurable truncated_reward (zero|keep|soft) to align AR length handling with verl#19

Open
leviking98z-rgb wants to merge 2 commits into
mainfrom
feat/truncated-reward-modes
Open

feat: configurable truncated_reward (zero|keep|soft) to align AR length handling with verl#19
leviking98z-rgb wants to merge 2 commits into
mainfrom
feat/truncated-reward-modes

Conversation

@leviking98z-rgb

@leviking98z-rgb leviking98z-rgb commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

RewardService hard-zeros the reward of any AR generation that hits max_new_tokens (sglang finish=="length"), as an anti-ramble guard. This diverges from verl: its dapo reward manager with overlong_buffer.enable=False keeps the math_verify score on the partial text (no zeroing), and with enable=True applies a graded overlong penalty — never a hard zero. As responses lengthen during training and truncation approaches ~100%, the hard-zero collapses reward_mean, while verl stays flat. Observed on Qwen3-4B DAPO-math DRPO: unirl reward_mean 0.53→0.36 (as truncation rises) vs verl flat ~0.5 even at 100% truncation.

This PR adds a truncated_reward mode to RewardService:

  • zero (default, unchanged) — force reward 0 on truncated traces (anti-ramble).
  • keep — keep the raw score on the partial text (= verl dapo, overlong disabled).
  • soft — verl DAPO graded overlong penalty over the last overlong_buffer_len tokens before max_new_tokens (mirrors verl.workers.reward_manager.dapo).

It opts the qwen3 GRPO/DRPO DAPO-math recipes into truncated_reward: keep for verl parity. Default behavior is unchanged for every other recipe.

Test Plan

  • ruff check and ruff format --check pass on unirl/reward/service.py.
  • scripts/check_recipe_targets.py passes (967 target paths resolve).
  • Both edited recipes parse as YAML and compose; truncated_reward is a valid RewardService.__init__ kwarg.
  • Behavioral validation in progress: a Qwen3-4B DAPO-math DRPO run with truncated_reward: keep to confirm the late-stage reward_mean no longer declines (tracks verl).

EXP

BEFORE.
截屏2026-06-10 12 27 47

AFTER.
截屏2026-06-10 12 25 30

…ength handling

RewardService previously hard-zeroed the reward of any AR generation that hit
max_new_tokens (anti-ramble). This diverges from verl, whose dapo reward manager
keeps the math_verify score on truncated traces (overlong_buffer.enable=False) or
applies a graded overlong penalty (enable=True) — never a hard zero. As responses
grow and truncation approaches 100%%, the hard-zero collapses unirl reward_mean
while verl stays flat.

Add a truncated_reward mode on RewardService:
  zero (default, unchanged) | keep (= verl overlong-disabled) |
  soft (= verl DAPO graded overlong penalty over overlong_buffer_len tokens).

Recipes opt in via reward.truncated_reward; default preserves prior behavior.
…l parity

The qwen3 DAPO-math GRPO/DRPO recipes reproduce verl baselines whose dapo reward
manager runs with overlong_buffer disabled (scores the partial text of truncated
traces, no zeroing). Set reward.truncated_reward=keep so the reproduction matches
verl: otherwise the default hard-zero collapses reward_mean as truncation rises to
~100% late in training, while verl stays flat.
@leviking98z-rgb leviking98z-rgb requested a review from haonan3 June 10, 2026 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant