perf(qwen3): packed varlen replay — zero padding (verl remove_padding parity) by leviking98z-rgb · Pull Request #43 · Tencent-Hunyuan/UniRL

leviking98z-rgb · 2026-06-12T04:48:36Z

Part of the verl performance-parity series tracked in #40.

Summary

Dense replay forwards [B, P_max + T_max] with the prompt and response blocks padded separately; even sorted packing fills it at most ~65%. For B>1 the replay now concatenates real prompt + flat response tokens into ONE packed row: transformers >= 4.57 natively detects restarting position_ids (with attention_mask=None) and builds the block-causal mask, so per-sequence RoPE/logp semantics are unchanged. The chunked fp32 logp runs over hidden states gathered at the flat predict positions. Also trims the dense path prompt block to the micro true max width (the track-level concat pads to the worker-shard max: train phase 54.3s -> 46.8s measured for the trim alone).

TrainStack.micro_cost_model=sum switches packing to real-token accounting (= verl). Pays off together with the flex_attention follow-up PR — SDPA falls back to the math kernel on packed block-causal masks.

Equivalence

packed vs per-sample dense replay, fp32 weights/body: mean |dlogp| = 1e-5, max = 3e-5 (bf16 run noise is the same class as the documented dense micro batch-shape sensitivity; rollout-anchored ratio absorbs it — ratio_mean 1.0000 in-run).

Test Plan

fp32 equivalence gate (above) on Qwen3-4B-Base
16-GPU run pack16v5_unirl_drpo_gz6: ratio 0.9998-1.0001
dense path regression: B==1 takes the original code path

… parity) Dense replay forwards [B, P_max + T_max] with prompt and response blocks padded separately; packing fills it at most ~65%. B>1 replay now concatenates real prompt + flat response tokens into ONE packed row: transformers >= 4.57 natively detects the restarting position_ids (attention_mask=None) and builds the block-causal mask, so per-sequence RoPE/logp semantics are unchanged. The chunked fp32 logp runs over hidden states gathered at flat predict positions. Also trims the dense path's prompt block to the micro's true max width. EQUIVALENCE: packed vs per-sample dense replay, fp32: mean|dlogp|=1e-5, max=3e-5. TrainStack micro_cost_model=sum switches packing to real-token accounting. (Needs the flex_attention follow-up to pay off: SDPA falls back to the math kernel on packed block-causal masks.)

leviking98z-rgb · 2026-06-12T09:09:37Z

Review — request changes. The packed-stream mechanics check out: predict_index offset+n_p-1+j is exactly right (never crosses a packing boundary), n_r==0 sequences stay flat-order-aligned, grad flows correctly through index_select + the non-reentrant-checkpointed flat lm_head chunks, the dense-path prompt trim preserves the left-pad invariants and cumsum position_ids, the prepare_segment anchor reassembly matches packed output order, and the sum packer is exact verl-style token accounting.

Two blockers before merge:

Correctness depends on transformers >= 4.53 packed-position_ids detection, but the repo pins >=4.41 with no runtime guard — on older versions the forward silently attends across sequence boundaries (wrong logps, no error). Add a feature-detect/version assert at the packed-path entry (or build the block mask explicitly).
As merged alone (no attn_implementation knob on this branch), SDPA handles the packed mask on a slow path and attention cost grows to (sum L_i)^2 — likely a net slowdown on the very recipe this PR modifies, and it silently switches ALL B>1 qwen3 replays. Gate packed replay behind a default-off flag or land together with the flex follow-up (perf(qwen3): flex_attention for packed replay + length bucketing #44).

Smaller: guard n_p==0 (predict index -1); sum cost model can exceed the budget when prompt lengths are unavailable; add a real kill switch + a tiny CPU packed-vs-dense unit test.

…n-replay

Addresses #43 review: the packed path now requires transformers.masking_utils.find_packed_sequence_indices (>=4.53) and falls back to the dense path otherwise (older versions would silently attend across sequence boundaries); UNIRL_DISABLE_PACKED_REPLAY=1 disables it explicitly for A/B and rollback.

leviking98z-rgb · 2026-06-12T09:12:57Z

Review blockers addressed in 2f9fc23: the packed path now feature-detects transformers masking_utils.find_packed_sequence_indices and falls back to the dense path when absent (no silent cross-sequence attention on older pins), and UNIRL_DISABLE_PACKED_REPLAY=1 is an explicit kill switch — which also resolves the merge-alone concern (without flex the operator can disable packing; landing together with #44 remains the recommendation).

This was referenced Jun 12, 2026

VLM/LLM Infra Updating #40

Open

perf(recipe): enable token-budget packing + DP balancing for Qwen-VL recipes #48

Closed

leviking98z-rgb added 2 commits June 12, 2026 17:12

Merge branch 'perf/01-token-budget-packing' into perf/02-packed-varle…

e5c9897

…n-replay

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(qwen3): packed varlen replay — zero padding (verl remove_padding parity)#43

perf(qwen3): packed varlen replay — zero padding (verl remove_padding parity)#43
leviking98z-rgb wants to merge 3 commits into
perf/01-token-budget-packingfrom
perf/02-packed-varlen-replay

leviking98z-rgb commented Jun 12, 2026 •

edited

Loading

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leviking98z-rgb commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Equivalence

Test Plan

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leviking98z-rgb commented Jun 12, 2026 •

edited

Loading