Skip to content

perf(qwen3): packed varlen replay — zero padding (verl remove_padding parity)#43

Draft
leviking98z-rgb wants to merge 3 commits into
perf/01-token-budget-packingfrom
perf/02-packed-varlen-replay
Draft

perf(qwen3): packed varlen replay — zero padding (verl remove_padding parity)#43
leviking98z-rgb wants to merge 3 commits into
perf/01-token-budget-packingfrom
perf/02-packed-varlen-replay

Conversation

@leviking98z-rgb

@leviking98z-rgb leviking98z-rgb commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Part of the verl performance-parity series tracked in #40.

Summary

Dense replay forwards [B, P_max + T_max] with the prompt and response blocks padded separately; even sorted packing fills it at most ~65%. For B>1 the replay now concatenates real prompt + flat response tokens into ONE packed row: transformers >= 4.57 natively detects restarting position_ids (with attention_mask=None) and builds the block-causal mask, so per-sequence RoPE/logp semantics are unchanged. The chunked fp32 logp runs over hidden states gathered at the flat predict positions. Also trims the dense path prompt block to the micro true max width (the track-level concat pads to the worker-shard max: train phase 54.3s -> 46.8s measured for the trim alone).

TrainStack.micro_cost_model=sum switches packing to real-token accounting (= verl). Pays off together with the flex_attention follow-up PR — SDPA falls back to the math kernel on packed block-causal masks.

Equivalence

packed vs per-sample dense replay, fp32 weights/body: mean |dlogp| = 1e-5, max = 3e-5 (bf16 run noise is the same class as the documented dense micro batch-shape sensitivity; rollout-anchored ratio absorbs it — ratio_mean 1.0000 in-run).

Test Plan

  • fp32 equivalence gate (above) on Qwen3-4B-Base
  • 16-GPU run pack16v5_unirl_drpo_gz6: ratio 0.9998-1.0001
  • dense path regression: B==1 takes the original code path

… parity)

Dense replay forwards [B, P_max + T_max] with prompt and response blocks
padded separately; packing fills it at most ~65%. B>1 replay now concatenates
real prompt + flat response tokens into ONE packed row: transformers >= 4.57
natively detects the restarting position_ids (attention_mask=None) and builds
the block-causal mask, so per-sequence RoPE/logp semantics are unchanged. The
chunked fp32 logp runs over hidden states gathered at flat predict positions.
Also trims the dense path's prompt block to the micro's true max width.

EQUIVALENCE: packed vs per-sample dense replay, fp32: mean|dlogp|=1e-5,
max=3e-5. TrainStack micro_cost_model=sum switches packing to real-token
accounting. (Needs the flex_attention follow-up to pay off: SDPA falls back
to the math kernel on packed block-causal masks.)
@leviking98z-rgb

Copy link
Copy Markdown
Collaborator Author

Review — request changes. The packed-stream mechanics check out: predict_index offset+n_p-1+j is exactly right (never crosses a packing boundary), n_r==0 sequences stay flat-order-aligned, grad flows correctly through index_select + the non-reentrant-checkpointed flat lm_head chunks, the dense-path prompt trim preserves the left-pad invariants and cumsum position_ids, the prepare_segment anchor reassembly matches packed output order, and the sum packer is exact verl-style token accounting.

Two blockers before merge:

  1. Correctness depends on transformers >= 4.53 packed-position_ids detection, but the repo pins >=4.41 with no runtime guard — on older versions the forward silently attends across sequence boundaries (wrong logps, no error). Add a feature-detect/version assert at the packed-path entry (or build the block mask explicitly).
  2. As merged alone (no attn_implementation knob on this branch), SDPA handles the packed mask on a slow path and attention cost grows to (sum L_i)^2 — likely a net slowdown on the very recipe this PR modifies, and it silently switches ALL B>1 qwen3 replays. Gate packed replay behind a default-off flag or land together with the flex follow-up (perf(qwen3): flex_attention for packed replay + length bucketing #44).

Smaller: guard n_p==0 (predict index -1); sum cost model can exceed the budget when prompt lengths are unavailable; add a real kill switch + a tiny CPU packed-vs-dense unit test.

Addresses #43 review: the packed path now requires
transformers.masking_utils.find_packed_sequence_indices (>=4.53) and falls
back to the dense path otherwise (older versions would silently attend
across sequence boundaries); UNIRL_DISABLE_PACKED_REPLAY=1 disables it
explicitly for A/B and rollback.
@leviking98z-rgb

Copy link
Copy Markdown
Collaborator Author

Review blockers addressed in 2f9fc23: the packed path now feature-detects transformers masking_utils.find_packed_sequence_indices and falls back to the dense path when absent (no silent cross-sequence attention on older pins), and UNIRL_DISABLE_PACKED_REPLAY=1 is an explicit kill switch — which also resolves the merge-alone concern (without flex the operator can disable packing; landing together with #44 remains the recommendation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant