perf(qwen3): packed varlen replay — zero padding (verl remove_padding parity)#43
perf(qwen3): packed varlen replay — zero padding (verl remove_padding parity)#43leviking98z-rgb wants to merge 3 commits into
Conversation
… parity) Dense replay forwards [B, P_max + T_max] with prompt and response blocks padded separately; packing fills it at most ~65%. B>1 replay now concatenates real prompt + flat response tokens into ONE packed row: transformers >= 4.57 natively detects the restarting position_ids (attention_mask=None) and builds the block-causal mask, so per-sequence RoPE/logp semantics are unchanged. The chunked fp32 logp runs over hidden states gathered at flat predict positions. Also trims the dense path's prompt block to the micro's true max width. EQUIVALENCE: packed vs per-sample dense replay, fp32: mean|dlogp|=1e-5, max=3e-5. TrainStack micro_cost_model=sum switches packing to real-token accounting. (Needs the flex_attention follow-up to pay off: SDPA falls back to the math kernel on packed block-causal masks.)
|
Review — request changes. The packed-stream mechanics check out: predict_index Two blockers before merge:
Smaller: guard n_p==0 (predict index -1); |
Addresses #43 review: the packed path now requires transformers.masking_utils.find_packed_sequence_indices (>=4.53) and falls back to the dense path otherwise (older versions would silently attend across sequence boundaries); UNIRL_DISABLE_PACKED_REPLAY=1 disables it explicitly for A/B and rollback.
|
Review blockers addressed in 2f9fc23: the packed path now feature-detects transformers masking_utils.find_packed_sequence_indices and falls back to the dense path when absent (no silent cross-sequence attention on older pins), and UNIRL_DISABLE_PACKED_REPLAY=1 is an explicit kill switch — which also resolves the merge-alone concern (without flex the operator can disable packing; landing together with #44 remains the recommendation). |
Part of the verl performance-parity series tracked in #40.
Summary
Dense replay forwards
[B, P_max + T_max]with the prompt and response blocks padded separately; even sorted packing fills it at most ~65%. For B>1 the replay now concatenates real prompt + flat response tokens into ONE packed row: transformers >= 4.57 natively detects restartingposition_ids(withattention_mask=None) and builds the block-causal mask, so per-sequence RoPE/logp semantics are unchanged. The chunked fp32 logp runs over hidden states gathered at the flat predict positions. Also trims the dense path prompt block to the micro true max width (the track-level concat pads to the worker-shard max: train phase 54.3s -> 46.8s measured for the trim alone).TrainStack.micro_cost_model=sumswitches packing to real-token accounting (= verl). Pays off together with the flex_attention follow-up PR — SDPA falls back to the math kernel on packed block-causal masks.Equivalence
packed vs per-sample dense replay, fp32 weights/body: mean |dlogp| = 1e-5, max = 3e-5 (bf16 run noise is the same class as the documented dense micro batch-shape sensitivity; rollout-anchored ratio absorbs it — ratio_mean 1.0000 in-run).
Test Plan
pack16v5_unirl_drpo_gz6: ratio 0.9998-1.0001