perf(train): token-budget packed micro-batches (verl dynamic-bsz parity)#42
perf(train): token-budget packed micro-batches (verl dynamic-bsz parity)#42leviking98z-rgb wants to merge 3 commits into
Conversation
micro_batch_size=1 ran one sequence per forward/backward (32 isolated micro-steps per worker per rollout at 16 GPU) - measured 4.5x slower than verl in the train phase. Pack each mini-batch into micro-batches under a token budget (TrainStack.micro_token_budget, length-sorted first-fit), with: - index-plan micros materialized via Batch.select - micro losses weighted by sample share so the step gradient is unchanged - NCCL micro-count parity: all-reduce(MAX) the bin count per optimizer step and re-partition into exactly K bins - FSDP runs collectives per micro, so unequal counts deadlock the process group - replay-anchored algorithms scatter anchor chunks back to track order - per-step CUDA alloc/reserved watermark + replay/backward wall-clock metrics 16-GPU DRPO (Qwen3-4B, DAPO, light regime): train phase 75.2s -> 54.3s.
|
Review — request changes. Great direction (packing + NCCL parity design is sound), but two blockers:
Also (non-blocking): example YAML should set Verified good: anchor scatter-back, multi-update plan sharing, bin non-emptiness, |
…mpt lengths Addresses #42 review B1 (the CUDA-watermark insertion re-bound the original else to torch.cuda.is_available(): GPU+no-backward raised UnboundLocalError, CPU silently zeroed grad_norm) and B2 (conditions is a Dict, getattr never found prompt — the 2D cost model was unreachable and the budget counted response tokens only).
|
Review findings addressed in 9dd5cbe: B1 (grad_norm else-branch restored) and B2 (dict accessor — the 2D/sum cost model now actually receives prompt lengths). Note: the published benchmark numbers were measured with the B2 bug live, i.e. budgets effectively counted response tokens only; prompt-aware accounting may slightly change bin shapes. Non-blocking items (token-mean weighting caveat, timing-metric semantics) acknowledged — will document rather than change behavior in this PR. |
The token-budget packing path was bolted onto TrainStack (built for the diffusion/FlowGRPO "batched" schedule) via runtime flags. Extract the family-agnostic pipeline into AbstractTrainStack(Remote, ABC) and derive DiffusionTrainStack (count-based micros) and LLMTrainStack (token-budget packed micros). The diffusion-vs-LLM seam collapses to one method, _plan; packing lives in stack/_packing.py as free functions. TrainStack aliases DiffusionTrainStack so the 60+ diffusion/PE configs are untouched; the 6 examples/ar/* configs retarget to LLMTrainStack. Carries the review fixes from 9dd5cbe forward into the new structure (stack.py is removed): the grad_norm else-branch fix (B1) lives in base.py's _run_update — the grad_norm if/else and the CUDA-memory block are independent; the prompt dict-accessor fix (B2, conditions is a Dict so getattr never found "prompt") lives in _packing.py's _plan_packed so the 2D cost model actually activates. Add seq-mean guard: LLMTrainStack rejects token-budget packing unless the algorithm uses a seq-mean loss aggregation, since sample-share micro weighting is gradient-exact only for seq-mean modes. Retire "mini-batch"; the optimizer-step level is an "update" (micro_slices->micros, train->_run_update, _train_mini_batches->_run_updates, _plan_optimizer_steps->_plan). Add tests/train/test_packing.py (22 CPU-only cases): packing coverage/budget invariants, exact-K partition, count<->packed plan sample-set equivalence, 2D-packer activation via the conditions dict (B2), and the seq-mean guard.
Part of the verl performance-parity series tracked in #40.
Summary
micro_batch_size=1ran one sequence per forward/backward (32 isolated micro-steps per worker per rollout at 16 GPU) — the dominant cause of a measured 4.5x train-phase gap vs verl. This addsTrainStack.micro_token_budget(verlppo_max_token_len_per_gpuparity): each mini-batch is packed into micro-batches under a token budget (length-sorted first-fit), withBatch.selectBenchmark
16-GPU DRPO (Qwen3-4B-Base + DAPO, light regime, phase-timed): train phase 75.2s -> 54.3s; ratio_mean stays 1.000.
Test Plan
pack16_unirl_drpo_gz6(wandb, project unirl-grpo) vs baselineperfphase16_unirl_base; ratio_mean≈1.0000 across all stepsmicro_token_budget: nullpreserves old count-based behavior