Skip to content

perf(train): token-budget packed micro-batches (verl dynamic-bsz parity)#42

Draft
leviking98z-rgb wants to merge 3 commits into
mainfrom
perf/01-token-budget-packing
Draft

perf(train): token-budget packed micro-batches (verl dynamic-bsz parity)#42
leviking98z-rgb wants to merge 3 commits into
mainfrom
perf/01-token-budget-packing

Conversation

@leviking98z-rgb

@leviking98z-rgb leviking98z-rgb commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Part of the verl performance-parity series tracked in #40.

Summary

micro_batch_size=1 ran one sequence per forward/backward (32 isolated micro-steps per worker per rollout at 16 GPU) — the dominant cause of a measured 4.5x train-phase gap vs verl. This adds TrainStack.micro_token_budget (verl ppo_max_token_len_per_gpu parity): each mini-batch is packed into micro-batches under a token budget (length-sorted first-fit), with

  • index-plan micros materialized via Batch.select
  • micro losses weighted by sample share (step gradient mathematically unchanged; unit-tested exact)
  • NCCL micro-count parity: per optimizer step, all-reduce(MAX) the bin count and re-partition into exactly K bins — FSDP runs collectives per micro, unequal counts deadlock the process group (observed: watchdog killed rank15)
  • replay-anchored algorithms scatter anchor chunks back to track order
  • diagnostics: per-step CUDA alloc/reserved watermark, per-micro replay/backward wall-clock, packing-efficiency log

Benchmark

16-GPU DRPO (Qwen3-4B-Base + DAPO, light regime, phase-timed): train phase 75.2s -> 54.3s; ratio_mean stays 1.000.

Test Plan

  • unit tests: bin coverage/budget invariants, exact-K partition, loss-weighting equality (1e-12)
  • 16-GPU runs: pack16_unirl_drpo_gz6 (wandb, project unirl-grpo) vs baseline perfphase16_unirl_base; ratio_mean≈1.0000 across all steps
  • legacy path regression: micro_token_budget: null preserves old count-based behavior

micro_batch_size=1 ran one sequence per forward/backward (32 isolated
micro-steps per worker per rollout at 16 GPU) - measured 4.5x slower than
verl in the train phase. Pack each mini-batch into micro-batches under a
token budget (TrainStack.micro_token_budget, length-sorted first-fit), with:
- index-plan micros materialized via Batch.select
- micro losses weighted by sample share so the step gradient is unchanged
- NCCL micro-count parity: all-reduce(MAX) the bin count per optimizer step
  and re-partition into exactly K bins - FSDP runs collectives per micro, so
  unequal counts deadlock the process group
- replay-anchored algorithms scatter anchor chunks back to track order
- per-step CUDA alloc/reserved watermark + replay/backward wall-clock metrics

16-GPU DRPO (Qwen3-4B, DAPO, light regime): train phase 75.2s -> 54.3s.
@leviking98z-rgb

Copy link
Copy Markdown
Collaborator Author

Review — request changes. Great direction (packing + NCCL parity design is sound), but two blockers:

  1. train() grad_norm flow broken (stack.py ~662): the CUDA-watermark block was inserted between the original if has_backward: / else:, so the else now binds to torch.cuda.is_available(). GPU + no-backward -> UnboundLocalError; CPU -> grad_norm silently forced to 0.0 + a spurious warning every step.
  2. Prompt lengths are dead code (stack.py ~437): resp_track.conditions is a Dict[str, Condition], so getattr(conditions, "prompt", None) always returns None -> the 2D packer is unreachable and the budget counts response tokens only, while the qwen3 packed replay forwards prompt+response — real micro size exceeds the budget by count * prompt_len. Use conditions.get("prompt").

Also (non-blocking): example YAML should set micro_cost_model: sum; the "step gradient is unchanged" claim only holds for seq-mean agg modes — for token-mean (DRPO/GRPO default) sample-share weighting changes per-sequence weights vs both the mbs=1 baseline and verl token-count weighting; replay_time_s/backward_time_s are mean-per-micro, not phase totals; _sync_micro_count on the default PG is OK today but deadlocks if the lengths-missing fallback fires asymmetrically — assert symmetry.

Verified good: anchor scatter-back, multi-update plan sharing, bin non-emptiness, Batch.select packed-field gathering.

…mpt lengths

Addresses #42 review B1 (the CUDA-watermark insertion re-bound the
original else to torch.cuda.is_available(): GPU+no-backward raised
UnboundLocalError, CPU silently zeroed grad_norm) and B2 (conditions is a
Dict, getattr never found prompt — the 2D cost model was unreachable and
the budget counted response tokens only).
@leviking98z-rgb

Copy link
Copy Markdown
Collaborator Author

Review findings addressed in 9dd5cbe: B1 (grad_norm else-branch restored) and B2 (dict accessor — the 2D/sum cost model now actually receives prompt lengths). Note: the published benchmark numbers were measured with the B2 bug live, i.e. budgets effectively counted response tokens only; prompt-aware accounting may slightly change bin shapes. Non-blocking items (token-mean weighting caveat, timing-metric semantics) acknowledged — will document rather than change behavior in this PR.

The token-budget packing path was bolted onto TrainStack (built for the
diffusion/FlowGRPO "batched" schedule) via runtime flags. Extract the
family-agnostic pipeline into AbstractTrainStack(Remote, ABC) and derive
DiffusionTrainStack (count-based micros) and LLMTrainStack (token-budget packed
micros). The diffusion-vs-LLM seam collapses to one method, _plan; packing lives
in stack/_packing.py as free functions. TrainStack aliases DiffusionTrainStack so
the 60+ diffusion/PE configs are untouched; the 6 examples/ar/* configs retarget
to LLMTrainStack.

Carries the review fixes from 9dd5cbe forward into the new structure (stack.py is
removed): the grad_norm else-branch fix (B1) lives in base.py's _run_update — the
grad_norm if/else and the CUDA-memory block are independent; the prompt
dict-accessor fix (B2, conditions is a Dict so getattr never found "prompt") lives
in _packing.py's _plan_packed so the 2D cost model actually activates.

Add seq-mean guard: LLMTrainStack rejects token-budget packing unless the algorithm
uses a seq-mean loss aggregation, since sample-share micro weighting is
gradient-exact only for seq-mean modes.

Retire "mini-batch"; the optimizer-step level is an "update" (micro_slices->micros,
train->_run_update, _train_mini_batches->_run_updates, _plan_optimizer_steps->_plan).

Add tests/train/test_packing.py (22 CPU-only cases): packing coverage/budget
invariants, exact-K partition, count<->packed plan sample-set equivalence, 2D-packer
activation via the conditions dict (B2), and the seq-mean guard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants