perf(train): token-budget packed micro-batches (verl dynamic-bsz parity) by leviking98z-rgb · Pull Request #42 · Tencent-Hunyuan/UniRL

leviking98z-rgb · 2026-06-12T04:48:35Z

Part of the verl performance-parity series tracked in #40.

Summary

micro_batch_size=1 ran one sequence per forward/backward (32 isolated micro-steps per worker per rollout at 16 GPU) — the dominant cause of a measured 4.5x train-phase gap vs verl. This adds TrainStack.micro_token_budget (verl ppo_max_token_len_per_gpu parity): each mini-batch is packed into micro-batches under a token budget (length-sorted first-fit), with

index-plan micros materialized via Batch.select
micro losses weighted by sample share (step gradient mathematically unchanged; unit-tested exact)
NCCL micro-count parity: per optimizer step, all-reduce(MAX) the bin count and re-partition into exactly K bins — FSDP runs collectives per micro, unequal counts deadlock the process group (observed: watchdog killed rank15)
replay-anchored algorithms scatter anchor chunks back to track order
diagnostics: per-step CUDA alloc/reserved watermark, per-micro replay/backward wall-clock, packing-efficiency log

Benchmark

16-GPU DRPO (Qwen3-4B-Base + DAPO, light regime, phase-timed): train phase 75.2s -> 54.3s; ratio_mean stays 1.000.

Test Plan

unit tests: bin coverage/budget invariants, exact-K partition, loss-weighting equality (1e-12)
16-GPU runs: pack16_unirl_drpo_gz6 (wandb, project unirl-grpo) vs baseline perfphase16_unirl_base; ratio_mean≈1.0000 across all steps
legacy path regression: micro_token_budget: null preserves old count-based behavior

micro_batch_size=1 ran one sequence per forward/backward (32 isolated micro-steps per worker per rollout at 16 GPU) - measured 4.5x slower than verl in the train phase. Pack each mini-batch into micro-batches under a token budget (TrainStack.micro_token_budget, length-sorted first-fit), with: - index-plan micros materialized via Batch.select - micro losses weighted by sample share so the step gradient is unchanged - NCCL micro-count parity: all-reduce(MAX) the bin count per optimizer step and re-partition into exactly K bins - FSDP runs collectives per micro, so unequal counts deadlock the process group - replay-anchored algorithms scatter anchor chunks back to track order - per-step CUDA alloc/reserved watermark + replay/backward wall-clock metrics 16-GPU DRPO (Qwen3-4B, DAPO, light regime): train phase 75.2s -> 54.3s.

leviking98z-rgb · 2026-06-12T09:09:36Z

Review — request changes. Great direction (packing + NCCL parity design is sound), but two blockers:

train() grad_norm flow broken (stack.py ~662): the CUDA-watermark block was inserted between the original if has_backward: / else:, so the else now binds to torch.cuda.is_available(). GPU + no-backward -> UnboundLocalError; CPU -> grad_norm silently forced to 0.0 + a spurious warning every step.
Prompt lengths are dead code (stack.py ~437): resp_track.conditions is a Dict[str, Condition], so getattr(conditions, "prompt", None) always returns None -> the 2D packer is unreachable and the budget counts response tokens only, while the qwen3 packed replay forwards prompt+response — real micro size exceeds the budget by count * prompt_len. Use conditions.get("prompt").

Also (non-blocking): example YAML should set micro_cost_model: sum; the "step gradient is unchanged" claim only holds for seq-mean agg modes — for token-mean (DRPO/GRPO default) sample-share weighting changes per-sequence weights vs both the mbs=1 baseline and verl token-count weighting; replay_time_s/backward_time_s are mean-per-micro, not phase totals; _sync_micro_count on the default PG is OK today but deadlocks if the lengths-missing fallback fires asymmetrically — assert symmetry.

Verified good: anchor scatter-back, multi-update plan sharing, bin non-emptiness, Batch.select packed-field gathering.

…mpt lengths Addresses #42 review B1 (the CUDA-watermark insertion re-bound the original else to torch.cuda.is_available(): GPU+no-backward raised UnboundLocalError, CPU silently zeroed grad_norm) and B2 (conditions is a Dict, getattr never found prompt — the 2D cost model was unreachable and the budget counted response tokens only).

leviking98z-rgb · 2026-06-12T09:12:56Z

Review findings addressed in 9dd5cbe: B1 (grad_norm else-branch restored) and B2 (dict accessor — the 2D/sum cost model now actually receives prompt lengths). Note: the published benchmark numbers were measured with the B2 bug live, i.e. budgets effectively counted response tokens only; prompt-aware accounting may slightly change bin shapes. Non-blocking items (token-mean weighting caveat, timing-metric semantics) acknowledged — will document rather than change behavior in this PR.

The token-budget packing path was bolted onto TrainStack (built for the diffusion/FlowGRPO "batched" schedule) via runtime flags. Extract the family-agnostic pipeline into AbstractTrainStack(Remote, ABC) and derive DiffusionTrainStack (count-based micros) and LLMTrainStack (token-budget packed micros). The diffusion-vs-LLM seam collapses to one method, _plan; packing lives in stack/_packing.py as free functions. TrainStack aliases DiffusionTrainStack so the 60+ diffusion/PE configs are untouched; the 6 examples/ar/* configs retarget to LLMTrainStack. Carries the review fixes from 9dd5cbe forward into the new structure (stack.py is removed): the grad_norm else-branch fix (B1) lives in base.py's _run_update — the grad_norm if/else and the CUDA-memory block are independent; the prompt dict-accessor fix (B2, conditions is a Dict so getattr never found "prompt") lives in _packing.py's _plan_packed so the 2D cost model actually activates. Add seq-mean guard: LLMTrainStack rejects token-budget packing unless the algorithm uses a seq-mean loss aggregation, since sample-share micro weighting is gradient-exact only for seq-mean modes. Retire "mini-batch"; the optimizer-step level is an "update" (micro_slices->micros, train->_run_update, _train_mini_batches->_run_updates, _plan_optimizer_steps->_plan). Add tests/train/test_packing.py (22 CPU-only cases): packing coverage/budget invariants, exact-K partition, count<->packed plan sample-set equivalence, 2D-packer activation via the conditions dict (B2), and the seq-mean guard.

This was referenced Jun 12, 2026

VLM/LLM Infra Updating #40

Open

perf(recipe): enable token-budget packing + DP balancing for Qwen-VL recipes #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(train): token-budget packed micro-batches (verl dynamic-bsz parity)#42

perf(train): token-budget packed micro-batches (verl dynamic-bsz parity)#42
leviking98z-rgb wants to merge 3 commits into
mainfrom
perf/01-token-budget-packing

leviking98z-rgb commented Jun 12, 2026 •

edited

Loading

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leviking98z-rgb commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Test Plan

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

leviking98z-rgb commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leviking98z-rgb commented Jun 12, 2026 •

edited

Loading