Add RL token throughput and packing metrics#3877
Conversation
Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
megatron/training/training.py
Outdated
| tokens_per_sec_per_gpu = tokens_per_sec / args.world_size | ||
|
|
||
| # For sequence packing, also compute actual tokens (non-padding) | ||
| if has_rl_utils and getattr(args, 'perform_rl_step', False) and getattr(args, 'rl_use_sequence_packing', False): |
There was a problem hiding this comment.
This looks unnecessarily complicated to me. Why do we check whether we have rl utils and requested to perform an RL step? We should invalidate this at the arg parse level and crash when someone requests rl step but has some issues with rl utils import.
Why do we use getattr here but normal args.ARG above? Let's use args.rl_use_sequence packing instead and let the argparser select the default values.
megatron/training/training.py
Outdated
| compute_tokens = rl_utils.get_packing_compute_tokens(runtime_state.packing_context) | ||
|
|
||
| # Scale to global batch (all DP ranks) | ||
| actual_tokens_global = actual_tokens * mpu.get_data_parallel_world_size() |
There was a problem hiding this comment.
I would rename actual to something more meaningful, e.g. all_dp_ranks_tokens or all_ranks_tokens or smth
megatron/training/training.py
Outdated
| log_string += f' packing_eff: {packing_efficiency:.1%} |' | ||
|
|
||
| # Store derived throughput metrics on RLRuntimeState so that | ||
| # downstream consumers (e.g. RLProfiler) can read them. |
There was a problem hiding this comment.
We need to log those to wandb too.
| Returns: | ||
| Total compute tokens (num_bins * bin_size) on this rank. | ||
| """ | ||
| if packing_context is None or packing_context.packed_trajs is None: |
There was a problem hiding this comment.
Your typing says that PackingContext cannot be None
| num_ranks = mpu.get_data_parallel_world_size() | ||
| bins_per_rank = packing_context.packed_trajs.shape[0] if packing_context.packed_trajs is not None else 0 | ||
| bin_size = packing_context.packed_trajs.shape[1] if packing_context.packed_trajs is not None else 0 | ||
| total_capacity = bins_per_rank * bin_size * num_ranks |
There was a problem hiding this comment.
Is this true that every rank will have the same amount of bins?
There was a problem hiding this comment.
Yes. distribute_packed_bins ensures it.
|
/claude review |
megatron/training/training.py
Outdated
|
|
||
| # Add tokens/sec to log string | ||
| log_string += f' toks/s: {tokens_per_sec:.0f} |' | ||
| log_string += f' toks/s/gpu: {tokens_per_sec_per_gpu:.0f} |' |
There was a problem hiding this comment.
compute_tokens is assigned here but never used. Was this intended for something (e.g., a log line or the packing_efficiency calculation)? If not, it should be removed to avoid confusion.
| log_string += f' toks/s/gpu: {tokens_per_sec_per_gpu:.0f} |' | |
| actual_tokens = rl_utils.get_packing_actual_tokens(runtime_state.packing_context) |
megatron/training/training.py
Outdated
| packing_efficiency = rl_utils.get_packing_efficiency(runtime_state.packing_context) | ||
|
|
||
| # Add tokens/sec to log string | ||
| log_string += f' toks/s: {tokens_per_sec:.0f} |' |
There was a problem hiding this comment.
Is this going to add this metric to the log for all training? I'm not sure we use this metric a lot in pretraining, so nervous it might just be adding noise to the log.
There was a problem hiding this comment.
I've moved all the extra metrics in training.py into a single if-block guarded by args.perform_rl_step; does that look good?
b215575 to
2b2a0d3
Compare
What does this PR do ?
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.