Skip to content

Add RL token throughput and packing metrics#3877

Open
tdene wants to merge 4 commits intoNVIDIA:mainfrom
tdene:tde/observability_metrics
Open

Add RL token throughput and packing metrics#3877
tdene wants to merge 4 commits intoNVIDIA:mainfrom
tdene:tde/observability_metrics

Conversation

@tdene
Copy link
Contributor

@tdene tdene commented Mar 15, 2026

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 15, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@tdene tdene marked this pull request as ready for review March 15, 2026 22:31
@tdene tdene requested a review from a team as a code owner March 15, 2026 22:31
@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 15, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team March 15, 2026 22:31
tokens_per_sec_per_gpu = tokens_per_sec / args.world_size

# For sequence packing, also compute actual tokens (non-padding)
if has_rl_utils and getattr(args, 'perform_rl_step', False) and getattr(args, 'rl_use_sequence_packing', False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unnecessarily complicated to me. Why do we check whether we have rl utils and requested to perform an RL step? We should invalidate this at the arg parse level and crash when someone requests rl step but has some issues with rl utils import.

Why do we use getattr here but normal args.ARG above? Let's use args.rl_use_sequence packing instead and let the argparser select the default values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

compute_tokens = rl_utils.get_packing_compute_tokens(runtime_state.packing_context)

# Scale to global batch (all DP ranks)
actual_tokens_global = actual_tokens * mpu.get_data_parallel_world_size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename actual to something more meaningful, e.g. all_dp_ranks_tokens or all_ranks_tokens or smth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

log_string += f' packing_eff: {packing_efficiency:.1%} |'

# Store derived throughput metrics on RLRuntimeState so that
# downstream consumers (e.g. RLProfiler) can read them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to log those to wandb too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Returns:
Total compute tokens (num_bins * bin_size) on this rank.
"""
if packing_context is None or packing_context.packed_trajs is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your typing says that PackingContext cannot be None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

num_ranks = mpu.get_data_parallel_world_size()
bins_per_rank = packing_context.packed_trajs.shape[0] if packing_context.packed_trajs is not None else 0
bin_size = packing_context.packed_trajs.shape[1] if packing_context.packed_trajs is not None else 0
total_capacity = bins_per_rank * bin_size * num_ranks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true that every rank will have the same amount of bins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. distribute_packed_bins ensures it.

@tdene
Copy link
Contributor Author

tdene commented Mar 16, 2026

/claude review


# Add tokens/sec to log string
log_string += f' toks/s: {tokens_per_sec:.0f} |'
log_string += f' toks/s/gpu: {tokens_per_sec_per_gpu:.0f} |'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_tokens is assigned here but never used. Was this intended for something (e.g., a log line or the packing_efficiency calculation)? If not, it should be removed to avoid confusion.

Suggested change
log_string += f' toks/s/gpu: {tokens_per_sec_per_gpu:.0f} |'
actual_tokens = rl_utils.get_packing_actual_tokens(runtime_state.packing_context)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

packing_efficiency = rl_utils.get_packing_efficiency(runtime_state.packing_context)

# Add tokens/sec to log string
log_string += f' toks/s: {tokens_per_sec:.0f} |'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to add this metric to the log for all training? I'm not sure we use this metric a lot in pretraining, so nervous it might just be adding noise to the log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved all the extra metrics in training.py into a single if-block guarded by args.perform_rl_step; does that look good?

@tdene tdene force-pushed the tde/observability_metrics branch from b215575 to 2b2a0d3 Compare March 19, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants