Add RL token throughput and packing metrics by tdene · Pull Request #3877 · NVIDIA/Megatron-LM

tdene · 2026-03-15T22:31:24Z

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>

copy-pr-bot · 2026-03-15T22:31:28Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

megatron/rl/rl_utils.py

yobibyte · 2026-03-16T10:18:27Z

megatron/training/training.py

+            tokens_per_sec_per_gpu = tokens_per_sec / args.world_size
+
+            # For sequence packing, also compute actual tokens (non-padding)
+            if has_rl_utils and getattr(args, 'perform_rl_step', False) and getattr(args, 'rl_use_sequence_packing', False):


This looks unnecessarily complicated to me. Why do we check whether we have rl utils and requested to perform an RL step? We should invalidate this at the arg parse level and crash when someone requests rl step but has some issues with rl utils import.

Why do we use getattr here but normal args.ARG above? Let's use args.rl_use_sequence packing instead and let the argparser select the default values.

yobibyte · 2026-03-16T10:19:47Z

megatron/training/training.py

+                    compute_tokens = rl_utils.get_packing_compute_tokens(runtime_state.packing_context)
+
+                    # Scale to global batch (all DP ranks)
+                    actual_tokens_global = actual_tokens * mpu.get_data_parallel_world_size()


I would rename actual to something more meaningful, e.g. all_dp_ranks_tokens or all_ranks_tokens or smth

yobibyte · 2026-03-16T10:22:56Z

megatron/training/training.py

+                log_string += f' packing_eff: {packing_efficiency:.1%} |'
+
+        # Store derived throughput metrics on RLRuntimeState so that
+        # downstream consumers (e.g. RLProfiler) can read them.


We need to log those to wandb too.

yobibyte · 2026-03-16T10:24:55Z

megatron/rl/sequence_packing_utils.py

+    Returns:
+        Total compute tokens (num_bins * bin_size) on this rank.
+    """
+    if packing_context is None or packing_context.packed_trajs is None:


Your typing says that PackingContext cannot be None

yobibyte · 2026-03-16T10:26:39Z

megatron/rl/sequence_packing_utils.py

+    num_ranks = mpu.get_data_parallel_world_size()
+    bins_per_rank = packing_context.packed_trajs.shape[0] if packing_context.packed_trajs is not None else 0
+    bin_size = packing_context.packed_trajs.shape[1] if packing_context.packed_trajs is not None else 0
+    total_capacity = bins_per_rank * bin_size * num_ranks


Is this true that every rank will have the same amount of bins?

Yes. distribute_packed_bins ensures it.

tdene · 2026-03-16T13:41:13Z

/claude review

claude · 2026-03-16T13:42:25Z

megatron/training/training.py

+
+            # Add tokens/sec to log string
+            log_string += f' toks/s: {tokens_per_sec:.0f} |'
+            log_string += f' toks/s/gpu: {tokens_per_sec_per_gpu:.0f} |'


compute_tokens is assigned here but never used. Was this intended for something (e.g., a log line or the packing_efficiency calculation)? If not, it should be removed to avoid confusion.

Suggested change

log_string += f' toks/s/gpu: {tokens_per_sec_per_gpu:.0f} |'

actual_tokens = rl_utils.get_packing_actual_tokens(runtime_state.packing_context)

jaredcasper · 2026-03-19T21:01:06Z

megatron/training/training.py

+                    packing_efficiency = rl_utils.get_packing_efficiency(runtime_state.packing_context)
+
+            # Add tokens/sec to log string
+            log_string += f' toks/s: {tokens_per_sec:.0f} |'


Is this going to add this metric to the log for all training? I'm not sure we use this metric a lot in pretraining, so nervous it might just be adding noise to the log.

I've moved all the extra metrics in training.py into a single if-block guarded by args.perform_rl_step; does that look good?

Add RL token throughput and packing metrics

fafaa0c

Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>

tdene marked this pull request as ready for review March 15, 2026 22:31

tdene requested a review from a team as a code owner March 15, 2026 22:31

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 15, 2026

svcnvidia-nemo-ci requested a review from a team March 15, 2026 22:31

copy-pr-bot bot temporarily deployed to test March 15, 2026 22:32 Inactive

svcnvidia-nemo-ci added the complexity: low label Mar 15, 2026

yobibyte reviewed Mar 16, 2026

View reviewed changes

megatron/rl/rl_utils.py Show resolved Hide resolved

yobibyte reviewed Mar 16, 2026

View reviewed changes

claude bot reviewed Mar 16, 2026

View reviewed changes

Address reviewer comments

d59d67a

tdene requested review from a team as code owners March 17, 2026 13:52

copy-pr-bot bot temporarily deployed to test March 17, 2026 13:53 Inactive

copy-pr-bot bot temporarily deployed to test March 19, 2026 00:05 Inactive

jaredcasper reviewed Mar 19, 2026

View reviewed changes

tdene added 2 commits March 19, 2026 16:21

Address reviewer comments

894a53e

Merge remote-tracking branch 'gh/main' into tde/observability_metrics

2b2a0d3

tdene force-pushed the tde/observability_metrics branch from b215575 to 2b2a0d3 Compare March 19, 2026 21:21

copy-pr-bot bot temporarily deployed to test March 19, 2026 21:22 Inactive

	log_string += f' toks/s/gpu: {tokens_per_sec_per_gpu:.0f} \|'
	actual_tokens = rl_utils.get_packing_actual_tokens(runtime_state.packing_context)

Conversation

tdene commented Mar 15, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot bot commented Mar 15, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdene commented Mar 16, 2026

Uh oh!

claude bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants