feat: Add perfetto tracing for async GRPO training by gspschmid · Pull Request #1876 · NVIDIA-NeMo/RL

gspschmid · 2026-02-04T13:58:36Z

Adds high-level tracing of async GRPO in the driver process, the trajectory collector and the replay buffer. This provides a quick visual impression of the time each component of GRPO contributes and to what degree they are overlapped.

Issues

List issues that this PR closes (syntax):

(None)

Usage

NEMORL_TRACE_ENABLED=1 NEMORL_TRACE_FILE=nemorl_trace.json RAY_ADDRESS=localhost:6379 uv run python examples/run_grpo.py --config ...

The resulting nemorl_trace.json can be opened in the usual perfetto trace viewer, e.g. via https://ui.perfetto.dev/. Here's what a simple example looks like (timings not necessarily representative, since this was run on a toy example, grpo_math_1B-2n8g-async-1off):

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

cc @guyueh1

Summary by CodeRabbit

New Features
- Enhanced tracing and observability infrastructure for detailed performance monitoring across training and collection workflows.
- Added Chrome Trace Format support for compatibility with standard performance analysis tools.
- Improved error handling and control flow logging across asynchronous operations.

coderabbitai · 2026-02-04T14:11:37Z

📝 Walkthrough

Walkthrough

The changes introduce comprehensive tracing instrumentation to the async utilities and GRPO training algorithms. A new trace.py module provides a lightweight Tracer class with Chrome Trace Format compatibility for performance analysis. Async utilities (ReplayBuffer, AsyncTrajectoryCollector, workers) and GRPO training flows are instrumented with tracer spans and events around critical operations.

Changes

Cohort / File(s)	Summary
Tracing Infrastructure `nemo_rl/utils/trace.py`	New module introducing `Tracer` class with span/instant event tracking, Chrome Trace Format export, and utilities for environment-controlled tracing (`tracing_enabled`, `new_tracer`). Includes `save_trace` for merging local and Ray actor events, and `trace_and_time` context manager for combined tracing and timing.
Async Utilities Instrumentation `nemo_rl/algorithms/async_utils.py`	Added tracer initialization and `collect_trace` methods to `ReplayBuffer` and `AsyncTrajectoryCollector`. Wrapped key operations (push_with_wait_signal, sample, set_weight_version, worker execution, batch processing) with tracer spans and instant events. Added per-prompt worker tracers and loop-level tracing for trajectory collection.
GRPO Training Instrumentation `nemo_rl/algorithms/grpo.py`	Imported tracing utilities and created tracer instance in `async_grpo_train`. Replaced timer blocks with `trace_and_time` contexts around major training stages (step, sample, data_processing, logprob_inference_prep, training, refit, validation, checkpointing). Wrapped refit and training segments with tracer spans. Ensured tracer events are saved in finally/failure paths via `save_trace`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR introduces major tracing instrumentation across core async GRPO components but lacks test files, updated existing tests, and documented regression testing.	Add unit tests for Tracer class in test_trace.py and update existing async/GRPO tests. Document performance and convergence validation results in PR description.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main feature being added: Perfetto tracing instrumentation for async GRPO training across driver, trajectory collector, and replay buffer.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@nemo_rl/algorithms/async_utils.py`:
- Around line 308-318: The code currently always creates/accumulates worker
tracers causing memory growth; change initialization and usage of
_worker_tracers to be conditional on tracing being enabled: initialize
self._worker_tracers as an empty list only if self._tracer.enabled (otherwise
set to None or skip), and update any code that creates/appends worker tracers to
check self._tracer.enabled before creating/appending; ensure collect_trace
(method collect_trace) handles the disabled case by skipping iterating over
_worker_tracers when tracing is off and still returns events from the main
tracers.

In `@nemo_rl/algorithms/grpo.py`:
- Around line 2516-2549: Replace the manual tracer.start_span("training") /
tracer.end_span("training") pair with a context-manager form so spans are always
closed on exceptions: wrap the entire training block (the code that calls
policy.prepare_for_lp_inference(), policy.get_logprobs(...),
policy.get_reference_policy_logprobs(...), policy.prepare_for_training(), and
policy.train(...)) in a single with tracer.span("training"): block instead of
start_span/end_span; do the same for the validation block (the block that
currently uses tracer.start_span("validation")/tracer.end_span("validation")) so
both spans use with tracer.span("..."):. Ensure the code nested inside remains
unchanged and indented under the with blocks so the span is properly closed on
exception.

In `@nemo_rl/utils/trace.py`:
- Line 1: The file header in nemo_rl/utils/trace.py still shows "Copyright (c)
2025, NVIDIA CORPORATION." — update this top-of-file copyright year to 2026 so
the header reads "Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved."
and ensure the updated header appears at the very top of the file (affecting the
file that defines tracing utilities in this module).
- Around line 320-321: Remove the unnecessary f-string prefixes on the two print
statements that output Perfetto/Chrome tracing links in nemo_rl/utils/trace.py:
replace print(f"View in Perfetto UI: https://ui.perfetto.dev") and print(f"Or
open in Chrome: chrome://tracing") with plain string literals (print("View in
Perfetto UI: https://ui.perfetto.dev") and print("Or open in Chrome:
chrome://tracing")) to satisfy Ruff F541; update both occurrences and run the
linter to confirm the warning is cleared.
- Line 144: Unpack the tuple from self._span_stack.pop() using an
underscore-prefixed name for the unused value: change the current unpacking
"span_name, span_start, _span_metadata = self._span_stack.pop()" to use
"_span_start" instead of "span_start" so the unused variable follows Python
convention and suppresses the Ruff warning; ensure no other references to
span_start exist in the method before committing.

🧹 Nitpick comments (5)

nemo_rl/utils/trace.py (2)
49-53: Rename global RAY_AVAILABLE to use the G_ prefix.

This aligns the global with the required naming convention (and update any references accordingly).
🔧 Suggested fix
 try:
     import ray
-    RAY_AVAILABLE = True
+    G_RAY_AVAILABLE = True
 except ImportError:
-    RAY_AVAILABLE = False
+    G_RAY_AVAILABLE = False
As per coding guidelines, Use upper snake_case with G prefix for global variables, e.g., G_MY_GLOBAL.

276-335: Add Google-style docstrings for public tracing helpers.

tracing_enabled, new_tracer, define_collect_trace, save_trace, and trace_and_time are public helpers but currently lack docstrings.
✍️ Example (apply similarly to the other helpers)
 def tracing_enabled():
+    """Check whether tracing is enabled via environment variables.
+
+    Returns:
+        True if NEMORL_TRACE_ENABLED is truthy, else False.
+    """
     return os.environ.get("NEMORL_TRACE_ENABLED", "0").lower() in ("1", "true", "yes")
As per coding guidelines, Use Google style docstrings for classes and functions.
nemo_rl/algorithms/async_utils.py (2)
58-60: Document collect_trace actor APIs.

These methods are called remotely (e.g., during trace aggregation) and should carry Google-style docstrings.
✍️ Suggested update
 `@define_collect_trace`
 def collect_trace(self):
+        """Collect tracer events for Perfetto export."""
         return self._tracer.get_events()
@@
 `@define_collect_trace`
 def collect_trace(self):
+        """Collect tracer events for Perfetto export."""
         events = self._tracer.get_events()
         events.extend(self._loop_tracer.get_events())
         for worker_tracer in self._worker_tracers:
             events.extend(worker_tracer.get_events())
         return events
As per coding guidelines, Use Google style docstrings for classes and functions.

Also applies to: 312-318

635-655: Avoid hidden config defaults and narrow the exception type.

Use explicit config values (no implicit False defaults) and catch only expected failures from cache invalidation.
🛠️ Suggested fix
-            async_cfg = self.master_config.get("grpo", {}).get("async_grpo", {})
-            if async_cfg.get("in_flight_weight_updates", False) and async_cfg.get(
-                "recompute_kv_cache_after_weight_updates", False
-            ):
+            async_cfg = self.master_config["grpo"]["async_grpo"]
+            if async_cfg.get("in_flight_weight_updates") and async_cfg.get(
+                "recompute_kv_cache_after_weight_updates"
+            ):
                 try:
                     print("🔄 Invalidating vLLM prefix/KV caches after weight update")
                     invalidated = self.policy_generation.invalidate_kv_cache()
                     if invalidated:
                         print("✅ Invalidated vLLM prefix/KV caches after weight update")
                     else:
                         print(
                             "⚠️ vLLM cache invalidation reported partial/unsuccessful on some workers"
                         )
-                except Exception as e:
+                except RuntimeError as e:
                     print(f"⚠️ Failed to invalidate vLLM caches: {e}")
As per coding guidelines, YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values, and In try-except blocks, limit the except clause to the smallest set of errors possible.
nemo_rl/algorithms/grpo.py (1)
2859-2864: Narrow the exception type when saving traces.

Catching Exception masks unexpected failures; constrain it to the expected Ray/IO errors.
🛠️ Suggested fix
-        except Exception as e:
+        except (ray.exceptions.RayError, OSError) as e:
             print(f"Error saving tracer events: {e}")
As per coding guidelines, In try-except blocks, limit the except clause to the smallest set of errors possible.

nemo_rl/algorithms/async_utils.py

coderabbitai · 2026-02-04T14:11:40Z

nemo_rl/algorithms/grpo.py

+                tracer.start_span("training")
                print("▶ Preparing for logprob inference...")
-                with timer.time("logprob_inference_prep"):
+                with trace_and_time("logprob_inference_prep"):
                    policy.prepare_for_lp_inference()

                print("▶ Computing logprobs...")
-                with timer.time("policy_and_reference_logprobs"):
-                    fprop_logprobs = policy.get_logprobs(
-                        train_data,
-                        timer=timer,
-                    )["logprobs"]
-                    reference_logprobs = policy.get_reference_policy_logprobs(
-                        train_data,
-                        timer=timer,
-                    )["reference_logprobs"]
+                with trace_and_time("policy_and_reference_logprobs"):
+                    with tracer.span("policy_logprobs"):
+                        fprop_logprobs = policy.get_logprobs(
+                            train_data,
+                            timer=timer,
+                        )["logprobs"]
+
+                    with tracer.span("reference_policy_logprobs"):
+                        reference_logprobs = policy.get_reference_policy_logprobs(
+                            train_data,
+                            timer=timer,
+                        )["reference_logprobs"]
                    train_data["prev_logprobs"] = fprop_logprobs
                    train_data["reference_policy_logprobs"] = reference_logprobs

                print("▶ Preparing for training...")
-                with timer.time("training_prep"):
+                with trace_and_time("training_prep"):
                    policy.prepare_for_training()
                    POLICY_GENERATION_STALE = True

                print("▶ Training policy...")
-                with timer.time("policy_training"):
+                with trace_and_time("policy_training"):
                    train_results = policy.train(
                        train_data,
                        loss_fn,
                        timer=timer,
                    )
+                tracer.end_span("training")


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file exists and get its size wc -l nemo_rl/algorithms/grpo.py

Repository: NVIDIA-NeMo/RL

Length of output: 89

🏁 Script executed:

# Now let's look at the code around lines 2516-2549 sed -n '2516,2549p' nemo_rl/algorithms/grpo.py

Repository: NVIDIA-NeMo/RL

Length of output: 1600

🏁 Script executed:

# And the code around lines 2587-2623 sed -n '2587,2623p' nemo_rl/algorithms/grpo.py

Repository: NVIDIA-NeMo/RL

Length of output: 1710

🏁 Script executed:

# Let's find where tracer is defined or imported rg -n "tracer\s*=" nemo_rl/algorithms/grpo.py | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 88

🏁 Script executed:

# Search for where tracer's span method is called to understand its usage rg -n "tracer\.span\(" nemo_rl/algorithms/grpo.py | head -10

Repository: NVIDIA-NeMo/RL

Length of output: 235

🏁 Script executed:

# Search for tracer import rg -n "import.*tracer|from.*tracer" nemo_rl/algorithms/grpo.py

Repository: NVIDIA-NeMo/RL

Length of output: 115

🏁 Script executed:

# Find the tracer implementation find nemo_rl -name "trace.py" -o -name "tracer.py"

Repository: NVIDIA-NeMo/RL

Length of output: 80

🏁 Script executed:

# Check if nemo_rl/utils/trace.py exists and read it wc -l nemo_rl/utils/trace.py

Repository: NVIDIA-NeMo/RL

Length of output: 84

🏁 Script executed:

# Read the trace.py file cat -n nemo_rl/utils/trace.py

Repository: NVIDIA-NeMo/RL

Length of output: 12541

Use context-manager spans to guarantee closure on exceptions.

The tracer.span() context manager ensures end_span() executes in a finally block, preventing unmatched spans if an exception bubbles up between start_span and end_span. Refactor the "training" span (lines 2516-2549) and "validation" span (lines 2587-2623) to use with tracer.span("name"): instead of manual start_span/end_span calls.

🤖 Prompt for AI Agents

In `@nemo_rl/algorithms/grpo.py` around lines 2516 - 2549, Replace the manual tracer.start_span("training") / tracer.end_span("training") pair with a context-manager form so spans are always closed on exceptions: wrap the entire training block (the code that calls policy.prepare_for_lp_inference(), policy.get_logprobs(...), policy.get_reference_policy_logprobs(...), policy.prepare_for_training(), and policy.train(...)) in a single with tracer.span("training"): block instead of start_span/end_span; do the same for the validation block (the block that currently uses tracer.start_span("validation")/tracer.end_span("validation")) so both spans use with tracer.span("..."):. Ensure the code nested inside remains unchanged and indented under the with blocks so the span is properly closed on exception.

nemo_rl/utils/trace.py

Signed-off-by: Georg Stefan Schmid <gschmid@nvidia.com>

gspschmid · 2026-02-04T16:41:32Z

Fwiw, I noticed that we might also be able to insert additional traces in Ray's own timeline, though I'm not sure what trade-offs that would come with. From the Ray documentation (https://docs.ray.io/en/latest/ray-observability/user-guides/ray-tracing.html#tracing) it seems that the feature is deprecated and requires OpenTelemetry as an external dependency.

terrykong

thanks for the PR!

a couple of questions:

could you share how this would compare to the opentelemetry or ray's actor timeline feature?
is it possible to generalize the decorators on utilities (ex) to avoid creating more indentation to hide these utilities on the algorithms/* files?

cc @guyueh1

gspschmid · 2026-02-04T19:07:59Z

Re 1. In terms of tracing output I think it would be relatively similar, i.e. a single perfetto trace that covers spans across various Ray actors. Annotation overhead (in terms of additional code) would be comparable as well: spans are introduced via context managers in either case (https://docs.ray.io/en/latest/ray-observability/user-guides/ray-tracing.html#custom-traces).

I was initially wondering whether the existing Ray infrastructure might be more robust, but given the note on top of that documentation page ("Tracing is an Alpha feature and no longer under active development/being maintained. APIs are subject to change.") I am less inclined to investigate much more deeply. In any case, the motivation behind this PR is mostly to trace a handful of steps of a training jobs to inform which performance optimization we should focus on.

Re 2. I agree that the less lines of code we touch (and indent), the better. Where we can use decorators we probably should, and indeed it might make sense to add a more general @span decorator that subsumes timer.time, tracer.span and and nvtx.annotate. Some of the functions like async_grpo_train might benefit from being broken up into smaller subfunctions, which would also give us a natural place to decorate those sections (thereby avoiding further indentation).

guyueh1 · 2026-02-04T22:22:56Z

@youngeunkwon0405 please review

youngeunkwon0405 · 2026-02-09T07:12:54Z

Hi @gspschmid, thanks for your contribution. This will be a very useful feature. I have a few suggestions on this PR.

The name is too general. Can you change the class/method name to more specific? E.g., Tracer --> PerfettoTracer or ChromeTracer, new_tracer --> new_tracer, save_trace --> save_trace, etc.)
Can we make this feature be enabled by the config yaml file instead of the ENV VAR? This will improve this feature's visibility and make it more users to use.
Can you also document this feature somewhere in /RL/docs/.
There are merge conflicts and DCO errors that need to be resolved.

gspschmid requested review from a team as code owners February 4, 2026 13:58

gspschmid force-pushed the gschmid/tracing branch from 580a4a5 to 0fa44fc Compare February 4, 2026 14:08

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

gspschmid changed the title ~~Add perfetto tracing for async GRPO training~~ feat: Add perfetto tracing for async GRPO training Feb 4, 2026

gspschmid requested a review from a team as a code owner February 4, 2026 15:05

feat: Add perfetto tracing for async GRPO training

438f285

Signed-off-by: Georg Stefan Schmid <gschmid@nvidia.com>

gspschmid force-pushed the gschmid/tracing branch from 0aaaf93 to 438f285 Compare February 4, 2026 15:06

terrykong reviewed Feb 4, 2026

View reviewed changes

Remove leftover code

9cbb426

guyueh1 requested a review from youngeunkwon0405 February 4, 2026 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add perfetto tracing for async GRPO training#1876

feat: Add perfetto tracing for async GRPO training#1876
gspschmid wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
gspschmid:gschmid/tracing

gspschmid commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gspschmid commented Feb 4, 2026

Uh oh!

terrykong left a comment

Uh oh!

gspschmid commented Feb 4, 2026

Uh oh!

guyueh1 commented Feb 4, 2026

Uh oh!

youngeunkwon0405 commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gspschmid commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Usage

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gspschmid commented Feb 4, 2026

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

gspschmid commented Feb 4, 2026

Uh oh!

guyueh1 commented Feb 4, 2026

Uh oh!

youngeunkwon0405 commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gspschmid commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading