Skip to content

feat: Add perfetto tracing for async GRPO training#1876

Open
gspschmid wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
gspschmid:gschmid/tracing
Open

feat: Add perfetto tracing for async GRPO training#1876
gspschmid wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
gspschmid:gschmid/tracing

Conversation

@gspschmid
Copy link

@gspschmid gspschmid commented Feb 4, 2026

Adds high-level tracing of async GRPO in the driver process, the trajectory collector and the replay buffer. This provides a quick visual impression of the time each component of GRPO contributes and to what degree they are overlapped.

Issues

List issues that this PR closes (syntax):

(None)

Usage

NEMORL_TRACE_ENABLED=1 NEMORL_TRACE_FILE=nemorl_trace.json RAY_ADDRESS=localhost:6379 uv run python examples/run_grpo.py --config ...

The resulting nemorl_trace.json can be opened in the usual perfetto trace viewer, e.g. via https://ui.perfetto.dev/. Here's what a simple example looks like (timings not necessarily representative, since this was run on a toy example, grpo_math_1B-2n8g-async-1off):
image

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

cc @guyueh1

Summary by CodeRabbit

  • New Features
    • Enhanced tracing and observability infrastructure for detailed performance monitoring across training and collection workflows.
    • Added Chrome Trace Format support for compatibility with standard performance analysis tools.
    • Improved error handling and control flow logging across asynchronous operations.

@gspschmid gspschmid requested review from a team as code owners February 4, 2026 13:58
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

The changes introduce comprehensive tracing instrumentation to the async utilities and GRPO training algorithms. A new trace.py module provides a lightweight Tracer class with Chrome Trace Format compatibility for performance analysis. Async utilities (ReplayBuffer, AsyncTrajectoryCollector, workers) and GRPO training flows are instrumented with tracer spans and events around critical operations.

Changes

Cohort / File(s) Summary
Tracing Infrastructure
nemo_rl/utils/trace.py
New module introducing Tracer class with span/instant event tracking, Chrome Trace Format export, and utilities for environment-controlled tracing (tracing_enabled, new_tracer). Includes save_trace for merging local and Ray actor events, and trace_and_time context manager for combined tracing and timing.
Async Utilities Instrumentation
nemo_rl/algorithms/async_utils.py
Added tracer initialization and collect_trace methods to ReplayBuffer and AsyncTrajectoryCollector. Wrapped key operations (push_with_wait_signal, sample, set_weight_version, worker execution, batch processing) with tracer spans and instant events. Added per-prompt worker tracers and loop-level tracing for trajectory collection.
GRPO Training Instrumentation
nemo_rl/algorithms/grpo.py
Imported tracing utilities and created tracer instance in async_grpo_train. Replaced timer blocks with trace_and_time contexts around major training stages (step, sample, data_processing, logprob_inference_prep, training, refit, validation, checkpointing). Wrapped refit and training segments with tracer spans. Ensured tracer events are saved in finally/failure paths via save_trace.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR introduces major tracing instrumentation across core async GRPO components but lacks test files, updated existing tests, and documented regression testing. Add unit tests for Tracer class in test_trace.py and update existing async/GRPO tests. Document performance and convergence validation results in PR description.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature being added: Perfetto tracing instrumentation for async GRPO training across driver, trajectory collector, and replay buffer.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@nemo_rl/algorithms/async_utils.py`:
- Around line 308-318: The code currently always creates/accumulates worker
tracers causing memory growth; change initialization and usage of
_worker_tracers to be conditional on tracing being enabled: initialize
self._worker_tracers as an empty list only if self._tracer.enabled (otherwise
set to None or skip), and update any code that creates/appends worker tracers to
check self._tracer.enabled before creating/appending; ensure collect_trace
(method collect_trace) handles the disabled case by skipping iterating over
_worker_tracers when tracing is off and still returns events from the main
tracers.

In `@nemo_rl/algorithms/grpo.py`:
- Around line 2516-2549: Replace the manual tracer.start_span("training") /
tracer.end_span("training") pair with a context-manager form so spans are always
closed on exceptions: wrap the entire training block (the code that calls
policy.prepare_for_lp_inference(), policy.get_logprobs(...),
policy.get_reference_policy_logprobs(...), policy.prepare_for_training(), and
policy.train(...)) in a single with tracer.span("training"): block instead of
start_span/end_span; do the same for the validation block (the block that
currently uses tracer.start_span("validation")/tracer.end_span("validation")) so
both spans use with tracer.span("..."):. Ensure the code nested inside remains
unchanged and indented under the with blocks so the span is properly closed on
exception.

In `@nemo_rl/utils/trace.py`:
- Line 1: The file header in nemo_rl/utils/trace.py still shows "Copyright (c)
2025, NVIDIA CORPORATION." — update this top-of-file copyright year to 2026 so
the header reads "Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved."
and ensure the updated header appears at the very top of the file (affecting the
file that defines tracing utilities in this module).
- Around line 320-321: Remove the unnecessary f-string prefixes on the two print
statements that output Perfetto/Chrome tracing links in nemo_rl/utils/trace.py:
replace print(f"View in Perfetto UI: https://ui.perfetto.dev") and print(f"Or
open in Chrome: chrome://tracing") with plain string literals (print("View in
Perfetto UI: https://ui.perfetto.dev") and print("Or open in Chrome:
chrome://tracing")) to satisfy Ruff F541; update both occurrences and run the
linter to confirm the warning is cleared.
- Line 144: Unpack the tuple from self._span_stack.pop() using an
underscore-prefixed name for the unused value: change the current unpacking
"span_name, span_start, _span_metadata = self._span_stack.pop()" to use
"_span_start" instead of "span_start" so the unused variable follows Python
convention and suppresses the Ruff warning; ensure no other references to
span_start exist in the method before committing.
🧹 Nitpick comments (5)
nemo_rl/utils/trace.py (2)

49-53: Rename global RAY_AVAILABLE to use the G_ prefix.

This aligns the global with the required naming convention (and update any references accordingly).

🔧 Suggested fix
 try:
     import ray
-    RAY_AVAILABLE = True
+    G_RAY_AVAILABLE = True
 except ImportError:
-    RAY_AVAILABLE = False
+    G_RAY_AVAILABLE = False

As per coding guidelines, Use upper snake_case with G prefix for global variables, e.g., G_MY_GLOBAL.


276-335: Add Google-style docstrings for public tracing helpers.

tracing_enabled, new_tracer, define_collect_trace, save_trace, and trace_and_time are public helpers but currently lack docstrings.

✍️ Example (apply similarly to the other helpers)
 def tracing_enabled():
+    """Check whether tracing is enabled via environment variables.
+
+    Returns:
+        True if NEMORL_TRACE_ENABLED is truthy, else False.
+    """
     return os.environ.get("NEMORL_TRACE_ENABLED", "0").lower() in ("1", "true", "yes")

As per coding guidelines, Use Google style docstrings for classes and functions.

nemo_rl/algorithms/async_utils.py (2)

58-60: Document collect_trace actor APIs.

These methods are called remotely (e.g., during trace aggregation) and should carry Google-style docstrings.

✍️ Suggested update
 `@define_collect_trace`
 def collect_trace(self):
+        """Collect tracer events for Perfetto export."""
         return self._tracer.get_events()
@@
 `@define_collect_trace`
 def collect_trace(self):
+        """Collect tracer events for Perfetto export."""
         events = self._tracer.get_events()
         events.extend(self._loop_tracer.get_events())
         for worker_tracer in self._worker_tracers:
             events.extend(worker_tracer.get_events())
         return events

As per coding guidelines, Use Google style docstrings for classes and functions.

Also applies to: 312-318


635-655: Avoid hidden config defaults and narrow the exception type.

Use explicit config values (no implicit False defaults) and catch only expected failures from cache invalidation.

🛠️ Suggested fix
-            async_cfg = self.master_config.get("grpo", {}).get("async_grpo", {})
-            if async_cfg.get("in_flight_weight_updates", False) and async_cfg.get(
-                "recompute_kv_cache_after_weight_updates", False
-            ):
+            async_cfg = self.master_config["grpo"]["async_grpo"]
+            if async_cfg.get("in_flight_weight_updates") and async_cfg.get(
+                "recompute_kv_cache_after_weight_updates"
+            ):
                 try:
                     print("🔄 Invalidating vLLM prefix/KV caches after weight update")
                     invalidated = self.policy_generation.invalidate_kv_cache()
                     if invalidated:
                         print("✅ Invalidated vLLM prefix/KV caches after weight update")
                     else:
                         print(
                             "⚠️ vLLM cache invalidation reported partial/unsuccessful on some workers"
                         )
-                except Exception as e:
+                except RuntimeError as e:
                     print(f"⚠️ Failed to invalidate vLLM caches: {e}")

As per coding guidelines, YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values, and In try-except blocks, limit the except clause to the smallest set of errors possible.

nemo_rl/algorithms/grpo.py (1)

2859-2864: Narrow the exception type when saving traces.

Catching Exception masks unexpected failures; constrain it to the expected Ray/IO errors.

🛠️ Suggested fix
-        except Exception as e:
+        except (ray.exceptions.RayError, OSError) as e:
             print(f"Error saving tracer events: {e}")

As per coding guidelines, In try-except blocks, limit the except clause to the smallest set of errors possible.

Comment on lines +2516 to +2549
tracer.start_span("training")
print("▶ Preparing for logprob inference...")
with timer.time("logprob_inference_prep"):
with trace_and_time("logprob_inference_prep"):
policy.prepare_for_lp_inference()

print("▶ Computing logprobs...")
with timer.time("policy_and_reference_logprobs"):
fprop_logprobs = policy.get_logprobs(
train_data,
timer=timer,
)["logprobs"]
reference_logprobs = policy.get_reference_policy_logprobs(
train_data,
timer=timer,
)["reference_logprobs"]
with trace_and_time("policy_and_reference_logprobs"):
with tracer.span("policy_logprobs"):
fprop_logprobs = policy.get_logprobs(
train_data,
timer=timer,
)["logprobs"]

with tracer.span("reference_policy_logprobs"):
reference_logprobs = policy.get_reference_policy_logprobs(
train_data,
timer=timer,
)["reference_logprobs"]
train_data["prev_logprobs"] = fprop_logprobs
train_data["reference_policy_logprobs"] = reference_logprobs

print("▶ Preparing for training...")
with timer.time("training_prep"):
with trace_and_time("training_prep"):
policy.prepare_for_training()
POLICY_GENERATION_STALE = True

print("▶ Training policy...")
with timer.time("policy_training"):
with trace_and_time("policy_training"):
train_results = policy.train(
train_data,
loss_fn,
timer=timer,
)
tracer.end_span("training")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file exists and get its size
wc -l nemo_rl/algorithms/grpo.py

Repository: NVIDIA-NeMo/RL

Length of output: 89


🏁 Script executed:

# Now let's look at the code around lines 2516-2549
sed -n '2516,2549p' nemo_rl/algorithms/grpo.py

Repository: NVIDIA-NeMo/RL

Length of output: 1600


🏁 Script executed:

# And the code around lines 2587-2623
sed -n '2587,2623p' nemo_rl/algorithms/grpo.py

Repository: NVIDIA-NeMo/RL

Length of output: 1710


🏁 Script executed:

# Let's find where tracer is defined or imported
rg -n "tracer\s*=" nemo_rl/algorithms/grpo.py | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 88


🏁 Script executed:

# Search for where tracer's span method is called to understand its usage
rg -n "tracer\.span\(" nemo_rl/algorithms/grpo.py | head -10

Repository: NVIDIA-NeMo/RL

Length of output: 235


🏁 Script executed:

# Search for tracer import
rg -n "import.*tracer|from.*tracer" nemo_rl/algorithms/grpo.py

Repository: NVIDIA-NeMo/RL

Length of output: 115


🏁 Script executed:

# Find the tracer implementation
find nemo_rl -name "trace.py" -o -name "tracer.py"

Repository: NVIDIA-NeMo/RL

Length of output: 80


🏁 Script executed:

# Check if nemo_rl/utils/trace.py exists and read it
wc -l nemo_rl/utils/trace.py

Repository: NVIDIA-NeMo/RL

Length of output: 84


🏁 Script executed:

# Read the trace.py file
cat -n nemo_rl/utils/trace.py

Repository: NVIDIA-NeMo/RL

Length of output: 12541


Use context-manager spans to guarantee closure on exceptions.

The tracer.span() context manager ensures end_span() executes in a finally block, preventing unmatched spans if an exception bubbles up between start_span and end_span. Refactor the "training" span (lines 2516-2549) and "validation" span (lines 2587-2623) to use with tracer.span("name"): instead of manual start_span/end_span calls.

🤖 Prompt for AI Agents
In `@nemo_rl/algorithms/grpo.py` around lines 2516 - 2549, Replace the manual
tracer.start_span("training") / tracer.end_span("training") pair with a
context-manager form so spans are always closed on exceptions: wrap the entire
training block (the code that calls policy.prepare_for_lp_inference(),
policy.get_logprobs(...), policy.get_reference_policy_logprobs(...),
policy.prepare_for_training(), and policy.train(...)) in a single with
tracer.span("training"): block instead of start_span/end_span; do the same for
the validation block (the block that currently uses
tracer.start_span("validation")/tracer.end_span("validation")) so both spans use
with tracer.span("..."):. Ensure the code nested inside remains unchanged and
indented under the with blocks so the span is properly closed on exception.

@gspschmid gspschmid changed the title Add perfetto tracing for async GRPO training feat: Add perfetto tracing for async GRPO training Feb 4, 2026
@gspschmid gspschmid requested a review from a team as a code owner February 4, 2026 15:05
Signed-off-by: Georg Stefan Schmid <gschmid@nvidia.com>
@gspschmid
Copy link
Author

Fwiw, I noticed that we might also be able to insert additional traces in Ray's own timeline, though I'm not sure what trade-offs that would come with. From the Ray documentation (https://docs.ray.io/en/latest/ray-observability/user-guides/ray-tracing.html#tracing) it seems that the feature is deprecated and requires OpenTelemetry as an external dependency.

Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR!

a couple of questions:

  1. could you share how this would compare to the opentelemetry or ray's actor timeline feature?
  2. is it possible to generalize the decorators on utilities (ex) to avoid creating more indentation to hide these utilities on the algorithms/* files?

cc @guyueh1

@gspschmid
Copy link
Author

Re 1. In terms of tracing output I think it would be relatively similar, i.e. a single perfetto trace that covers spans across various Ray actors. Annotation overhead (in terms of additional code) would be comparable as well: spans are introduced via context managers in either case (https://docs.ray.io/en/latest/ray-observability/user-guides/ray-tracing.html#custom-traces).

I was initially wondering whether the existing Ray infrastructure might be more robust, but given the note on top of that documentation page ("Tracing is an Alpha feature and no longer under active development/being maintained. APIs are subject to change.") I am less inclined to investigate much more deeply. In any case, the motivation behind this PR is mostly to trace a handful of steps of a training jobs to inform which performance optimization we should focus on.

Re 2. I agree that the less lines of code we touch (and indent), the better. Where we can use decorators we probably should, and indeed it might make sense to add a more general @span decorator that subsumes timer.time, tracer.span and and nvtx.annotate. Some of the functions like async_grpo_train might benefit from being broken up into smaller subfunctions, which would also give us a natural place to decorate those sections (thereby avoiding further indentation).

@guyueh1
Copy link
Contributor

guyueh1 commented Feb 4, 2026

@youngeunkwon0405 please review

@youngeunkwon0405
Copy link
Contributor

Hi @gspschmid, thanks for your contribution. This will be a very useful feature. I have a few suggestions on this PR.

  • The name is too general. Can you change the class/method name to more specific? E.g., Tracer --> PerfettoTracer or ChromeTracer, new_tracer --> new_tracer, save_trace --> save_trace, etc.)
  • Can we make this feature be enabled by the config yaml file instead of the ENV VAR? This will improve this feature's visibility and make it more users to use.
  • Can you also document this feature somewhere in /RL/docs/.
  • There are merge conflicts and DCO errors that need to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants