Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions docs/about/concepts/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
---
orphan: true
---

(about-concepts)=
# Understanding Concepts for {{product_name}}

Expand Down Expand Up @@ -44,6 +40,12 @@ Understand how servers are configured and connected.
Understand the importance of verification and common implementation patterns.
:::

:::{grid-item-card} {octicon}`sync;1.5em;sd-mr-1` On-Policy Training
:link: on-policy-training
:link-type: ref
Understand on-policy vs off-policy training and why token/logprob mismatches cause instability.
:::

:::{grid-item-card} {octicon}`iterations;1.5em;sd-mr-1` Key Terminology
:link: key-terminology
:link-type: ref
Expand All @@ -62,5 +64,6 @@ Core Components <core-components>
Architecture <architecture>
Configuration System <configuration>
Task Verification <task-verification>
On-Policy Training <on-policy-training>
Key Terminology <key-terminology>
```
50 changes: 50 additions & 0 deletions docs/about/concepts/on-policy-training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
(on-policy-training)=

# On-Policy Training

In reinforcement learning, **on-policy training** means the model weights that you are training match the model weights used in inference. When token IDs and sampling log probabilities (logprobs) match between generation and training, you have on-policy training. When they don't match, you have off-policy training—which can cause instability or collapse, though in some cases it may be acceptable or even desired.

## Why On-Policy Matters

Policy optimization algorithms backpropagate through a loss calculated using logprobs. When the logprobs computed during training differ from those computed during generation, the gradients become large and potentially unreliable. Small mismatches are tolerable, but large mismatches can cause training runs to crash.

### Common Causes of Mismatch

Several scenarios lead to train-generation mismatch, including differences in training and inference algorithm implementations or kernel implementations (e.g., vLLM vs Megatron-core):

**Re-tokenization**
: When generated tokens are de-tokenized to strings and then re-tokenized for the next model call, the token IDs can change. For example, tokens that de-tokenize to `"_Ski" + "nny"` might re-tokenize as a single `"_Skinny"` token.

**Re-chat templating**
: When the model's output is parsed into structured objects (like tool calls) and then re-rendered through a chat template, the formatting can differ from the original generation.

**Non-monotonic history**
: When rollout history is modified during execution—such as truncating reasoning traces or summarizing context—the prompt token IDs at training time differ from those seen during generation.

:::{tip}
For a detailed technical explanation of these problems and their solutions, refer to {doc}`../../contribute/rl-framework-integration/openai-compatible-http-server-on-policy-correction`.
:::

### Example: Reasoning Models

Models like Qwen3 that produce reasoning traces present a specific challenge. Some chat templates remove reasoning from previous turns during inference, but training sees the full trajectory. This creates a logprob mismatch because the model computes probabilities against different context.

## Recommended Approaches

### For Models with Reasoning Traces

1. **Preferred**: Disable reasoning truncation and keep reasoning across all turns
2. **Alternative**: Disable monotonicity enforcement and on-policy token ID correction

### For Agents with Context Management

- Evaluate whether history modification is necessary for your use case
- If you must modify history, monitor training stability closely with checks disabled
- Leverage importance sampling to account for off-policy

## Related Topics

- {doc}`../../contribute/rl-framework-integration/openai-compatible-http-server-on-policy-correction` — Technical details on on-policy corrections
- {doc}`../../environment-tutorials/multi-turn` — Multi-turn environments where on-policy enforcement matters
- {doc}`../../environment-tutorials/multi-step` — Multi-step environments with sequential tool calls
- {doc}`key-terminology` — RL vocabulary including rollouts, logprobs, and policy
4 changes: 4 additions & 0 deletions docs/environment-tutorials/multi-step.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ Build training environments where models make sequential tool calls, using resul

---

:::{note}
Multi-step environments enforce on-policy training by default. See {doc}`/about/concepts/on-policy-training` for more information.
:::

## Key Concepts

Before starting, understand these NeMo Gym terms:
Expand Down
4 changes: 4 additions & 0 deletions docs/environment-tutorials/multi-turn.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ Train models on extended conversations where context accumulates across user/ass

---

:::{note}
Multi-turn environments enforce on-policy training by default. See {doc}`/about/concepts/on-policy-training` for more information.
:::

## Why Multi-Turn RL?

Standard fine-tuning trains models on static conversation transcripts. Multi-turn RL goes further:
Expand Down
5 changes: 5 additions & 0 deletions docs/reference/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -655,6 +655,11 @@ You may need to reformat some of your docstrings to Napoleon format docstrings h


# FAQ: NeMo Gym, training frameworks, and token IDs

:::{seealso}
For conceptual background on why token ID alignment matters, see {doc}`/about/concepts/on-policy-training`.
:::

One of the goals of NeMo Gym is to act as a rollout tool for LLM post-training, either as synthetic data generation for SFT or as training environments for RL.

RL training frameworks don't typically operate in OpenAI schema; they operate in tokens IDs. It is especially critical to always have the correct token IDs during training so that we stay on-policy and to make sure that what we think the model sees is what the model actually sees. However, when providing this OpenAI schema compatible interface to training environment developers, we lose track of the token IDs in Gym.
Expand Down
22 changes: 5 additions & 17 deletions docs/training-tutorials/verl.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,25 +180,13 @@ ng_collect_rollouts +agent_name=math_simple_agent \

---

## Production Alternative: NeMo RL
## Alternative: NeMo RL

NeMo RL provides **working** NeMo Gym integration today:
NeMo RL provides working NeMo Gym integration today:

```bash
# Install NeMo RL (separate from NeMo Gym)
pip install nemo-rl

# Run GRPO training with NeMo Gym environment
python examples/nemo_gym/run_grpo_nemo_gym.py \
--config examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
```

NeMo RL includes:

- ✅ Full rollout orchestration
- ✅ On-policy token ID corrections
- ✅ GRPO (Group Relative Policy Optimization) and DAPO (Diversity-Aware Policy Optimization)
- ✅ Multi-node distributed training
- Efficient and scalable post-training
- SFT, DPO, GRPO, DAPO
- Multi-turn and agent environments

**Reference**: `nemo_rl/environments/nemo_gym.py`

Expand Down
2 changes: 0 additions & 2 deletions docs/troubleshooting/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,3 @@ orphan: true
# Troubleshooting

Solutions for common errors and issues.