diff --git a/docs/about/concepts/index.md b/docs/about/concepts/index.md index f0d038dbd..0fb181452 100644 --- a/docs/about/concepts/index.md +++ b/docs/about/concepts/index.md @@ -1,7 +1,3 @@ ---- -orphan: true ---- - (about-concepts)= # Understanding Concepts for {{product_name}} @@ -44,6 +40,12 @@ Understand how servers are configured and connected. Understand the importance of verification and common implementation patterns. ::: +:::{grid-item-card} {octicon}`sync;1.5em;sd-mr-1` On-Policy Training +:link: on-policy-training +:link-type: ref +Understand on-policy vs off-policy training and why token/logprob mismatches cause instability. +::: + :::{grid-item-card} {octicon}`iterations;1.5em;sd-mr-1` Key Terminology :link: key-terminology :link-type: ref @@ -62,5 +64,6 @@ Core Components Architecture Configuration System Task Verification +On-Policy Training Key Terminology ``` diff --git a/docs/about/concepts/on-policy-training.md b/docs/about/concepts/on-policy-training.md new file mode 100644 index 000000000..a51197972 --- /dev/null +++ b/docs/about/concepts/on-policy-training.md @@ -0,0 +1,50 @@ +(on-policy-training)= + +# On-Policy Training + +In reinforcement learning, **on-policy training** means the model weights that you are training match the model weights used in inference. When token IDs and sampling log probabilities (logprobs) match between generation and training, you have on-policy training. When they don't match, you have off-policy training—which can cause instability or collapse, though in some cases it may be acceptable or even desired. + +## Why On-Policy Matters + +Policy optimization algorithms backpropagate through a loss calculated using logprobs. When the logprobs computed during training differ from those computed during generation, the gradients become large and potentially unreliable. Small mismatches are tolerable, but large mismatches can cause training runs to crash. + +### Common Causes of Mismatch + +Several scenarios lead to train-generation mismatch, including differences in training and inference algorithm implementations or kernel implementations (e.g., vLLM vs Megatron-core): + +**Re-tokenization** +: When generated tokens are de-tokenized to strings and then re-tokenized for the next model call, the token IDs can change. For example, tokens that de-tokenize to `"_Ski" + "nny"` might re-tokenize as a single `"_Skinny"` token. + +**Re-chat templating** +: When the model's output is parsed into structured objects (like tool calls) and then re-rendered through a chat template, the formatting can differ from the original generation. + +**Non-monotonic history** +: When rollout history is modified during execution—such as truncating reasoning traces or summarizing context—the prompt token IDs at training time differ from those seen during generation. + +:::{tip} +For a detailed technical explanation of these problems and their solutions, refer to {doc}`../../contribute/rl-framework-integration/openai-compatible-http-server-on-policy-correction`. +::: + +### Example: Reasoning Models + +Models like Qwen3 that produce reasoning traces present a specific challenge. Some chat templates remove reasoning from previous turns during inference, but training sees the full trajectory. This creates a logprob mismatch because the model computes probabilities against different context. + +## Recommended Approaches + +### For Models with Reasoning Traces + +1. **Preferred**: Disable reasoning truncation and keep reasoning across all turns +2. **Alternative**: Disable monotonicity enforcement and on-policy token ID correction + +### For Agents with Context Management + +- Evaluate whether history modification is necessary for your use case +- If you must modify history, monitor training stability closely with checks disabled +- Leverage importance sampling to account for off-policy + +## Related Topics + +- {doc}`../../contribute/rl-framework-integration/openai-compatible-http-server-on-policy-correction` — Technical details on on-policy corrections +- {doc}`../../environment-tutorials/multi-turn` — Multi-turn environments where on-policy enforcement matters +- {doc}`../../environment-tutorials/multi-step` — Multi-step environments with sequential tool calls +- {doc}`key-terminology` — RL vocabulary including rollouts, logprobs, and policy diff --git a/docs/environment-tutorials/multi-step.md b/docs/environment-tutorials/multi-step.md index e53b214e1..ca9102eae 100644 --- a/docs/environment-tutorials/multi-step.md +++ b/docs/environment-tutorials/multi-step.md @@ -28,6 +28,10 @@ Build training environments where models make sequential tool calls, using resul --- +:::{note} +Multi-step environments enforce on-policy training by default. See {doc}`/about/concepts/on-policy-training` for more information. +::: + ## Key Concepts Before starting, understand these NeMo Gym terms: diff --git a/docs/environment-tutorials/multi-turn.md b/docs/environment-tutorials/multi-turn.md index c80b39e72..68b7686e1 100644 --- a/docs/environment-tutorials/multi-turn.md +++ b/docs/environment-tutorials/multi-turn.md @@ -28,6 +28,10 @@ Train models on extended conversations where context accumulates across user/ass --- +:::{note} +Multi-turn environments enforce on-policy training by default. See {doc}`/about/concepts/on-policy-training` for more information. +::: + ## Why Multi-Turn RL? Standard fine-tuning trains models on static conversation transcripts. Multi-turn RL goes further: diff --git a/docs/reference/faq.md b/docs/reference/faq.md index f705392c1..4d0d2c8dd 100644 --- a/docs/reference/faq.md +++ b/docs/reference/faq.md @@ -655,6 +655,11 @@ You may need to reformat some of your docstrings to Napoleon format docstrings h # FAQ: NeMo Gym, training frameworks, and token IDs + +:::{seealso} +For conceptual background on why token ID alignment matters, see {doc}`/about/concepts/on-policy-training`. +::: + One of the goals of NeMo Gym is to act as a rollout tool for LLM post-training, either as synthetic data generation for SFT or as training environments for RL. RL training frameworks don't typically operate in OpenAI schema; they operate in tokens IDs. It is especially critical to always have the correct token IDs during training so that we stay on-policy and to make sure that what we think the model sees is what the model actually sees. However, when providing this OpenAI schema compatible interface to training environment developers, we lose track of the token IDs in Gym. diff --git a/docs/training-tutorials/verl.md b/docs/training-tutorials/verl.md index 853165b9c..88885d620 100644 --- a/docs/training-tutorials/verl.md +++ b/docs/training-tutorials/verl.md @@ -180,25 +180,13 @@ ng_collect_rollouts +agent_name=math_simple_agent \ --- -## Production Alternative: NeMo RL +## Alternative: NeMo RL -NeMo RL provides **working** NeMo Gym integration today: +NeMo RL provides working NeMo Gym integration today: -```bash -# Install NeMo RL (separate from NeMo Gym) -pip install nemo-rl - -# Run GRPO training with NeMo Gym environment -python examples/nemo_gym/run_grpo_nemo_gym.py \ - --config examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml -``` - -NeMo RL includes: - -- ✅ Full rollout orchestration -- ✅ On-policy token ID corrections -- ✅ GRPO (Group Relative Policy Optimization) and DAPO (Diversity-Aware Policy Optimization) -- ✅ Multi-node distributed training +- Efficient and scalable post-training +- SFT, DPO, GRPO, DAPO +- Multi-turn and agent environments **Reference**: `nemo_rl/environments/nemo_gym.py` diff --git a/docs/troubleshooting/index.md b/docs/troubleshooting/index.md index 8253e9ea0..1e400a3a3 100644 --- a/docs/troubleshooting/index.md +++ b/docs/troubleshooting/index.md @@ -7,5 +7,3 @@ orphan: true # Troubleshooting Solutions for common errors and issues. - -