diff --git a/docs/about/architecture.md b/docs/about/architecture.md index ef4242740..a67bdfddd 100644 --- a/docs/about/architecture.md +++ b/docs/about/architecture.md @@ -25,7 +25,7 @@ Model servers expose OpenAI-compatible inference endpoints for chat and response - `POST /v1/chat/completions` - `POST /v1/responses` -The base model server class defines these endpoints. Concrete model servers implement them (for example, the OpenAI-backed model server). Agents call these endpoints through the shared server client. +The base model server class defines these endpoints. Concrete model servers implement them (for example, the OpenAI or vLLM model server). Agents call these endpoints through the shared server client. ### Resources servers (environment + verification) @@ -36,14 +36,15 @@ Resources servers expose environment lifecycle endpoints: Individual resources servers can add domain-specific endpoints for tools or environment steps. For example: -- A resources server can register a catch-all tool route like `POST /{path}` for tool execution. -- Aviary-based resources servers add `POST /step` and `POST /close` for multi-step environments. +- Individual tools as `POST /get_weather` or `POST /search` +- A resources server can register a catch-all tool route like `POST /{path}` for dynamic environments. +- Supports `POST /step` and `POST /close` for Gymnasium-style environments . ### Agent servers (rollout orchestration) Agent servers expose two primary endpoints: -- `POST /v1/responses` for multi-step interaction +- `POST /v1/responses` for individual generations - `POST /run` for full rollout execution and verification The base agent server class wires these routes, while each agent implementation defines how to call model and resources servers. @@ -63,19 +64,15 @@ The shared server client fetches the resolved configuration from the head server The `SimpleAgent` implementation orchestrates a complete rollout and verification sequence: -1. Call the resources server `POST /seed_session` to initialize session state. -2. Call the agent `POST /v1/responses`. The agent calls the model server `POST /v1/responses` and issues tool calls to the resources server via `POST /{tool_name}`. -3. Call the resources server `POST /verify` and return the verified rollout response. +1. Call the resources server `POST /seed_session` to initialize environment state. +2. Call the agent `POST /v1/responses`. The agent calls the model server `POST /v1/responses` and issues tool calls to the resources server via `POST /{tool_name}` to interact with the environment. +3. Call the resources server `POST /verify` and return the rollout and reward. The rollout collection flow uses the agent `POST /run` endpoint and writes the returned metrics to JSONL output. -### Multi-step environments (Aviary example) - -Some resources servers model environments with explicit step and close endpoints. Aviary-based resources servers accept `POST /step` for environment transitions and `POST /close` to release an environment instance. - ## Session and State -All servers add session handling that assigns a session ID when one is not present. Agents propagate cookies between model and resources servers, which lets resources servers store per-session state. Several resources servers keep in-memory maps keyed by session ID (for example, counters or tool environments) to track environment state across steps. +All servers add session handling that assigns a session ID on initialization. Agents propagate cookies between model and resources servers, which lets resources servers store per-session state. Several resources servers keep in-memory maps keyed by session ID (for example, counters or tool environments) to track environment state across steps. ## Configuration and Port Resolution diff --git a/docs/contribute/rl-framework-integration/index.md b/docs/contribute/rl-framework-integration/index.md index 7e3ab0329..a0f3b1726 100644 --- a/docs/contribute/rl-framework-integration/index.md +++ b/docs/contribute/rl-framework-integration/index.md @@ -8,9 +8,22 @@ These guides cover how to integrate NeMo Gym into a new RL training framework. U - Contributing NeMo Gym integration for a training framework that does not have one yet :::{tip} -Just want to train models? Use {ref}`NeMo RL ` instead. +Just want to train models? See existing integrations: +- {ref}`NeMo RL ` - Multi-step and multi-turn RL training at scale +- {doc}`TRL (Hugging Face) <../training-tutorials/trl>` - GRPO training with distributed training support +- {doc}`Unsloth <../training-tutorials/unsloth>` - Fast, memory-efficient training for single-step tasks ::: +## Existing Integrations + +NeMo Gym currently integrates with the following RL training frameworks: + +**[NeMo RL](https://github.com/NVIDIA-NeMo/RL)**: NVIDIA's RL training framework, purpose-built for large-scale frontier model training. Provides full support for multi-step and multi-turn environments with production-grade distributed training capabilities. + +**[TRL](https://github.com/huggingface/trl)**: Hugging Face's transformer reinforcement learning library. Supports GRPO with single and multi-turn NeMo Gym environments using vLLM generation, multi-environment training, and distributed training via Accelerate and DeepSpeed. See the {doc}`TRL tutorial <../training-tutorials/trl>` for usage examples. + +**[Unsloth](https://github.com/unslothai/unsloth)**: Fast, memory-efficient fine-tuning library. Supports optimized GRPO with single-step NeMo Gym environments including low precision, parameter-efficient fine-tuning, and training in notebook environments. See the {doc}`Unsloth tutorial <../training-tutorials/unsloth>` for getting started. + ## Prerequisites Before integrating Gym into your training framework, ensure you have: diff --git a/docs/index.md b/docs/index.md index a8ac3bab5..21f98cc82 100644 --- a/docs/index.md +++ b/docs/index.md @@ -418,8 +418,8 @@ Rollout Collection 🟡 Nemotron Nano 🟡 Nemotron Super NeMo RL GRPO -Unsloth Training 🟡 TRL +🟡 Unsloth 🟡 VERL 🟡 NeMo Customizer Offline Training diff --git a/docs/training-tutorials/trl.md b/docs/training-tutorials/trl.md index 9c64ad9db..db561a069 100644 --- a/docs/training-tutorials/trl.md +++ b/docs/training-tutorials/trl.md @@ -1,139 +1,296 @@ (training-trl)= -# TRL Training +# RL Training with TRL -```{warning} -**Status: In Development** — TRL integration is planned but not yet implemented. Track progress at [GitHub Issue #548](https://github.com/NVIDIA-NeMo/Gym/issues/548). +This tutorial demonstrates how to use [Hugging Face TRL](https://github.com/huggingface/trl) to train models in NeMo Gym environments using GRPO. -Looking to train now? Use {doc}`NeMo RL <../tutorials/nemo-rl-grpo/index>` (production-ready) or {doc}`Unsloth <../tutorials/unsloth-training>` (single GPU). -``` +**TRL (Transformer Reinforcement Learning)** is Hugging Face's library for post-training foundation models. It provides implementations of algorithms like SFT, GRPO, and DPO, with support for distributed training with vLLM inference. -Train models using [Hugging Face TRL](https://huggingface.co/docs/trl) with NeMo Gym verifiers as reward functions. +The integration supports multi-step and multi-turn rollouts, multi-environment training, and any NeMo Gym environment (thoroughly tested: workplace assistant, reasoning gym, MCQA, and math with judge). -## Why TRL + NeMo Gym? +:::{card} -**TRL** provides production-ready RL training for large language models: +**Goal**: Train models on NeMo Gym environments using TRL's GRPO trainer. -- PPO, DPO, ORPO algorithms out of the box -- Seamless HuggingFace Hub integration -- Active community and documentation +^^^ -**NeMo Gym** adds: +**In this tutorial, you will**: -- Domain-specific verifiers (math, code, tool calling) -- Standardized reward computation via HTTP API -- Pre-built training environments +1. Set up TRL with NeMo Gym environments +2. Prepare NeMo-Gym datasets and configure training +3. Train models using GRPO with vLLM for optimized inference +4. Scale to multi-node training with Slurm -## Planned Integration Pattern +::: -When implemented, TRL integration will use NeMo Gym verifiers as reward functions via HTTP: +## Overview -```text -┌─────────────┐ HTTP ┌──────────────────────┐ -│ TRL Trainer │ ──────────▶ │ NeMo Gym Resource │ -│ (PPO/DPO) │ │ Server (Verifier) │ -│ │ ◀────────── │ │ -│ │ reward │ │ -└─────────────┘ └──────────────────────┘ -``` +The TRL integration with NeMo Gym enables: -### Proposed Reward Wrapper - -The integration will expose NeMo Gym verifiers as TRL-compatible reward functions: - -```python -# Proposed implementation (not yet available) -# Location: nemo_gym/integrations/trl.py - -from typing import List - -class NeMoGymRewardFunction: - """Wrap a NeMo Gym resource server as a TRL reward function.""" - - def __init__(self, resources_server_url: str): - from nemo_gym import ResourcesServerClient - self.client = ResourcesServerClient(resources_server_url) - - def __call__(self, outputs: List[str]) -> List[float]: - """Compute rewards for a batch of model outputs.""" - rewards = [] - for output in outputs: - response = self.client.verify(output) - rewards.append(response.reward) - return rewards -``` +- **Multi-step and multi-turn environments**: Full support for tool calling and complex agentic tasks +- **Multi-environment training**: Train on multiple environments simultaneously +- **Production-scale training**: Multi-node distributed training with DeepSpeed +- **Flexible verification**: Algorithmic verification, LLM-as-a-judge, and custom reward functions + + +--- + +## Steps to get started + +### Install TRL and NeMo Gym + +1. **Install TRL with vLLM support** + + ```bash + cd trl/ + uv venv + source .venv/bin/activate + uv sync --extra vllm + ``` + +2. **Install NeMo Gym in a separate virtual environment** + + ```bash + # deactivate trl venv + deactivate + git clone https://github.com/NVIDIA-NeMo/Gym.git + cd Gym + uv venv --python 3.12 + source .venv/bin/activate + uv sync + ``` + + +### Download a Dataset + +1. **Prepare Dataset** -### Usage Pattern (Planned) + Download and prepare a NeMo Gym dataset: -```python -# Planned usage (not yet available) -from trl import PPOTrainer, PPOConfig -from nemo_gym.integrations.trl import NeMoGymRewardFunction + ```bash + cd Gym + source .venv/bin/activate -# Start a resource server (e.g., math verification) -# nemo-gym serve math --port 8080 + # Set HuggingFace token in env.yaml + echo "hf_token: your_token_here" >> env.yaml -# Create reward function pointing to the server -reward_fn = NeMoGymRewardFunction("http://localhost:8080") + # Prepare workplace assistant dataset + config_paths="responses_api_models/vllm_model/configs/vllm_model_for_training.yaml,\ + resources_servers/workplace_assistant/configs/workplace_assistant.yaml" -# Use with TRL trainer -config = PPOConfig(...) -trainer = PPOTrainer(config=config, reward_model=reward_fn) -trainer.train() + ng_prepare_data "+config_paths=[${config_paths}]" \ + +output_dirpath=data/workplace_assistant \ + +mode=train_preparation \ + +should_download=true \ + +data_source=huggingface + ``` + +#### Dataset Format + +NeMo Gym datasets are stored as JSONL. Each line contains a task with input messages, tool definitions, metadata such as ground truth for verification, and an agent server reference. The following example shows the workplace dataset structure. Metadata fields can differ between datasets, as long as the corresponding resources server uses the fields appropriately. + +```json +{ + "responses_create_params": { + "input": [ + {"role": "system", "content": "..."}, + {"role": "user", "content": "Move any of jinsoo's tasks that are in review to completed"} + ], + "tools": [...], + "parallel_tool_calls": false, + "temperature": 1 + }, + "ground_truth": [ + {"name": "project_management_update_task", "arguments": "{...}"}, + ... + ], + "category": "workbench_project_management", + "environment_name": "workbench", + "agent_ref": { + "type": "responses_api_agents", + "name": "workplace_assistant_simple_agent" + } +} ``` -## Target Algorithms +2. **Update Environment Config** + + Update `env.yaml` in `Gym/` to include model information: + + ```yaml + policy_base_url: http://127.0.0.1:8000/v1 + policy_api_key: EMPTY + policy_model_name: Qwen/Qwen2.5-1.5B-Instruct + hf_token: ... + ``` + +3. **Update Training Config** + + Update `examples/scripts/nemo_gym/config.yaml` to point to the dataset generated above, and any other optional modifications. + +### Start vLLM and NeMo Gym Servers and Run Training + +The following steps run in 3 terminals. It can also be ran with processes in the background, or using tmux. + +1. **Start NeMo Gym Servers** (Terminal 1) + + ```bash + cd Gym/ + source .venv/bin/activate + + config_paths="resources_servers/workplace_assistant/configs/workplace_assistant.yaml,\ + responses_api_models/vllm_model/configs/vllm_model_for_training.yaml" + + ng_run "+config_paths=[${config_paths}]" + ``` + + This starts: + - **Agent server**: Orchestrates rollouts using resource servers and model servers + - **Resources server**: Supports environment logic such as state-management, tool implementations, and task verification + - **Model server**: Adapts vLLM server requests to support NeMo Gym agents and on-policy RL training while ensuring OpenAI API compatibility + - **Head server**: Manages servers used in training enabling their discovery + +2. **Start TRL vLLM Server on GPU 0** (Terminal 2) -| Algorithm | TRL Support | NeMo Gym Integration | -|-----------|-------------|----------------------| -| PPO | ✅ Stable | 🔜 Planned | -| DPO | ✅ Stable | 🔜 Planned | -| ORPO | ✅ Stable | 🔜 Planned | -| GRPO | ❌ Not in TRL | ✅ Use {doc}`NeMo RL <../tutorials/nemo-rl-grpo/index>` | + ```bash + cd trl/ + source .venv/bin/activate + CUDA_VISIBLE_DEVICES=0 trl vllm-serve \ + --model Qwen/Qwen2.5-1.5B-Instruct \ + --max-model-len 16384 \ + --host 0.0.0.0 \ + --port 8000 + ``` -## Architecture Considerations +3. **Run Training on GPU 1** (Terminal 3) -For architects evaluating this integration: + ```bash + source trl/.venv/bin/activate + cd trl/examples/scripts/nemo_gym + export WANDB_API_KEY=... + uv add omegaconf -**Network Latency** -: Verifier calls add HTTP round-trip latency per batch. Mitigate with batched verification and local resource servers. + CUDA_VISIBLE_DEVICES=1 python train_multi_environment.py --config config.yaml + ``` -**Distributed Training** -: Each TRL worker connects to the same or separate resource servers. Compatibility with FSDP and DeepSpeed training modes is a design goal. +## Multi-Environment Training -**Error Handling** -: Planned retry logic and timeout configuration for verifier failures during training. +Train on multiple environments simultaneously. This example combines the workplace assistant dataset from the previous section with a reasoning gym dataset. -## Contributing +1. **Generate Reasoning Gym Dataset** -Help move TRL integration forward: + ```bash + cd Gym + source .venv/bin/activate + uv add reasoning-gym + cd resources_servers/reasoning_gym + python scripts/create_dataset.py \ + --task mini_sudoku \ + --size 2000 \ + --seed 42 \ + --output data/reasoning_gym/train_mini_sudoku.jsonl -1. **Track progress**: Watch [Issue #548](https://github.com/NVIDIA-NeMo/Gym/issues/548) -2. **Contribute**: See {doc}`../contribute/rl-framework-integration/index` for integration guidelines + python scripts/create_dataset.py \ + --task mini_sudoku \ + --size 50 \ + --seed 24 \ + --output data/reasoning_gym/val_mini_sudoku.jsonl + ``` -## Available Alternatives +2. **Shuffle Datasets** -Ready to train today? These integrations work now: + Combine datasets into one file for training: -::::{grid} 1 2 2 2 + ```bash + cat data/workplace_assistant/train.jsonl data/reasoning_gym/train_mini_sudoku.jsonl | shuf > train_multi_env.jsonl + ``` + +3. **Start NeMo Gym Servers** + + ```bash + config_paths="responses_api_models/vllm_model/configs/vllm_model_for_training.yaml,\ + resources_servers/workplace_assistant/configs/workplace_assistant.yaml,\ + resources_servers/reasoning_gym/configs/reasoning_gym.yaml" + + ng_run "+config_paths=[${config_paths}]" + ``` + +4. **Run Training** + +Update the config to point at the new dataset, then run training: + + ```bash + python train_multi_environment.py --config config.yaml + ``` + +The training script automatically routes each example to the correct environment based on the `agent_ref` field in the dataset. + +## Multi-Node Training with Slurm + +An example five-node training script is provided in `submit.sh`. Nodes one through four run the training algorithm, while node five runs vLLM inference for NeMo Gym agent rollouts. + +1. **Configure the Script** + + Update `submit.sh` with your Slurm account, partition, paths to your project directory, and updated training configs. + +1. **Submit the Job** + + ```bash + sbatch submit.sh + ``` + +1. **Monitor Training** + + ```bash + tail -f logs//* + ``` + +> **Tip**: Set up wandb logging for detailed training metrics. For more details on TRL's vLLM integration, refer to the vLLM integration page. + + +## Documentation + +Visit the Hugging Face integration docs for more details: + +:::{button-link} https://huggingface.co/docs/trl/main/nemo_gym_integration +:color: primary +:class: sd-rounded-pill + +TRL NeMo-Gym Integration Guide +::: + +--- + +## Next Steps + +After completing this tutorial, explore these options: + +::::{grid} 1 1 2 2 :gutter: 3 -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` NeMo RL with GRPO -:link: ../tutorials/nemo-rl-grpo/index -:link-type: doc +:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Use Other Training Environments +:link: https://github.com/NVIDIA-NeMo/Gym#-available-resource-servers -Production-ready multi-node training with GRPO algorithm. +Browse available resource servers on GitHub to find other training environments. +++ -{bdg-success}`available` {bdg-primary}`recommended` +{bdg-secondary}`github` {bdg-secondary}`resource-servers` ::: -:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` Unsloth Training -:link: ../tutorials/unsloth-training +:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Build a Custom Training Environment +:link: creating-resource-server :link-type: doc -Fast, memory-efficient fine-tuning on a single GPU. +Create your own resource server with custom tools and verification logic. +++ -{bdg-success}`available` {bdg-secondary}`single-gpu` +{bdg-secondary}`tutorial` {bdg-secondary}`custom-tools` ::: :::: + +## Resources + +- [TRL GitHub Repository](https://github.com/huggingface/trl) +- [TRL Documentation](https://huggingface.co/docs/trl) +- [TRL NeMo-Gym Integration](https://huggingface.co/docs/trl/main/nemo_gym_integration) +- [Training Script](https://github.com/huggingface/trl/blob/main/examples/scripts/nemo_gym/run_grpo_nemo_gym.py) +- [GRPO Trainer Documentation](https://huggingface.co/docs/trl/main/grpo_trainer) diff --git a/docs/tutorials/unsloth-training.md b/docs/training-tutorials/unsloth.md similarity index 100% rename from docs/tutorials/unsloth-training.md rename to docs/training-tutorials/unsloth.md diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index 5d6c375ea..6222a12d4 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -64,12 +64,4 @@ Learn how to set up NeMo Gym and NeMo RL training environments, run tests, prepa {bdg-primary}`training` {bdg-secondary}`rl` {bdg-secondary}`grpo` {bdg-secondary}`multi-step` ::: -:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` Unsloth -:link: training-unsloth -:link-type: ref -Fast, memory-efficient fine-tuning for single-step tasks: math, structured outputs, instruction following, reasoning gym and more. -+++ -{bdg-primary}`training` {bdg-secondary}`unsloth` {bdg-secondary}`single-step` -::: - ::::