diff --git a/README.md b/README.md index 24bcb773b..0f13eeaea 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ * **[2026/01]** 💎 **INT4 Quantization-Aware Training (QAT)**: Inspired by the Kimi K2-Thinking report, Miles now features a full-stack INT4 W4A16 QAT pipeline. This allows 1TB-scale models to fit into single-machine VRAM (e.g., NVIDIA H200), doubling rollout efficiency by eliminating cross-node bottlenecks while maintaining BF16-equivalent accuracy. [Blog](https://lmsys.org/blog/2026-01-28-int4-qat/) * **[2026/01]** 💎 **Unified VLM/LLM Multi-Turn Training**: We provided an implementation for the VLM multi-turn sampling paradigm. Developers only need to write a customized `rollout` function to easily start multi-turn RL for VLM, just like training LLM. [blog](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/vlm-multi-turn/readme-en.md) * **[2026/01]** 🤖 **Multi-Agent Co-Evolution**: Miles now supports **MrlX**, a novel asynchronous co-evolutionary framework for Multi-Agent RL. Achieve superior performance in complex tasks like Doctor-Patient simulations and DeepResearch pipelines by enabling specialized agents to evolve together symbiotically. [[Link]](https://github.com/AQ-MedAI/MrlX) -* **[2025/12]** 🔄 **Rollout Routing Replay (R3)**: In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [[Paper]](https://arxiv.org/pdf/2510.11370) +* **[2025/12]** 🔄 **Rollout Routing Replay (R3)**: In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [[Paper]](https://arxiv.org/pdf/2510.11370) [[Docs]](docs/en/advanced/miles-router.md#22-rollout-routing-replay-r3-for-moe) * **[2025/11]** 🔥 **Unified FP8 Release**: Solves the stability issues in MoE RL by ensuring training and inference use the exact same FP8 quantization logic. [[Blog]](https://lmsys.org/blog/2025-11-25-fp8-rl/) * **[2025/11]** ⚡ **Speculative Decoding in RL**: Integrated speculative rollout with online SFT for draft models, achieving massive throughput gains. [[Blog]](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) * **[2025/11]** 🎉 **Miles Project Launch**: A joint effort by InfiXAI, Ant Group, SGLang RL Team, and the Miles community. [[Announcement]](https://lmsys.org/blog/2025-11-19-miles/) diff --git a/docs/en/advanced/miles-router.md b/docs/en/advanced/miles-router.md new file mode 100644 index 000000000..1eeb42b65 --- /dev/null +++ b/docs/en/advanced/miles-router.md @@ -0,0 +1,93 @@ +# Miles Router + +miles includes an optional Miles Router used during rollout / data generation. It is a lightweight HTTP router/proxy that sits in front of one or more SGLang worker servers and adds training-oriented capabilities that are not the main goal of serving-focused routers. + +--- + +## 1. What is Miles Router? + +Miles Router is a small FastAPI service that: + +- Registers workers (SGLang HTTP servers) into a local pool +- Routes requests to a selected worker (simple least-inflight load balancing) +- Proxies arbitrary paths to the selected worker (e.g. `/generate`) +- Runs periodic health checks and quarantines unhealthy workers +- Supports middleware plugins (via `--miles-router-middleware-paths`) to implement rollout-specific processing (e.g. caching, request/response transforms) + +In miles's architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer. + +### How it is launched + +In distributed training, miles will start a router automatically when `--sglang-router-ip` is not provided: + +- If `--use-miles-router` is set, miles starts Miles Router +- Otherwise, miles starts SGLang Model Gateway + +--- + +## 2. Why we need Miles Router + +Unlike production inference, RL rollout needs to capture additional metadata for training: token-level logprobs, loss masks, and (for MoE models) expert routing decisions. Miles Router provides these capabilities through its middleware system and passthrough proxy design. + +### 2.1 Radix-tree cache (transparent token management) + +> Use this when your rollout pipeline is text-in/text-out and you cannot reliably persist token IDs; if you already control token-in/token-out (e.g. search r1, multiturn VLM examples), you likely don't need the radix-tree cache. + +Text-in text-out interfaces can cause token retokenization mismatches - re-tokenizing text at training time may produce different token sequences than rollout, breaking per-token alignment needed for PPO/GRPO losses. + +The radix-tree cache solves this transparently: it intercepts text-based requests, tokenizes them, and stores trajectories (text, token IDs, logprobs, loss masks) keyed by the text prefix. After rollout finishes, calling `/retrieve_from_text` returns the exact token sequence with aligned metadata, without requiring any changes to your rollout code. + +Implementation-wise, the radix-tree cache: + +- Accepts text plus tokens/metadata and stores them in a radix tree +- Uses longest-prefix matching to reuse cached token sequences (enabling token-in/token-out downstream) +- Allows insertion of new text continuations as rollout proceeds (multiple trajectories per prompt, e.g. GRPO) +- Periodically cleans up stale nodes to control memory usage + +Use the radix cache when you have text-based rollout code and want token-level precision without rewriting, or when running GRPO with multiple trajectories sharing the same prompt prefix. + +### 2.2 Rollout routing replay (R3) for MoE + +For MoE models, miles supports rollout routing replay (R3): record expert routing decisions during rollout and replay them during training to improve stability. + +#### SGLang side + +SGLang provides expert routing capture via: + +- `--enable-return-routed-experts`: server argument to enable routing capture +- `RoutedExpertsCapturer`: captures `topk_ids` (selected expert IDs) at each MoE layer during forward pass +- `return_routed_experts`: request parameter to retrieve routing data +- Returns `routed_experts` in response `meta_info` - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs + +#### miles side + +miles consumes the routing data and replays it during training: + +- `--use-miles-router --use-rollout-routing-replay`: both flags required to enable R3 +- Rollout sends `return_routed_experts=True` and stores results in `sample.rollout_routed_experts` +- Training calls `fill_routing_replay()` to load routing data into `RoutingReplay` objects +- During forward pass, recorded routing decisions are replayed instead of recomputed + +#### Why Miles Router is needed + +We need Miles Router because the SGLang worker returns routed experts in the response (`meta_info.routed_experts`) when the request sets `return_routed_experts=true`, and Miles Router preserves this field end-to-end. SGLang Model Gateway may drop this extra metadata when it reconstructs responses with a fixed schema (see section 3). + +--- + +## 3. Differences vs SGLang Model Gateway + +Miles Router and SGLang Model Gateway can both route requests to workers, but they are optimized for different goals. + +### Key differences + +Miles Router is a lightweight Python/FastAPI proxy that acts as a passthrough to SGLang workers. This passthrough design enables RL-specific features like radix-tree trajectory caching and R3 (which require preserving raw response metadata like `routed_experts`). + +SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for miles's R3 flow. + +For more details on SGLang Model Gateway, see the [official documentation](https://docs.sglang.io/advanced_features/sgl_model_gateway.html). + +### When to use which + +- Use Miles Router when you need R3 or radix-tree caching +- Use SGLang Model Gateway for everything else (recommended default) + diff --git a/docs/en/get_started/customization.md b/docs/en/get_started/customization.md index b1088ce64..bfd502422 100644 --- a/docs/en/get_started/customization.md +++ b/docs/en/get_started/customization.md @@ -417,3 +417,4 @@ Stabilize MoE RL training by recording and replaying expert routing decisions to | `--use-routing-replay` | Forward-backward routing consistency in training. ([arXiv:2507.18071](https://arxiv.org/abs/2507.18071)) | | `--use-rollout-routing-replay` | R3: Replay routing from rollout during training. **Requires `--use-miles-router`**. ([arXiv:2510.11370](https://arxiv.org/abs/2510.11370)) | +For detailed explanation of R3 and MilesRouter, see [Miles Router](../advanced/miles-router.md). diff --git a/docs/en/index.rst b/docs/en/index.rst index afafc6796..3f08d98d0 100644 --- a/docs/en/index.rst +++ b/docs/en/index.rst @@ -41,6 +41,7 @@ miles is the RL-framework behind GLM-4.7, GLM-4.6 and GLM-4.5. Apart from models :caption: Advanced Features _examples_synced/reproducibility/README.md + advanced/miles-router.md advanced/speculative-decoding.md advanced/fault-tolerance.md advanced/arch-support-beyond-megatron.md