radixark · zhaochenyang20 · Feb 7, 2026 · Jan 31, 2026 · Feb 2, 2026 · Feb 3, 2026
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@
 *   **[2026/01]** 💎 **INT4 Quantization-Aware Training (QAT)**: Inspired by the Kimi K2-Thinking report, Miles now features a full-stack INT4 W4A16 QAT pipeline. This allows 1TB-scale models to fit into single-machine VRAM (e.g., NVIDIA H200), doubling rollout efficiency by eliminating cross-node bottlenecks while maintaining BF16-equivalent accuracy. [Blog](https://lmsys.org/blog/2026-01-28-int4-qat/)
 *   **[2026/01]** 💎 **Unified VLM/LLM Multi-Turn Training**: We provided an implementation for the VLM multi-turn sampling paradigm. Developers only need to write a customized `rollout` function to easily start multi-turn RL for VLM, just like training LLM. [blog](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/vlm-multi-turn/readme-en.md)
 *   **[2026/01]** 🤖 **Multi-Agent Co-Evolution**: Miles now supports **MrlX**, a novel asynchronous co-evolutionary framework for Multi-Agent RL. Achieve superior performance in complex tasks like Doctor-Patient simulations and DeepResearch pipelines by enabling specialized agents to evolve together symbiotically. [[Link]](https://github.com/AQ-MedAI/MrlX)
-*   **[2025/12]** 🔄 **Rollout Routing Replay (R3)**: In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [[Paper]](https://arxiv.org/pdf/2510.11370)
+*   **[2025/12]** 🔄 **Rollout Routing Replay (R3)**: In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [[Paper]](https://arxiv.org/pdf/2510.11370) [[Docs]](docs/en/advanced/miles-router.md#22-rollout-routing-replay-r3-for-moe)
 *   **[2025/11]** 🔥 **Unified FP8 Release**: Solves the stability issues in MoE RL by ensuring training and inference use the exact same FP8 quantization logic. [[Blog]](https://lmsys.org/blog/2025-11-25-fp8-rl/)
 *   **[2025/11]** ⚡ **Speculative Decoding in RL**: Integrated speculative rollout with online SFT for draft models, achieving massive throughput gains. [[Blog]](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md)
 *   **[2025/11]** 🎉 **Miles Project Launch**: A joint effort by InfiXAI, Ant Group, SGLang RL Team, and the Miles community. [[Announcement]](https://lmsys.org/blog/2025-11-19-miles/)

diff --git a/docs/en/advanced/miles-router.md b/docs/en/advanced/miles-router.md
@@ -0,0 +1,93 @@
+# Miles Router
+
+miles includes an optional Miles Router used during rollout / data generation. It is a lightweight HTTP router/proxy that sits in front of one or more SGLang worker servers and adds training-oriented capabilities that are not the main goal of serving-focused routers.
+
+---
+
+## 1. What is Miles Router?
+
+Miles Router is a small FastAPI service that:
+
+- Registers workers (SGLang HTTP servers) into a local pool
+- Routes requests to a selected worker (simple least-inflight load balancing)
+- Proxies arbitrary paths to the selected worker (e.g. `/generate`)
+- Runs periodic health checks and quarantines unhealthy workers
+- Supports middleware plugins (via `--miles-router-middleware-paths`) to implement rollout-specific processing (e.g. caching, request/response transforms)
+
+In miles's architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer.
-In miles's architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer.
+In the miles architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer.
-In miles's architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer.
+In the miles architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer.
+
+### How it is launched
+
+In distributed training, miles will start a router automatically when `--sglang-router-ip` is not provided:
+
+- If `--use-miles-router` is set, miles starts Miles Router
+- Otherwise, miles starts SGLang Model Gateway
+
+---
+
+## 2. Why we need Miles Router
+
+Unlike production inference, RL rollout needs to capture additional metadata for training: token-level logprobs, loss masks, and (for MoE models) expert routing decisions. Miles Router provides these capabilities through its middleware system and passthrough proxy design.
+
+### 2.1 Radix-tree cache (transparent token management)
+
+> Use this when your rollout pipeline is text-in/text-out and you cannot reliably persist token IDs; if you already control token-in/token-out (e.g. search r1, multiturn VLM examples), you likely don't need the radix-tree cache.
+
+Text-in text-out interfaces can cause token retokenization mismatches - re-tokenizing text at training time may produce different token sequences than rollout, breaking per-token alignment needed for PPO/GRPO losses.
+
+The radix-tree cache solves this transparently: it intercepts text-based requests, tokenizes them, and stores trajectories (text, token IDs, logprobs, loss masks) keyed by the text prefix. After rollout finishes, calling `/retrieve_from_text` returns the exact token sequence with aligned metadata, without requiring any changes to your rollout code.
+
+Implementation-wise, the radix-tree cache:
+
+- Accepts text plus tokens/metadata and stores them in a radix tree
+- Uses longest-prefix matching to reuse cached token sequences (enabling token-in/token-out downstream)
+- Allows insertion of new text continuations as rollout proceeds (multiple trajectories per prompt, e.g. GRPO)
+- Periodically cleans up stale nodes to control memory usage
+
+Use the radix cache when you have text-based rollout code and want token-level precision without rewriting, or when running GRPO with multiple trajectories sharing the same prompt prefix.
+
+### 2.2 Rollout routing replay (R3) for MoE
+
+For MoE models, miles supports rollout routing replay (R3): record expert routing decisions during rollout and replay them during training to improve stability.
+
+#### SGLang side
+
+SGLang provides expert routing capture via:
+
+- `--enable-return-routed-experts`: server argument to enable routing capture
+- `RoutedExpertsCapturer`: captures `topk_ids` (selected expert IDs) at each MoE layer during forward pass
+- `return_routed_experts`: request parameter to retrieve routing data
+- Returns `routed_experts` in response `meta_info` - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs
- Returns `routed_experts` in response `meta_info` - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs
+- Returns `routed_experts` in the response's `meta_info` field - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs
- Returns `routed_experts` in response `meta_info` - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs
+- Returns `routed_experts` in the response's `meta_info` field - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs
+
+#### miles side
+
+miles consumes the routing data and replays it during training:
+
+- `--use-miles-router --use-rollout-routing-replay`: both flags required to enable R3
+- Rollout sends `return_routed_experts=True` and stores results in `sample.rollout_routed_experts`
+- Training calls `fill_routing_replay()` to load routing data into `RoutingReplay` objects
+- During forward pass, recorded routing decisions are replayed instead of recomputed
+
+#### Why Miles Router is needed
+
+We need Miles Router because the SGLang worker returns routed experts in the response (`meta_info.routed_experts`) when the request sets `return_routed_experts=true`, and Miles Router preserves this field end-to-end. SGLang Model Gateway may drop this extra metadata when it reconstructs responses with a fixed schema (see section 3).
+
+---
+
+## 3. Differences vs SGLang Model Gateway
+
+Miles Router and SGLang Model Gateway can both route requests to workers, but they are optimized for different goals.
+
+### Key differences
+
+Miles Router is a lightweight Python/FastAPI proxy that acts as a passthrough to SGLang workers. This passthrough design enables RL-specific features like radix-tree trajectory caching and R3 (which require preserving raw response metadata like `routed_experts`).
+
+SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for miles's R3 flow.
-SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for miles's R3 flow.
+SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for the miles R3 flow.
-SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for miles's R3 flow.
+SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for the miles R3 flow.
+
+For more details on SGLang Model Gateway, see the [official documentation](https://docs.sglang.io/advanced_features/sgl_model_gateway.html).
+
+### When to use which
+
+- Use Miles Router when you need R3 or radix-tree caching
+- Use SGLang Model Gateway for everything else (recommended default)
+
diff --git a/docs/en/get_started/customization.md b/docs/en/get_started/customization.md
@@ -417,3 +417,4 @@ Stabilize MoE RL training by recording and replaying expert routing decisions to
 | `--use-routing-replay` | Forward-backward routing consistency in training. ([arXiv:2507.18071](https://arxiv.org/abs/2507.18071)) |
 | `--use-rollout-routing-replay` | R3: Replay routing from rollout during training. **Requires `--use-miles-router`**. ([arXiv:2510.11370](https://arxiv.org/abs/2510.11370)) |
 
+For detailed explanation of R3 and MilesRouter, see [Miles Router](../advanced/miles-router.md).
diff --git a/docs/en/index.rst b/docs/en/index.rst
@@ -41,6 +41,7 @@ miles is the RL-framework behind GLM-4.7, GLM-4.6 and GLM-4.5. Apart from models
    :caption: Advanced Features
 
    _examples_synced/reproducibility/README.md
+   advanced/miles-router.md
    advanced/speculative-decoding.md
    advanced/fault-tolerance.md
    advanced/arch-support-beyond-megatron.md
Original file line number	Diff line number	Diff line change
Expand Up		@@ -417,3 +417,4 @@ Stabilize MoE RL training by recording and replaying expert routing decisions to
		\| `--use-routing-replay` \| Forward-backward routing consistency in training. ([arXiv:2507.18071](https://arxiv.org/abs/2507.18071)) \|
		\| `--use-rollout-routing-replay` \| R3: Replay routing from rollout during training. Requires `--use-miles-router`. ([arXiv:2510.11370](https://arxiv.org/abs/2510.11370)) \|

		For detailed explanation of R3 and MilesRouter, see [Miles Router](../advanced/miles-router.md).