Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
* **[2026/01]** 💎 **INT4 Quantization-Aware Training (QAT)**: Inspired by the Kimi K2-Thinking report, Miles now features a full-stack INT4 W4A16 QAT pipeline. This allows 1TB-scale models to fit into single-machine VRAM (e.g., NVIDIA H200), doubling rollout efficiency by eliminating cross-node bottlenecks while maintaining BF16-equivalent accuracy. [Blog](https://lmsys.org/blog/2026-01-28-int4-qat/)
* **[2026/01]** 💎 **Unified VLM/LLM Multi-Turn Training**: We provided an implementation for the VLM multi-turn sampling paradigm. Developers only need to write a customized `rollout` function to easily start multi-turn RL for VLM, just like training LLM. [blog](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/vlm-multi-turn/readme-en.md)
* **[2026/01]** 🤖 **Multi-Agent Co-Evolution**: Miles now supports **MrlX**, a novel asynchronous co-evolutionary framework for Multi-Agent RL. Achieve superior performance in complex tasks like Doctor-Patient simulations and DeepResearch pipelines by enabling specialized agents to evolve together symbiotically. [[Link]](https://github.com/AQ-MedAI/MrlX)
* **[2025/12]** 🔄 **Rollout Routing Replay (R3)**: In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [[Paper]](https://arxiv.org/pdf/2510.11370)
* **[2025/12]** 🔄 **Rollout Routing Replay (R3)**: In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [[Paper]](https://arxiv.org/pdf/2510.11370) [[Docs]](docs/en/advanced/miles-router.md#22-rollout-routing-replay-r3-for-moe)
* **[2025/11]** 🔥 **Unified FP8 Release**: Solves the stability issues in MoE RL by ensuring training and inference use the exact same FP8 quantization logic. [[Blog]](https://lmsys.org/blog/2025-11-25-fp8-rl/)
* **[2025/11]** ⚡ **Speculative Decoding in RL**: Integrated speculative rollout with online SFT for draft models, achieving massive throughput gains. [[Blog]](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md)
* **[2025/11]** 🎉 **Miles Project Launch**: A joint effort by InfiXAI, Ant Group, SGLang RL Team, and the Miles community. [[Announcement]](https://lmsys.org/blog/2025-11-19-miles/)
Expand Down
93 changes: 93 additions & 0 deletions docs/en/advanced/miles-router.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Miles Router

miles includes an optional Miles Router used during rollout / data generation. It is a lightweight HTTP router/proxy that sits in front of one or more SGLang worker servers and adds training-oriented capabilities that are not the main goal of serving-focused routers.

---

## 1. What is Miles Router?

Miles Router is a small FastAPI service that:

- Registers workers (SGLang HTTP servers) into a local pool
- Routes requests to a selected worker (simple least-inflight load balancing)
- Proxies arbitrary paths to the selected worker (e.g. `/generate`)
Comment on lines +11 to +13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The MilesRouter implements a proxying mechanism that allows unauthenticated registration of arbitrary worker URLs via the /add_worker endpoint. This creates a significant Server-Side Request Forgery (SSRF) risk, as an attacker can register internal or restricted URLs and then use the router to access them. Since this documentation introduces the feature, it should include a prominent security warning, and the underlying implementation in miles/router/router.py should be updated to include authentication and URL validation.

- Runs periodic health checks and quarantines unhealthy workers
- Supports middleware plugins (via `--miles-router-middleware-paths`) to implement rollout-specific processing (e.g. caching, request/response transforms)

In miles's architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The possessive form miles's is a bit awkward to read. For better flow, I suggest rephrasing to "In the miles architecture".

Suggested change
In miles's architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer.
In the miles architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer.


### How it is launched

In distributed training, miles will start a router automatically when `--sglang-router-ip` is not provided:

- If `--use-miles-router` is set, miles starts Miles Router
- Otherwise, miles starts SGLang Model Gateway

---

## 2. Why we need Miles Router

Unlike production inference, RL rollout needs to capture additional metadata for training: token-level logprobs, loss masks, and (for MoE models) expert routing decisions. Miles Router provides these capabilities through its middleware system and passthrough proxy design.

### 2.1 Radix-tree cache (transparent token management)

> Use this when your rollout pipeline is text-in/text-out and you cannot reliably persist token IDs; if you already control token-in/token-out (e.g. search r1, multiturn VLM examples), you likely don't need the radix-tree cache.

Text-in text-out interfaces can cause token retokenization mismatches - re-tokenizing text at training time may produce different token sequences than rollout, breaking per-token alignment needed for PPO/GRPO losses.

The radix-tree cache solves this transparently: it intercepts text-based requests, tokenizes them, and stores trajectories (text, token IDs, logprobs, loss masks) keyed by the text prefix. After rollout finishes, calling `/retrieve_from_text` returns the exact token sequence with aligned metadata, without requiring any changes to your rollout code.

Implementation-wise, the radix-tree cache:

- Accepts text plus tokens/metadata and stores them in a radix tree
- Uses longest-prefix matching to reuse cached token sequences (enabling token-in/token-out downstream)
- Allows insertion of new text continuations as rollout proceeds (multiple trajectories per prompt, e.g. GRPO)
- Periodically cleans up stale nodes to control memory usage

Use the radix cache when you have text-based rollout code and want token-level precision without rewriting, or when running GRPO with multiple trajectories sharing the same prompt prefix.
Comment on lines +32 to +47
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The radix-tree cache mechanism described here lacks user isolation, as it is shared across all users and keyed solely by text prefixes. This allows for cross-user data leakage, where one user can retrieve cached token sequences and metadata for another user's prompts. It also enables cache poisoning attacks. The documentation should be updated to reflect these security considerations, and the implementation should be revised to provide data isolation.


### 2.2 Rollout routing replay (R3) for MoE

For MoE models, miles supports rollout routing replay (R3): record expert routing decisions during rollout and replay them during training to improve stability.

#### SGLang side

SGLang provides expert routing capture via:

- `--enable-return-routed-experts`: server argument to enable routing capture
- `RoutedExpertsCapturer`: captures `topk_ids` (selected expert IDs) at each MoE layer during forward pass
- `return_routed_experts`: request parameter to retrieve routing data
- Returns `routed_experts` in response `meta_info` - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence could be clearer. I suggest rephrasing to explicitly state that routed_experts is in the meta_info field and then describe the tensor.

Suggested change
- Returns `routed_experts` in response `meta_info` - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs
- Returns `routed_experts` in the response's `meta_info` field - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs


#### miles side

miles consumes the routing data and replays it during training:

- `--use-miles-router --use-rollout-routing-replay`: both flags required to enable R3
- Rollout sends `return_routed_experts=True` and stores results in `sample.rollout_routed_experts`
- Training calls `fill_routing_replay()` to load routing data into `RoutingReplay` objects
- During forward pass, recorded routing decisions are replayed instead of recomputed

#### Why Miles Router is needed

We need Miles Router because the SGLang worker returns routed experts in the response (`meta_info.routed_experts`) when the request sets `return_routed_experts=true`, and Miles Router preserves this field end-to-end. SGLang Model Gateway may drop this extra metadata when it reconstructs responses with a fixed schema (see section 3).

---

## 3. Differences vs SGLang Model Gateway

Miles Router and SGLang Model Gateway can both route requests to workers, but they are optimized for different goals.

### Key differences

Miles Router is a lightweight Python/FastAPI proxy that acts as a passthrough to SGLang workers. This passthrough design enables RL-specific features like radix-tree trajectory caching and R3 (which require preserving raw response metadata like `routed_experts`).

SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for miles's R3 flow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The possessive form miles's is a bit awkward. For better readability, I suggest changing "miles's R3 flow" to "the miles R3 flow".

Suggested change
SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for miles's R3 flow.
SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for the miles R3 flow.


For more details on SGLang Model Gateway, see the [official documentation](https://docs.sglang.io/advanced_features/sgl_model_gateway.html).

### When to use which

- Use Miles Router when you need R3 or radix-tree caching
- Use SGLang Model Gateway for everything else (recommended default)

1 change: 1 addition & 0 deletions docs/en/get_started/customization.md
Original file line number Diff line number Diff line change
Expand Up @@ -417,3 +417,4 @@ Stabilize MoE RL training by recording and replaying expert routing decisions to
| `--use-routing-replay` | Forward-backward routing consistency in training. ([arXiv:2507.18071](https://arxiv.org/abs/2507.18071)) |
| `--use-rollout-routing-replay` | R3: Replay routing from rollout during training. **Requires `--use-miles-router`**. ([arXiv:2510.11370](https://arxiv.org/abs/2510.11370)) |

For detailed explanation of R3 and MilesRouter, see [Miles Router](../advanced/miles-router.md).
1 change: 1 addition & 0 deletions docs/en/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ miles is the RL-framework behind GLM-4.7, GLM-4.6 and GLM-4.5. Apart from models
:caption: Advanced Features

_examples_synced/reproducibility/README.md
advanced/miles-router.md
advanced/speculative-decoding.md
advanced/fault-tolerance.md
advanced/arch-support-beyond-megatron.md
Expand Down