-
Notifications
You must be signed in to change notification settings - Fork 95
[Doc] Add doc for slime router #538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @Hecate0821, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces new documentation for the 'Miles Router' component, a lightweight HTTP router/proxy designed for training-oriented capabilities in the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds comprehensive documentation for the Miles Router, explaining its purpose, features like the radix-tree cache and Rollout Routing Replay (R3), and its differences from the SGLang Model Gateway. However, the features it describes have significant security vulnerabilities in their current implementation. Specifically, there is a high risk of Server-Side Request Forgery (SSRF) due to unauthenticated worker registration, and a risk of cross-user data leakage in the shared radix-tree cache. These issues should be addressed in the code and noted in the documentation. The documentation itself is well-written, but there are a few minor suggestions to improve clarity and readability. Additionally, please correct the typo in the pull request title from 'slime' to 'miles'.
| ### 2.1 Radix-tree cache (transparent token management) | ||
|
|
||
| > Use this when your rollout pipeline is text-in/text-out and you cannot reliably persist token IDs; if you already control token-in/token-out (e.g. search r1, multiturn VLM examples), you likely don't need the radix-tree cache. | ||
|
|
||
| Text-in text-out interfaces can cause token retokenization mismatches - re-tokenizing text at training time may produce different token sequences than rollout, breaking per-token alignment needed for PPO/GRPO losses. | ||
|
|
||
| The radix-tree cache solves this transparently: it intercepts text-based requests, tokenizes them, and stores trajectories (text, token IDs, logprobs, loss masks) keyed by the text prefix. After rollout finishes, calling `/retrieve_from_text` returns the exact token sequence with aligned metadata, without requiring any changes to your rollout code. | ||
|
|
||
| Implementation-wise, the radix-tree cache: | ||
|
|
||
| - Accepts text plus tokens/metadata and stores them in a radix tree | ||
| - Uses longest-prefix matching to reuse cached token sequences (enabling token-in/token-out downstream) | ||
| - Allows insertion of new text continuations as rollout proceeds (multiple trajectories per prompt, e.g. GRPO) | ||
| - Periodically cleans up stale nodes to control memory usage | ||
|
|
||
| Use the radix cache when you have text-based rollout code and want token-level precision without rewriting, or when running GRPO with multiple trajectories sharing the same prompt prefix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The radix-tree cache mechanism described here lacks user isolation, as it is shared across all users and keyed solely by text prefixes. This allows for cross-user data leakage, where one user can retrieve cached token sequences and metadata for another user's prompts. It also enables cache poisoning attacks. The documentation should be updated to reflect these security considerations, and the implementation should be revised to provide data isolation.
| - Registers workers (SGLang HTTP servers) into a local pool | ||
| - Routes requests to a selected worker (simple least-inflight load balancing) | ||
| - Proxies arbitrary paths to the selected worker (e.g. `/generate`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MilesRouter implements a proxying mechanism that allows unauthenticated registration of arbitrary worker URLs via the /add_worker endpoint. This creates a significant Server-Side Request Forgery (SSRF) risk, as an attacker can register internal or restricted URLs and then use the router to access them. Since this documentation introduces the feature, it should include a prominent security warning, and the underlying implementation in miles/router/router.py should be updated to include authentication and URL validation.
| - Runs periodic health checks and quarantines unhealthy workers | ||
| - Supports middleware plugins (via `--miles-router-middleware-paths`) to implement rollout-specific processing (e.g. caching, request/response transforms) | ||
|
|
||
| In miles's architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The possessive form miles's is a bit awkward to read. For better flow, I suggest rephrasing to "In the miles architecture".
| In miles's architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer. | |
| In the miles architecture, the router is part of the rollout system ("SGLang + router") that generates samples and pushes them into the data buffer. |
| - `--enable-return-routed-experts`: server argument to enable routing capture | ||
| - `RoutedExpertsCapturer`: captures `topk_ids` (selected expert IDs) at each MoE layer during forward pass | ||
| - `return_routed_experts`: request parameter to retrieve routing data | ||
| - Returns `routed_experts` in response `meta_info` - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence could be clearer. I suggest rephrasing to explicitly state that routed_experts is in the meta_info field and then describe the tensor.
| - Returns `routed_experts` in response `meta_info` - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs | |
| - Returns `routed_experts` in the response's `meta_info` field - a `[seq_len - 1, num_layers, top_k]` tensor of expert IDs |
|
|
||
| MilesRouter is a lightweight Python/FastAPI proxy that acts as a passthrough to SGLang workers. This passthrough design enables RL-specific features like radix-tree trajectory caching and R3 (which require preserving raw response metadata like `routed_experts`). | ||
|
|
||
| SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for miles's R3 flow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The possessive form miles's is a bit awkward. For better readability, I suggest changing "miles's R3 flow" to "the miles R3 flow".
| SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for miles's R3 flow. | |
| SGLang Model Gateway is a high-performance Rust-based router optimized for large-scale inference: async non-blocking routing, advanced fault tolerance (retries, circuit breakers), multiple load balancing policies (including cache-aware routing), and PD disaggregation support. However, it reconstructs responses with a fixed schema, so it does not preserve the metadata needed for the miles R3 flow. |
No description provided.