docs: add Miles server arguments#517
Conversation
Summary of ChangesHello @Ratish1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the documentation for the Miles server by introducing a dedicated guide that meticulously lists and explains all command-line arguments. This new resource provides users with an exhaustive reference for fine-tuning Miles' behavior, covering everything from hardware resource allocation and training backend selection to advanced reinforcement learning parameters and debugging options, ultimately streamlining the configuration process and improving the overall user experience. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive documentation for the Miles server arguments in a new markdown file. The documentation is well-organized and detailed. My review focuses on improving clarity and consistency in the argument descriptions. I've pointed out a couple of minor inconsistencies that could confuse users. Overall, this is a valuable addition to the project.
|
Hi @Ratish1 does it makes sense to add the arguments for checkpointings? Also, I think there're some arguments not covered in this PR, e.g., FSDP: --deterministic-mode. It might make sense to provide a comprehensive overview of all arguments |
Hey @zijiexia , I have added even more server arguments, could you let me know if it looks good now?. Thanks |
This do not affect disaggregated mode? |
|
Disable weights backuper to save host memory. By default, this feature is enabled. Please explain in one or two lines what is weights-backuper and it's trade off. |
|
Path to the Huggingface checkpoint used to initialize SGLang and provide the tokenizer. It must have the same architecture as the model being trained. It doesn't necessarily need to contain the most up-to-date parameters. looks weird to me. I think we can only keep this: |
|
Skip special tokens in the response. Useful when the response is used as a prompt for the next rollout. is this needed in multi-turn RL. If so, please stress this |
what does this mean? |
|
What do you mean by manage the data by yourself? Please make it rather clearer. |
|
Did you guys checked the consistency with https://github.com/radixark/miles/blob/main/docs/en/get_started/quick_start.md? If not, please check it 😂 |
| | `--tis-clip-low` | Lower bound clipping threshold C for importance sampling ratios to control variance. | `0.0` | Type: float | Miles Native | | ||
| | `--custom-tis-function-path` | Path to a custom TIS or MIS function. [Ref](../get_started/customization.md#10-custom-tisrs-function---custom-tis-function-path) | `None` | Type: str | Miles Native | | ||
| | `--custom-pg-loss-reducer-function-path` | Custom reducer function for policy gradient loss. [Ref](../get_started/customization.md#11-custom-pg-loss-reducer---custom-pg-loss-reducer-function-path) | `None` | Type: str | Miles Native | | ||
| | `--use-routing-replay` | Enable [Routing Replay](https://arxiv.org/abs/2507.18071). | `False` | bool flag (set to enable) | Miles Native | |
There was a problem hiding this comment.
Enable r2 for MoE: record expert routing decisions during forward and replay them during backward. Will be automatically set to True when enable --use-rollout-routing-replay
| | `--balance-data` | Repartition each rollout batch so each data-parallel rank gets a similar total token count via the Karmarkar-Karp method. It may be beneficial for training speed, but changes per-rank sample grouping and adds a small CPU scheduling overhead. | `False` | bool flag (set to enable) | Miles Native | | ||
| | `--data-pad-size-multiplier` | Multiplier used to calculate the sequence padding boundary. Miles rounds sequence lengths up to a multiple of `tensor_parallel_size * data_pad_size_multiplier`. This optimization ensures that matrix dimensions are aligned with NVIDIA Tensor Core requirements, maximizing throughput and reducing VRAM fragmentation. | `128` | Type: int | Miles Native | | ||
| | `--micro-batch-size` | Micro batch size per GPU. Ignored when `--use-dynamic-batch-size` is enabled. | `1` | Type: int | Megatron-LM (Reset by Miles) | | ||
|
|
There was a problem hiding this comment.
Could you add one more parameter --seq-length? It's a very confusing param in Megatron but not effective in miles at all. Ref #574 (comment)
| | `--n-samples-per-prompt` | Number of responses to generate for each prompt, e.g., the group size of GRPO. | `1` | Type: int | Miles Native | | ||
| | `--global-batch-size` | Total samples per optimizer step. Automatically calculated or **overridden** if `num_steps_per_rollout` is set. | `None` | Type: int | Megatron-LM (Reset by Miles) | | ||
| | `--num-steps-per-rollout` | The number of training steps to perform using the data collected in a single rollout round. Setting this to `n` means the policy model will be updated `n` times using the same batch of rollout data. Miles ensures that `(rollout-batch-size * n-samples-per-prompt) = (global-batch-size * num-steps-per-rollout)`. If this value is not provided, you have to set `--global-batch-size` explicitly. If both are provided, `--num-steps-per-rollout` will **override** the global batch size with `num_steps_per_rollout = (rollout_batch_size * n_samples_per_prompt) // num_steps_per_rollout`. | `None` | Type: int | Miles Native | | ||
| | `--use-dynamic-batch-size` | Dynamically packs variable-length samples into micro-batches to maximize GPU utilization, ensuring the total token count per batch does not exceed `--max-tokens-per-gpu`. For example, with a 300-token limit, samples of lengths 100, 200, and 300 would be packed into two batches: `[100, 200]` and `[300]`. **Note:** Miles ensures that enabling this optimization does not affect the mathematical correctness of per-sample or per-token loss calculation. It is **strongly recommended** to enable this for maximum efficiency. | `False` | bool flag (set to enable) | Miles Native | |
There was a problem hiding this comment.
This can only be enabled when --qkv-format is thd, not work for bshd
| | Argument | Description | Default | Options | Source | | ||
| | :--- | :--- | :--- | :--- | :--- | | ||
| | `--train-backend` | The backend for training. Highly suggest Megatron for numerical stability and efficiency. | `"megatron"` | `megatron`, `fsdp` | Miles Native | | ||
| | `--qkv-format` | The QKV layout. | `"thd"` | `thd`, `bshd` | Miles Native | |
There was a problem hiding this comment.
Write more about this param. New models may not support thd but bshd only.
There was a problem hiding this comment.
may say sth like: If or not to pack all sequences with variant lengths into tokens dimension. By default use thd format because it will faster than bshd by saving padding overhead. However, for new models with novel attention architectures (e.g. sparse, attention sink), thd format may lack training backend support. Use bshd to train those models.
| | `--log-probs-max-tokens-per-gpu` | The maximum number of tokens per GPU for calculating log probs. This is used to calculate the log probs of the responses during rollout, and should be set to a larger value than `max_tokens_per_gpu` if you want better performance. | `None` | Type: int | Miles Native | | ||
| | `--balance-data` | Repartition each rollout batch so each data-parallel rank gets a similar total token count via the Karmarkar-Karp method. It may be beneficial for training speed, but changes per-rank sample grouping and adds a small CPU scheduling overhead. | `False` | bool flag (set to enable) | Miles Native | | ||
| | `--data-pad-size-multiplier` | Multiplier used to calculate the sequence padding boundary. Miles rounds sequence lengths up to a multiple of `tensor_parallel_size * data_pad_size_multiplier`. This optimization ensures that matrix dimensions are aligned with NVIDIA Tensor Core requirements, maximizing throughput and reducing VRAM fragmentation. | `128` | Type: int | Miles Native | | ||
| | `--micro-batch-size` | Micro batch size per GPU. Ignored when `--use-dynamic-batch-size` is enabled. | `1` | Type: int | Megatron-LM (Reset by Miles) | |
There was a problem hiding this comment.
Not work for --qkv-format=thd
There was a problem hiding this comment.
you mean --micro-batch-size? this also works for thd; both dynamic/specified micro batch size work for thd but only specified one works for bshd
There was a problem hiding this comment.
Sorry, Yueming is right. Ignore my words.
| | `--sglang-mem-fraction-static` | Fraction of GPU memory to reserve for SGLang KV cache. | `0.9` | Type: float | SGLang | | ||
| | `--sglang-server-concurrency` | Maximum number of concurrent requests. | `512` | Type: int | SGLang | | ||
| | `--sglang-router-ip` | IP address of the SGLang router. | `None` | Type: str | SGLang Gateway | | ||
| | `--sglang-router-port` | Port of the SGLang router. | `None` | Type: int | SGLang Gateway | |
|
|
||
| | Argument | Description | Default | Options | Source | | ||
| | :--- | :--- | :--- | :--- | :--- | | ||
| | `--sglang-mem-fraction-static` | Fraction of GPU memory to reserve for SGLang KV cache. | `0.9` | Type: float | SGLang | |
There was a problem hiding this comment.
Why 0.9 here ? It's too large. 0.7~0.8 is good.
There was a problem hiding this comment.
I think the default value for sglang mem fraction static is 0.9.
| | Argument | Description | Default | Options | Source | | ||
| | :--- | :--- | :--- | :--- | :--- | | ||
| | `--check-weight-update-equal` | Verify that weight updates are equal across ranks. | `False` | bool flag (set to enable) | Miles Native | | ||
| | `--save-debug-rollout-data` | Path to save rollout data for offline analysis. | `None` | Type: str | Miles Native | |
There was a problem hiding this comment.
Add --save-debug-rollout-data, --load-debug-rollout-data, --debug-rollout-only, --debug-train-only refers to debug.md
| | `--disable-grpo-std-normalization` | Disable standard deviation normalization for GRPO. From [Dr.GRPO](https://arxiv.org/pdf/2503.20783) | `False` | bool flag (set to enable) | Miles Native | | ||
| | `--disable-rewards-normalization` | Disable the default group-wise reward normalization for GRPO, GSPO, and REINFORCE++. This effectively skips the baseline subtraction step. | `False` | bool flag (set to enable) | Miles Native | | ||
| | `--use-rollout-entropy` | Enable entropy calculation when calculating the logprobs from actor and reference model. This is useful for implementing custom entropy-based loss masking. | `False` | bool flag (set to enable) | Miles Native | | ||
| | `--use-rollout-logprobs` | Use rollout logprobs for importance sampling ratios, use the logprobs from the actor model if not set. If `--get-mismatch-metrics` is set, the log probs will be recomputed by training engine, one more forward pass will be applied. | `False` | bool flag (set to enable) | Miles Native | |
There was a problem hiding this comment.
Plz check code logics in loss.py. Maybe it's should be "use rollout logprobs as the old policy logprobs for importance sampling ratios in GRPO/GSPO" ?
| | `--rollout-batch-size` | Number of prompts per rollout batch. The total data returned should be `rollout_batch_size` * `n_samples_per_prompt`. | Required | Type: int | Miles Native | | ||
| | `--n-samples-per-prompt` | Number of responses to generate for each prompt, e.g., the group size of GRPO. | `1` | Type: int | Miles Native | | ||
| | `--global-batch-size` | Total samples per optimizer step. Automatically calculated or **overridden** if `num_steps_per_rollout` is set. | `None` | Type: int | Megatron-LM (Reset by Miles) | | ||
| | `--num-steps-per-rollout` | The number of training steps to perform using the data collected in a single rollout round. Setting this to `n` means the policy model will be updated `n` times using the same batch of rollout data. Miles ensures that `(rollout-batch-size * n-samples-per-prompt) = (global-batch-size * num-steps-per-rollout)`. If this value is not provided, you have to set `--global-batch-size` explicitly. If both are provided, `--num-steps-per-rollout` will **override** the global batch size with `num_steps_per_rollout = (rollout_batch_size * n_samples_per_prompt) // num_steps_per_rollout`. | `None` | Type: int | Miles Native | |
There was a problem hiding this comment.
Could you check what will happend if multi samples are returned in the generate function (maybe the refactored one)?
| choices=["thd", "bshd"], | ||
| default="thd", | ||
| help="The qkv layout for Megatron backend.", | ||
| help="The qkv layout.", |
There was a problem hiding this comment.
Why change here? More details needed
There was a problem hiding this comment.
I think I changed here because the same parameter is also applies to FSDP backend. I wonder if this could make it less confusing.
There was a problem hiding this comment.
Ah, yes, this was a historical issue, at the beginning I only supported for megatron, but after refactor, as general utils, it should also work for fsdp... (though I did not test fsdp+bshd)
| "which will be used as the prompt and the label respectively. " | ||
| "If you want to use a custom template, you can set --apply-chat-template to true, in that case, " | ||
| "the input should be the same structure as an openai message, e.g. [{'role': 'user', 'content': 'blabla'}]. " | ||
| "which will be used as the prompt and the label respectively." |
| | `--max-tokens-per-gpu` | The maximum number of tokens (Prompt + Response combined) per GPU for dynamic batch size. This parameter defines the total sequence length budget for packing samples into micro-batches during training. Note that when enabling context parallel (CP), the effective capacity is shared, so the value should be approximately `(Total_Sequence_Length) // cp_size`. | `None` | Type: int | Miles Native | | ||
| | `--log-probs-max-tokens-per-gpu` | The maximum number of tokens per GPU for calculating log probs. This is used to calculate the log probs of the responses during rollout, and should be set to a larger value than `max_tokens_per_gpu` if you want better performance. | `None` | Type: int | Miles Native | | ||
| | `--balance-data` | Repartition each rollout batch so each data-parallel rank gets a similar total token count via the Karmarkar-Karp method. It may be beneficial for training speed, but changes per-rank sample grouping and adds a small CPU scheduling overhead. | `False` | bool flag (set to enable) | Miles Native | | ||
| | `--data-pad-size-multiplier` | Multiplier used to calculate the sequence padding boundary. Miles rounds sequence lengths up to a multiple of `tensor_parallel_size * data_pad_size_multiplier`. This optimization ensures that matrix dimensions are aligned with NVIDIA Tensor Core requirements, maximizing throughput and reducing VRAM fragmentation. | `128` | Type: int | Miles Native | |
There was a problem hiding this comment.
Add a notice: Better not change this. <128 may trigger accuracy loss under thd with TP>=4.
|
This line-level comment is hard to use :( —if a comment spans multiple lines, the line to change should be the last one. |
|
|
||
| | Argument | Description | Default | Options | Source | | ||
| | :--- | :--- | :--- | :--- | :--- | | ||
| | `--check-weight-update-equal` | Verify that weight updates are equal across ranks. | `False` | bool flag (set to enable) | Miles Native | |
There was a problem hiding this comment.
| | `--check-weight-update-equal` | Verify that weight updates are equal across ranks. | `False` | bool flag (set to enable) | Miles Native | | |
| | `--check-weight-update-equal` | Use SGLang's weight checker to check and ensure that the loaded weight from HF checkpoint and received from Megatron are bit-wise equal. | `False` | bool flag (set to enable) | Miles Native | |
(suggest to be more specific on this)
|
Hey @guapisolo @yueming-yuan , thank you so much for the reviews. I have addressed all of them, let me know if you need more changes. Also @guapisolo , thanks for the note about “multi-sample” returns from the generate function. I clarified this under --custom-generate-function-path: in the refactored interface a custom generate can return list[Sample], but the default rollout/training pipelines expect each prompt group to be a flat list[Sample] of length --n-samples-per-prompt (they assert len(group) == n_samples_per_prompt). so if users return multi-samples per generate call, they’ll need a compatible rollout pipeline that handles that structure. Lmk if this sounds good. Thanks again |
Co-authored-by: Zijie Xia <zijie_xia@icloud.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
Co-authored-by: Zijie Xia <zijie_xia@icloud.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
Co-authored-by: Zijie Xia <zijie_xia@icloud.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
Co-authored-by: Zijie Xia <zijie_xia@icloud.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>













This PR adds complete docs for Miles server arguments.