Skip to content

Conversation

@Ratish1
Copy link
Contributor

@Ratish1 Ratish1 commented Jan 25, 2026

This PR adds complete docs for Miles server arguments.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Ratish1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the documentation for the Miles server by introducing a dedicated guide that meticulously lists and explains all command-line arguments. This new resource provides users with an exhaustive reference for fine-tuning Miles' behavior, covering everything from hardware resource allocation and training backend selection to advanced reinforcement learning parameters and debugging options, ultimately streamlining the configuration process and improving the overall user experience.

Highlights

  • New Documentation File: A new comprehensive documentation file, docs/en/advanced/miles_server_args.md, has been added to detail all available command-line arguments for configuring the Miles server.
  • Detailed Argument Listing: The new documentation meticulously lists and describes arguments across various categories, including Cluster and Resource Management, Training Backend, Rollout Management, Data Handling, Evaluation, RL Algorithms, Logging, Fault Tolerance, Miles Router, Reward Models, Buffer Management, Multi-Token Prediction (MTP), SGLang, FSDP, Debugging, and Environment Variables.
  • Minor Code Cleanup: A small formatting correction was made in miles/utils/arguments.py to remove a trailing comma from the help string of the --log-reward-category argument.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive documentation for the Miles server arguments in a new markdown file. The documentation is well-organized and detailed. My review focuses on improving clarity and consistency in the argument descriptions. I've pointed out a couple of minor inconsistencies that could confuse users. Overall, this is a valuable addition to the project.

help=(
"Log statistics of the category of reward, such as why the reward function considers it as failed. "
"Specify the key in the reward dict using this argument.",
"Specify the key in the reward dict using this argument."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed so that the command python3 train.py --help would work or else it would return an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is merged through another PR now.

@zijiexia
Copy link
Contributor

Hi @Ratish1 does it makes sense to add the arguments for checkpointings? Also, I think there're some arguments not covered in this PR, e.g., FSDP: --deterministic-mode. It might make sense to provide a comprehensive overview of all arguments

@Ratish1
Copy link
Contributor Author

Ratish1 commented Jan 27, 2026

Hi @Ratish1 does it makes sense to add the arguments for checkpointings? Also, I think there're some arguments not covered in this PR, e.g., FSDP: --deterministic-mode. It might make sense to provide a comprehensive overview of all arguments

Hey @zijiexia , I have added even more server arguments, could you let me know if it looks good now?. Thanks

@Ratish1 Ratish1 requested a review from zijiexia January 28, 2026 05:51
@zijiexia
Copy link
Contributor

Hi @Ratish1 , I've made some changes to the docs, see this PR: Ratish1#1, please let me know what you think. Thanks!

@zhaochenyang20
Copy link
Collaborator

  1. Total GPUs per node on the machine. Specify if using fewer than 8 GPUs per node in colocate mode.

This do not affect disaggregated mode?

@zhaochenyang20
Copy link
Collaborator

These expressions are clear. I just feel quite strange.

image

What if users turn on these params when using disaggregate? In other words, I think we should not have these parameters. Just have --collocate is enough.

Updated the description of the --true-on-policy-mode parameter for clarity and added a reference link.
@zhaochenyang20
Copy link
Collaborator

Disable weights backuper to save host memory. By default, this feature is enabled.

Please explain in one or two lines what is weights-backuper and it's trade off.

Updated descriptions for several server arguments to improve clarity and added references for better understanding.
@zhaochenyang20
Copy link
Collaborator

image

explain in what case we shall keep the old actor, and the trade off to keeep it.

@zhaochenyang20
Copy link
Collaborator

image

Put prompt data before --disable-rollout-global-dataset

@zhaochenyang20
Copy link
Collaborator

Disable the global dataset for rollout. If set, the rollout will use the --prompt-data as the prompt dataset, and the prompts for rollout will be sampled from the dataset. If not set, you need to manage the data by yourself.

What do you mean by manage the data by yourself? Please make it rather clearer.

@zhaochenyang20
Copy link
Collaborator

image

I think this If you want to use a custom template, you can set --apply-chat-template to true is redundant. Put these descriptions in --apply-chat-template and explain how to use customized chat template

@zhaochenyang20
Copy link
Collaborator

image Put the first two arguements later with its related arguements.

@zhaochenyang20
Copy link
Collaborator

image

What is gbs here? And, please state that settng --num-steps-per-rollout to n means that each batch of rollout data should update the policy model n times.

And, should the default value be 1 but not None?

@zhaochenyang20
Copy link
Collaborator

Did you guys checked the consistency with https://github.com/radixark/miles/blob/main/docs/en/get_started/quick_start.md?

If not, please check it 😂

@zhaochenyang20
Copy link
Collaborator

image

I think "max tokens per gpu should be around max_response_len // cp_size instead of max_response_len"

max tokens per gpu should strongly related with sequence length (promt + response). Not only response length?

@zhaochenyang20
Copy link
Collaborator

image

I am quite confused with this parameters. I think for a given response, we should calculate all the tokens' log probs. The current explanation let me feel that some tokens out of the --log-probs-max-tokens-per-gpu will not calculate the log probs.
This seems to be batch size parameter for megatron that how many log probs are calcualted each batch?

@zhaochenyang20
Copy link
Collaborator

I quite don't understand:

image

why this is related with verl? Make clearer explanation and trade off.

@zhaochenyang20
Copy link
Collaborator

image
  1. explain the first one more detailedly.
  2. are you sure it's for per GPU? Micro batch size per GPU? the parame name does not has per GPU.

@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Jan 31, 2026

two more suggestions:

Based on our discussion, here are the documentation optimization requirements for the miles_server_args.md file, focused on classification and mapping:

  1. Clear Argument Attribution and Categorization
  • Source Labeling: Each argument must be explicitly labeled with its origin to distinguish between Miles native arguments, Megatron passthroughs, and SGLang/SGLang Model Gateway arguments.
  1. Explicit Mapping of Passthrough Relationships
  • Passthrough Logic: Clearly define the relationship between Miles and its underlying frameworks.
  • Naming Conventions: Document the existing prefix rules to help users identify the target backend:
  • --sglang-*: Arguments passed directly to SGLang.
  • --router-*: Arguments directed to the SGLang Model Gateway/Router.
  • No Prefix: Default arguments corresponding to Megatron-LM.

@Ratish1
Copy link
Contributor Author

Ratish1 commented Jan 31, 2026

image What is gbs here? And, please state that settng `--num-steps-per-rollout` to `n` means that each batch of rollout data should update the policy model `n` times.

And, should the default value be 1 but not None?

Currently in the codebase it seems to be None. is that intended behaviour?, Should I update it in this PR itself?

@Ratish1
Copy link
Contributor Author

Ratish1 commented Jan 31, 2026

image 1. explain the first one more detailedly. 2. are you sure it's for per GPU? Micro batch size per GPU? the parame name does not has per GPU.

Yes micro batch size is per GPU

@zijiexia
Copy link
Contributor

zijiexia commented Jan 31, 2026

image I am quite confused with this parameters. I think for a given response, we should calculate all the tokens' log probs. The current explanation let me feel that some tokens out of the `--log-probs-max-tokens-per-gpu` will not calculate the log probs. This seems to be batch size parameter for megatron that how many log probs are calcualted each batch?

I feel like this argument is deprecated, I couldn't find anywhere it was referred.

| `--use-dynamic-batch-size` | Dynamically packs variable-length samples into micro-batches to maximize GPU utilization, ensuring the total token count per batch does not exceed `--max-tokens-per-gpu`. For example, with a 300-token limit, samples of lengths 100, 200, and 300 would be packed into two batches: `[100, 200]` and `[300]`. **Note:** Miles ensures that enabling this optimization does not affect the mathematical correctness of per-sample or per-token loss calculation. It is **strongly recommended** to enable this for maximum efficiency. | `False` | bool flag (set to enable) | Miles Native |
| `--max-tokens-per-gpu` | The maximum number of tokens (Prompt + Response combined) per GPU for dynamic batch size. This parameter defines the total sequence length budget for packing samples into micro-batches during training. Note that when enabling context parallel (CP), the effective capacity is shared, so the value should be approximately `(Total_Sequence_Length) // cp_size`. | `None` | Type: int | Miles Native |
| `--log-probs-max-tokens-per-gpu` | The maximum number of tokens per GPU for calculating log probs. This is used to calculate the log probs of the responses during rollout, and should be set to a larger value than `max_tokens_per_gpu` if you want better performance. | `None` | Type: int | Miles Native |
| `--balance-data` | Balance the number of tokens between data parallel ranks with `karmarkar_karp` for verl. Note that this may allocate the different response of the same prompt into different training steps. | `False` | Type: bool | Megatron-LM |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `--balance-data` | Balance the number of tokens between data parallel ranks with `karmarkar_karp` for verl. Note that this may allocate the different response of the same prompt into different training steps. | `False` | Type: bool | Megatron-LM |
| `--balance-data` | Repartition each rollout batch so each data-parallel rank gets a similar total token count via Karmarkar-Karp method. It may be beneficial for training speed but changes per-rank sample grouping and adds a small CPU scheduling overhead. | `False` | Type: bool | Miles Native |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ratish1 , can you also change the help in arguments.py accordingly?

| `--true-on-policy-mode` | Strictly align SGLang's log probs and training engine's log probs to bit-wise equal. This parameter is only used for FSDP right now. [Ref](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/mismatch/blog-en.md#truly-on-policy-training) | `False` | bool flag (set to enable) | Miles Native |
| `--train-env-vars` | Extra environment variables for training process, e.g., PyTorch memory management ones. | `{}` | Type: JSON / Dict | Miles Native |
| `--train-memory-margin-bytes` | Reserved memory margin for training in bytes. Defaults to 1GB. | `1073741824` | Type: int | Miles Native |
| `--disable-weights-backuper` | Disables the system that backups model weights (Actor, Ref, Old Actor) to CPU RAM. Disabling saves significant host memory but prevents weight-swapping features like KL-divergence. | `False` | bool flag (set to disable) | Miles Native |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `--disable-weights-backuper` | Disables the system that backups model weights (Actor, Ref, Old Actor) to CPU RAM. Disabling saves significant host memory but prevents weight-swapping features like KL-divergence. | `False` | bool flag (set to disable) | Miles Native |
| `--disable-weights-backuper` | Applies to `megatron` training backend only. Disables the system that backups model weights (Actor, Ref, Old Actor) to CPU RAM. Disabling saves significant host memory but prevents features that rely on weight-swapping, such as computing KL-divergence against a reference model. **Note**: do not set `--ref-load` and `--keep-old-actor` if disable weights backuper. | `False` | bool flag (set to disable) | Miles Native |


| Argument | Description | Default | Options | Source |
| :--- | :--- | :--- | :--- | :--- |
| `--prompt-data` | Path to the prompt dataset (JSONL format) and each line should contains `--input-key` and `--label-key` which will be used as the prompt and the label respectively. If you want to use a custom template, you can set `--apply-chat-template` to true | `None` | Type: str | Miles Native |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `--prompt-data` | Path to the prompt dataset (JSONL format) and each line should contains `--input-key` and `--label-key` which will be used as the prompt and the label respectively. If you want to use a custom template, you can set `--apply-chat-template` to true | `None` | Type: str | Miles Native |
| `--prompt-data` | Path to the prompt dataset (JSONL format) and each line should contains `--input-key` and `--label-key` which will be used as the prompt and the label respectively. | `None` | Type: str | Miles Native |

| Argument | Description | Default | Options | Source |
| :--- | :--- | :--- | :--- | :--- |
| `--prompt-data` | Path to the prompt dataset (JSONL format) and each line should contains `--input-key` and `--label-key` which will be used as the prompt and the label respectively. If you want to use a custom template, you can set `--apply-chat-template` to true | `None` | Type: str | Miles Native |
| `--disable-rollout-global-dataset` | Disable the global dataset for rollout. If set, the rollout will use the `--prompt-data` as the prompt dataset, and the prompts for rollout will be sampled from the dataset. If not set, you need to manage the data by your self. | `False` | bool flag (set to disable) | Miles Native |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `--disable-rollout-global-dataset` | Disable the global dataset for rollout. If set, the rollout will use the `--prompt-data` as the prompt dataset, and the prompts for rollout will be sampled from the dataset. If not set, you need to manage the data by your self. | `False` | bool flag (set to disable) | Miles Native |
| `--disable-rollout-global-dataset` | Disable the global dataset for rollout. By default, Miles loads `--prompt-data` into a global dataset and samples from it for rollout. Setting this flag turns off this behavior, Use this flag only when providing a custom `--rollout-function-path` (and usually a custom `--data-source-path`) that handles data loading independently. | `False` | bool flag (set to disable) | Miles Native |

| `--lr` | Learning rate for the Actor. | `1e-6` | Type: float | Megatron-LM |
| `--lr-warmup-init` | Initial learning rate for warmup. | `0.0` | Type: float | Megatron-LM |
| `--min-lr` | Minimum learning rate after decay. | `0.0` | Type: float | Megatron-LM |
| `--lr-decay-style` | Learning rate decay style. | `constant`(FSDP), `linear`(Megatron) | Type: str | Megatron-LM |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ratish1 , I think most of the arguments I put outside the Megatron/FSDP sections should be able to source back to both backend, that's why I marked both FSDP and Megatron defaults here. Could you help me double check? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants