Skip to content

Conversation

@coketaste
Copy link
Contributor

Motivation

Integrate sglang disagg models running on SLURM Cluster

Technical Details

Refactor madengine to run models in sglang_disagg of MAD-private repo .
(1) adopted models have been added to models.json
(2) use the same interface of legacy madengine, i.e., madengine run --tags sglang_disagg_pd_qwen3-32B --additional-context "{'slurm_args': {'FRAMEWORK': 'sglang_disagg', 'PREFILL_NODES': '2', 'DECODE_NODES': '2', 'PARTITION': 'amd-rccl', 'TIME': '12:00:00', 'DOCKER_IMAGE': ''}}"
(3) update the field of slurm_args to context, the fields include FRAMEWORK, PREFILL_NODES, DECODE_NODES, PARTITION, TIME, DOCKER_IMAGE. if DOCKER_IMAGE is empty, it will use the default image in run.sh. Read the field of the selected model in models.json, the model name which will be set as MODEL_NAME (the string without --model) is in the attribute of args, e.g., --model DeepSeek-V2.
(4) if the flow check the slurm_args in context, it will execute the script 'scripts/sglang_disagg/run.sh' to submit the job to SLURM cluster directly, skip the run_model function to build docker image and run container.

Test Plan

Test Result

Submission Checklist

@coketaste coketaste self-assigned this Sep 23, 2025
@coketaste
Copy link
Contributor Author

Models of SGLang Disagg have been added in the PR: https://github.com/ROCm/MAD-private/pull/112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants