Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
4967316
add example
yushengsu-thu Jan 7, 2026
303e1ca
Merge branch 'radixark:main' into miles-lora-megatron
yushengsu-thu Jan 7, 2026
f593e33
fix megatron training problem
yushengsu-thu Jan 12, 2026
342d9f5
update
yushengsu-thu Jan 12, 2026
4ce9857
support training side - megatron: base + lora
yushengsu-thu Jan 12, 2026
4f5cf71
support rollout part
yushengsu-thu Jan 13, 2026
e193417
1.minor fix 2.change Lora to CanonicalLoRA - fix cuda problem. - eff…
yushengsu-thu Jan 14, 2026
382e9d5
done - but need to fix weightupdate problem
yushengsu-thu Jan 17, 2026
a2494d3
need to fix weight update
yushengsu-thu Jan 17, 2026
6594833
lora megatron backend - end2end training
yushengsu-thu Jan 18, 2026
ee92631
enable no --lora-adapter-path
yushengsu-thu Jan 21, 2026
4953c11
update script
yushengsu-thu Jan 22, 2026
8da0dfc
to-do: need to enable --no-offload-train and --no-offload-rollout
yushengsu-thu Jan 22, 2026
c43c88d
fix: fix update weight quite failure issue
gongyisheng Jan 27, 2026
e317bbb
[chore] fix offload train failure
gongyisheng Feb 6, 2026
0dc084a
[chore] improve comment
gongyisheng Feb 6, 2026
82c8309
Merge pull request #2 from gongyisheng/miles-lora-megatron
yushengsu-thu Feb 12, 2026
5ffceee
[fix] fix offload rollout garbage output issue
gongyisheng Feb 13, 2026
ac63173
Merge pull request #3 from gongyisheng/miles-lora-megatron
yushengsu-thu Feb 17, 2026
191dc05
modify examples
yushengsu-thu Feb 17, 2026
5bd4de7
re-factor and clean codes
yushengsu-thu Feb 17, 2026
55673dd
merge
yushengsu-thu Feb 17, 2026
9121cc2
refine script
yushengsu-thu Feb 17, 2026
8bc5bae
fix tms resume bug
yushengsu-thu Feb 17, 2026
94bd76e
pre-commit
yushengsu-thu Feb 17, 2026
35d8de7
update
yushengsu-thu Feb 17, 2026
d8116c2
update
yushengsu-thu Feb 17, 2026
8538866
add qwen3 4b benchmark
gongyisheng Feb 18, 2026
52a99b3
rename file
gongyisheng Feb 18, 2026
e0e3290
update script
yushengsu-thu Feb 18, 2026
3e8d889
Merge pull request #4 from gongyisheng/miles-lora-megatron
yushengsu-thu Feb 18, 2026
76ea001
update script - need to check: expandable_segments, attention-backend…
yushengsu-thu Feb 18, 2026
8245e45
add lora ci
yushengsu-thu Feb 20, 2026
d486222
fix lora ci failing
yushengsu-thu Feb 20, 2026
1bbb800
fix threshold bc MLA dismatch
yushengsu-thu Feb 20, 2026
0d3a627
fix pre-commit
yushengsu-thu Feb 20, 2026
d6c1fae
[ci] fix ci arg missing failure
gongyisheng Feb 20, 2026
5217533
Merge pull request #5 from gongyisheng/miles-lora-megatron
yushengsu-thu Feb 20, 2026
15157fa
fix ci
yushengsu-thu Feb 20, 2026
a1609e3
[ci] fix ci health check related error
gongyisheng Feb 20, 2026
c00f0e9
Merge pull request #6 from gongyisheng/miles-lora-megatron
yushengsu-thu Feb 21, 2026
5276c1e
Merge remote-tracking branch 'upstream/main' into miles-lora-megatron
yushengsu-thu Feb 21, 2026
7e6f96b
fix review part
yushengsu-thu Feb 23, 2026
7b5935f
pre-commit
yushengsu-thu Feb 24, 2026
43ddbdd
merge
yushengsu-thu Feb 24, 2026
b2740a7
update
yushengsu-thu Feb 24, 2026
f10047a
update
yushengsu-thu Feb 24, 2026
671c8ed
update
yushengsu-thu Feb 24, 2026
9012c1c
update
yushengsu-thu Feb 24, 2026
c881f99
update
yushengsu-thu Feb 24, 2026
9b59322
update
yushengsu-thu Feb 24, 2026
3151537
clean code dup
yushengsu-thu Feb 25, 2026
1740bf1
pre-commit
yushengsu-thu Feb 25, 2026
30c9243
update
yushengsu-thu Feb 25, 2026
934408e
pre-commit
yushengsu-thu Feb 25, 2026
e27aa52
fix review
yushengsu-thu Feb 25, 2026
aa90f48
del dup script
yushengsu-thu Feb 25, 2026
77a3906
add NotImplementedError
yushengsu-thu Feb 25, 2026
cf51ad1
fixed review part
yushengsu-thu Feb 26, 2026
295f818
pre-commit fix
yushengsu-thu Feb 26, 2026
79f9b91
update
yushengsu-thu Feb 26, 2026
e28526c
Merge branch 'main' into miles-lora-megatron
yushengsu-thu Feb 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 63 additions & 1 deletion .github/workflows/pr-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -994,6 +994,68 @@ jobs:
shell: bash
run: python tests/ci/gpu_lock_exec.py --count ${{ matrix.info.num_gpus }} -- python tests/${{ matrix.info.test_file }}

e2e-test-lora:
if: (github.event_name == 'workflow_dispatch') || (github.event.pull_request && contains(github.event.pull_request.labels.*.name, 'run-ci-lora'))
runs-on: self-hosted
container:
image: radixark/miles:dev
options: >
--gpus all
--ipc=host
--shm-size=32g
--ulimit memlock=-1
--ulimit stack=67108864
--memory=0
--memory-swap=0
-v /mnt/nvme0n1/miles_ci:/data/miles_ci
-v /mnt/nvme0n1/miles_ci/models:/root/models
-v /mnt/nvme0n1/miles_ci/datasets:/root/datasets
--privileged
--ulimit nofile=65535:65535
-v /tmp:/tmp
strategy:
fail-fast: false
matrix:
info: [{"num_gpus": 8, "test_file": "e2e/lora/test_lora_qwen2.5_0.5B.py"}]
defaults:
run:
working-directory: ${{ github.workspace }}
env:
GITHUB_COMMIT_NAME: ${{ github.sha }}_${{ github.event.pull_request.number || 'non-pr' }}
WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
MILES_TEST_ENABLE_INFINITE_RUN: ${{ (github.event_name == 'workflow_dispatch' && github.event.inputs.infinite_run) || 'false' }}
MILES_TEST_USE_DEEPEP: ${{ matrix.info.use_deepep || '0' }}
MILES_TEST_USE_FP8_ROLLOUT: ${{ matrix.info.use_fp8_rollout || '0' }}
MILES_TEST_ENABLE_EVAL: ${{ matrix.info.enable_eval || '1' }}
MILES_TEST_FEW_GPU: '0'

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Cleanup Ray processes
shell: bash
run: |
pkill -9 -f 'ray::' 2>/dev/null || true
pkill -9 -f raylet 2>/dev/null || true
pkill -9 -f gcs_server 2>/dev/null || true
pkill -9 -f 'ray-dashboard' 2>/dev/null || true
pkill -9 sglang 2>/dev/null || true
ray stop --force 2>/dev/null || true
rm -rf /tmp/ray/* 2>/dev/null || true
sleep 3

- name: Install
shell: bash
run: |
cd /sgl-workspace/sglang && git fetch origin sglang-miles && git checkout FETCH_HEAD && git log --oneline -1 && pip install -e python --no-deps --break-system-packages
cd /root/Megatron-LM && git reset --hard HEAD && git log --oneline -1 && git apply $GITHUB_WORKSPACE/docker/patch/dev/megatron.patch && pip install -e . --no-deps --break-system-packages
cd $GITHUB_WORKSPACE && pip install -e . --no-deps --break-system-packages

- name: Execute
shell: bash
run: python tests/ci/gpu_lock_exec.py --count ${{ matrix.info.num_gpus }} -- python tests/${{ matrix.info.test_file }}

e2e-test-image:
if: (github.event_name == 'workflow_dispatch') || (github.event.pull_request && contains(github.event.pull_request.labels.*.name, 'run-ci-image'))
runs-on: self-hosted
Expand All @@ -1016,7 +1078,7 @@ jobs:
strategy:
fail-fast: false
matrix:
info: [{"num_gpus": 8, "test_file": "e2e/fsdp/test_qwen3_4B_fsdp_true_on_policy.py"}, {"num_gpus": 8, "test_file": "e2e/fsdp/test_qwen3_vl_4B_fsdp.py"}, {"num_gpus": 8, "test_file": "e2e/fsdp/test_qwen3_0.6B_fsdp_distributed.py"}, {"num_gpus": 8, "test_file": "e2e/fsdp/test_qwen3_0.6B_megatron_fsdp_align.py"}, {"num_gpus": 8, "test_file": "e2e/megatron/test_quick_start_glm4_9B.py"}, {"num_gpus": 8, "test_file": "e2e/megatron/test_qwen3_30B_A3B.py", "use_deepep": "1", "use_fp8_rollout": "1"}, {"enable_eval": "0", "num_gpus": 8, "test_file": "e2e/megatron/test_qwen3_30B_A3B_r3.py", "use_deepep": "1", "use_fp8_rollout": "1"}, {"enable_eval": "0", "num_gpus": 8, "test_file": "e2e/megatron/test_qwen3_30B_A3B_r3.py"}, {"num_gpus": 8, "test_file": "e2e/megatron/test_qwen3_4B_ppo.py"}, {"num_gpus": 8, "test_file": "e2e/megatron/test_moonlight_16B_A3B.py"}, {"enable_eval": "0", "num_gpus": 8, "test_file": "e2e/megatron/test_moonlight_16B_A3B_r3.py"}, {"num_gpus": 8, "test_file": "e2e/megatron/test_mimo_7B_mtp_only_grad.py"}, {"enable_eval": "0", "num_gpus": 8, "test_file": "e2e/megatron/test_glm47_flash_r3_mtp.py"}, {"num_gpus": 8, "test_file": "e2e/short/test_qwen2.5_0.5B_gsm8k_async_short.py"}, {"num_gpus": 8, "test_file": "e2e/short/test_qwen2.5_0.5B_gsm8k_short.py"}, {"num_gpus": 8, "test_file": "e2e/short/test_qwen3_0.6B_fsdp_colocated_2xGPU.py"}, {"num_gpus": 8, "test_file": "e2e/precision/test_qwen3_0.6B_parallel_check.py"}, {"num_gpus": 8, "test_file": "e2e/ckpt/test_qwen3_4B_ckpt.py"}, {"num_gpus": 8, "test_file": "e2e/ckpt/test_qwen3_4B_ckpt.py --async-save"}, {"num_gpus": 8, "test_file": "e2e/long/test_qwen2.5_0.5B_gsm8k.py"}, {"num_gpus": 8, "test_file": "e2e/long/test_qwen2.5_0.5B_gsm8k_async.py"}]
info: [{"num_gpus": 8, "test_file": "e2e/fsdp/test_qwen3_4B_fsdp_true_on_policy.py"}, {"num_gpus": 8, "test_file": "e2e/fsdp/test_qwen3_vl_4B_fsdp.py"}, {"num_gpus": 8, "test_file": "e2e/fsdp/test_qwen3_0.6B_fsdp_distributed.py"}, {"num_gpus": 8, "test_file": "e2e/fsdp/test_qwen3_0.6B_megatron_fsdp_align.py"}, {"num_gpus": 8, "test_file": "e2e/megatron/test_quick_start_glm4_9B.py"}, {"num_gpus": 8, "test_file": "e2e/megatron/test_qwen3_30B_A3B.py", "use_deepep": "1", "use_fp8_rollout": "1"}, {"enable_eval": "0", "num_gpus": 8, "test_file": "e2e/megatron/test_qwen3_30B_A3B_r3.py", "use_deepep": "1", "use_fp8_rollout": "1"}, {"enable_eval": "0", "num_gpus": 8, "test_file": "e2e/megatron/test_qwen3_30B_A3B_r3.py"}, {"num_gpus": 8, "test_file": "e2e/megatron/test_qwen3_4B_ppo.py"}, {"num_gpus": 8, "test_file": "e2e/megatron/test_moonlight_16B_A3B.py"}, {"enable_eval": "0", "num_gpus": 8, "test_file": "e2e/megatron/test_moonlight_16B_A3B_r3.py"}, {"num_gpus": 8, "test_file": "e2e/megatron/test_mimo_7B_mtp_only_grad.py"}, {"enable_eval": "0", "num_gpus": 8, "test_file": "e2e/megatron/test_glm47_flash_r3_mtp.py"}, {"num_gpus": 8, "test_file": "e2e/lora/test_lora_qwen2.5_0.5B.py"}, {"num_gpus": 8, "test_file": "e2e/short/test_qwen2.5_0.5B_gsm8k_async_short.py"}, {"num_gpus": 8, "test_file": "e2e/short/test_qwen2.5_0.5B_gsm8k_short.py"}, {"num_gpus": 8, "test_file": "e2e/short/test_qwen3_0.6B_fsdp_colocated_2xGPU.py"}, {"num_gpus": 8, "test_file": "e2e/precision/test_qwen3_0.6B_parallel_check.py"}, {"num_gpus": 8, "test_file": "e2e/ckpt/test_qwen3_4B_ckpt.py"}, {"num_gpus": 8, "test_file": "e2e/ckpt/test_qwen3_4B_ckpt.py --async-save"}, {"num_gpus": 8, "test_file": "e2e/long/test_qwen2.5_0.5B_gsm8k.py"}, {"num_gpus": 8, "test_file": "e2e/long/test_qwen2.5_0.5B_gsm8k_async.py"}]
defaults:
run:
working-directory: ${{ github.workspace }}
Expand Down
10 changes: 9 additions & 1 deletion .github/workflows/pr-test.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@
{'test_file': 'e2e/long/test_qwen2.5_0.5B_gsm8k_async.py', 'num_gpus': 8},
] %>

<% set lora_tests = [
{'test_file': 'e2e/lora/test_lora_qwen2.5_0.5B.py', 'num_gpus': 8},
] %>

<% set jobs = {
'fast': {
'test_executor': 'pytest',
Expand Down Expand Up @@ -83,9 +87,13 @@
'label': 'run-ci-long',
'tests': long_tests,
},
'e2e-test-lora': {
'label': 'run-ci-lora',
'tests': lora_tests,
},
'e2e-test-image': {
'label': 'run-ci-image',
'tests': fsdp_tests + megatron_tests + short_tests + precision_tests + ckpt_tests + long_tests,
'tests': fsdp_tests + megatron_tests + lora_tests + short_tests + precision_tests + ckpt_tests + long_tests,
},
} %>
name: PR Test
Expand Down
165 changes: 165 additions & 0 deletions examples/lora/run-qwen2.5-0.5B-megatron-lora.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
#!/bin/bash
export FLASHINFER_DISABLE_VERSION_CHECK=1
export GPUS_PER_NODE=8
# will prevent ray from buffering stdout/stderr
export PYTHONBUFFERED=16

# for rerun the task
pkill sglang
ray stop --force
sleep 5 # Wait for processes to terminate gracefully
# Force kill any remaining processes.
# Note: `pkill -9 python` is broad and can be risky.
pkill -9 sglang
pkill -9 ray
pkill -9 python

set -ex


SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/../../scripts/models/qwen2.5-0.5B.sh"

CKPT_ARGS=(
--hf-checkpoint /root/Qwen2.5-0.5B-Instruct/
--megatron-to-hf-mode bridge
)

LORA_ARGS=(
--lora-rank 32 # LoRA rank (typical values: 8, 16, 32, 64)
--lora-alpha 32 # LoRA alpha (usually 2x rank)
--lora-dropout 0.0 # LoRA dropout (0.0 for RL training)
--target-modules "all-linear"
--megatron-to-hf-mode bridge
)
##############################
##############################
##############################

ROLLOUT_ARGS=(
--prompt-data /root/gsm8k/train.parquet
--input-key messages
--label-key label
--apply-chat-template
--rollout-shuffle
--rm-type math
--num-rollout 100
# --num-rollout 10 # onyl train 10 stesp
--rollout-batch-size 32
# --rollout-batch-size 16 # for testing
--n-samples-per-prompt 8
--rollout-max-response-len 1024
--rollout-temperature 1

--global-batch-size 256
# --global-batch-size 32 # for testing
)

EVAL_ARGS=(
# --eval-interval 20
--eval-interval 10
--eval-prompt-data gsm8k /root/gsm8k/test.parquet
--n-samples-per-eval-prompt 1
--eval-max-response-len 1024
--eval-top-k 1
)

PERF_ARGS=(
--tensor-model-parallel-size 1
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1

--use-dynamic-batch-size
--max-tokens-per-gpu 9216
)

GRPO_ARGS=(
--advantage-estimator grpo
# --use-kl-loss # if use kl loss, should use --ref-load
--kl-loss-coef 0.00
--kl-loss-type low_var_kl
--kl-coef 0.00
--entropy-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)

OPTIMIZER_ARGS=(
--optimizer adam
# --lr 1e-6
--lr 1e-5 # Higher LR often works better for LoRA
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
)

WANDB_ARGS=(
--use-wandb
--wandb-host https://wandb.ai/
--wandb-team miles-lora
--wandb-project miles-lora-megatron
--wandb-group qwen2.5-0.5B-gsm8k-test
)


SGLANG_ARGS=(
--rollout-num-gpus-per-engine 1
# --sglang-mem-fraction-static 0.7
--sglang-mem-fraction-static 0.4

# --sglang-enable-deterministic-inference
# --sglang-attention-backend flashinfer
# --deterministic-mode
)

MISC_ARGS=(
# default dropout in megatron is 0.1
--attention-dropout 0.0
--hidden-dropout 0.0
# should be good for model performance
--accumulate-allreduce-grads-in-fp32
--attention-softmax-in-fp32
# need to comment this when using model with MLA
--attention-backend flash
)


# launch the master node of ray in container
ray start --head --node-ip-address 127.0.0.1 --num-gpus $GPUS_PER_NODE --disable-usage-stats
# ray start --head --node-ip-address 127.0.0.1 --num-gpus 1 --disable-usage-stats

ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{
"env_vars": {
"PYTHONPATH": "/root/Megatron-LM",
"CUDA_DEVICE_MAX_CONNECTIONS": "1",
"NCCL_ALGO": "Ring",
"NVTE_ALLOW_NONDETERMINISTIC_ALGO": "0",
"CUBLAS_WORKSPACE_CONFIG": ":4096:8"
}
}' \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node $GPUS_PER_NODE \
--colocate \
--calculate-per-token-loss \
--use-miles-router \
${MODEL_ARGS[@]} \
${CKPT_ARGS[@]} \
${LORA_ARGS[@]} \
${OPTIMIZER_ARGS[@]} \
${GRPO_ARGS[@]} \
${WANDB_ARGS[@]} \
${PERF_ARGS[@]} \
${EVAL_ARGS[@]} \
${SGLANG_ARGS[@]} \
${MISC_ARGS[@]} \
${ROLLOUT_ARGS[@]}


# colocate : update from tesnor
# disaggrate : update from distributed
Loading
Loading