22
33## Generate profiles
44
5- ### Run with MPI
5+ ### Run with OpenMPI
66
77** Step 1:** Start a container using Docker, Enroot or others. Please refer to ` ../../jenkins/current_image_tags.properties ` for the Docker image URI.
88
@@ -16,50 +16,61 @@ pip install -e ../..
1616
1717``` bash
1818# Run DeepSeek-R1 NVFP4
19- NP=4 ./mpi_launch.sh ./run_single .sh config_ctx.yaml
20- NP=4 ./mpi_launch.sh ./run_single .sh config_gen.yaml
19+ NP=4 ./mpi_launch.sh ./run .sh config_ctx.yaml
20+ NP=4 ./mpi_launch.sh ./run .sh config_gen.yaml
2121
2222# Run DeepSeek-V3.2-Exp
23- NP=4 ./mpi_launch.sh ./run_single .sh config_ctx.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM
24- NP=4 ./mpi_launch.sh ./run_single .sh config_gen.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM
23+ NP=4 ./mpi_launch.sh ./run .sh config_ctx.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM
24+ NP=4 ./mpi_launch.sh ./run .sh config_gen.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM
2525
2626# Run DeepSeek-V3.2-Exp with 32k context length
27- NP=4 ./mpi_launch.sh ./run_single .sh config_ctx.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --max-seq-len $(( 32768 + 1024 + 4 )) --moe-backend DEEPGEMM --batch-size 1 --seq-len-q 32769
28- NP=4 ./mpi_launch.sh ./run_single .sh config_gen.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --max-seq-len $(( 32768 + 1024 + 4 )) --moe-backend DEEPGEMM --seq-len-kv-cache 32769
27+ NP=4 ./mpi_launch.sh ./run .sh config_ctx.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --max-seq-len $(( 32768 + 1024 + 4 )) --moe-backend DEEPGEMM --batch-size 1 --seq-len-q 32769
28+ NP=4 ./mpi_launch.sh ./run .sh config_gen.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --max-seq-len $(( 32768 + 1024 + 4 )) --moe-backend DEEPGEMM --seq-len-kv-cache 32769
2929
3030# Run with attention TP
31- NP=4 ./mpi_launch.sh ./run_single .sh config_gen .yaml --no-enable-attention-dp
32- NP=4 ./mpi_launch.sh ./run_single .sh config_ctx .yaml --no-enable-attention-dp
31+ NP=4 ./mpi_launch.sh ./run .sh config_ctx .yaml --no-enable-attention-dp
32+ NP=4 ./mpi_launch.sh ./run .sh config_gen .yaml --no-enable-attention-dp
3333
3434# Run with attention TP and TRTLLMGen
35- NP=4 TRTLLM_ENABLE_PDL=1 ./mpi_launch.sh ./run_single .sh config_ctx.yaml --no-enable-attention-dp --moe-backend TRTLLM
36- NP=4 TRTLLM_ENABLE_PDL=1 ./mpi_launch.sh ./run_single .sh config_gen.yaml --no-enable-attention-dp --moe-backend TRTLLM
35+ NP=4 ./mpi_launch.sh -x TRTLLM_ENABLE_PDL=1 ./run .sh config_ctx.yaml --no-enable-attention-dp --moe-backend TRTLLM --balance-method NotModified
36+ NP=4 ./mpi_launch.sh -x TRTLLM_ENABLE_PDL=1 ./run .sh config_gen.yaml --no-enable-attention-dp --moe-backend TRTLLM --balance-method NotModified
3737
3838# Run with MTP3
39- NP=4 ./mpi_launch.sh ./run_single .sh config_gen.yaml --batch-size 32 --seq-len-q 4
39+ NP=4 ./mpi_launch.sh ./run .sh config_gen.yaml --batch-size 32 --seq-len-q 4
4040
4141# Run 4 layers
42- NP=4 ./mpi_launch.sh ./run_single .sh config_ctx.yaml --layer-indices 5,6,7,8
43- NP=4 ./mpi_launch.sh ./run_single .sh config_gen.yaml --layer-indices 5,6,7,8
42+ NP=4 ./mpi_launch.sh ./run .sh config_ctx.yaml --layer-indices 5,6,7,8
43+ NP=4 ./mpi_launch.sh ./run .sh config_gen.yaml --layer-indices 5,6,7,8
4444
4545# Scale DEP=16 to 4 GPUs: reduce the number of experts, uses MNNVL A2A if applicable
46- NP=4 ./mpi_launch.sh ./run_single .sh config_gen.yaml --scaled-from 16 --moe-backend WIDEEP
46+ NP=4 ./mpi_launch.sh ./run .sh config_gen.yaml --scaled-from 16 --moe-backend WIDEEP
4747
4848# Scale TEP=16 to 4 GPUs: reduce the number of attention heads and experts
49- NP=4 ./mpi_launch.sh ./run_single .sh config_gen.yaml --scaled-from 16 --no-enable-attention-dp
49+ NP=4 ./mpi_launch.sh ./run .sh config_gen.yaml --scaled-from 16 --no-enable-attention-dp
5050
5151# Run Qwen3-Next (balanced routing is not implemented)
52- NP=2 TRTLLM_ENABLE_PDL=1 ./mpi_launch.sh ./run_single .sh config_ctx.yaml --model Qwen/Qwen3-Next-80B-A3B-Instruct --layer-indices 6,7 --no-enable-attention-dp --moe-backend TRTLLM --balance-method NotModified
53- NP=2 TRTLLM_ENABLE_PDL=1 ./mpi_launch.sh ./run_single .sh config_gen.yaml --model Qwen/Qwen3-Next-80B-A3B-Instruct --layer-indices 6,7 --no-enable-attention-dp --moe-backend TRTLLM --balance-method NotModified
52+ NP=2 ./mpi_launch.sh ./run .sh config_ctx.yaml --model Qwen/Qwen3-Next-80B-A3B-Instruct --layer-indices 6,7 --no-enable-attention-dp --batch-size 4
53+ NP=2 ./mpi_launch.sh ./run .sh config_gen.yaml --model Qwen/Qwen3-Next-80B-A3B-Instruct --layer-indices 6,7 --no-enable-attention-dp --batch-size 512
5454
5555# Run with DeepEP A2A
56- NP=4 TRTLLM_FORCE_ALLTOALL_METHOD=DeepEP ./mpi_launch.sh ./run_single.sh config_ctx.yaml --moe-backend WIDEEP
57- NP=4 TRTLLM_FORCE_ALLTOALL_METHOD=DeepEP ./mpi_launch.sh ./run_single.sh config_gen.yaml --moe-backend WIDEEP
56+ NP=4 ./mpi_launch.sh -x TRTLLM_FORCE_ALLTOALL_METHOD=DeepEP ./run.sh config_ctx.yaml --moe-backend WIDEEP
57+ NP=4 ./mpi_launch.sh -x TRTLLM_FORCE_ALLTOALL_METHOD=DeepEP ./run.sh config_gen.yaml --moe-backend WIDEEP
58+
59+ # Run with imbalanced ranks: except for activating all experts, a% of the tokens are sent to the 1st rank
60+ # Note: if balance ratio is 0, ignore activating all experts
61+ NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --balance-method ImbalancedRanks --balance-ratio 0.5
62+ NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --balance-method ImbalancedRanks --balance-ratio 0.5
63+
64+ # Run with imbalanced experts and balanced ranks: except for activating all experts, a% of the tokens are sent to the front experts on each rank
65+ NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --balance-method ImbalancedExperts --balance-ratio 0.5
66+ NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --balance-method ImbalancedExperts --balance-ratio 0.5
5867```
5968
6069### Run with Slurm
6170
62- > Tips: If you have a running job with environment installed, please skip step 1 and 2 and go straight to step 3. In this case, your job must be run with ` --container-name aaa ` , and if the container name is not "layer_wise_benchmarks" please ` export CONTAINER_NAME=aaa ` .
71+ > Tips:
72+ > 1 . If you have a running Slurm job, please skip step 1 and go straight to step 2 and 3.
73+ > 2 . Further, if you have installed ` tensorrt_llm ` in the Slurm job, you can also skip step 2 and run step 3 with ` export CONTAINER_NAME=aaa ` specified. If you don't know the container name, run ` export CONTAINER_NAME=$(SLURM_JOB_ID=$SLURM_JOB_ID ./slurm_query_container_name.sh) ` to get it.
6374
6475** Step 1:** On the controller node, allocate one or multiple nodes, and record the ` SLURM_JOB_ID ` :
6576
@@ -77,26 +88,61 @@ SLURM_JOB_ID=$SLURM_JOB_ID ./slurm_init_containers.sh
7788
7889It uses the image recorded in ` ../../jenkins/current_image_tags.properties ` . The image will be downloaded to ` ../../enroot/ ` for once.
7990
91+ > Tips: If you want to change the image, no need to reallocate Slurm jobs. Just start another container by running step 2 with ` export CONTAINER_NAME=aaa ` , and step 3 will run in the container specified by the ` CONTAINER_NAME ` env.
92+
8093** Step 3:** Run benchmarks to generate profiles. Run the following command on the controller node, where ` NODES ` &le ; the number of allocated nodes:
8194
8295``` bash
8396# Run DeepSeek-R1 NVFP4 with wide ep: uses MNNVL A2A if applicable
84- SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 ./slurm_launch.sh ./run_single .sh config_gen.yaml --moe-backend WIDEEP
97+ SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 ./slurm_launch.sh ./run .sh config_gen.yaml --moe-backend WIDEEP
8598
86- # Run with attention TP and TRTLLMGen
87- SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 TRTLLM_ENABLE_PDL=1 ./slurm_launch.sh ./run_single .sh config_gen.yaml --no-enable-attention-dp --moe-backend TRTLLM
99+ # Run with TRTLLMGen
100+ SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 TRTLLM_ENABLE_PDL=1 ./slurm_launch.sh ./run .sh config_gen.yaml --moe-backend TRTLLM
88101
89102# Run with DeepEPLowLatency
90- SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 TRTLLM_FORCE_ALLTOALL_METHOD=DeepEPLowLatency ./slurm_launch.sh ./run_single .sh config_gen.yaml --moe-backend WIDEEP
103+ SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 TRTLLM_FORCE_ALLTOALL_METHOD=DeepEPLowLatency ./slurm_launch.sh ./run .sh config_gen.yaml --moe-backend WIDEEP
91104
92105# You can run 4-GPU and 8-GPU tasks without reallocate the slurm job
93- SLURM_JOB_ID=$SLURM_JOB_ID NODES=1 NP=4 ./slurm_launch.sh ./run_single.sh config_ctx.yaml
94- SLURM_JOB_ID=$SLURM_JOB_ID NODES=2 NP=8 ./slurm_launch.sh ./run_single.sh config_ctx.yaml
106+ SLURM_JOB_ID=$SLURM_JOB_ID NODES=1 NP=4 ./slurm_launch.sh ./run.sh config_ctx.yaml
107+ SLURM_JOB_ID=$SLURM_JOB_ID NODES=2 NP=8 ./slurm_launch.sh ./run.sh config_gtx.yaml
108+ ```
109+
110+ ### Batched run
111+
112+ By specifying a list for ` --batch-size ` on the command line (or ` batch_size ` in the YAML file), the script runs multiple configurations in a single process. This significantly reduces the total runtime because it avoids repeated library initialization and model initialization.
113+
114+ Supported list arguments:
115+ - ` --batch-size ` (or ` batch_size ` in YAML)
116+ - ` --seq-len-q ` (or ` seq_len_q ` in YAML)
117+ - ` --seq-len-kv-cache ` (or ` seq_len_kv_cache ` in YAML)
118+ - ` --balance-ratio ` (or ` balance_ratio ` in YAML)
119+
120+ Command line arguments are comma separated, for example, ` --batch-size 1,2,4 ` . Configs in the YAML file are lists, for example, ` batch_size: [1, 2, 4] ` .
121+
122+ Run with OpenMPI:
123+
124+ ``` bash
125+ NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --batch-size 1,2,4 --seq-len-q 1024,8192
126+ NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --scaled-from 16 --moe-backend WIDEEP --batch-size 32,64,128,256,512 --seq-len-q 1,2,3,4
95127```
96128
97129## Parse profiles
98130
99- Coming soon.
131+ Run the following command in the container:
132+
133+ ``` bash
134+ python3 parse.py --world-size 4
135+
136+ # Specify the location of the .nsys-rep file
137+ python3 parse.py --profile-dir ./profiles --world-size 4 --rank 0
138+ ```
139+
140+ It can parse only GEN phase profiles for now.
141+
142+ You will receive three reports, each containing kernel timing statistics grouped by module:
143+ 1 . A printed report on stdout
144+ 2 . A CSV report at ` profiles/report_np4_rank0.csv `
145+ 3 . An HTML report at ` profiles/report_np4_rank0.html `
100146
101147## Trouble shooting
102148
0 commit comments