From 32d9ecd474546560376ee16a25f264dcd68b2c38 Mon Sep 17 00:00:00 2001 From: Dongfeng Yu Date: Sat, 25 Oct 2025 21:16:32 +0000 Subject: [PATCH 1/4] [None][doc] Clarify the perf best practice and supported hardware for gptoss Signed-off-by: Dongfeng Yu --- .../quick-start-recipe-for-gpt-oss-on-trtllm.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md b/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md index 16da732a1d2..c5842c5cb17 100644 --- a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md +++ b/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md @@ -25,10 +25,10 @@ There are multiple MOE backends inside TensorRT LLM. Here are the support matrix | Device | Activation Type | MoE Weights Type | MoE Backend | Use Case | |------------|------------------|------------------|-------------|----------------| -| B200/GB200 | MXFP8 | MXFP4 | TRTLLM | Low Latency | -| B200/GB200 | MXFP8 | MXFP4 | CUTLASS | Max Throughput | +| B200/GB200/B300 | MXFP8 | MXFP4 | TRTLLM | Low Latency and Max Throughput | -The default moe backend is `CUTLASS`, so for the combination which is not supported by `CUTLASS`, one must set the `moe_config.backend` explicitly to run the model. +The default moe backend is `CUTLASS`, so for the best possible perf, one must set the `moe_config.backend` explicitly to run the model. +`CUTLASS` was better for max throughput at first but now we have optimized `TRTLLM` moe to be universally faster. ## Deployment Steps From f7dadde930691148705e310169b4f0e3cfd8877e Mon Sep 17 00:00:00 2001 From: Dongfeng Yu Date: Sat, 25 Oct 2025 21:17:49 +0000 Subject: [PATCH 2/4] Update doc Signed-off-by: Dongfeng Yu --- .../quick-start-recipe-for-gpt-oss-on-trtllm.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md b/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md index c5842c5cb17..a6a02040b12 100644 --- a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md +++ b/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md @@ -25,7 +25,7 @@ There are multiple MOE backends inside TensorRT LLM. Here are the support matrix | Device | Activation Type | MoE Weights Type | MoE Backend | Use Case | |------------|------------------|------------------|-------------|----------------| -| B200/GB200/B300 | MXFP8 | MXFP4 | TRTLLM | Low Latency and Max Throughput | +| B200/GB200/B300/GB300 | MXFP8 | MXFP4 | TRTLLM | Low Latency and Max Throughput | The default moe backend is `CUTLASS`, so for the best possible perf, one must set the `moe_config.backend` explicitly to run the model. `CUTLASS` was better for max throughput at first but now we have optimized `TRTLLM` moe to be universally faster. From 3d4018e169a5a0c60d28ed2316f0aba1d995fc98 Mon Sep 17 00:00:00 2001 From: dongfengy <99041270+dongfengy@users.noreply.github.com> Date: Mon, 27 Oct 2025 13:14:25 -0700 Subject: [PATCH 3/4] Update quick-start-recipe-for-gpt-oss-on-trtllm.md Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com> --- .../quick-start-recipe-for-gpt-oss-on-trtllm.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md b/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md index a6a02040b12..7b32e24025b 100644 --- a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md +++ b/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md @@ -23,9 +23,9 @@ The guide is intended for developers and practitioners seeking high-throughput o There are multiple MOE backends inside TensorRT LLM. Here are the support matrix of the MOE backends. -| Device | Activation Type | MoE Weights Type | MoE Backend | Use Case | -|------------|------------------|------------------|-------------|----------------| -| B200/GB200/B300/GB300 | MXFP8 | MXFP4 | TRTLLM | Low Latency and Max Throughput | +| Device | Activation Type | MoE Weights Type | MoE Backend | Use Case | +|---------------------- |-----------------|------------------|-------------|--------------------------------| +| B200/GB200/B300/GB300 | MXFP8 | MXFP4 | TRTLLM | Low Latency and Max Throughput | The default moe backend is `CUTLASS`, so for the best possible perf, one must set the `moe_config.backend` explicitly to run the model. `CUTLASS` was better for max throughput at first but now we have optimized `TRTLLM` moe to be universally faster. From 58d6cf14525f4b0c131588c7647727590a48e8ce Mon Sep 17 00:00:00 2001 From: dongfengy <99041270+dongfengy@users.noreply.github.com> Date: Mon, 27 Oct 2025 13:16:31 -0700 Subject: [PATCH 4/4] Update quick-start-recipe-for-gpt-oss-on-trtllm.md Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com> --- .../quick-start-recipe-for-gpt-oss-on-trtllm.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md b/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md index 7b32e24025b..17e16583092 100644 --- a/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md +++ b/docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md @@ -25,7 +25,7 @@ There are multiple MOE backends inside TensorRT LLM. Here are the support matrix | Device | Activation Type | MoE Weights Type | MoE Backend | Use Case | |---------------------- |-----------------|------------------|-------------|--------------------------------| -| B200/GB200/B300/GB300 | MXFP8 | MXFP4 | TRTLLM | Low Latency and Max Throughput | +| B200/GB200/B300/GB300 | MXFP8 | MXFP4 | TRTLLM | Low Latency and Max Throughput | The default moe backend is `CUTLASS`, so for the best possible perf, one must set the `moe_config.backend` explicitly to run the model. `CUTLASS` was better for max throughput at first but now we have optimized `TRTLLM` moe to be universally faster.