Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@ The guide is intended for developers and practitioners seeking high-throughput o

There are multiple MOE backends inside TensorRT LLM. Here are the support matrix of the MOE backends.

| Device | Activation Type | MoE Weights Type | MoE Backend | Use Case |
|------------|------------------|------------------|-------------|----------------|
| B200/GB200 | MXFP8 | MXFP4 | TRTLLM | Low Latency |
| B200/GB200 | MXFP8 | MXFP4 | CUTLASS | Max Throughput |
| Device | Activation Type | MoE Weights Type | MoE Backend | Use Case |
|---------------------- |-----------------|------------------|-------------|--------------------------------|
| B200/GB200/B300/GB300 | MXFP8 | MXFP4 | TRTLLM | Low Latency and Max Throughput |

The default moe backend is `CUTLASS`, so for the combination which is not supported by `CUTLASS`, one must set the `moe_config.backend` explicitly to run the model.
The default moe backend is `CUTLASS`, so for the best possible perf, one must set the `moe_config.backend` explicitly to run the model.
`CUTLASS` was better for max throughput at first but now we have optimized `TRTLLM` moe to be universally faster.

## Deployment Steps

Expand Down