Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE by zianglih · Pull Request #17449 · sgl-project/sglang

zianglih · 2026-01-21T01:38:02Z

Motivation

#17093
This PR adds mxfp8 quantization support to SGLang, using Triton for dense linear layer, and existing mxfp8 CUTLASS groped GEMM kernel in sgl-kernel from #13731 for MoE.

Online mxfp8 quantization from bf16 checkpoints and serving mxfp8 checkpoints directly are both supported.

Modifications

Accuracy Tests


# Eval:
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum

# bf16:
python -m sglang.launch_server --tp 2 --model Qwen/Qwen3-30B-A3B-Instruct-2507 --fp8-gemm-backend triton --moe-runner-backend cutlass  &
# Trial 1
Accuracy: 0.964
Invalid: 0.000
Latency: 14.543 s
Output throughput: 11779.580 token/s
# Trial 2
Accuracy: 0.964
Invalid: 0.000
Latency: 13.427 s
Output throughput: 12778.580 token/s
# Trial 3
Accuracy: 0.964
Invalid: 0.000
Latency: 13.049 s
Output throughput: 13128.875 token/s
# Online mxfp8:
python -m sglang.launch_server --tp 2 --model Qwen/Qwen3-30B-A3B-Instruct-2507 --quantization mxfp8 --fp8-gemm-backend triton --moe-runner-backend cutlass  &
# Trial 1
Accuracy: 0.964
Invalid: 0.000
Latency: 23.825 s
Output throughput: 7197.115 token/s
# Trial 2
Accuracy: 0.966
Invalid: 0.000
Latency: 18.390 s
Output throughput: 9310.458 token/s
# Trial 3
Accuracy: 0.966
Invalid: 0.000
Latency: 18.178 s
Output throughput: 9419.016 token/s
# Offline mxfp8:
python -m sglang.launch_server --tp 1 --model /data/models/Qwen3-30B-A3B-Instruct-2507-MXFP8 --fp8-gemm-backend triton --moe-runner-backend cutlass  &
# Trial 1
Accuracy: 0.963
Invalid: 0.000
Latency: 18.698 s
Output throughput: 9075.207 token/s
# Trial 2
Accuracy: 0.963
Invalid: 0.000
Latency: 18.297 s
Output throughput: 9273.994 token/s
# Trial 3
Accuracy: 0.963
Invalid: 0.000
Latency: 17.752 s
Output throughput: 9558.566 token/s

Benchmarking and Profiling

Next Steps

Latest DeepGEMM includes mxfp8 & mxfp4 kernels by passing in recipe = (1, 1, 32). Add DeepGEMM backend for mxfp8 when sgl-kernel version is bumped.
- Some existing DeepGEMM entrypoint refactoring work: Expand deep_gemm entrypoint to support more FP8 recipes. #17294
Improve Triton mxfp8 kernel performance.
Blackwell mxfp8 RL integration.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-21T01:38:06Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

/sgl-workspace/sglang# python python/sglang/test/test_block_fp8.py -k TestMXFP8DenseLinear test_mxfp8_dense_linear (__main__.TestMXFP8DenseLinear.test_mxfp8_dense_linear) ... [CI Test Method] TestMXFP8DenseLinear.test_mxfp8_dense_linear ok ---------------------------------------------------------------------- Ran 1 test in 1.012s OK

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum BEFORE Accuracy: 0.817 Invalid: 0.000 Latency: 25.296 s Output throughput: 9041.457 token/s AFTER Accuracy: 0.849 Invalid: 0.000 Latency: 14.757 s Output throughput: 14630.467 token/s

ispobock · 2026-01-26T12:35:30Z

/tag-and-rerun-ci

zianglih · 2026-01-26T22:45:35Z

Implemented a few minor fixes for /update_weights_from_disk.
Test for online mxfp8 quantization:

python -m sglang.launch_server --tp 2 --model Qwen/Qwen3-30B-A3B-Instruct-2507 --quantization mxfp8 --fp8-gemm-backend triton --moe-runner-backend cutlass  &
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.964
Invalid: 0.000
Latency: 20.583 s
Output throughput: 8330.881 token/s

Test for Qwen3-4B-Instruct-2507-MXFP8 (dense), with /update_weights_from_disk:

python -m sglang.launch_server --tp 1 --model /data/models/Qwen3-4B-Instruct-2507-MXFP8 --fp8-gemm-backend triton --moe-runner-backend cutlass  &
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.849
Invalid: 0.000
Latency: 16.115 s
Output throughput: 13397.779 token/s
curl -sS http://localhost:30000/update_weights_from_disk \
  -H 'Content-Type: application/json' \
  -d '{
    "model_path": "/data/models/Qwen3-4B-Instruct-2507-MXFP8",
    "flush_cache": true,
    "abort_all_requests": false
  }'
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.849
Invalid: 0.000
Latency: 15.149 s
Output throughput: 14251.889 token/s

Test for Qwen3-30B-A3B-Instruct-2507-MXFP8 (MoE), with /update_weights_from_disk:


python -m sglang.launch_server --tp 1 --model /data/models/Qwen3-30B-A3B-Instruct-2507-MXFP8 --fp8-gemm-backend triton --moe-runner-backend cutlass  &
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.963
Invalid: 0.000
Latency: 18.686 s
Output throughput: 9081.105 token/s
curl -sS http://localhost:30000/update_weights_from_disk \
  -H 'Content-Type: application/json' \
  -d '{
    "model_path": "/data/models/Qwen3-30B-A3B-Instruct-2507-MXFP8",
    "flush_cache": true,
    "abort_all_requests": false
  }'
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.963
Invalid: 0.000
Latency: 19.230 s
Output throughput: 8823.926 token/s

Implementation is now fully complete.

fxmarty-amd · 2026-01-29T16:14:50Z

+                if self.use_mxfp8 and not self.is_checkpoint_fp8_serialized:
+                    raise ValueError(
+                        "MXFP8 requires fp8-serialized checkpoint for linear layers."
+                    )


This is surprising given the snippet below?

sglang/python/sglang/srt/layers/quantization/fp8.py

Lines 409 to 412 in 0769de9

elif self.use_mxfp8:

if not self.is_checkpoint_fp8_serialized:

self._quantize_mxfp8_weights(layer)

return

Should this rather say that it is simply untested? Or should this error be removed?

mxfp8 online quantization from bf16 is tested as in the Accuracy Test section. This raise is never triggered so can be removed.

…UTLASS MoE (sgl-project#17449)

The per-tensor FP8 MoE path in process_weights_after_loading() replaced Parameter objects with new ones via torch.nn.Parameter(), destroying custom attributes (weight_loader, quant_method) needed by EPLB weight hot-reload. The block-quant path was already fixed (PR sgl-project#17449) but the per-tensor path was missed. Additionally, update_weights_from_disk() unconditionally called process_weights_after_loading() on ALL modules even during partial reloads (e.g. EPLB expert rebalancing), causing non-expert layers like FP8 attention to be double-processed (double transpose -> shape mismatch). Changes in fp8.py: - Dynamic-quant path: use .data= for weight rebinding and .data[expert].fill_() for scale updates instead of Parameter replacement. - Checkpoint-FP8 path: use .data.fill_() for input_scale merging; fill both columns of the [E,2] w13_weight_scale in-place instead of replacing with a new [E] Parameter. - DeepGemm apply(): add ndim==2 guard to collapse w13_weight_scale from [E,2] to [E] via [:,0] before expanding to block shape. Changes in model_runner.py: - When weight_name_filter is set (EPLB expert rebalancing), split load_weights and process_weights_after_loading into two steps, only calling the latter on modules whose names match the filter. Signed-off-by: Socratesa <lihaode@zju.edu.cn>

zianglih requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners January 21, 2026 01:38

zianglih added 16 commits January 20, 2026 18:22

Initial MXFP8 without DeepGEMM.

b5e3c86

Initial attempt for Triton MXFP8 backend

c34132f

Minor mxfp8 server args refactor

26cbd05

Fix CUDA graph but still slow

2f3fc3e

Minor update mxfp8 triton kernels.

804388d

Inline redundant helper functions.

46b85cc

Force triton kernel, remove fallback.

d5b7ab3

Further simplify mxfp8 helper wrappers.

0f587fa

Refactor branches.

5080aa2

Clean up.

5cec813

Quantize weights using es_sm100_mxfp8_blockscaled_grouped_quant.

c84e981

Move weight qantization to weight loading.

0d99be7

Clean up.

4a829fb

Clean up.

f50d568

zianglih force-pushed the mxfp8-no-dg branch from 0e9ab9c to f50d568 Compare January 21, 2026 02:22

Fix CUTLASS MoE CUDA graph capture.

70b732f

zianglih changed the title ~~Add mxfp8 support for online quantization, Triton linear, and CUTLASS MoE~~ Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE Jan 21, 2026

Merge branch 'main' into mxfp8-no-dg

5b17ae6

zianglih and others added 8 commits January 26, 2026 10:53

Minor clean up.

c4cf535

Merge branch 'main' into mxfp8-no-dg

255d07e

Debug weight sync.

a3c7468

Fix online quantization error.

263ef0e

Try fix weight update with cuda graph.

8f5188d

Clean up redundant if-else.

bc45712

Merge branch 'main' into mxfp8-no-dg

12692da

Add _copy_or_rebind helper function for online mxfp8 quantization.

1fcee19

Merge branch 'main' into mxfp8-no-dg

833e269

ispobock approved these changes Jan 29, 2026

View reviewed changes

ispobock merged commit 3c9cc44 into sgl-project:main Jan 29, 2026
205 of 223 checks passed

fxmarty-amd reviewed Jan 29, 2026

View reviewed changes

fxmarty-amd mentioned this pull request Jan 29, 2026

Do online fp8 quantization while loading weights instead of in process_weights_after_loading, reducing memory overhead #17945

Open

2 tasks

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026

Add mxfp8 support for online quantization, Triton dense linear, and C…

66d575a

…UTLASS MoE (sgl-project#17449)

Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026

Add mxfp8 support for online quantization, Triton dense linear, and C…

d291e26

…UTLASS MoE (sgl-project#17449)

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

Add mxfp8 support for online quantization, Triton dense linear, and C…

4d8f053

…UTLASS MoE (sgl-project#17449)

This was referenced Feb 9, 2026

[Feature] [ModelOpt] Loader doesn't support MXFP8 #18258

Closed

Implement nvfp4 radixark/miles#546

Draft

Fix nvfp4 weight update #18085

Merged

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

Add mxfp8 support for online quantization, Triton dense linear, and C…

4c735e4

…UTLASS MoE (sgl-project#17449)

zianglih mentioned this pull request Feb 18, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

25 tasks

wolfcomos mentioned this pull request Feb 21, 2026

[WIP]enable mxfp8 on nvidia sm120 #19112

Merged

5 tasks

Socratesa mentioned this pull request Feb 27, 2026

[Quantization] Fix per-tensor FP8 MoE parameter identity for EPLB hot-reload #19499

Open

5 tasks

zianglih deleted the mxfp8-no-dg branch April 6, 2026 08:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE#17449

Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE#17449
ispobock merged 40 commits intosgl-project:mainfrom
zianglih:mxfp8-no-dg

zianglih commented Jan 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 21, 2026

Uh oh!

ispobock commented Jan 26, 2026

Uh oh!

zianglih commented Jan 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

fxmarty-amd Jan 29, 2026 •

edited

Loading

Uh oh!

zianglih Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

	elif self.use_mxfp8:
	if not self.is_checkpoint_fp8_serialized:
	self._quantize_mxfp8_weights(layer)
	return

Conversation

zianglih commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Next Steps

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 21, 2026

Uh oh!

ispobock commented Jan 26, 2026

Uh oh!

zianglih commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fxmarty-amd Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zianglih Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zianglih commented Jan 21, 2026 •

edited

Loading

zianglih commented Jan 26, 2026 •

edited

Loading

fxmarty-amd Jan 29, 2026 •

edited

Loading