[Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum #29795

IwakuraRein · 2025-12-01T17:08:41Z

Purpose

Choose ReduceScatterSum over ReduceSum when sizes are the same
Avoid calling fp8 quant kernel twice when using fp8 attention by creating a stage buffer for decode_ql_nope and decode_q_pe. This also eliminates the torch.cat in the flashinfer mla backend.

Test Plan

Launch command

  --dtype auto --kv-cache-dtype fp8 \
  --tensor-parallel-size 8 \
  --swap-space 16 --max-num-seqs 1024 --trust-remote-code --max-model-len 10240 --gpu-memory-utilization 0.95 \
  --max-num-batched-tokens 16384 --async-scheduling \
  --max-cudagraph-capture-size 1024 --compilation_config.cudagraph_mode FULL_DECODE_ONLY

Evaluation command

lm-eval --model local-completions --tasks gsm8k --model_args model=nvidia/DeepSeek-R1-0528-FP4-v2,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False --num_fewshot 20

Models

Test with DeepSeek R1 0528 checkpoint and 8 H200.
Test with DeepSeek R1 0528 FP4 V2 checkpoint and 4 GB200.

Test Result

Hopper, VLLM_ATTENTION_BACKEND=FLASHMLA

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9530|±  |0.0058|
|     |       |strict-match    |    20|exact_match|↑  |0.9522|±  |0.0059|

Blackwell, VLLM_ATTENTION_BACKEND=FLASHINFER_MLA

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|_  |0.9538|_  |0.0058|
|     |       |strict-match    |    20|exact_match|_  |0.9522|_  |0.0059|

Perf gain: 5% projected improvement in decode when using flashinfer mla backend when serving DeepSeek R1 FP4 on 4 GB200 with DEP4

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Siyuan Fu <[email protected]>

chatgpt-codex-connector · 2025-12-01T22:48:20Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

heheda12345 · 2025-12-02T21:26:38Z

CC @MatthewBonanni @LucasWilkinson

MatthewBonanni

Thanks for this contribution! Can you include some benchmark results (or a snippet of a profile) to get a sense of the speedup?

vllm/v1/attention/backends/mla/common.py

Signed-off-by: Siyuan Fu <[email protected]>

pavanimajety

LGTM, minor comments

pavanimajety · 2025-12-08T21:44:34Z

@IwakuraRein Could you please add in the PR description why ReduceScatterSum is better than ReduceSum when sizes are same? or post some comparison results?

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

…Sum (vllm-project#29795) Signed-off-by: Siyuan Fu <[email protected]> Signed-off-by: mayoohee <[email protected]>

mergify bot added nvidia v1 labels Dec 1, 2025

github-project-automation bot added this to NVIDIA Dec 1, 2025

IwakuraRein force-pushed the improve-attn-fp8-quant branch from 529031a to 75957c4 Compare December 1, 2025 17:55

IwakuraRein added 3 commits December 1, 2025 14:00

avoid launching two quant kernel and reshape

70feac7

Signed-off-by: Siyuan Fu <[email protected]>

choose reduce scatter over reduce

f79b5b0

Signed-off-by: Siyuan Fu <[email protected]>

minor upd

b39aa3b

Signed-off-by: Siyuan Fu <[email protected]>

IwakuraRein force-pushed the improve-attn-fp8-quant branch from 4c10b28 to b39aa3b Compare December 1, 2025 22:01

IwakuraRein changed the title ~~[Perf] Improve attn fp8 quant; replace ReduceSum with ReduceScatterSum~~ [Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum Dec 1, 2025

IwakuraRein marked this pull request as ready for review December 1, 2025 22:48

IwakuraRein requested a review from pavanimajety as a code owner December 1, 2025 22:48

Merge branch 'main' into improve-attn-fp8-quant

654c640

MatthewBonanni reviewed Dec 2, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Show resolved Hide resolved

IwakuraRein added 5 commits December 2, 2025 17:07

add comment

fe802e3

Signed-off-by: Siyuan Fu <[email protected]>

Merge branch 'main' into improve-attn-fp8-quant

49f4406

Merge branch 'main' into improve-attn-fp8-quant

6cb9ec6

Merge branch 'main' into improve-attn-fp8-quant

321006e

Merge branch 'main' into improve-attn-fp8-quant

7a9b405

pavanimajety approved these changes Dec 4, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Dec 4, 2025

IwakuraRein added 3 commits December 4, 2025 12:06

Merge branch 'main' into improve-attn-fp8-quant

14fc200

Merge branch 'main' into improve-attn-fp8-quant

14fd58f

Merge branch 'main' into improve-attn-fp8-quant

a2c6c3a

pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 5, 2025

IwakuraRein added 2 commits December 6, 2025 18:59

Merge branch 'main' into improve-attn-fp8-quant

b45ccf5

Merge branch 'main' into improve-attn-fp8-quant

e654cd1

pavanimajety merged commit 1fb632f into vllm-project:main Dec 8, 2025
58 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Dec 8, 2025

therealnaveenkamal added a commit to therealnaveenkamal/vllm that referenced this pull request Dec 9, 2025

added changes from vllm-project#29795

e18a277

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

mayoohee pushed a commit to mayoohee/vllm that referenced this pull request Dec 9, 2025

[Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatter…

06a19b9

…Sum (vllm-project#29795) Signed-off-by: Siyuan Fu <[email protected]> Signed-off-by: mayoohee <[email protected]>

IwakuraRein deleted the improve-attn-fp8-quant branch December 9, 2025 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum #29795

[Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum #29795

IwakuraRein commented Dec 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

heheda12345 commented Dec 2, 2025

Uh oh!

MatthewBonanni left a comment

Uh oh!

Uh oh!

pavanimajety left a comment

Uh oh!

pavanimajety commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum #29795

[Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum #29795

Conversation

IwakuraRein commented Dec 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

heheda12345 commented Dec 2, 2025

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

pavanimajety commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

IwakuraRein commented Dec 1, 2025 •

edited by github-actions bot

Loading