-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum #29795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum #29795
Conversation
529031a to
75957c4
Compare
Signed-off-by: Siyuan Fu <[email protected]>
Signed-off-by: Siyuan Fu <[email protected]>
Signed-off-by: Siyuan Fu <[email protected]>
4c10b28 to
b39aa3b
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
MatthewBonanni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution! Can you include some benchmark results (or a snippet of a profile) to get a sense of the speedup?
pavanimajety
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor comments
|
@IwakuraRein Could you please add in the PR description why ReduceScatterSum is better than ReduceSum when sizes are same? or post some comparison results? |
Signed-off-by: Naveenraj Kamalakannan <[email protected]>
…Sum (vllm-project#29795) Signed-off-by: Siyuan Fu <[email protected]> Signed-off-by: mayoohee <[email protected]>
Purpose
Choose
ReduceScatterSumoverReduceSumwhen sizes are the sameAvoid calling fp8 quant kernel twice when using fp8 attention by creating a stage buffer for
decode_ql_nopeanddecode_q_pe. This also eliminates thetorch.catin the flashinfer mla backend.Test Plan
Launch command
Evaluation command
Models
Test Result
Hopper,
VLLM_ATTENTION_BACKEND=FLASHMLABlackwell,
VLLM_ATTENTION_BACKEND=FLASHINFER_MLAPerf gain: 5% projected improvement in decode when using flashinfer mla backend when serving DeepSeek R1 FP4 on 4 GB200 with DEP4
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.