support 32K model len on deepseek r1 W8A8 #728

flying632 · 2025-04-29T14:43:58Z

What this PR does / why we need it?

Optimize NPU memory usage. #723

vllm v0.8.4.rc2 and DeepSeek R1 can only support a model length of 16K. When attempting to run with a model length of 32K, an "Out of Memory" (OOM) error will occur.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: sunbaosong <[email protected]>

support 32K model len on deepseek r1 W8A8

d983c71

Signed-off-by: sunbaosong <[email protected]>

flying632 marked this pull request as ready for review April 29, 2025 14:44

github-actions bot added the module:quantization label Apr 29, 2025

flying632 changed the title ~~support 32K model len on deepseek r1 W8A8~~ [Performance] support 32K model len on deepseek r1 W8A8 Apr 29, 2025

flying632 changed the title ~~[Performance] support 32K model len on deepseek r1 W8A8~~ support 32K model len on deepseek r1 W8A8 Apr 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support 32K model len on deepseek r1 W8A8 #728

support 32K model len on deepseek r1 W8A8 #728

flying632 commented Apr 29, 2025

support 32K model len on deepseek r1 W8A8 #728

Are you sure you want to change the base?

support 32K model len on deepseek r1 W8A8 #728

Conversation

flying632 commented Apr 29, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?