[DO NOT MERGE] [Quantization][Feature][WIP] Add AWQ quantization in vllm-ascend. #4274

menogrey · 2025-11-19T08:08:35Z

What this PR does / why we need it?

Add AWQ quantization in vllm-ascend.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: menogrey <[email protected]>

github-actions · 2025-11-19T08:11:00Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces support for AWQ quantization on Ascend NPUs. The changes include adding the AWQ quantization method to the platform, implementing the necessary quantization configuration and methods, and leveraging NPU-specific operations for AWQ. The overall approach is sound, but there is a critical issue in the implementation of the npu_fused_experts function regarding how activations are quantized and how the matrix multiplication kernel is called, which needs to be addressed.

gemini-code-assist · 2025-11-19T08:11:09Z

vllm_ascend/quantization/awq/awq.py

+    hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states)
+    if not use_wna16:
+        hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states)
+        scale_args13 = {
+            "scale": [w13_scale.to(scale_dtype)],
+            "per_token_scale": [pertoken_scale],
+        }
+    else:
+        scale_args13 = {
+            "antiquant_scale": [w13_scale],
+            "antiquant_offset": [w13_offset],
+        }
+
+    hidden_states = torch_npu.npu_grouped_matmul(
+        x=[hidden_states],
+        weight=[w13],
+        scale=[w13_scale.to(scale_dtype)],
+        per_token_scale=[pertoken_scale],
+        **scale_args13,


The activation quantization and matrix multiplication logic in this block appears to have several issues:

Redundant Quantization: When use_wna16 is False, torch_npu.npu_dynamic_quant is called twice consecutively (lines 79 and 81), which is redundant and incorrect.

Incorrect Conditional Quantization: If use_wna16 is True, which typically implies activations are not quantized (Weight-Native, Activation-16bit), npu_dynamic_quant is still called unconditionally on line 79.

Incorrect API Usage: The call to torch_npu.npu_grouped_matmul on line 92 passes arguments for both standard quantization (scale, per_token_scale) and on-the-fly dequantization (antiquant_scale, antiquant_offset) when use_wna16 is True. This is likely incorrect.

A similar set of issues exists for the second npu_grouped_matmul call (lines 106-128). This logic should be refactored to conditionally quantize the activation and call npu_grouped_matmul with the correct, mutually exclusive set of arguments based on the use_wna16 flag.

[Quantization][Feature][WIP] Add AWQ quantization in vllm-ascend.

f0e9e37

Signed-off-by: menogrey <[email protected]>

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

github-actions bot added module:ops module:core module:quantization labels Nov 19, 2025

menogrey closed this Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO NOT MERGE] [Quantization][Feature][WIP] Add AWQ quantization in vllm-ascend. #4274

[DO NOT MERGE] [Quantization][Feature][WIP] Add AWQ quantization in vllm-ascend. #4274

menogrey commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[DO NOT MERGE] [Quantization][Feature][WIP] Add AWQ quantization in vllm-ascend. #4274

[DO NOT MERGE] [Quantization][Feature][WIP] Add AWQ quantization in vllm-ascend. #4274

Conversation

menogrey commented Nov 19, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant