Skip to content

Conversation

@menogrey
Copy link
Contributor

What this PR does / why we need it?

Add AWQ quantization in vllm-ascend.

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for AWQ quantization on Ascend NPUs. The changes include adding the AWQ quantization method to the platform, implementing the necessary quantization configuration and methods, and leveraging NPU-specific operations for AWQ. The overall approach is sound, but there is a critical issue in the implementation of the npu_fused_experts function regarding how activations are quantized and how the matrix multiplication kernel is called, which needs to be addressed.

Comment on lines +79 to +97
hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states)
if not use_wna16:
hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states)
scale_args13 = {
"scale": [w13_scale.to(scale_dtype)],
"per_token_scale": [pertoken_scale],
}
else:
scale_args13 = {
"antiquant_scale": [w13_scale],
"antiquant_offset": [w13_offset],
}

hidden_states = torch_npu.npu_grouped_matmul(
x=[hidden_states],
weight=[w13],
scale=[w13_scale.to(scale_dtype)],
per_token_scale=[pertoken_scale],
**scale_args13,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The activation quantization and matrix multiplication logic in this block appears to have several issues:

  1. Redundant Quantization: When use_wna16 is False, torch_npu.npu_dynamic_quant is called twice consecutively (lines 79 and 81), which is redundant and incorrect.
  2. Incorrect Conditional Quantization: If use_wna16 is True, which typically implies activations are not quantized (Weight-Native, Activation-16bit), npu_dynamic_quant is still called unconditionally on line 79.
  3. Incorrect API Usage: The call to torch_npu.npu_grouped_matmul on line 92 passes arguments for both standard quantization (scale, per_token_scale) and on-the-fly dequantization (antiquant_scale, antiquant_offset) when use_wna16 is True. This is likely incorrect.

A similar set of issues exists for the second npu_grouped_matmul call (lines 106-128). This logic should be refactored to conditionally quantize the activation and call npu_grouped_matmul with the correct, mutually exclusive set of arguments based on the use_wna16 flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant