Skip to content

Conversation

@shen-shanshan
Copy link
Collaborator

@shen-shanshan shen-shanshan commented Nov 19, 2025

What this PR does / why we need it?

Implement and register custom AscendMMEncoderAttention.

Note: This is just a draft and need future updates following upstream interface.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: shen-shanshan <[email protected]>
@shen-shanshan shen-shanshan marked this pull request as draft November 19, 2025 09:23
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the custom attention logic for Ascend NPUs by extracting it from the model-specific AscendQwen2_5_VisionAttention into a more general AscendMMEncoderAttention class. This is a good step towards modularity. However, the refactoring has introduced several critical issues. The new AscendMMEncoderAttention module has multiple missing imports and uninitialized attributes that will cause runtime errors. Additionally, the weight padding logic has been moved into the forward pass, which is a problematic design that can cause statefulness and race conditions. The original weight loading logic in AscendQwen2_5_VisionTransformer is now broken as it tries to call methods and access attributes that have been moved. These issues need to be addressed to make the PR functional.

Comment on lines 172 to 177
if ("attn.qkv.weight_scale" in name
or "attn.qkv.weight_offset" in name) and self.enable_pad:
param.data = self.pad_qkv_weight_scale_offset(param.data)
elif ("attn.qkv.deq_scale" in name
or "attn.qkv.quant_bias" in name) and self.enable_pad:
param.data = self.pad_qkv_deq_scale_quant_bias(param.data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This logic is broken after the refactoring. It attempts to access self.enable_pad and call padding methods (self.pad_qkv_weight_scale_offset, self.pad_qkv_deq_scale_quant_bias) that have been removed from AscendQwen2_5_VisionTransformer. This will cause AttributeErrors. This logic for padding quantization-related tensors should be moved to a load_weights method within the AscendMMEncoderAttention class, where it can correctly access its own attributes and methods.

Comment on lines +16 to +18
import torch.nn.functional as F

from vllm.attention.layers.mm_encoder_attention import MMEncoderAttention
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This file is missing several imports, which will lead to NameError exceptions at runtime. The following names are used but not defined: Optional, QuantizationConfig, dist_utils, is_enable_nz, torch_npu, ACL_FORMAT_FRACTAL_ND, and rearrange.

from typing import Optional

import torch
import torch.nn.functional as F
import torch_npu
from einops import rearrange

from vllm.attention.layers.mm_encoder_attention import MMEncoderAttention
from vllm.distributed import utils as dist_utils
from vllm.model_executor.layers.quantization import QuantizationConfig
from vllm_ascend.utils import ACL_FORMAT_FRACTAL_ND, is_enable_nz

Comment on lines +44 to +45
self.hidden_size_per_attention_head = dist_utils.divide(
projection_size, num_heads)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The attribute self.origin_hidden_size_per_attention_head is used in the forward method, but it's only initialized if self.enable_pad is true. This will cause an AttributeError when padding is disabled. It should be initialized unconditionally.

        self.hidden_size_per_attention_head = dist_utils.divide(
            projection_size, num_heads)
        self.origin_hidden_size_per_attention_head = self.hidden_size_per_attention_head

sin = F.pad(
sin, (0, self.half_pad_hidden_size_per_attention_head))

if not self.interleaved:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The self.interleaved attribute is used here but it is not initialized in the __init__ method of AscendMMEncoderAttention, which will cause an AttributeError. This property needs to be passed down from AscendQwen2_5_VisionTransformer during the instantiation of this attention layer.

self.hidden_size_per_attention_head)
return cos_new, sin_new

return cos, sin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This return statement is unreachable because the function will always return on line 186. This dead code should be removed.

Comment on lines 198 to 205
if self.enable_pad and not self.finish_pad:
self.qkv.weight.data = self.pad_qkv_weight(self.qkv.weight.data)
self.qkv.bias.data = self.pad_qkv_bias(self.qkv.bias.data)
self.proj.weight.data = self.pad_proj_weight(self.proj.weight.data)
# TODO(shen-shanshan): optimize this to avoid redundant computation.
cos, sin = self.pad_cos_sin(cos, sin)
# TODO(shen-shanshan): add padding for quantization.
self.finish_pad = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Modifying module parameters like self.qkv.weight.data inside the forward method is not a good practice. It makes the method stateful and not thread-safe, which can lead to race conditions and unpredictable behavior. This padding logic should be executed only once, ideally during weight loading. Consider overriding the load_weights method in this class to perform the padding after the weights have been loaded by the parent class. This would also be the right place to move the quantization padding logic from AscendQwen2_5_VisionTransformer.load_weights.

Signed-off-by: shen-shanshan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant