-
Notifications
You must be signed in to change notification settings - Fork 583
[MM][Draft] Implement and register custom AscendMMEncoderAttention #4279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the custom attention logic for Ascend NPUs by extracting it from the model-specific AscendQwen2_5_VisionAttention into a more general AscendMMEncoderAttention class. This is a good step towards modularity. However, the refactoring has introduced several critical issues. The new AscendMMEncoderAttention module has multiple missing imports and uninitialized attributes that will cause runtime errors. Additionally, the weight padding logic has been moved into the forward pass, which is a problematic design that can cause statefulness and race conditions. The original weight loading logic in AscendQwen2_5_VisionTransformer is now broken as it tries to call methods and access attributes that have been moved. These issues need to be addressed to make the PR functional.
vllm_ascend/models/qwen2_5_vl.py
Outdated
| if ("attn.qkv.weight_scale" in name | ||
| or "attn.qkv.weight_offset" in name) and self.enable_pad: | ||
| param.data = self.pad_qkv_weight_scale_offset(param.data) | ||
| elif ("attn.qkv.deq_scale" in name | ||
| or "attn.qkv.quant_bias" in name) and self.enable_pad: | ||
| param.data = self.pad_qkv_deq_scale_quant_bias(param.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is broken after the refactoring. It attempts to access self.enable_pad and call padding methods (self.pad_qkv_weight_scale_offset, self.pad_qkv_deq_scale_quant_bias) that have been removed from AscendQwen2_5_VisionTransformer. This will cause AttributeErrors. This logic for padding quantization-related tensors should be moved to a load_weights method within the AscendMMEncoderAttention class, where it can correctly access its own attributes and methods.
| import torch.nn.functional as F | ||
|
|
||
| from vllm.attention.layers.mm_encoder_attention import MMEncoderAttention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is missing several imports, which will lead to NameError exceptions at runtime. The following names are used but not defined: Optional, QuantizationConfig, dist_utils, is_enable_nz, torch_npu, ACL_FORMAT_FRACTAL_ND, and rearrange.
from typing import Optional
import torch
import torch.nn.functional as F
import torch_npu
from einops import rearrange
from vllm.attention.layers.mm_encoder_attention import MMEncoderAttention
from vllm.distributed import utils as dist_utils
from vllm.model_executor.layers.quantization import QuantizationConfig
from vllm_ascend.utils import ACL_FORMAT_FRACTAL_ND, is_enable_nz| self.hidden_size_per_attention_head = dist_utils.divide( | ||
| projection_size, num_heads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attribute self.origin_hidden_size_per_attention_head is used in the forward method, but it's only initialized if self.enable_pad is true. This will cause an AttributeError when padding is disabled. It should be initialized unconditionally.
self.hidden_size_per_attention_head = dist_utils.divide(
projection_size, num_heads)
self.origin_hidden_size_per_attention_head = self.hidden_size_per_attention_head| sin = F.pad( | ||
| sin, (0, self.half_pad_hidden_size_per_attention_head)) | ||
|
|
||
| if not self.interleaved: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.hidden_size_per_attention_head) | ||
| return cos_new, sin_new | ||
|
|
||
| return cos, sin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if self.enable_pad and not self.finish_pad: | ||
| self.qkv.weight.data = self.pad_qkv_weight(self.qkv.weight.data) | ||
| self.qkv.bias.data = self.pad_qkv_bias(self.qkv.bias.data) | ||
| self.proj.weight.data = self.pad_proj_weight(self.proj.weight.data) | ||
| # TODO(shen-shanshan): optimize this to avoid redundant computation. | ||
| cos, sin = self.pad_cos_sin(cos, sin) | ||
| # TODO(shen-shanshan): add padding for quantization. | ||
| self.finish_pad = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying module parameters like self.qkv.weight.data inside the forward method is not a good practice. It makes the method stateful and not thread-safe, which can lead to race conditions and unpredictable behavior. This padding logic should be executed only once, ideally during weight loading. Consider overriding the load_weights method in this class to perform the padding after the weights have been loaded by the parent class. This would also be the right place to move the quantization padding logic from AscendQwen2_5_VisionTransformer.load_weights.
Signed-off-by: shen-shanshan <[email protected]>
What this PR does / why we need it?
Implement and register custom AscendMMEncoderAttention.
Note: This is just a draft and need future updates following upstream interface.
Does this PR introduce any user-facing change?
How was this patch tested?