[MM][Draft] Implement and register custom AscendMMEncoderAttention #4279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

shen-shanshan wants to merge 3 commits into vllm-project:main from shen-shanshan:vit

+136 −234

Collaborator

shen-shanshan commented Nov 19, 2025 •

edited by github-actions bot

Loading

What this PR does / why we need it?

Implement and register custom AscendMMEncoderAttention.

Note: This is just a draft and need future updates following upstream interface.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@2918c1b

shen-shanshan added 2 commits

November 19, 2025 09:19


          implement custom AscendMMEncoderAttention

0dcb635

Signed-off-by: shen-shanshan <[email protected]>


          update

dc9c21d

Signed-off-by: shen-shanshan <[email protected]>

shen-shanshan marked this pull request as draft

November 19, 2025 09:23

shen-shanshan mentioned this pull request

[RFC]: Remove VL Modeling Files #4084

Open

8 tasks

github-actions bot commented Nov 19, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions bot added module:ops module:core labels

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request refactors the custom attention logic for Ascend NPUs by extracting it from the model-specific AscendQwen2_5_VisionAttention into a more general AscendMMEncoderAttention class. This is a good step towards modularity. However, the refactoring has introduced several critical issues. The new AscendMMEncoderAttention module has multiple missing imports and uninitialized attributes that will cause runtime errors. Additionally, the weight padding logic has been moved into the forward pass, which is a problematic design that can cause statefulness and race conditions. The original weight loading logic in AscendQwen2_5_VisionTransformer is now broken as it tries to call methods and access attributes that have been moved. These issues need to be addressed to make the PR functional.

vllm_ascend/models/qwen2_5_vl.py Outdated

Comment on lines 172 to 177

    
                              if ("attn.qkv.weight_scale" in name

                                    or "attn.qkv.weight_offset" in name) and self.enable_pad:

                                  param.data = self.pad_qkv_weight_scale_offset(param.data)

                              elif ("attn.qkv.deq_scale" in name

                                    or "attn.qkv.quant_bias" in name) and self.enable_pad:

                                  param.data = self.pad_qkv_deq_scale_quant_bias(param.data)

Contributor

gemini-code-assist bot Nov 19, 2025

This logic is broken after the refactoring. It attempts to access self.enable_pad and call padding methods (self.pad_qkv_weight_scale_offset, self.pad_qkv_deq_scale_quant_bias) that have been removed from AscendQwen2_5_VisionTransformer. This will cause AttributeErrors. This logic for padding quantization-related tensors should be moved to a load_weights method within the AscendMMEncoderAttention class, where it can correctly access its own attributes and methods.

vllm_ascend/ops/multi_modal/mm_encoder_attention.py

Comment on lines +16 to +18

    
              import torch.nn.functional as F

              from vllm.attention.layers.mm_encoder_attention import MMEncoderAttention

Contributor

gemini-code-assist bot Nov 19, 2025

This file is missing several imports, which will lead to NameError exceptions at runtime. The following names are used but not defined: Optional, QuantizationConfig, dist_utils, is_enable_nz, torch_npu, ACL_FORMAT_FRACTAL_ND, and rearrange.

from typing import Optional

import torch
import torch.nn.functional as F
import torch_npu
from einops import rearrange

from vllm.attention.layers.mm_encoder_attention import MMEncoderAttention
from vllm.distributed import utils as dist_utils
from vllm.model_executor.layers.quantization import QuantizationConfig
from vllm_ascend.utils import ACL_FORMAT_FRACTAL_ND, is_enable_nz

vllm_ascend/ops/multi_modal/mm_encoder_attention.py

Comment on lines +44 to +45

    
                      self.hidden_size_per_attention_head = dist_utils.divide(

                          projection_size, num_heads)

Contributor

gemini-code-assist bot Nov 19, 2025

The attribute self.origin_hidden_size_per_attention_head is used in the forward method, but it's only initialized if self.enable_pad is true. This will cause an AttributeError when padding is disabled. It should be initialized unconditionally.

        self.hidden_size_per_attention_head = dist_utils.divide(
            projection_size, num_heads)
        self.origin_hidden_size_per_attention_head = self.hidden_size_per_attention_head

vllm_ascend/ops/multi_modal/mm_encoder_attention.py Outdated

    
                          sin = F.pad(

                              sin, (0, self.half_pad_hidden_size_per_attention_head))

                      if not self.interleaved:

Contributor

gemini-code-assist bot Nov 19, 2025

The self.interleaved attribute is used here but it is not initialized in the __init__ method of AscendMMEncoderAttention, which will cause an AttributeError. This property needs to be passed down from AscendQwen2_5_VisionTransformer during the instantiation of this attention layer.

vllm_ascend/ops/multi_modal/mm_encoder_attention.py Outdated

    
                                                self.hidden_size_per_attention_head)

                      return cos_new, sin_new

                      return cos, sin

Contributor

gemini-code-assist bot Nov 19, 2025

This return statement is unreachable because the function will always return on line 186. This dead code should be removed.

vllm_ascend/ops/multi_modal/mm_encoder_attention.py Outdated

Comment on lines 198 to 205

    
                      if self.enable_pad and not self.finish_pad:

                          self.qkv.weight.data = self.pad_qkv_weight(self.qkv.weight.data)

                          self.qkv.bias.data = self.pad_qkv_bias(self.qkv.bias.data)

                          self.proj.weight.data = self.pad_proj_weight(self.proj.weight.data)

                          # TODO(shen-shanshan): optimize this to avoid redundant computation.

                          cos, sin = self.pad_cos_sin(cos, sin)

                          # TODO(shen-shanshan): add padding for quantization.

                          self.finish_pad = True

Contributor

gemini-code-assist bot Nov 19, 2025

Modifying module parameters like self.qkv.weight.data inside the forward method is not a good practice. It makes the method stateful and not thread-safe, which can lead to race conditions and unpredictable behavior. This padding logic should be executed only once, ideally during weight loading. Consider overriding the load_weights method in this class to perform the padding after the weights have been loaded by the parent class. This would also be the right place to move the quantization padding logic from AscendQwen2_5_VisionTransformer.load_weights.


          update

ce5be51

Signed-off-by: shen-shanshan <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core module:ops