Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Dec 1, 2025

Merge after #29352

Lots of mamba backens assume num_decode_reqs == num_decode_tokens when max_query_len == 1

Make this safe for now

num_decodes = attn_metadata.num_decode_tokens # token count (=request)

MatthewBonanni and others added 6 commits November 26, 2025 15:49
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mergify mergify bot added the v1 label Dec 1, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a potential out-of-bounds access in Mamba backends by ensuring num_decode_tokens is correctly handled, especially when max_query_len is 1. The changes involve refactoring how padding for CUDA graphs is handled and introducing a workaround in split_decodes_and_prefills. While the core fix in utils.py seems correct, the associated refactoring appears to have introduced critical bugs in mamba1_attn.py and mamba2_attn.py by removing necessary padding for tensors used in CUDA graph capture. This could lead to out-of-bounds memory access and must be addressed.

Comment on lines 134 to 146
block_idx_last_scheduled_token, non_blocking=True
)
block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[
:padded_decodes
:num_decode_tokens
]
block_idx_last_scheduled_token[num_decodes:] = 0

self.block_idx_last_computed_token[:num_decodes].copy_(
block_idx_last_computed_token, non_blocking=True
)
block_idx_last_computed_token = self.block_idx_last_computed_token[
:padded_decodes
:num_decode_tokens
]
block_idx_last_computed_token[num_decodes:] = 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic to pad block_idx_last_scheduled_token and block_idx_last_computed_token with zeros for CUDA graph dummy requests has been removed. For padded requests (from num_decodes to num_decode_tokens), these tensors will contain garbage values from previous runs. This can lead to out-of-bounds memory access in the Mamba kernels, as these tensors are used for indexing. This is a critical issue that needs to be fixed.

                self.block_idx_last_scheduled_token[:num_decodes].copy_(
                    block_idx_last_scheduled_token, non_blocking=True
                )
                block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[
                    :num_decode_tokens
                ]
                block_idx_last_scheduled_token[num_decodes:] = 0

                self.block_idx_last_computed_token[:num_decodes].copy_(
                    block_idx_last_computed_token, non_blocking=True
                )
                block_idx_last_computed_token = self.block_idx_last_computed_token[
                    :num_decode_tokens
                ]
                block_idx_last_computed_token[num_decodes:] = 0

Comment on lines 306 to 324
self.state_indices_tensor[:num_decodes].copy_(
state_indices_tensor, non_blocking=True
)
state_indices_tensor = self.state_indices_tensor[:num_input_tokens]
state_indices_tensor[num_decodes:] = PAD_SLOT_ID
state_indices_tensor = self.state_indices_tensor[:num_decode_tokens]

if self.vllm_config.cache_config.enable_prefix_caching:
self.block_idx_last_scheduled_token[:num_decodes].copy_(
block_idx_last_scheduled_token, non_blocking=True
)
block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[
:num_input_tokens
:num_decode_tokens
]
block_idx_last_scheduled_token[num_decodes:] = 0

self.block_idx_last_computed_token[:num_decodes].copy_(
block_idx_last_computed_token, non_blocking=True
)
block_idx_last_computed_token = self.block_idx_last_computed_token[
:num_input_tokens
:num_decode_tokens
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The padding logic for state_indices_tensor, block_idx_last_scheduled_token, and block_idx_last_computed_token has been removed. This is critical for correct CUDA graph capture.

  • state_indices_tensor must be padded with PAD_SLOT_ID for dummy requests to be skipped by the kernel.
  • block_idx_* tensors must be padded with 0 to avoid using garbage indices.

Without this padding, CUDA graph capture for decode requests will likely cause out-of-bounds memory access. Note that PAD_SLOT_ID will need to be re-imported.

            self.state_indices_tensor[:num_decodes].copy_(
                state_indices_tensor, non_blocking=True
            )
            state_indices_tensor = self.state_indices_tensor[:num_decode_tokens]
            state_indices_tensor[num_decodes:] = PAD_SLOT_ID

            if self.vllm_config.cache_config.enable_prefix_caching:
                self.block_idx_last_scheduled_token[:num_decodes].copy_(
                    block_idx_last_scheduled_token, non_blocking=True
                )
                block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[
                    :num_decode_tokens
                ]
                block_idx_last_scheduled_token[num_decodes:] = 0

                self.block_idx_last_computed_token[:num_decodes].copy_(
                    block_idx_last_computed_token, non_blocking=True
                )
                block_idx_last_computed_token = self.block_idx_last_computed_token[
                    :num_decode_tokens
                ]
                block_idx_last_computed_token[num_decodes:] = 0

Comment on lines 12 to 15
from vllm.v1.attention.backends.utils import (
PAD_SLOT_ID,
CommonAttentionMetadata,
compute_causal_conv1d_metadata,
split_decodes_and_prefills,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The import for PAD_SLOT_ID has been removed, but it's necessary for padding state_indices_tensor during CUDA graph capture for decode requests. Please re-add this import.

Suggested change
from vllm.v1.attention.backends.utils import (
PAD_SLOT_ID,
CommonAttentionMetadata,
compute_causal_conv1d_metadata,
split_decodes_and_prefills,
from vllm.v1.attention.backends.utils import (
PAD_SLOT_ID,
CommonAttentionMetadata,
compute_causal_conv1d_metadata,
split_decodes_and_prefills,
)

@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 2, 2025
@LucasWilkinson LucasWilkinson added this to the v0.12.0 milestone Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants