[BugFix] Fix poetential out of bounds blocktable access #29811

LucasWilkinson · 2025-12-01T20:57:11Z

Merge after #29352

Lots of mamba backens assume num_decode_reqs == num_decode_tokens when max_query_len == 1

Make this safe for now

vllm/vllm/model_executor/layers/mamba/short_conv.py

Line 138 in eaf8148

num_decodes = attn_metadata.num_decode_tokens # token count (=request)

Signed-off-by: Matthew Bonanni <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

chatgpt-codex-connector · 2025-12-01T20:57:23Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request aims to fix a potential out-of-bounds access in Mamba backends by ensuring num_decode_tokens is correctly handled, especially when max_query_len is 1. The changes involve refactoring how padding for CUDA graphs is handled and introducing a workaround in split_decodes_and_prefills. While the core fix in utils.py seems correct, the associated refactoring appears to have introduced critical bugs in mamba1_attn.py and mamba2_attn.py by removing necessary padding for tensors used in CUDA graph capture. This could lead to out-of-bounds memory access and must be addressed.

gemini-code-assist · 2025-12-01T21:00:11Z

vllm/v1/attention/backends/mamba1_attn.py

                    block_idx_last_scheduled_token, non_blocking=True
                )
                block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[
-                    :padded_decodes
+                    :num_decode_tokens
                ]
-                block_idx_last_scheduled_token[num_decodes:] = 0

                self.block_idx_last_computed_token[:num_decodes].copy_(
                    block_idx_last_computed_token, non_blocking=True
                )
                block_idx_last_computed_token = self.block_idx_last_computed_token[
-                    :padded_decodes
+                    :num_decode_tokens
                ]
-                block_idx_last_computed_token[num_decodes:] = 0



The logic to pad block_idx_last_scheduled_token and block_idx_last_computed_token with zeros for CUDA graph dummy requests has been removed. For padded requests (from num_decodes to num_decode_tokens), these tensors will contain garbage values from previous runs. This can lead to out-of-bounds memory access in the Mamba kernels, as these tensors are used for indexing. This is a critical issue that needs to be fixed.

self.block_idx_last_scheduled_token[:num_decodes].copy_( block_idx_last_scheduled_token, non_blocking=True ) block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[ :num_decode_tokens ] block_idx_last_scheduled_token[num_decodes:] = 0 self.block_idx_last_computed_token[:num_decodes].copy_( block_idx_last_computed_token, non_blocking=True ) block_idx_last_computed_token = self.block_idx_last_computed_token[ :num_decode_tokens ] block_idx_last_computed_token[num_decodes:] = 0

gemini-code-assist · 2025-12-01T21:00:11Z

vllm/v1/attention/backends/mamba2_attn.py

            self.state_indices_tensor[:num_decodes].copy_(
                state_indices_tensor, non_blocking=True
            )
-            state_indices_tensor = self.state_indices_tensor[:num_input_tokens]
-            state_indices_tensor[num_decodes:] = PAD_SLOT_ID
+            state_indices_tensor = self.state_indices_tensor[:num_decode_tokens]

            if self.vllm_config.cache_config.enable_prefix_caching:
                self.block_idx_last_scheduled_token[:num_decodes].copy_(
                    block_idx_last_scheduled_token, non_blocking=True
                )
                block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[
-                    :num_input_tokens
+                    :num_decode_tokens
                ]
-                block_idx_last_scheduled_token[num_decodes:] = 0

                self.block_idx_last_computed_token[:num_decodes].copy_(
                    block_idx_last_computed_token, non_blocking=True
                )
                block_idx_last_computed_token = self.block_idx_last_computed_token[
-                    :num_input_tokens
+                    :num_decode_tokens
                ]


The padding logic for state_indices_tensor, block_idx_last_scheduled_token, and block_idx_last_computed_token has been removed. This is critical for correct CUDA graph capture.

state_indices_tensor must be padded with PAD_SLOT_ID for dummy requests to be skipped by the kernel.

block_idx_* tensors must be padded with 0 to avoid using garbage indices.

Without this padding, CUDA graph capture for decode requests will likely cause out-of-bounds memory access. Note that PAD_SLOT_ID will need to be re-imported.

self.state_indices_tensor[:num_decodes].copy_( state_indices_tensor, non_blocking=True ) state_indices_tensor = self.state_indices_tensor[:num_decode_tokens] state_indices_tensor[num_decodes:] = PAD_SLOT_ID if self.vllm_config.cache_config.enable_prefix_caching: self.block_idx_last_scheduled_token[:num_decodes].copy_( block_idx_last_scheduled_token, non_blocking=True ) block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[ :num_decode_tokens ] block_idx_last_scheduled_token[num_decodes:] = 0 self.block_idx_last_computed_token[:num_decodes].copy_( block_idx_last_computed_token, non_blocking=True ) block_idx_last_computed_token = self.block_idx_last_computed_token[ :num_decode_tokens ] block_idx_last_computed_token[num_decodes:] = 0

gemini-code-assist · 2025-12-01T21:00:11Z

vllm/v1/attention/backends/mamba2_attn.py

 from vllm.v1.attention.backends.utils import (
-    PAD_SLOT_ID,
    CommonAttentionMetadata,
    compute_causal_conv1d_metadata,
    split_decodes_and_prefills,


The import for PAD_SLOT_ID has been removed, but it's necessary for padding state_indices_tensor during CUDA graph capture for decode requests. Please re-add this import.

Suggested change

from vllm.v1.attention.backends.utils import (

PAD_SLOT_ID,

CommonAttentionMetadata,

compute_causal_conv1d_metadata,

split_decodes_and_prefills,

from vllm.v1.attention.backends.utils import (

PAD_SLOT_ID,

CommonAttentionMetadata,

compute_causal_conv1d_metadata,

split_decodes_and_prefills,

)

MatthewBonanni and others added 6 commits November 26, 2025 15:49

remove padding

318eb05

Signed-off-by: Matthew Bonanni <[email protected]>

remove unnecessary zeroing

659dd3a

Signed-off-by: Matthew Bonanni <[email protected]>

Remove unnecessary min()

3cd6cdb

Signed-off-by: Matthew Bonanni <[email protected]>

Merge branch 'main' into remove_backend_cg_padding

7ea3166

remove unnecessary padding

97d50c9

Signed-off-by: Matthew Bonanni <[email protected]>

wip

0b7e4cf

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson requested a review from tdoublep as a code owner December 1, 2025 20:57

mergify bot added the v1 label Dec 1, 2025

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 2, 2025

LucasWilkinson added this to the v0.12.0 milestone Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Fix poetential out of bounds blocktable access #29811

[BugFix] Fix poetential out of bounds blocktable access #29811

LucasWilkinson commented Dec 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[BugFix] Fix poetential out of bounds blocktable access #29811

Are you sure you want to change the base?

[BugFix] Fix poetential out of bounds blocktable access #29811

Conversation

LucasWilkinson commented Dec 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LucasWilkinson commented Dec 1, 2025 •

edited by github-actions bot

Loading