-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[BugFix] Fix poetential out of bounds blocktable access #29811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[BugFix] Fix poetential out of bounds blocktable access #29811
Conversation
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a potential out-of-bounds access in Mamba backends by ensuring num_decode_tokens is correctly handled, especially when max_query_len is 1. The changes involve refactoring how padding for CUDA graphs is handled and introducing a workaround in split_decodes_and_prefills. While the core fix in utils.py seems correct, the associated refactoring appears to have introduced critical bugs in mamba1_attn.py and mamba2_attn.py by removing necessary padding for tensors used in CUDA graph capture. This could lead to out-of-bounds memory access and must be addressed.
| block_idx_last_scheduled_token, non_blocking=True | ||
| ) | ||
| block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[ | ||
| :padded_decodes | ||
| :num_decode_tokens | ||
| ] | ||
| block_idx_last_scheduled_token[num_decodes:] = 0 | ||
|
|
||
| self.block_idx_last_computed_token[:num_decodes].copy_( | ||
| block_idx_last_computed_token, non_blocking=True | ||
| ) | ||
| block_idx_last_computed_token = self.block_idx_last_computed_token[ | ||
| :padded_decodes | ||
| :num_decode_tokens | ||
| ] | ||
| block_idx_last_computed_token[num_decodes:] = 0 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to pad block_idx_last_scheduled_token and block_idx_last_computed_token with zeros for CUDA graph dummy requests has been removed. For padded requests (from num_decodes to num_decode_tokens), these tensors will contain garbage values from previous runs. This can lead to out-of-bounds memory access in the Mamba kernels, as these tensors are used for indexing. This is a critical issue that needs to be fixed.
self.block_idx_last_scheduled_token[:num_decodes].copy_(
block_idx_last_scheduled_token, non_blocking=True
)
block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[
:num_decode_tokens
]
block_idx_last_scheduled_token[num_decodes:] = 0
self.block_idx_last_computed_token[:num_decodes].copy_(
block_idx_last_computed_token, non_blocking=True
)
block_idx_last_computed_token = self.block_idx_last_computed_token[
:num_decode_tokens
]
block_idx_last_computed_token[num_decodes:] = 0| self.state_indices_tensor[:num_decodes].copy_( | ||
| state_indices_tensor, non_blocking=True | ||
| ) | ||
| state_indices_tensor = self.state_indices_tensor[:num_input_tokens] | ||
| state_indices_tensor[num_decodes:] = PAD_SLOT_ID | ||
| state_indices_tensor = self.state_indices_tensor[:num_decode_tokens] | ||
|
|
||
| if self.vllm_config.cache_config.enable_prefix_caching: | ||
| self.block_idx_last_scheduled_token[:num_decodes].copy_( | ||
| block_idx_last_scheduled_token, non_blocking=True | ||
| ) | ||
| block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[ | ||
| :num_input_tokens | ||
| :num_decode_tokens | ||
| ] | ||
| block_idx_last_scheduled_token[num_decodes:] = 0 | ||
|
|
||
| self.block_idx_last_computed_token[:num_decodes].copy_( | ||
| block_idx_last_computed_token, non_blocking=True | ||
| ) | ||
| block_idx_last_computed_token = self.block_idx_last_computed_token[ | ||
| :num_input_tokens | ||
| :num_decode_tokens | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The padding logic for state_indices_tensor, block_idx_last_scheduled_token, and block_idx_last_computed_token has been removed. This is critical for correct CUDA graph capture.
state_indices_tensormust be padded withPAD_SLOT_IDfor dummy requests to be skipped by the kernel.block_idx_*tensors must be padded with 0 to avoid using garbage indices.
Without this padding, CUDA graph capture for decode requests will likely cause out-of-bounds memory access. Note that PAD_SLOT_ID will need to be re-imported.
self.state_indices_tensor[:num_decodes].copy_(
state_indices_tensor, non_blocking=True
)
state_indices_tensor = self.state_indices_tensor[:num_decode_tokens]
state_indices_tensor[num_decodes:] = PAD_SLOT_ID
if self.vllm_config.cache_config.enable_prefix_caching:
self.block_idx_last_scheduled_token[:num_decodes].copy_(
block_idx_last_scheduled_token, non_blocking=True
)
block_idx_last_scheduled_token = self.block_idx_last_scheduled_token[
:num_decode_tokens
]
block_idx_last_scheduled_token[num_decodes:] = 0
self.block_idx_last_computed_token[:num_decodes].copy_(
block_idx_last_computed_token, non_blocking=True
)
block_idx_last_computed_token = self.block_idx_last_computed_token[
:num_decode_tokens
]
block_idx_last_computed_token[num_decodes:] = 0| from vllm.v1.attention.backends.utils import ( | ||
| PAD_SLOT_ID, | ||
| CommonAttentionMetadata, | ||
| compute_causal_conv1d_metadata, | ||
| split_decodes_and_prefills, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import for PAD_SLOT_ID has been removed, but it's necessary for padding state_indices_tensor during CUDA graph capture for decode requests. Please re-add this import.
| from vllm.v1.attention.backends.utils import ( | |
| PAD_SLOT_ID, | |
| CommonAttentionMetadata, | |
| compute_causal_conv1d_metadata, | |
| split_decodes_and_prefills, | |
| from vllm.v1.attention.backends.utils import ( | |
| PAD_SLOT_ID, | |
| CommonAttentionMetadata, | |
| compute_causal_conv1d_metadata, | |
| split_decodes_and_prefills, | |
| ) |
Merge after #29352
Lots of mamba backens assume
num_decode_reqs == num_decode_tokens when max_query_len == 1Make this safe for now
vllm/vllm/model_executor/layers/mamba/short_conv.py
Line 138 in eaf8148