Skip to content

Conversation

@amd-hhashemi
Copy link
Contributor

@amd-hhashemi amd-hhashemi commented Dec 1, 2025

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mergify mergify bot added the rocm Related to AMD ROCm label Dec 1, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new GEMM kernel, wvSplitKrc, optimized for skinny matrices on ROCm GPUs using a Split-K algorithm with atomic reduction. While the initiative is good, the implementation has several critical issues that must be addressed. There are correctness bugs related to GPU architecture detection and memory access patterns that will lead to incorrect results or prevent the kernel from being used at all. Additionally, there is a significant performance issue due to an inefficient runtime loop inside the kernel. I have provided detailed comments and suggestions to fix these problems.

return out_c;
}

#if defined(__gfx950__) // TODO: Add NAVI support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The preprocessor directive #if defined(__gfx950__) is likely incorrect as gfx950 is not a standard ROCm architecture name. For MI300 series GPUs, the architecture is gfx942. This will prevent the kernel from being compiled for the intended hardware, making this new feature dead code. You should use a correct macro, for example #if defined(__gfx942__) or the existing __HIP__MI3XX__ if it's meant for all MI300 series.

#if defined(__HIP__MI3XX__)  // TODO: Add NAVI support

Comment on lines +1422 to +1434
if (((K + kfitsPerRdc * kFit - 1) / (kfitsPerRdc * kFit)) * numCuWithFullK <=
CuCount)
while (true) {
while (kFit > TUC_) {
uint32_t kFit_ = kFit - TUC_;
if (((K + (kfitsPerRdc * kFit_ - 1)) / (kfitsPerRdc * kFit_)) *
numCuWithFullK >
CuCount)
break;
kFit = kFit_;
}
if (((K + ((kfitsPerRdc - 1) * kFit - 1)) / ((kfitsPerRdc - 1) * kFit)) *
numCuWithFullK <=
CuCount)
kfitsPerRdc--;
else
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This while(true) loop calculates the optimal K-split configuration (kFit and kfitsPerRdc) at runtime inside the kernel. This is highly inefficient because it is executed by every thread, leading to significant redundant computation and thread divergence. This logic should be performed once on the host, and the results passed as arguments to the kernel to avoid severe performance degradation.

for (uint32_t k2 = 0; k2 < UNRL; k2++) {
uint32_t k = k_str + k2 * THRDS * A_CHUNK;
uint32_t k_ = k + threadIdx.x * A_CHUNK;
const scalar_t* B_ = &B[min__(k_, K - A_CHUNK)];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The calculation of the base address for loading from matrix B is incorrect. min__(k_, K - A_CHUNK) causes threads that should be processing data near the end of the matrix to instead load data from an earlier, incorrect position. This will lead to incorrect matrix multiplication results. The boundary handling should rely on the out-of-bounds checks and zero-padding already present later in the code.

      const scalar_t* B_ = &B[k_];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended. It will do junk calc, but it will ultimately get discarded. This way the loop can be fully unrolled without compiler getting confused by oob-handling if-conditions in the loop.

Comment on lines 138 to 143
use_skinny_reduce_counting = (
envs.VLLM_ROCM_USE_SKINNY_GEMM
and on_gfx950()
and x.dtype in [torch.float16, torch.bfloat16]
and (n == 32 and k == 2880 and (m == 640 or m == 128))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function on_gfx950 checks for the "gfx950" architecture, which is incorrect for MI300 series GPUs (which are gfx942). This will cause on_gfx950() to return False, and the new wvSplitKrc kernel will never be executed. This makes the new feature dead code. You should use a correct check for the target hardware. A similar issue exists in csrc/rocm/skinny_gemms.cu where the kernel is guarded by #if defined(__gfx950__).

Comment on lines 1359 to 1363
__device__ inline int min__(int a, int b) {
int tmp;
asm("v_min_i32_e32 %0, %2, %3 " : "=v"(tmp) : "0"(tmp), "v"(a), "v"(b));
return tmp;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The min__ function is implemented with inline assembly that has an incorrect constraint "0"(tmp). This can lead to miscompilation. It's safer and more maintainable to use the standard min() function, which the compiler will optimize to the v_min_i32_e32 instruction. Given the critical bug in its usage at line 1485, it's best to refactor this helper function.

__device__ inline int min__(int a, int b) {
  return min(a, b);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

amd-hhashemi and others added 24 commits December 2, 2025 05:28
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Johnny Yang <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
…a building with DCP > 1 (vllm-project#29449)

Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
…m-project#28619)

Signed-off-by: Jinzhen Lin <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Wentao Ye <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
Signed-off-by: Fadi Arafeh <[email protected]>
Signed-off-by: Hashem Hashemi <[email protected]>
@mergify mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models gpt-oss Related to GPT-OSS models nvidia structured-output labels Dec 2, 2025
@mergify mergify bot added v1 tpu Related to Google TPUs tool-calling labels Dec 2, 2025
@mergify mergify bot added the kv-connector label Dec 2, 2025
@mergify
Copy link

mergify bot commented Dec 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @amd-hhashemi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 2, 2025
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Dec 2, 2025
@github-project-automation github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Dec 2, 2025
@amd-hhashemi amd-hhashemi deleted the wvSplitKrc branch December 2, 2025 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector llama Related to Llama models multi-modality Related to multi-modality (#4194) needs-rebase new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output tool-calling tpu Related to Google TPUs v1

Projects

Status: Done
Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.