SplitK with Atomic Reduce Counting for Skinny GEMMs #29807

amd-hhashemi · 2025-12-01T19:37:13Z

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

chatgpt-codex-connector · 2025-12-01T19:37:20Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request introduces a new GEMM kernel, wvSplitKrc, optimized for skinny matrices on ROCm GPUs using a Split-K algorithm with atomic reduction. While the initiative is good, the implementation has several critical issues that must be addressed. There are correctness bugs related to GPU architecture detection and memory access patterns that will lead to incorrect results or prevent the kernel from being used at all. Additionally, there is a significant performance issue due to an inefficient runtime loop inside the kernel. I have provided detailed comments and suggestions to fix these problems.

gemini-code-assist · 2025-12-01T19:41:10Z

csrc/rocm/skinny_gemms.cu

  return out_c;
 }

+#if defined(__gfx950__)  // TODO: Add NAVI support


The preprocessor directive #if defined(__gfx950__) is likely incorrect as gfx950 is not a standard ROCm architecture name. For MI300 series GPUs, the architecture is gfx942. This will prevent the kernel from being compiled for the intended hardware, making this new feature dead code. You should use a correct macro, for example #if defined(__gfx942__) or the existing __HIP__MI3XX__ if it's meant for all MI300 series.

#if defined(__HIP__MI3XX__) // TODO: Add NAVI support

gemini-code-assist · 2025-12-01T19:41:11Z

csrc/rocm/skinny_gemms.cu

+  if (((K + kfitsPerRdc * kFit - 1) / (kfitsPerRdc * kFit)) * numCuWithFullK <=
+      CuCount)
+    while (true) {
+      while (kFit > TUC_) {
+        uint32_t kFit_ = kFit - TUC_;
+        if (((K + (kfitsPerRdc * kFit_ - 1)) / (kfitsPerRdc * kFit_)) *
+                numCuWithFullK >
+            CuCount)
+          break;
+        kFit = kFit_;
+      }
+      if (((K + ((kfitsPerRdc - 1) * kFit - 1)) / ((kfitsPerRdc - 1) * kFit)) *
+              numCuWithFullK <=
+          CuCount)
+        kfitsPerRdc--;
+      else
+        break;
+    }


This while(true) loop calculates the optimal K-split configuration (kFit and kfitsPerRdc) at runtime inside the kernel. This is highly inefficient because it is executed by every thread, leading to significant redundant computation and thread divergence. This logic should be performed once on the host, and the results passed as arguments to the kernel to avoid severe performance degradation.

gemini-code-assist · 2025-12-01T19:41:11Z

csrc/rocm/skinny_gemms.cu

+    for (uint32_t k2 = 0; k2 < UNRL; k2++) {
+      uint32_t k = k_str + k2 * THRDS * A_CHUNK;
+      uint32_t k_ = k + threadIdx.x * A_CHUNK;
+      const scalar_t* B_ = &B[min__(k_, K - A_CHUNK)];


The calculation of the base address for loading from matrix B is incorrect. min__(k_, K - A_CHUNK) causes threads that should be processing data near the end of the matrix to instead load data from an earlier, incorrect position. This will lead to incorrect matrix multiplication results. The boundary handling should rely on the out-of-bounds checks and zero-padding already present later in the code.

const scalar_t* B_ = &B[k_];

This is intended. It will do junk calc, but it will ultimately get discarded. This way the loop can be fully unrolled without compiler getting confused by oob-handling if-conditions in the loop.

gemini-code-assist · 2025-12-01T19:41:11Z

vllm/model_executor/layers/utils.py

+    use_skinny_reduce_counting = (
+        envs.VLLM_ROCM_USE_SKINNY_GEMM
+        and on_gfx950()
+        and x.dtype in [torch.float16, torch.bfloat16]
+        and (n == 32 and k == 2880 and (m == 640 or m == 128))
+    )


The function on_gfx950 checks for the "gfx950" architecture, which is incorrect for MI300 series GPUs (which are gfx942). This will cause on_gfx950() to return False, and the new wvSplitKrc kernel will never be executed. This makes the new feature dead code. You should use a correct check for the target hardware. A similar issue exists in csrc/rocm/skinny_gemms.cu where the kernel is guarded by #if defined(__gfx950__).

gemini-code-assist · 2025-12-01T19:41:11Z

csrc/rocm/skinny_gemms.cu

+__device__ inline int min__(int a, int b) {
+  int tmp;
+  asm("v_min_i32_e32 %0, %2, %3 " : "=v"(tmp) : "0"(tmp), "v"(a), "v"(b));
+  return tmp;
+}


The min__ function is implemented with inline assembly that has an incorrect constraint "0"(tmp). This can lead to miscompilation. It's safer and more maintainable to use the standard min() function, which the compiler will optimize to the v_min_i32_e32 instruction. Given the critical bug in its usage at line 1485, it's best to refactor this helper function.

__device__ inline int min__(int a, int b) { return min(a, b); }

Signed-off-by: Hashem Hashemi <[email protected]>

Signed-off-by: Johnny Yang <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

…oject#28878) Signed-off-by: HDCharles <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

…a building with DCP > 1 (vllm-project#29449) Signed-off-by: Matthew Bonanni <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

…m-project#28619) Signed-off-by: Jinzhen Lin <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

…#29576) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

Signed-off-by: Fadi Arafeh <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

mergify · 2025-12-02T05:29:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @amd-hhashemi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Cross-wave splitk skinny gemm optimization

93a41eb

amd-hhashemi requested review from gshtras and tjtanaa as code owners December 1, 2025 19:37

mergify bot added the rocm Related to AMD ROCm label Dec 1, 2025

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

amd-hhashemi and others added 24 commits December 2, 2025 05:28

correction to enable cases

4d962d3

Signed-off-by: Hashem Hashemi <[email protected]>

remove debug overhead

9e90868

Signed-off-by: Hashem Hashemi <[email protected]>

Adjustments to B[] staging and add M=64 tests

bf31236

Signed-off-by: Hashem Hashemi <[email protected]>

optimize bias addition

35a8c5a

Signed-off-by: Hashem Hashemi <[email protected]>

perf adjustments

d896a91

Signed-off-by: Hashem Hashemi <[email protected]>

Make race-hazard free by zero-initing, for minor penalty.

1ad4865

Signed-off-by: Hashem Hashemi <[email protected]>

Drop next B[] fetch of at end of k-shard.

47dba69

Signed-off-by: Hashem Hashemi <[email protected]>

fix paramtization of GrpsShrB

20c1f4c

Signed-off-by: Hashem Hashemi <[email protected]>

cleanup

8a14af8

Signed-off-by: Hashem Hashemi <[email protected]>

cleanup2

8e6b283

Signed-off-by: Hashem Hashemi <[email protected]>

cleanup3

1f56a41

Signed-off-by: Hashem Hashemi <[email protected]>

lint fixes

6a73423

Signed-off-by: Hashem Hashemi <[email protected]>

Fix min() double upcast due to mix signs, without asm.

584dd8b

Signed-off-by: Hashem Hashemi <[email protected]>

[TPU] add tpu_inference (vllm-project#27277)

acdab3f

Signed-off-by: Johnny Yang <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

[Bugfix] Make compressed-tensors MoEs respect ignored layers (vllm-pr…

83c1648

…oject#28878) Signed-off-by: HDCharles <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

[Attention][Async] Eliminate seq_lens_cpu in FlashAttention metadat…

869370c

…a building with DCP > 1 (vllm-project#29449) Signed-off-by: Matthew Bonanni <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

add xpu supported model and model id for cpu (vllm-project#29380)

f97b092

Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

[Model Runner V2] Minor code cleanup (vllm-project#29570)

e6f56f5

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

[Model Runner V2] Minor cleanup for build_attn_metadata (vllm-project…

41120f8

…#29576) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

[DOC] Add vLLM Bangkok Meetup info (vllm-project#29561)

91a3822

Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

[cpu][fix] Fix Arm CI tests (vllm-project#29552)

bdd7d24

Signed-off-by: Fadi Arafeh <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

[Model Runner V2] Refactor CudaGraphManager (vllm-project#29583)

0564ba2

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

[Bugfix] Fix getting device for MoE LoRA (vllm-project#29475)

02691a9

Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>

github-project-automation bot added this to NVIDIA and gpt-oss Issues & Enhancements Dec 2, 2025

mergify bot added the speculative-decoding label Dec 2, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Dec 2, 2025

github-project-automation bot added this to Structured Output Dec 2, 2025

mergify bot added v1 tpu Related to Google TPUs tool-calling labels Dec 2, 2025

github-project-automation bot added this to Tool Calling Dec 2, 2025

mergify bot added the kv-connector label Dec 2, 2025

mergify bot added the needs-rebase label Dec 2, 2025

amd-hhashemi closed this Dec 2, 2025

github-project-automation bot moved this to Done in NVIDIA Dec 2, 2025

github-project-automation bot moved this to Done in Tool Calling Dec 2, 2025

github-project-automation bot moved this to Done in Structured Output Dec 2, 2025

github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Dec 2, 2025

amd-hhashemi deleted the wvSplitKrc branch December 2, 2025 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SplitK with Atomic Reduce Counting for Skinny GEMMs #29807

SplitK with Atomic Reduce Counting for Skinny GEMMs #29807

Uh oh!

amd-hhashemi commented Dec 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

amd-hhashemi Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

amd-hhashemi Dec 1, 2025

Uh oh!

mergify bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

SplitK with Atomic Reduce Counting for Skinny GEMMs #29807

SplitK with Atomic Reduce Counting for Skinny GEMMs #29807

Uh oh!

Conversation

amd-hhashemi commented Dec 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

amd-hhashemi Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

amd-hhashemi Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

amd-hhashemi commented Dec 1, 2025 •

edited by github-actions bot

Loading