ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch #16744

leejet · 2025-10-23T15:00:43Z

This PR fixes an invalid CUDA kernel launch issue for k_compute_batched_ptrs when ne12 or ne13 is large.

The previous launch used dim3 block(ne13, ne12), which can exceed the maximum threads per block (1024) and trigger cudaErrorInvalidConfiguration.
Updated the launch to use a fixed block size (e.g., 32×32 threads) and compute the required number of blocks in each dimension, ensuring the total threads per block do not exceed CUDA limits.
No changes were made to the kernel logic itself; only the grid/block configuration was adjusted.

jeffbolznv · 2025-10-23T15:10:10Z

Same comment for this one. Please add a backend test that hits this case. The bug may exist in other backends, too.

leejet · 2025-10-23T15:49:50Z

The backend test has been added.

ggml/src/ggml-cuda/ggml-cuda.cu

slaren · 2025-10-23T21:02:58Z

tests/test-backend-ops.cpp

+
+            // test cases with large batch size
+            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 8, 256, {1024, 2}, {1, 1}));
+            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 8, 256, {4096, 1}, {1, 1}));


Wouldn't a batch size of 1024 be enough to test this?

1024 happens to be the maximum number of threads, so it won’t trigger any issues. I’ve now changed the batch size to 1536, which is smaller than before.

Co-authored-by: Johannes Gäßler <[email protected]>

am17an · 2025-10-26T15:46:08Z

@JohannesGaessler merge?

fix k_compute_batched_ptrs

c2cbb16

leejet requested a review from slaren as a code owner October 23, 2025 15:00

leejet mentioned this pull request Oct 23, 2025

add chroma radiance support leejet/stable-diffusion.cpp#910

Merged

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 23, 2025

add backend ops test

e707662

github-actions bot added the testing Everything test related label Oct 23, 2025

JohannesGaessler reviewed Oct 23, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

slaren reviewed Oct 23, 2025

View reviewed changes

leejet and others added 2 commits October 25, 2025 00:50

Update ggml/src/ggml-cuda/ggml-cuda.cu

96e2320

Co-authored-by: Johannes Gäßler <[email protected]>

reduce the batch size

1da9854

JohannesGaessler approved these changes Oct 24, 2025

View reviewed changes

JohannesGaessler merged commit bbac6a2 into ggml-org:master Oct 26, 2025
71 of 72 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch #16744

ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch #16744

leejet commented Oct 23, 2025

Uh oh!

jeffbolznv commented Oct 23, 2025

Uh oh!

leejet commented Oct 23, 2025

Uh oh!

Uh oh!

slaren Oct 23, 2025

Uh oh!

leejet Oct 24, 2025

Uh oh!

am17an commented Oct 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch #16744

ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch #16744

Conversation

leejet commented Oct 23, 2025

Uh oh!

jeffbolznv commented Oct 23, 2025

Uh oh!

leejet commented Oct 23, 2025

Uh oh!

Uh oh!

slaren Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

leejet Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

am17an commented Oct 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants