vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices #15524

0cc4m · 2025-08-23T11:55:32Z

This applies the optimization @jeffbolznv contributed in #15427 to non-coopmat GPUs, as long as they support subgroup ballots. It closes the gap on AMD GPUs to ROCm prompt processing in Mixture-of-Experts models.

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	0	pp512	45.16 ± 0.06	342.07 ± 3.44	+657.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	1	pp512	44.83 ± 0.02	317.92 ± 3.07	+609.2%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	0	pp512	191.17 ± 0.68	520.07 ± 5.21	+172.0%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	1	pp512	190.88 ± 0.65	508.50 ± 5.39	+166.4%

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	0	pp512	137.88 ± 0.09	781.56 ± 4.59	+466.8%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	1	pp512	136.86 ± 0.08	758.64 ± 3.46	+454.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	0	pp512	584.38 ± 1.23	1181.19 ± 6.38	+102.1%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	1	pp512	590.68 ± 2.70	1200.36 ± 13.60	+103.2%

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	0	pp512	35.25 ± 0.04	127.06 ± 0.70	+260.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	1	pp512	30.81 ± 0.04	82.30 ± 0.30	+167.1%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	0	pp512	115.28 ± 0.74	177.98 ± 1.80	+54.4%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	1	pp512	114.72 ± 0.49	175.38 ± 1.30	+52.9%

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	0	pp512	458.25 ± 2.08	960.08 ± 3.09	+109.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	1	pp512	453.05 ± 1.17	949.62 ± 7.72	+109.6%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	0	pp512	914.31 ± 5.93	1224.32 ± 9.70	+33.9%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	1	pp512	914.47 ± 8.57	1217.99 ± 7.75	+33.2%

(Nvidia result without coopmat or coopmat2, those already have the optimization)

@netrunnereve This should give your gpt-oss benchmarks a good boost.

0cc4m · 2025-08-23T13:33:55Z

I see it's not working with llvmpipe currently, I'll fix it.

jeffbolznv

Thanks for doing this. As MoE becomes more popular it makes sense to have this optimization as broadly as possible.

jeffbolznv · 2025-08-23T14:52:59Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

@@ -2454,32 +2470,34 @@ static void ggml_vk_load_shaders(vk_device& device) {
        CREATE_MM2(pipeline_dequant_mul_mat_mat_f16[GGML_TYPE_IQ4_NL],  matmul_iq4_nl_f16,  mmq_wg_denoms, warptile_mmq, vk_mat_mat_push_constants, 3)
        CREATE_MM2(pipeline_dequant_mul_mat_mat_f16[GGML_TYPE_MXFP4],   matmul_mxfp4_f16,   mmq_wg_denoms, warptile_mmq, vk_mat_mat_push_constants, 3)

-        CREATE_MM2(pipeline_matmul_id_f16, matmul_id_f16, wg_denoms, warptile, vk_mat_mat_id_push_constants, 4)
+        assert(device->subgroup_ballot);


GGML_ASSERT? (and another one below)

jeffbolznv · 2025-08-23T14:54:01Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

-        CREATE_MM2(GGML_TYPE_IQ4_XS,  pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ4_XS],  matmul_id_iq4_xs_f32,  mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id);
-        CREATE_MM2(GGML_TYPE_IQ4_NL,  pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ4_NL],  matmul_id_iq4_nl_f32,  mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id);
-        CREATE_MM2(GGML_TYPE_MXFP4,   pipeline_dequant_mul_mat_mat_id[GGML_TYPE_MXFP4],   matmul_id_mxfp4_f32,   mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id);
+        if (device->subgroup_ballot) {


Should this also check whether subgroup_size_control is supported?

I bet we could assume that subgroupBallot is supported everywhere, but I think subgroup_size_control is less broadly supported.

Yes, but IIRC this value does nothing if size control is unsupported, that check exists in the pipeline creation function.

But will the shader code work correctly if subgroup_require_full_support isn't supported? There's a shared memory array sized by NUM_WARPS which assumes it can compute the number of subgroups.

I think you're right. I added the check, and also a size control check to the reduced Intel subgroup size. Also noticed an issue where subgroup_require_full_support is only active if subgroup_size_control is supported. On AMD GCN this meant it wasn't getting set, even though it's supported.

jeffbolznv · 2025-08-23T14:55:11Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

@@ -10092,12 +10166,9 @@ static bool ggml_vk_build_graph(ggml_backend_vk_context * ctx, ggml_cgraph * cgr
            }
        }
        if (need_sync) {
-            VK_LOG_DEBUG("node_idx=" << i << " sync");


Was it intentional to remove these?

Yes, there is no i outside the loop. I assume you moved the check.

3Simplex · 2025-08-23T15:31:47Z

AMD RX6900xt 16gb, Adrenalin Edition 25.8.1, Win11

Something to do with context window slows this model down

Qwen3-Coder-30B-A3B-Instruct-UD-IQ2_M.gguf -c 16384

prompt eval time =   16593.05 ms /  8041 tokens (    2.06 ms per token,   484.60 tokens per second)
       eval time =    7268.26 ms /   421 tokens (   17.26 ms per token,    57.92 tokens per second)
      total time =   23861.30 ms /  8462 tokens

Qwen3-Coder-30B-A3B-Instruct-UD-IQ2_M.gguf -c 30720

prompt eval time =   41910.78 ms /  4024 tokens (   10.42 ms per token,    96.01 tokens per second)
       eval time =    3173.70 ms /   139 tokens (   22.83 ms per token,    43.80 tokens per second)
      total time =   45084.48 ms /  4163 tokens

Qwen3-Coder-30B-A3B-Instruct-Q3_K_S.gguf

prompt eval time =   41879.03 ms /  4024 tokens (   10.41 ms per token,    96.09 tokens per second)
       eval time =    7980.29 ms /   177 tokens (   45.09 ms per token,    22.18 tokens per second)
      total time =   49859.32 ms /  4201 tokens

Qwen3-Coder-30B-A3B-Instruct-UD-Q3_K_XL.gguf

prompt eval time =   43871.21 ms /  4076 tokens (   10.76 ms per token,    92.91 tokens per second)
       eval time =    7339.46 ms /   132 tokens (   55.60 ms per token,    17.98 tokens per second)
      total time =   51210.67 ms /  4208 tokens

Qwen3-Coder-30B-A3B-Instruct.i1-IQ3_M.gguf

prompt eval time =   44394.01 ms /  4076 tokens (   10.89 ms per token,    91.81 tokens per second)
       eval time =   11684.67 ms /   254 tokens (   46.00 ms per token,    21.74 tokens per second)
      total time =   56078.68 ms /  4330 tokens

Qwen3-Coder-30B-A3B-Instruct-Q2_K_L.gguf

prompt eval time =   16956.96 ms /  8339 tokens (    2.03 ms per token,   491.77 tokens per second)
       eval time =    3096.21 ms /   182 tokens (   17.01 ms per token,    58.78 tokens per second)
      total time =   20053.17 ms /  8521 tokens

netrunnereve · 2025-08-23T20:37:39Z

@netrunnereve This should give your gpt-oss benchmarks a good boost.

Thanks! I saw #15427 and thought that this was possible but never really dug into it.

On my W8100 + RX 470:

PR:

model	size	params	backend	ngl	threads	n_batch	main_gpu	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	100	8	2048	1	pp512	153.60 ± 0.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	100	8	2048	1	tg128	35.14 ± 0.08
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	100	8	128	1	pp512	90.23 ± 0.09
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	100	8	128	1	tg128	35.13 ± 0.20

Master:

model	size	params	backend	ngl	threads	n_batch	main_gpu	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	100	8	2048	1	pp512	41.74 ± 0.18
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	100	8	2048	1	tg128	35.38 ± 0.16
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	100	8	128	1	pp512	69.32 ± 0.02
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	100	8	128	1	tg128	35.42 ± 0.21

The prompt processing speeds are now pretty close to what I get with Llama 8B, and a bonus it's also running well with the default batch size. I wish Google or Mistral would make a MOE in this 20-30B range.

@3Simplex how does the IQ2-M look like on master with the same context sizes, and does the speed fall gradually or abruptly when you try different sizes between 16k and 32k?

netrunnereve · 2025-08-23T21:27:32Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+        m_warptile_id = { 128,  64,  64, 16, mul_mat_subgroup_size_16, 32, 2, tm_m, tn_m, tk_m, mul_mat_subgroup_size_16 };
+        s_warptile_id = { mul_mat_subgroup_size_16, 32, 32, 16, 32, 32, 2, tm_s, tn_s, tk_s, mul_mat_subgroup_size_16 };
+
+        l_warptile_mmqid = { 128, 128, 128, 32, mul_mat_subgroup_size_8 * 2, 64, 2, tm_l, tn_l, tk_l, mul_mat_subgroup_size_8 };


I think you need to add these to the shared memory check in

llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp

Lines 2293 to 2302 in fe591d2

// Disable mul_mat_id if not enough shared memory is available

if (!ggml_vk_matmul_shmem_support(device, s_warptile_mmq, true, t)) {

device->mul_mat_id_s[i] = false;

device->mul_mat_id_m[i] = false;

device->mul_mat_id_l[i] = false;

} else if (!ggml_vk_matmul_shmem_support(device, m_warptile_mmq, true, t)) {

device->mul_mat_id_m[i] = false;

device->mul_mat_id_l[i] = false;

} else if (!ggml_vk_matmul_shmem_support(device, l_warptile_mmq, true, t)) {

device->mul_mat_id_l[i] = false;

.

You're right, thanks.

netrunnereve · 2025-08-23T21:36:55Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+            CREATE_MM(GGML_TYPE_Q4_K, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_K].f32acc, matmul_id_q4_k_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0);
+            CREATE_MM(GGML_TYPE_Q5_K, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q5_K].f32acc, matmul_id_q5_k_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0);
+            CREATE_MM(GGML_TYPE_Q6_K, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q6_K].f32acc, matmul_id_q6_k_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0);
+            CREATE_MM(GGML_TYPE_IQ1_S,   pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ1_S].f32acc,   matmul_id_iq1_s_f32,   , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0);


Did you mean to use warptile_mmqid?

Yes, thanks.

3Simplex · 2025-08-23T22:46:11Z

@netrunnereve This should give your gpt-oss benchmarks a good boost.

Thanks! I saw #15427 and thought that this was possible but never really dug into it.

On my W8100 + RX 470:

PR:

model size params backend ngl threads n_batch main_gpu test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 pp512 153.60 ± 0.01
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 tg128 35.14 ± 0.08
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 pp512 90.23 ± 0.09
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 tg128 35.13 ± 0.20
Master:

model size params backend ngl threads n_batch main_gpu test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 pp512 41.74 ± 0.18
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 tg128 35.38 ± 0.16
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 pp512 69.32 ± 0.02
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 tg128 35.42 ± 0.21
The prompt processing speeds are now pretty close to what I get with Llama 8B, and a bonus it's also running well with the default batch size. I wish Google or Mistral would make a MOE in this 20-30B range.

@3Simplex how does the IQ2-M look like on master with the same context sizes, and does the speed fall gradually or abruptly when you try different sizes between 16k and 32k?

Same settings on Master branch, IQ2-M

.\llama-server --model "C:\Users\Clayton\Llama.Cpp-Toolbox\Converted\Qwen3-Coder-30B-A3B-Instruct-UD-IQ2_M.gguf" --port 8083 --api-key KEY --alias Qwen3-Coder-30B-A3B-Instruct-UD-IQ2_M --jinja -c 17408 -ngl 99 -t 8

srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 3997
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.512384
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 3997, n_tokens = 1949, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 3997, n_tokens = 1949
slot      release: id  0 | task 0 | stop processing: n_past = 4166, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =   36822.57 ms /  3997 tokens (    9.21 ms per token,   108.55 tokens per second)
       eval time =    1973.79 ms /   170 tokens (   11.61 ms per token,    86.13 tokens per second)
      total time =   38796.36 ms /  4167 tokens
srv  update_slots: all slots are idle

srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 3997
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.512384
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 3997, n_tokens = 1949, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 3997, n_tokens = 1949
slot      release: id  0 | task 0 | stop processing: n_past = 4143, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =   36000.94 ms /  3997 tokens (    9.01 ms per token,   111.02 tokens per second)
       eval time =    1715.73 ms /   147 tokens (   11.67 ms per token,    85.68 tokens per second)
      total time =   37716.67 ms /  4144 tokens
srv  update_slots: all slots are idle

Everything over 16384 is similar until I max out.

srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 17408, n_keep = 0, n_prompt_tokens = 3997
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.512384
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 3997, n_tokens = 1949, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 3997, n_tokens = 1949
slot      release: id  0 | task 0 | stop processing: n_past = 4143, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =   68772.96 ms /  3997 tokens (   17.21 ms per token,    58.12 tokens per second)
       eval time =    3011.76 ms /   147 tokens (   20.49 ms per token,    48.81 tokens per second)
      total time =   71784.73 ms /  4144 tokens
srv  update_slots: all slots are idle

This is the max I can run with.

srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 30720, n_keep = 0, n_prompt_tokens = 3997
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.512384
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 3997, n_tokens = 1949, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 3997, n_tokens = 1949
slot      release: id  0 | task 0 | stop processing: n_past = 4131, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =   68998.54 ms /  3997 tokens (   17.26 ms per token,    57.93 tokens per second)
       eval time =    4044.20 ms /   135 tokens (   29.96 ms per token,    33.38 tokens per second)
      total time =   73042.74 ms /  4132 tokens
srv  update_slots: all slots are idle

…groups for subgroup mul_mat_id

JohannesGaessler · 2025-08-24T11:39:10Z

@0cc4m just FYI, since I assumed you made these tables manually: take a look at scipts/compare-commits.sh. The script in turn internally calls llama-bench and scripts/compare-commits.py to create a table with a performance comparison automatically.

0cc4m · 2025-08-24T12:19:07Z

@0cc4m just FYI, since I assumed you made these tables manually: take a look at scipts/compare-commits.sh. The script in turn internally calls llama-bench and scripts/compare-commits.py to create a table with a performance comparison automatically.

Thank you, it wasn't manual, I had Gemini write a script that takes two llama-bench result markdown tables and merges them into the shape I posted above.

I have pretty long compile times and often need comparisons during development (due to a lack of good profiling tools for Vulkan), that's why I keep multiple build directories around and manually generate the data.

netrunnereve · 2025-08-24T15:43:22Z

@3Simplex Since this is happening on master as well you should create a new issue as it's not really relavant to this PR. FYI in the future please use llama-bench to get your results.

Your issue might be due to vram and you can try monitoring that while you run with different context sizes.

3Simplex · 2025-08-24T20:31:48Z

@0cc4m just FYI, since I assumed you made these tables manually: take a look at scipts/compare-commits.sh. The script in turn internally calls llama-bench and scripts/compare-commits.py to create a table with a performance comparison automatically.

This is exactly the kind of thing I needed to know.

@3Simplex Since this is happening on master as well you should create a new issue as it's not really relavant to this PR. FYI in the future please use llama-bench to get your results.

Your issue might be due to vram and you can try monitoring that while you run with different context sizes.

Yeah, that's what made me laugh above. I have been meaning to learn about other functionality like the scripts and examples but haven't had the time to read them. I was thinking, that would have made things so much easier if I had known about it earlier.

…ggml-org#15524) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16

0cc4m added 2 commits August 23, 2025 11:36

vulkan: use subgroup function for mul_mat_id shader even without coopmat

170a0ae

vulkan: fix compile warnings

fe591d2

0cc4m requested a review from jeffbolznv August 23, 2025 11:55

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 23, 2025

0cc4m changed the title ~~0cc4m/vulkan mmid subgroup~~ vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices Aug 23, 2025

jeffbolznv reviewed Aug 23, 2025

View reviewed changes

netrunnereve reviewed Aug 23, 2025

View reviewed changes

0cc4m added 2 commits August 24, 2025 08:17

vulkan: properly check for subgroup size control and require full sub…

0c822f0

…groups for subgroup mul_mat_id

vulkan: disable subgroup mul_mat_id on devices with subgroups < 16

034da0a

jeffbolznv approved these changes Aug 24, 2025

View reviewed changes

0cc4m merged commit 043fb27 into master Aug 24, 2025
48 checks passed

0cc4m deleted the 0cc4m/vulkan-mmid-subgroup branch August 24, 2025 17:36

	// Disable mul_mat_id if not enough shared memory is available
	if (!ggml_vk_matmul_shmem_support(device, s_warptile_mmq, true, t)) {
	device->mul_mat_id_s[i] = false;
	device->mul_mat_id_m[i] = false;
	device->mul_mat_id_l[i] = false;
	} else if (!ggml_vk_matmul_shmem_support(device, m_warptile_mmq, true, t)) {
	device->mul_mat_id_m[i] = false;
	device->mul_mat_id_l[i] = false;
	} else if (!ggml_vk_matmul_shmem_support(device, l_warptile_mmq, true, t)) {
	device->mul_mat_id_l[i] = false;

vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices #15524

vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices #15524

Conversation

0cc4m commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Aug 23, 2025

Uh oh!

jeffbolznv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3Simplex commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrunnereve commented Aug 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3Simplex commented Aug 23, 2025

Uh oh!

JohannesGaessler commented Aug 24, 2025

Uh oh!

0cc4m commented Aug 24, 2025

Uh oh!

netrunnereve commented Aug 24, 2025

Uh oh!

Uh oh!

3Simplex commented Aug 24, 2025

Uh oh!

Uh oh!

0cc4m commented Aug 23, 2025 •

edited

Loading

3Simplex commented Aug 23, 2025 •

edited

Loading