Skip to content

vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices #15524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 24, 2025

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Aug 23, 2025

This applies the optimization @jeffbolznv contributed in #15427 to non-coopmat GPUs, as long as they support subgroup ballots. It closes the gap on AMD GPUs to ROCm prompt processing in Mixture-of-Experts models.

ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 45.16 ± 0.06 342.07 ± 3.44 +657.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 44.83 ± 0.02 317.92 ± 3.07 +609.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 191.17 ± 0.68 520.07 ± 5.21 +172.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 190.88 ± 0.65 508.50 ± 5.39 +166.4%

ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 137.88 ± 0.09 781.56 ± 4.59 +466.8%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 136.86 ± 0.08 758.64 ± 3.46 +454.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 584.38 ± 1.23 1181.19 ± 6.38 +102.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 590.68 ± 2.70 1200.36 ± 13.60 +103.2%

ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 35.25 ± 0.04 127.06 ± 0.70 +260.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 30.81 ± 0.04 82.30 ± 0.30 +167.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 115.28 ± 0.74 177.98 ± 1.80 +54.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 114.72 ± 0.49 175.38 ± 1.30 +52.9%

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s (before) t/s (after) diff
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 458.25 ± 2.08 960.08 ± 3.09 +109.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 453.05 ± 1.17 949.62 ± 7.72 +109.6%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 914.31 ± 5.93 1224.32 ± 9.70 +33.9%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 914.47 ± 8.57 1217.99 ± 7.75 +33.2%

(Nvidia result without coopmat or coopmat2, those already have the optimization)

@netrunnereve This should give your gpt-oss benchmarks a good boost.

@0cc4m 0cc4m requested a review from jeffbolznv August 23, 2025 11:55
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 23, 2025
@0cc4m 0cc4m changed the title 0cc4m/vulkan mmid subgroup vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices Aug 23, 2025
@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 23, 2025

I see it's not working with llvmpipe currently, I'll fix it.

Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. As MoE becomes more popular it makes sense to have this optimization as broadly as possible.

@@ -2454,32 +2470,34 @@ static void ggml_vk_load_shaders(vk_device& device) {
CREATE_MM2(pipeline_dequant_mul_mat_mat_f16[GGML_TYPE_IQ4_NL], matmul_iq4_nl_f16, mmq_wg_denoms, warptile_mmq, vk_mat_mat_push_constants, 3)
CREATE_MM2(pipeline_dequant_mul_mat_mat_f16[GGML_TYPE_MXFP4], matmul_mxfp4_f16, mmq_wg_denoms, warptile_mmq, vk_mat_mat_push_constants, 3)

CREATE_MM2(pipeline_matmul_id_f16, matmul_id_f16, wg_denoms, warptile, vk_mat_mat_id_push_constants, 4)
assert(device->subgroup_ballot);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGML_ASSERT? (and another one below)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

CREATE_MM2(GGML_TYPE_IQ4_XS, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ4_XS], matmul_id_iq4_xs_f32, mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id);
CREATE_MM2(GGML_TYPE_IQ4_NL, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ4_NL], matmul_id_iq4_nl_f32, mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id);
CREATE_MM2(GGML_TYPE_MXFP4, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_MXFP4], matmul_id_mxfp4_f32, mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id);
if (device->subgroup_ballot) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also check whether subgroup_size_control is supported?

I bet we could assume that subgroupBallot is supported everywhere, but I think subgroup_size_control is less broadly supported.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but IIRC this value does nothing if size control is unsupported, that check exists in the pipeline creation function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But will the shader code work correctly if subgroup_require_full_support isn't supported? There's a shared memory array sized by NUM_WARPS which assumes it can compute the number of subgroups.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. I added the check, and also a size control check to the reduced Intel subgroup size. Also noticed an issue where subgroup_require_full_support is only active if subgroup_size_control is supported. On AMD GCN this meant it wasn't getting set, even though it's supported.

@@ -10092,12 +10166,9 @@ static bool ggml_vk_build_graph(ggml_backend_vk_context * ctx, ggml_cgraph * cgr
}
}
if (need_sync) {
VK_LOG_DEBUG("node_idx=" << i << " sync");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it intentional to remove these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is no i outside the loop. I assume you moved the check.

@3Simplex
Copy link

3Simplex commented Aug 23, 2025

AMD RX6900xt 16gb, Adrenalin Edition 25.8.1, Win11


Something to do with context window slows this model down

Qwen3-Coder-30B-A3B-Instruct-UD-IQ2_M.gguf -c 16384

prompt eval time =   16593.05 ms /  8041 tokens (    2.06 ms per token,   484.60 tokens per second)
       eval time =    7268.26 ms /   421 tokens (   17.26 ms per token,    57.92 tokens per second)
      total time =   23861.30 ms /  8462 tokens

Qwen3-Coder-30B-A3B-Instruct-UD-IQ2_M.gguf -c 30720

prompt eval time =   41910.78 ms /  4024 tokens (   10.42 ms per token,    96.01 tokens per second)
       eval time =    3173.70 ms /   139 tokens (   22.83 ms per token,    43.80 tokens per second)
      total time =   45084.48 ms /  4163 tokens

Qwen3-Coder-30B-A3B-Instruct-Q3_K_S.gguf

prompt eval time =   41879.03 ms /  4024 tokens (   10.41 ms per token,    96.09 tokens per second)
       eval time =    7980.29 ms /   177 tokens (   45.09 ms per token,    22.18 tokens per second)
      total time =   49859.32 ms /  4201 tokens

Qwen3-Coder-30B-A3B-Instruct-UD-Q3_K_XL.gguf

prompt eval time =   43871.21 ms /  4076 tokens (   10.76 ms per token,    92.91 tokens per second)
       eval time =    7339.46 ms /   132 tokens (   55.60 ms per token,    17.98 tokens per second)
      total time =   51210.67 ms /  4208 tokens

Qwen3-Coder-30B-A3B-Instruct.i1-IQ3_M.gguf

prompt eval time =   44394.01 ms /  4076 tokens (   10.89 ms per token,    91.81 tokens per second)
       eval time =   11684.67 ms /   254 tokens (   46.00 ms per token,    21.74 tokens per second)
      total time =   56078.68 ms /  4330 tokens

Qwen3-Coder-30B-A3B-Instruct-Q2_K_L.gguf

prompt eval time =   16956.96 ms /  8339 tokens (    2.03 ms per token,   491.77 tokens per second)
       eval time =    3096.21 ms /   182 tokens (   17.01 ms per token,    58.78 tokens per second)
      total time =   20053.17 ms /  8521 tokens

@netrunnereve
Copy link
Collaborator

@netrunnereve This should give your gpt-oss benchmarks a good boost.

Thanks! I saw #15427 and thought that this was possible but never really dug into it.

On my W8100 + RX 470:

PR:

model size params backend ngl threads n_batch main_gpu test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 pp512 153.60 ± 0.01
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 tg128 35.14 ± 0.08
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 pp512 90.23 ± 0.09
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 tg128 35.13 ± 0.20

Master:

model size params backend ngl threads n_batch main_gpu test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 pp512 41.74 ± 0.18
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 tg128 35.38 ± 0.16
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 pp512 69.32 ± 0.02
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 tg128 35.42 ± 0.21

The prompt processing speeds are now pretty close to what I get with Llama 8B, and a bonus it's also running well with the default batch size. I wish Google or Mistral would make a MOE in this 20-30B range.

@3Simplex how does the IQ2-M look like on master with the same context sizes, and does the speed fall gradually or abruptly when you try different sizes between 16k and 32k?

m_warptile_id = { 128, 64, 64, 16, mul_mat_subgroup_size_16, 32, 2, tm_m, tn_m, tk_m, mul_mat_subgroup_size_16 };
s_warptile_id = { mul_mat_subgroup_size_16, 32, 32, 16, 32, 32, 2, tm_s, tn_s, tk_s, mul_mat_subgroup_size_16 };

l_warptile_mmqid = { 128, 128, 128, 32, mul_mat_subgroup_size_8 * 2, 64, 2, tm_l, tn_l, tk_l, mul_mat_subgroup_size_8 };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add these to the shared memory check in

// Disable mul_mat_id if not enough shared memory is available
if (!ggml_vk_matmul_shmem_support(device, s_warptile_mmq, true, t)) {
device->mul_mat_id_s[i] = false;
device->mul_mat_id_m[i] = false;
device->mul_mat_id_l[i] = false;
} else if (!ggml_vk_matmul_shmem_support(device, m_warptile_mmq, true, t)) {
device->mul_mat_id_m[i] = false;
device->mul_mat_id_l[i] = false;
} else if (!ggml_vk_matmul_shmem_support(device, l_warptile_mmq, true, t)) {
device->mul_mat_id_l[i] = false;
.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks.

CREATE_MM(GGML_TYPE_Q4_K, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_K].f32acc, matmul_id_q4_k_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0);
CREATE_MM(GGML_TYPE_Q5_K, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q5_K].f32acc, matmul_id_q5_k_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0);
CREATE_MM(GGML_TYPE_Q6_K, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q6_K].f32acc, matmul_id_q6_k_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0);
CREATE_MM(GGML_TYPE_IQ1_S, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ1_S].f32acc, matmul_id_iq1_s_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to use warptile_mmqid?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks.

@3Simplex
Copy link

@netrunnereve This should give your gpt-oss benchmarks a good boost.

Thanks! I saw #15427 and thought that this was possible but never really dug into it.

On my W8100 + RX 470:

PR:

model size params backend ngl threads n_batch main_gpu test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 pp512 153.60 ± 0.01
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 tg128 35.14 ± 0.08
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 pp512 90.23 ± 0.09
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 tg128 35.13 ± 0.20
Master:

model size params backend ngl threads n_batch main_gpu test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 pp512 41.74 ± 0.18
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 2048 1 tg128 35.38 ± 0.16
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 pp512 69.32 ± 0.02
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 128 1 tg128 35.42 ± 0.21
The prompt processing speeds are now pretty close to what I get with Llama 8B, and a bonus it's also running well with the default batch size. I wish Google or Mistral would make a MOE in this 20-30B range.

@3Simplex how does the IQ2-M look like on master with the same context sizes, and does the speed fall gradually or abruptly when you try different sizes between 16k and 32k?

Same settings on Master branch, IQ2-M

.\llama-server --model "C:\Users\Clayton\Llama.Cpp-Toolbox\Converted\Qwen3-Coder-30B-A3B-Instruct-UD-IQ2_M.gguf" --port 8083 --api-key KEY --alias Qwen3-Coder-30B-A3B-Instruct-UD-IQ2_M --jinja -c 17408 -ngl 99 -t 8
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 3997
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.512384
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 3997, n_tokens = 1949, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 3997, n_tokens = 1949
slot      release: id  0 | task 0 | stop processing: n_past = 4166, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =   36822.57 ms /  3997 tokens (    9.21 ms per token,   108.55 tokens per second)
       eval time =    1973.79 ms /   170 tokens (   11.61 ms per token,    86.13 tokens per second)
      total time =   38796.36 ms /  4167 tokens
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 3997
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.512384
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 3997, n_tokens = 1949, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 3997, n_tokens = 1949
slot      release: id  0 | task 0 | stop processing: n_past = 4143, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =   36000.94 ms /  3997 tokens (    9.01 ms per token,   111.02 tokens per second)
       eval time =    1715.73 ms /   147 tokens (   11.67 ms per token,    85.68 tokens per second)
      total time =   37716.67 ms /  4144 tokens
srv  update_slots: all slots are idle

Everything over 16384 is similar until I max out.

srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 17408, n_keep = 0, n_prompt_tokens = 3997
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.512384
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 3997, n_tokens = 1949, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 3997, n_tokens = 1949
slot      release: id  0 | task 0 | stop processing: n_past = 4143, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =   68772.96 ms /  3997 tokens (   17.21 ms per token,    58.12 tokens per second)
       eval time =    3011.76 ms /   147 tokens (   20.49 ms per token,    48.81 tokens per second)
      total time =   71784.73 ms /  4144 tokens
srv  update_slots: all slots are idle

This is the max I can run with.

srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 30720, n_keep = 0, n_prompt_tokens = 3997
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.512384
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 3997, n_tokens = 1949, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 3997, n_tokens = 1949
slot      release: id  0 | task 0 | stop processing: n_past = 4131, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =   68998.54 ms /  3997 tokens (   17.26 ms per token,    57.93 tokens per second)
       eval time =    4044.20 ms /   135 tokens (   29.96 ms per token,    33.38 tokens per second)
      total time =   73042.74 ms /  4132 tokens
srv  update_slots: all slots are idle

@JohannesGaessler
Copy link
Collaborator

@0cc4m just FYI, since I assumed you made these tables manually: take a look at scipts/compare-commits.sh. The script in turn internally calls llama-bench and scripts/compare-commits.py to create a table with a performance comparison automatically.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Aug 24, 2025

@0cc4m just FYI, since I assumed you made these tables manually: take a look at scipts/compare-commits.sh. The script in turn internally calls llama-bench and scripts/compare-commits.py to create a table with a performance comparison automatically.

Thank you, it wasn't manual, I had Gemini write a script that takes two llama-bench result markdown tables and merges them into the shape I posted above.

I have pretty long compile times and often need comparisons during development (due to a lack of good profiling tools for Vulkan), that's why I keep multiple build directories around and manually generate the data.

@netrunnereve
Copy link
Collaborator

@3Simplex Since this is happening on master as well you should create a new issue as it's not really relavant to this PR. FYI in the future please use llama-bench to get your results.

Your issue might be due to vram and you can try monitoring that while you run with different context sizes.

@0cc4m 0cc4m merged commit 043fb27 into master Aug 24, 2025
48 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-mmid-subgroup branch August 24, 2025 17:36
@3Simplex
Copy link

@0cc4m just FYI, since I assumed you made these tables manually: take a look at scipts/compare-commits.sh. The script in turn internally calls llama-bench and scripts/compare-commits.py to create a table with a performance comparison automatically.

This is exactly the kind of thing I needed to know.

@3Simplex Since this is happening on master as well you should create a new issue as it's not really relavant to this PR. FYI in the future please use llama-bench to get your results.

Your issue might be due to vram and you can try monitoring that while you run with different context sizes.

Yeah, that's what made me laugh above. I have been meaning to learn about other functionality like the scripts and examples but haven't had the time to read them. I was thinking, that would have made things so much easier if I had known about it earlier.

qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 25, 2025
…ggml-org#15524)

* vulkan: use subgroup function for mul_mat_id shader even without coopmat

* vulkan: fix compile warnings

* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id

* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants