-
Notifications
You must be signed in to change notification settings - Fork 12.8k
vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices #15524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I see it's not working with llvmpipe currently, I'll fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. As MoE becomes more popular it makes sense to have this optimization as broadly as possible.
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
@@ -2454,32 +2470,34 @@ static void ggml_vk_load_shaders(vk_device& device) { | |||
CREATE_MM2(pipeline_dequant_mul_mat_mat_f16[GGML_TYPE_IQ4_NL], matmul_iq4_nl_f16, mmq_wg_denoms, warptile_mmq, vk_mat_mat_push_constants, 3) | |||
CREATE_MM2(pipeline_dequant_mul_mat_mat_f16[GGML_TYPE_MXFP4], matmul_mxfp4_f16, mmq_wg_denoms, warptile_mmq, vk_mat_mat_push_constants, 3) | |||
|
|||
CREATE_MM2(pipeline_matmul_id_f16, matmul_id_f16, wg_denoms, warptile, vk_mat_mat_id_push_constants, 4) | |||
assert(device->subgroup_ballot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GGML_ASSERT? (and another one below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
CREATE_MM2(GGML_TYPE_IQ4_XS, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ4_XS], matmul_id_iq4_xs_f32, mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id); | ||
CREATE_MM2(GGML_TYPE_IQ4_NL, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ4_NL], matmul_id_iq4_nl_f32, mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id); | ||
CREATE_MM2(GGML_TYPE_MXFP4, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_MXFP4], matmul_id_mxfp4_f32, mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id); | ||
if (device->subgroup_ballot) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also check whether subgroup_size_control is supported?
I bet we could assume that subgroupBallot is supported everywhere, but I think subgroup_size_control is less broadly supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but IIRC this value does nothing if size control is unsupported, that check exists in the pipeline creation function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But will the shader code work correctly if subgroup_require_full_support isn't supported? There's a shared memory array sized by NUM_WARPS
which assumes it can compute the number of subgroups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right. I added the check, and also a size control check to the reduced Intel subgroup size. Also noticed an issue where subgroup_require_full_support is only active if subgroup_size_control is supported. On AMD GCN this meant it wasn't getting set, even though it's supported.
@@ -10092,12 +10166,9 @@ static bool ggml_vk_build_graph(ggml_backend_vk_context * ctx, ggml_cgraph * cgr | |||
} | |||
} | |||
if (need_sync) { | |||
VK_LOG_DEBUG("node_idx=" << i << " sync"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it intentional to remove these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is no i outside the loop. I assume you moved the check.
AMD RX6900xt 16gb, Adrenalin Edition 25.8.1, Win11 Something to do with context window slows this model down Qwen3-Coder-30B-A3B-Instruct-UD-IQ2_M.gguf -c 16384
Qwen3-Coder-30B-A3B-Instruct-UD-IQ2_M.gguf -c 30720
Qwen3-Coder-30B-A3B-Instruct-Q3_K_S.gguf
Qwen3-Coder-30B-A3B-Instruct-UD-Q3_K_XL.gguf
Qwen3-Coder-30B-A3B-Instruct.i1-IQ3_M.gguf
Qwen3-Coder-30B-A3B-Instruct-Q2_K_L.gguf
|
Thanks! I saw #15427 and thought that this was possible but never really dug into it. On my W8100 + RX 470: PR:
Master:
The prompt processing speeds are now pretty close to what I get with Llama 8B, and a bonus it's also running well with the default batch size. I wish Google or Mistral would make a MOE in this 20-30B range. @3Simplex how does the IQ2-M look like on master with the same context sizes, and does the speed fall gradually or abruptly when you try different sizes between 16k and 32k? |
m_warptile_id = { 128, 64, 64, 16, mul_mat_subgroup_size_16, 32, 2, tm_m, tn_m, tk_m, mul_mat_subgroup_size_16 }; | ||
s_warptile_id = { mul_mat_subgroup_size_16, 32, 32, 16, 32, 32, 2, tm_s, tn_s, tk_s, mul_mat_subgroup_size_16 }; | ||
|
||
l_warptile_mmqid = { 128, 128, 128, 32, mul_mat_subgroup_size_8 * 2, 64, 2, tm_l, tn_l, tk_l, mul_mat_subgroup_size_8 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to add these to the shared memory check in
llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp
Lines 2293 to 2302 in fe591d2
// Disable mul_mat_id if not enough shared memory is available | |
if (!ggml_vk_matmul_shmem_support(device, s_warptile_mmq, true, t)) { | |
device->mul_mat_id_s[i] = false; | |
device->mul_mat_id_m[i] = false; | |
device->mul_mat_id_l[i] = false; | |
} else if (!ggml_vk_matmul_shmem_support(device, m_warptile_mmq, true, t)) { | |
device->mul_mat_id_m[i] = false; | |
device->mul_mat_id_l[i] = false; | |
} else if (!ggml_vk_matmul_shmem_support(device, l_warptile_mmq, true, t)) { | |
device->mul_mat_id_l[i] = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, thanks.
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
CREATE_MM(GGML_TYPE_Q4_K, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q4_K].f32acc, matmul_id_q4_k_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0); | ||
CREATE_MM(GGML_TYPE_Q5_K, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q5_K].f32acc, matmul_id_q5_k_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0); | ||
CREATE_MM(GGML_TYPE_Q6_K, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_Q6_K].f32acc, matmul_id_q6_k_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0); | ||
CREATE_MM(GGML_TYPE_IQ1_S, pipeline_dequant_mul_mat_mat_id[GGML_TYPE_IQ1_S].f32acc, matmul_id_iq1_s_f32, , mmq_wg_denoms, warptile_mmq, vk_mat_mat_id_push_constants, 4, _id, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to use warptile_mmqid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks.
Same settings on Master branch, IQ2-M
Everything over 16384 is similar until I max out.
This is the max I can run with.
|
@0cc4m just FYI, since I assumed you made these tables manually: take a look at |
Thank you, it wasn't manual, I had Gemini write a script that takes two llama-bench result markdown tables and merges them into the shape I posted above. I have pretty long compile times and often need comparisons during development (due to a lack of good profiling tools for Vulkan), that's why I keep multiple build directories around and manually generate the data. |
@3Simplex Since this is happening on master as well you should create a new issue as it's not really relavant to this PR. FYI in the future please use llama-bench to get your results. Your issue might be due to vram and you can try monitoring that while you run with different context sizes. |
This is exactly the kind of thing I needed to know.
Yeah, that's what made me laugh above. I have been meaning to learn about other functionality like the scripts and examples but haven't had the time to read them. I was thinking, that would have made things so much easier if I had known about it earlier. |
…ggml-org#15524) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
This applies the optimization @jeffbolznv contributed in #15427 to non-coopmat GPUs, as long as they support subgroup ballots. It closes the gap on AMD GPUs to ROCm prompt processing in Mixture-of-Experts models.
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
(Nvidia result without coopmat or coopmat2, those already have the optimization)
@netrunnereve This should give your gpt-oss benchmarks a good boost.