Skip to content

ggml : fix FA mask dim 2 and 3 #14505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 3, 2025
Merged

ggml : fix FA mask dim 2 and 3 #14505

merged 3 commits into from
Jul 3, 2025

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Jul 2, 2025

In #14500, @JohannesGaessler correctly noted that the FA did not utilize dim 3 of the mask. I overlooked this and now as I was updating #14363 realized that we need to align the dimensions.

The fix is simple, I will try to update it myself across the Vulkan and CUDA backends later today.

Also small fix for the ggml_soft_max_ext(): was incorrectly requiring the mask to be a 3D array + test for this.

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Nvidia GPU Issues specific to Nvidia GPUs labels Jul 2, 2025
@ggerganov
Copy link
Member Author

ggerganov commented Jul 2, 2025

For the CUDA kernels I think I updated most of the variants, but the MMA is more difficult to figure out. @JohannesGaessler Could you take a look at 2a20a7e and see if it makes sense. Any help with updating the MMA code is appreciated.

Edit: nvm, I think my approach was not correct.

@github-actions github-actions bot added the Vulkan Issues specific to the Vulkan backend label Jul 3, 2025
@ggerganov ggerganov marked this pull request as ready for review July 3, 2025 05:26
@ggerganov
Copy link
Member Author

@JohannesGaessler @jeffbolznv For now I will disable the broadcast of ggml_flash_attn_ext() in Vulkan again (it was already disabled for CUDA). Apologies for changing the API again - I messed it up a bit in #14435.

To summarize, the correct broadcast logic that we have to support is as follow:

llama.cpp/ggml/include/ggml.h

Lines 1983 to 2003 in 89ee2f1

// q: [n_embd_k, n_batch, n_head, ne3 ]
// k: [n_embd_k, n_kv, n_head_kv, ne3 ]
// v: [n_embd_v, n_kv, n_head_kv, ne3 ] !! not transposed !!
// mask: [n_kv, n_batch_pad, ne32, ne33] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !!
// res: [n_embd_v, n_head, n_batch, ne3 ] !! permuted !!
//
// broadcast:
// n_head % n_head_kv == 0
// n_head % ne32 == 0
// ne3 % ne33 == 0
//
GGML_API struct ggml_tensor * ggml_flash_attn_ext(
struct ggml_context * ctx,
struct ggml_tensor * q,
struct ggml_tensor * k,
struct ggml_tensor * v,
struct ggml_tensor * mask,
float scale,
float max_bias,
float logit_softcap);

In practice, the ne32 will probably always be equal to 1. I.e. we specify the mask for a single attention head and broadcast it to all other heads. But in theory, ne32 could be any integer that can be broadcasted to n_head.

The ne33 on master is always equal to 1. But with the upcoming #14363 it can also be equal to ne3 > 1 (number of parallel sequences).

This way, the mask that we pass to ggml_soft_max_ext() and ggml_flash_attn_ext() is technically the same - the dimensions correspond 1-to-1.

Merging this for now so I can continue working on top of this and later on we'll hopefully add support for these cases.

@ggerganov ggerganov merged commit 9067487 into master Jul 3, 2025
48 checks passed
@ggerganov ggerganov deleted the gg/fa-fix-dim-2 branch July 3, 2025 07:47
@JohannesGaessler
Copy link
Collaborator

If we use dimension 3 to identify the sequence that a mask belongs to, do we still need broadcasting? Intuitively I would have thought that there would be a 1:1 mapping between sequences and attention masks. But in that case, wouldn't it be simpler to just run one instance of GGML_OP_FLASH_ATTN_EXT per parallel sequence? (I assume the thinking is that a single ggml op/kernel launch is going to be more efficient.)

@ggerganov
Copy link
Member Author

It is correct that ne3 == ne33 always. It's just that generalization for ne3 % ne33 == 0 should be easy, so no reason not to support it.

But in that case, wouldn't it be simpler to just run one instance of GGML_OP_FLASH_ATTN_EXT per parallel sequence?

Technically, when we reach the attention, we could make views of the batch and loop over each sequence, but this has a few disadvantages:

  • The number of nodes in the graph will increase, in some cases significantly
  • The number of nodes will be a function of the batch contents. Not a big trouble, but does not help either
  • There would be extra concat nodes for merging back the results from the separate streams of the FAs

It also has some advantages, but overall I think the single-shot version is better. The multi-shot strategy that you mention is currently a backup plan that I think we should consider only if the single-shot version does not work for some reason.

@jeffbolznv
Copy link
Collaborator

I started implementing this expecting it to be straightforward, but got a bit confused on handling grouped query attention. Seems like we can't group when mask->ne[2] != 1 because the different iq2 values all want different masks. Is this intended? I could disable the optimization for that case, just want to make sure that's expected and not a problem with the definition.

@ggerganov
Copy link
Member Author

Seems like we can't group when mask->ne[2] != 1 because the different iq2 values all want different masks.

I could have missed something again. But is there a reason preventing to use mask with index at dimension 2: im2 = iq2 % nm2? If iq2 is the reduced dimension after GQA, then use: im2 = (iq2 * qk_ratio + r) % nm2.

(here nm2 == mask->ne[2])

@jeffbolznv
Copy link
Collaborator

im2 = (iq2 * qk_ratio + r) % nm2

Right, this is the problematic part. The GQA path wants to have a single matrix per CTA, but now this would be multiple matrices per CTA.

You had previously said "In practice, the ne32 will probably always be equal to 1", so maybe it's fine to just disable it.

@ggerganov
Copy link
Member Author

Yes, it's fine to disable it. Can you enable it for mask->ne[2] == 1 && mask->ne[3] > 1?

@jeffbolznv
Copy link
Collaborator

Seems to work fine, latest commit enables it.

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jul 3, 2025
* origin/master:
Fix conditional enabling following arch checks for ggml-sycl (ggml-org#14504)
convert : correct gemma 3n conversion (ggml-org#14450)
kv-cache : use ggml_set_rows (ggml-org#14285)
ggml : fix FA mask dim 2 and 3 (ggml-org#14505)
ggml : remove kompute backend (ggml-org#14501)
CUDA: add dynamic shared mem to softmax, refactor general usage (ggml-org#14497)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants