ggml : fix FA mask dim 2 and 3 #14505

ggerganov · 2025-07-02T14:11:57Z

In #14500, @JohannesGaessler correctly noted that the FA did not utilize dim 3 of the mask. I overlooked this and now as I was updating #14363 realized that we need to align the dimensions.

~~The fix is simple, I will try to update it myself across the Vulkan and CUDA backends later today.~~

Also small fix for the ggml_soft_max_ext(): was incorrectly requiring the mask to be a 3D array + test for this.

ggml-ci

ggerganov · 2025-07-02T16:29:00Z

For the CUDA kernels I think I updated most of the variants, but the MMA is more difficult to figure out. @JohannesGaessler Could you take a look at 2a20a7e and see if it makes sense. Any help with updating the MMA code is appreciated.

Edit: nvm, I think my approach was not correct.

ggml-ci

ggerganov · 2025-07-03T05:38:30Z

@JohannesGaessler @jeffbolznv For now I will disable the broadcast of ggml_flash_attn_ext() in Vulkan again (it was already disabled for CUDA). Apologies for changing the API again - I messed it up a bit in #14435.

To summarize, the correct broadcast logic that we have to support is as follow:

llama.cpp/ggml/include/ggml.h

Lines 1983 to 2003 in 89ee2f1

    
           // q:    [n_embd_k, n_batch,     n_head,    ne3 ] 
        
           // k:    [n_embd_k, n_kv,        n_head_kv, ne3 ] 
        
           // v:    [n_embd_v, n_kv,        n_head_kv, ne3 ] !! not transposed !! 
        
           // mask: [n_kv,     n_batch_pad, ne32,      ne33] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !! 
        
           // res:  [n_embd_v, n_head,      n_batch,   ne3 ] !! permuted !! 
        
           // 
        
           // broadcast: 
        
           //   n_head % n_head_kv == 0 
        
           //   n_head % ne32      == 0 
        
           //   ne3    % ne33      == 0 
        
           // 
        
           GGML_API struct ggml_tensor * ggml_flash_attn_ext( 
        
                   struct ggml_context * ctx, 
        
                   struct ggml_tensor  * q, 
        
                   struct ggml_tensor  * k, 
        
                   struct ggml_tensor  * v, 
        
                   struct ggml_tensor  * mask, 
        
                   float                 scale, 
        
                   float                 max_bias, 
        
                   float                 logit_softcap);

In practice, the ne32 will probably always be equal to 1. I.e. we specify the mask for a single attention head and broadcast it to all other heads. But in theory, ne32 could be any integer that can be broadcasted to n_head.

The ne33 on master is always equal to 1. But with the upcoming #14363 it can also be equal to ne3 > 1 (number of parallel sequences).

This way, the mask that we pass to ggml_soft_max_ext() and ggml_flash_attn_ext() is technically the same - the dimensions correspond 1-to-1.

Merging this for now so I can continue working on top of this and later on we'll hopefully add support for these cases.

JohannesGaessler · 2025-07-03T12:54:18Z

If we use dimension 3 to identify the sequence that a mask belongs to, do we still need broadcasting? Intuitively I would have thought that there would be a 1:1 mapping between sequences and attention masks. But in that case, wouldn't it be simpler to just run one instance of GGML_OP_FLASH_ATTN_EXT per parallel sequence? (I assume the thinking is that a single ggml op/kernel launch is going to be more efficient.)

ggerganov · 2025-07-03T13:17:05Z

It is correct that ne3 == ne33 always. It's just that generalization for ne3 % ne33 == 0 should be easy, so no reason not to support it.

But in that case, wouldn't it be simpler to just run one instance of GGML_OP_FLASH_ATTN_EXT per parallel sequence?

Technically, when we reach the attention, we could make views of the batch and loop over each sequence, but this has a few disadvantages:

The number of nodes in the graph will increase, in some cases significantly
The number of nodes will be a function of the batch contents. Not a big trouble, but does not help either
There would be extra concat nodes for merging back the results from the separate streams of the FAs

It also has some advantages, but overall I think the single-shot version is better. The multi-shot strategy that you mention is currently a backup plan that I think we should consider only if the single-shot version does not work for some reason.

jeffbolznv · 2025-07-03T14:48:11Z

I started implementing this expecting it to be straightforward, but got a bit confused on handling grouped query attention. Seems like we can't group when mask->ne[2] != 1 because the different iq2 values all want different masks. Is this intended? I could disable the optimization for that case, just want to make sure that's expected and not a problem with the definition.

ggerganov · 2025-07-03T15:20:46Z

Seems like we can't group when mask->ne[2] != 1 because the different iq2 values all want different masks.

I could have missed something again. But is there a reason preventing to use mask with index at dimension 2: im2 = iq2 % nm2? If iq2 is the reduced dimension after GQA, then use: im2 = (iq2 * qk_ratio + r) % nm2.

(here nm2 == mask->ne[2])

jeffbolznv · 2025-07-03T15:26:16Z

im2 = (iq2 * qk_ratio + r) % nm2

Right, this is the problematic part. The GQA path wants to have a single matrix per CTA, but now this would be multiple matrices per CTA.

You had previously said "In practice, the ne32 will probably always be equal to 1", so maybe it's fine to just disable it.

ggerganov · 2025-07-03T15:31:50Z

Yes, it's fine to disable it. Can you enable it for mask->ne[2] == 1 && mask->ne[3] > 1?

jeffbolznv · 2025-07-03T15:39:38Z

Seems to work fine, latest commit enables it.

* origin/master: Fix conditional enabling following arch checks for ggml-sycl (ggml-org#14504) convert : correct gemma 3n conversion (ggml-org#14450) kv-cache : use ggml_set_rows (ggml-org#14285) ggml : fix FA mask dim 2 and 3 (ggml-org#14505) ggml : remove kompute backend (ggml-org#14501) CUDA: add dynamic shared mem to softmax, refactor general usage (ggml-org#14497)

ggml : fix FA mask dim 2 and 3

6036177

ggml-ci

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Nvidia GPU Issues specific to Nvidia GPUs labels Jul 2, 2025

ggerganov mentioned this pull request Jul 2, 2025

vulkan: support softmax/FA batch and broadcast #14449

Merged

backends : unsupport batched FA in CUDA and Vulkan

89ee2f1

ggml-ci

ggerganov force-pushed the gg/fa-fix-dim-2 branch from 2a20a7e to 89ee2f1 Compare July 3, 2025 04:57

github-actions bot added the Vulkan Issues specific to the Vulkan backend label Jul 3, 2025

ggerganov marked this pull request as ready for review July 3, 2025 05:26

vulkan : disable FA for mask->ne[2] != 1

b1b22ae

ggerganov force-pushed the gg/fa-fix-dim-2 branch from a65fa3a to b1b22ae Compare July 3, 2025 07:33

ggerganov merged commit 9067487 into master Jul 3, 2025
48 checks passed

ggerganov deleted the gg/fa-fix-dim-2 branch July 3, 2025 07:47

jeffbolznv mentioned this pull request Jul 3, 2025

vulkan: Handle updated FA dim2/3 definition #14518

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : fix FA mask dim 2 and 3 #14505

ggml : fix FA mask dim 2 and 3 #14505

Uh oh!

ggerganov commented Jul 2, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jul 2, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jul 3, 2025

Uh oh!

Uh oh!

JohannesGaessler commented Jul 3, 2025

Uh oh!

ggerganov commented Jul 3, 2025

Uh oh!

jeffbolznv commented Jul 3, 2025

Uh oh!

ggerganov commented Jul 3, 2025

Uh oh!

jeffbolznv commented Jul 3, 2025

Uh oh!

ggerganov commented Jul 3, 2025

Uh oh!

jeffbolznv commented Jul 3, 2025

Uh oh!

Uh oh!

ggml : fix FA mask dim 2 and 3 #14505

ggml : fix FA mask dim 2 and 3 #14505

Uh oh!

Conversation

ggerganov commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 3, 2025

Uh oh!

Uh oh!

JohannesGaessler commented Jul 3, 2025

Uh oh!

ggerganov commented Jul 3, 2025

Uh oh!

jeffbolznv commented Jul 3, 2025

Uh oh!

ggerganov commented Jul 3, 2025

Uh oh!

jeffbolznv commented Jul 3, 2025

Uh oh!

ggerganov commented Jul 3, 2025

Uh oh!

jeffbolznv commented Jul 3, 2025

Uh oh!

Uh oh!

ggerganov commented Jul 2, 2025 •

edited

Loading

ggerganov commented Jul 2, 2025 •

edited

Loading