Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Jan 29, 2026

I was working on #19167 but realized that the normal (non-ngram) model is not even supported yet.

Thinking it will be simple, I gave it a try, but ended up stuck at implementing their notion of "zero-computing experts" (ref: link to paper)

image

The main problem is that ggml_mul_mat_id isn't made for this purpose and I have no idea how to adapt it, or which ops may need to be added to make it work.


To illustrate what's the problem, I will take an example of how a normal MoE FFN work:

  • Calculate expert probs using the router
  • Sort & get top_k experts
  • Do FFN gate/up/down with the selected top_k experts, this is done via ggml_mul_mat_id
  • Weighted sum the output

This means we spend the the same amount of computation for each token, proportionally to n_expert_used

However, with longcat-flash:

  • After top_k expert, ONLY experts with ID < n_zero_experts will go through FFN; for the rest, they skip the FFN altogether
  • This makes the amount of computation to be varied token-by-token. For example: one token can use n FFN expert, while another can use n-1, another can use 0 FFN expert (in other words, skipping the MoE altogether)

Apart from the weird MoE, the model has double block architecture, meaning there are 2 attentions and 2 FFNs per layer. Upon converting to GGUF, we convert it to a model of 2 * n_layer, which make the implementation much easier.

@ggerganov
Copy link
Member

Huh, interesting. Likely need to extend ggml_mul_mat_id with more general alpha_i*A^B + beta_i*C and implement special casing in the kernels for alpha_i=0 to skip the compute.

@hebangwen
Copy link

Huh, interesting. Likely need to extend ggml_mul_mat_id with more general alpha_i*A^B + beta_i*C and implement special casing in the kernels for alpha_i=0 to skip the compute.

Hello, I'm also paying attention to the adaptation of the loncat-flash model. Suppose B is the expert weight. If we change mul_mat_ids to alpha * A^B + beta * C, since the weights of zero-compute experts are not saved and the experts activated for each token are different, there might be an interleaved expert activation scenario. In this case, we need to determine for each token whether the activated expert is a zero-compute expert. Would this have an impact on performance?

If we bypass the requirement of ggml_mul_mat_ids to arrange in token order and add a new operator that rearranges the activations in expert order, like torch._grouped_gemm, would this provide better adaptability? In this case, we would only need to determine whether the currently computed expert ID is valid or whether it is a zero-compute expert. See #18369

@ggerganov
Copy link
Member

@hebangwen Not sure I follow, but I think simply setting the coefficient like this should work:

# normal expert
alpha_i = 1.0f
beta_i  = 0.0f

# zero-compute expert
alpha_i = 0.0f
beta_i  = 1.0f

And the matrix C is just the MoE input (i.e. x_t from the paper).

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 30, 2026

It can be simpler to explain the mul_mat_id in simple terms like this:

  • A normal mul_mat(A, B) calculates A @ B = C, simple.
  • Now, instead of having just a single B, we have B as a stack of multiple matrices: B = (B0, B1, B2, ..., Bn)
  • For MoE, we want to mul_mat A with a subset of B, something like: A @ (B2, B8, ...)
  • So mul_mat_id takes an extra param, the indexes of elements in B to be used: mul_mat_id(A, B, (2, 8)) --> (A @ B2, A @ B8)

As @ggerganov suggested, I imagine the mul_mat_id will now take an extra alpha, beta params:

idx = (2, 8)
alpha = (0.3, 0.0)
mul_mat_id(A, B, idx, alpha) --> (A @ B2 * 0.3, A @ B8 * 0.0)

In the example above, computation for A @ B8 will be skipped as its beta value is 0.0

However, one issue is that the router weight alpha is non-zero. Indeed, it depends on the top_k operation to sort out the activated experts. So, I think just adding a check like if (idx >= B->ne[2]) could be enough? So if B only has 4 experts and the experts ID=5 is accessed, then it's out-of-bound; we skip the mul_mat in such case.

However, yet another problem, even when the idea above is implemented: The output dim of mul_mat_id will not be the same as the input, and more importantly, there are also other ops like mul or gelu in-between FFN gate/up/down. We can resolve this by assuming that output of the skipped mul mat will be all 0.0, but I think that's not a generic solution.

For the calculation of the beta_i*C term, it will also be a bit tricky, as we need to filter out only the coefficients (aka router weights) that correspond to the zero-computing experts. Something like torch.where could be necessary, but that's another rabbit hole I think

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 30, 2026

Seems like quite more works than I initially thought, so I think we should re-consider if this worth implementing. Currently, only longcat-flash family using this technique, so it can be quite risky to too many infrastructure to support it.

@ggerganov
Copy link
Member

Yes, seems more complicated. Let's reconsider later in case this architecture shows any promise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

help wanted Needs help from the community model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants