-
Notifications
You must be signed in to change notification settings - Fork 14.7k
model: support Longcat-Flash (help wanted) #19182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Huh, interesting. Likely need to extend |
Hello, I'm also paying attention to the adaptation of the loncat-flash model. Suppose If we bypass the requirement of |
|
@hebangwen Not sure I follow, but I think simply setting the coefficient like this should work: # normal expert
alpha_i = 1.0f
beta_i = 0.0f
# zero-compute expert
alpha_i = 0.0f
beta_i = 1.0fAnd the matrix C is just the MoE input (i.e. |
|
It can be simpler to explain the
As @ggerganov suggested, I imagine the In the example above, computation for However, one issue is that the router weight However, yet another problem, even when the idea above is implemented: The output dim of For the calculation of the |
|
Seems like quite more works than I initially thought, so I think we should re-consider if this worth implementing. Currently, only longcat-flash family using this technique, so it can be quite risky to too many infrastructure to support it. |
|
Yes, seems more complicated. Let's reconsider later in case this architecture shows any promise. |
I was working on #19167 but realized that the normal (non-ngram) model is not even supported yet.
Thinking it will be simple, I gave it a try, but ended up stuck at implementing their notion of "zero-computing experts" (ref: link to paper)
The main problem is that
ggml_mul_mat_idisn't made for this purpose and I have no idea how to adapt it, or which ops may need to be added to make it work.To illustrate what's the problem, I will take an example of how a normal MoE FFN work:
ggml_mul_mat_idThis means we spend the the same amount of computation for each token, proportionally to
n_expert_usedHowever, with longcat-flash:
n_zero_expertswill go through FFN; for the rest, they skip the FFN altogetherApart from the weird MoE, the model has double block architecture, meaning there are 2 attentions and 2 FFNs per layer. Upon converting to GGUF, we convert it to a model of
2 * n_layer, which make the implementation much easier.