-
Notifications
You must be signed in to change notification settings - Fork 14.7k
llama: Add option to merge gate and exp weights #19139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Ah, should probably also go through those to ensure they are all generators and yielding from base then. |
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
c846c0b to
93c09b0
Compare
|
Still, I think there are some problems with the conversion script:
|
|
I tried to address 2) above. Full disclosure I used AI for this to make a list of MoE models and make all the edits. For gpt-oss, I don't know what is the correct way to do it, one way is what I added in #18740, I can add the same here. |
You must update all arches in |
I don't understand what you mean. Please assume I have no experience with conversion. |
Sorry, I meant llama.cpp/convert_hf_to_gguf.py Lines 489 to 491 in 566884c
|
|
IMO it's probably better to find a way to generalize this, rather than apply the change to each models. Also, maybe only apply this to a limited number of models first, then see if it actually works. What I'm worry is that contributors may not aware of this change and some models will eventually forget to support this. Also, it addes a bit more boilerplate to new models' code. We should find a way to establish a new rule that new models should use merged gate/up by default |
|
My initial PR #18740 did this only for 1 model. The discussion was that it should be done for all models, for this we needed to call
The llama-arch.cpp file is full of boilerplate code. I don't see what this PR changes. A function can be made to abstract the if statements if that's you mean. I was planning to do that for the Q,K,V merge anyway. With that said, I don't care if all models get support, especially since it involves a lot of code changes which aren't exactly fun or foolproof. My aim is extract maximum performance out of a select few models which are proven to have many users like gpt-oss, qwen3 and now glm 4.7 flash. So the solution I see is this PR with some arch's which have been verified to work, the rest can either fail through |
Most of them are not really boilerplate code, for example: you may see most models copy the same FFN/FFN_EXPS code, but this is because the tensor dimension must be specified and some models indeed can have different shapes for it, this is totally up to the original model (or the people who created the model) to decide. So, they are not actually boilerplate code, but they are configurations. Definition of merged gate/up and QKV are not the same because they can be inferred from the existing code. In terms of functional programming: you can have a pure function to transform from the unmerged "definition" to merged one. My suggestion is that we can avoid adding llama.cpp-specific code paths like what you're doing into the model "definition". Instead, it can be implemented as a transformation inside
The same logic can be reused for QKV merged. We can potentially generalize it further so more merged FFN and merged QKV can reuse exactly the same code path with minor modifications. Otherwise, if we only support the 3 models as your said:
I (more) agree with this because it makes things easier to test and to track down issues. As pointed out in my comments earlier, there were some issues regarding gpt-oss. So it makes more sense to scale down the scope of this PR, and put more efforts into testing it thoroughly |
Otherwise, another way to do is to abstract the loading of FFN into a new function, |
Not sure what you mean. We can merge Q,K,V into a one big matrix and same for gate and up. I'm not able to see the difference and even your proposed implementation uses the same path for both.
Re the |
|
And btw this pattern already exists in Lines 3297 to 3309 in 362bff9
|
A Edit: Either way, leave it for a follow-up. |
What I mean was that when writing the tensor loading code, the The detail of whether they are merged or not should be abstracted out, because it's has nothing with the model definition itself. In terms of functional programming, it doesn't make sense to write something like: Instead, It doesn't change the implementation under the hood, but the point is that: model definition should ONLY convey what make up the model, NOT what we added to make it faster |
Continuing on #18740 and #18866, add option
--fuse_gate_up_expstoconvert_hf_to_gguf.py.I've just added the gate_up tracking for deepseek2 (GLM 4.7 flash) and gpt-oss - although for gpt-oss we need even more changes (it goes through the
generate_extra_tensorsfor generating expert weights). This PR is not complete as we would need to add this check in all MoE models and their tensors, but putting it out there in any case.on 5090:
Master
this PR (with the fused GGUF)