Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Jan 27, 2026

Continuing on #18740 and #18866, add option --fuse_gate_up_exps to convert_hf_to_gguf.py.
I've just added the gate_up tracking for deepseek2 (GLM 4.7 flash) and gpt-oss - although for gpt-oss we need even more changes (it goes through the generate_extra_tensors for generating expert weights). This PR is not complete as we would need to add this check in all MoE models and their tensors, but putting it out there in any case.

on 5090:

Master

model size params backend ngl fa test t/s
deepseek2 30B.A3B Q4_K - Medium 16.88 GiB 29.94 B CUDA 99 1 pp2048 6269.31 ± 11.73
deepseek2 30B.A3B Q4_K - Medium 16.88 GiB 29.94 B CUDA 99 1 pp4096 5833.01 ± 11.12
deepseek2 30B.A3B Q4_K - Medium 16.88 GiB 29.94 B CUDA 99 1 pp8192 5137.33 ± 16.16
deepseek2 30B.A3B Q4_K - Medium 16.88 GiB 29.94 B CUDA 99 1 tg128 168.98 ± 0.68

this PR (with the fused GGUF)

model size params backend ngl fa test t/s
deepseek2 30B.A3B Q4_K - Medium 16.88 GiB 29.94 B CUDA 99 1 pp2048 6846.90 ± 10.22
deepseek2 30B.A3B Q4_K - Medium 16.88 GiB 29.94 B CUDA 99 1 pp4096 6303.35 ± 15.73
deepseek2 30B.A3B Q4_K - Medium 16.88 GiB 29.94 B CUDA 99 1 pp8192 5502.23 ± 9.50
deepseek2 30B.A3B Q4_K - Medium 16.88 GiB 29.94 B CUDA 99 1 tg128 167.92 ± 0.75

@CISC
Copy link
Collaborator

CISC commented Jan 27, 2026

although for gpt-oss we need even more changes (it goes through the generate_extra_tensors for generating expert weights).

Ah, should probably also go through those to ensure they are all generators and yielding from base then.

am17an and others added 2 commits January 27, 2026 23:12
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
@am17an am17an force-pushed the exp_weight_merge_glm branch from c846c0b to 93c09b0 Compare January 27, 2026 17:11
@ngxson
Copy link
Collaborator

ngxson commented Jan 27, 2026

Still, I think there are some problems with the conversion script:

  • For gpt-oss, the model already come with the gate_up fused. It's split and repacked inside generate_extra_tensors, so in theory, up/gate won't be passed to modify_tensors. Instead, GptOssModel.generate_extra_tensors() method must be aware of either fuse_gate_up_exps is enabled or not, and repack it accordingly without going through modify_tensors (which is known to do dequant --> requant roud trip)
  • The cpp code currently only support gpt-oss and deepseek2, so I think the conversion script must throw an error on any other models. This is to avoid user to produce a non-working GGUF

@am17an
Copy link
Collaborator Author

am17an commented Jan 28, 2026

I tried to address 2) above. Full disclosure I used AI for this to make a list of MoE models and make all the edits.

For gpt-oss, I don't know what is the correct way to do it, one way is what I added in #18740, I can add the same here.

@CISC
Copy link
Collaborator

CISC commented Jan 28, 2026

I tried to address 2) above.

You must update all arches in tensor_mapping.pyconstants.py too, otherwise conversion will fail (so, really, this point was already addressed :)).

@am17an
Copy link
Collaborator Author

am17an commented Jan 28, 2026

You must update all arches in tensor_mapping.py too

I don't understand what you mean. Please assume I have no experience with conversion.

@CISC
Copy link
Collaborator

CISC commented Jan 28, 2026

You must update all arches in tensor_mapping.py too

I don't understand what you mean. Please assume I have no experience with conversion.

Sorry, I meant constants.py, since we changed it to use format_tensor_name:

def format_tensor_name(self, key: gguf.MODEL_TENSOR, bid: int | None = None, suffix: str = ".weight") -> str:
if key not in gguf.MODEL_TENSORS[self.model_arch]:
raise ValueError(f"Missing {key!r} for MODEL_TENSORS of {self.model_arch!r}")

@ngxson
Copy link
Collaborator

ngxson commented Jan 28, 2026

IMO it's probably better to find a way to generalize this, rather than apply the change to each models. Also, maybe only apply this to a limited number of models first, then see if it actually works.

What I'm worry is that contributors may not aware of this change and some models will eventually forget to support this. Also, it addes a bit more boilerplate to new models' code. We should find a way to establish a new rule that new models should use merged gate/up by default

@am17an
Copy link
Collaborator Author

am17an commented Jan 28, 2026

My initial PR #18740 did this only for 1 model. The discussion was that it should be done for all models, for this we needed to call ModelBase::modify_tensors from each model's modify_tensors. See #18866, #19084, #19091. That was to lay the groundwork for doing this. This PR is also initially did it for 1 model (GLM 4.7 flash) but it would be weird to support a flag for one model and I can foresee people asking for support for other models.

Also, it addes a bit more boilerplate to new models' code.

The llama-arch.cpp file is full of boilerplate code. I don't see what this PR changes. A function can be made to abstract the if statements if that's you mean. I was planning to do that for the Q,K,V merge anyway.

With that said, I don't care if all models get support, especially since it involves a lot of code changes which aren't exactly fun or foolproof. My aim is extract maximum performance out of a select few models which are proven to have many users like gpt-oss, qwen3 and now glm 4.7 flash. So the solution I see is this PR with some arch's which have been verified to work, the rest can either fail through constants.py or we can hardcode the supported arch set.

@ngxson
Copy link
Collaborator

ngxson commented Jan 28, 2026

The llama-arch.cpp file is full of boilerplate code. I don't see what this PR changes. A function can be made to abstract the if statements if that's you mean. I was planning to do that for the Q,K,V merge anyway.

Most of them are not really boilerplate code, for example: you may see most models copy the same FFN/FFN_EXPS code, but this is because the tensor dimension must be specified and some models indeed can have different shapes for it, this is totally up to the original model (or the people who created the model) to decide. So, they are not actually boilerplate code, but they are configurations.

Definition of merged gate/up and QKV are not the same because they can be inferred from the existing code. In terms of functional programming: you can have a pure function to transform from the unmerged "definition" to merged one.

My suggestion is that we can avoid adding llama.cpp-specific code paths like what you're doing into the model "definition". Instead, it can be implemented as a transformation inside create_tensor, logic will be somewhat like:

  • if the tensor name tn is gate or up, see if we already had another up or gate added
  • if yes, look up in the list of tensors if we have gate_up merged tensor
  • if no, treat is as normal

The same logic can be reused for QKV merged. We can potentially generalize it further so more merged FFN and merged QKV can reuse exactly the same code path with minor modifications.

Otherwise, if we only support the 3 models as your said:

My aim is extract maximum performance out of a select few models which are proven to have many users like gpt-oss, qwen3 and now glm 4.7 flash.

I (more) agree with this because it makes things easier to test and to track down issues. As pointed out in my comments earlier, there were some issues regarding gpt-oss. So it makes more sense to scale down the scope of this PR, and put more efforts into testing it thoroughly

@ngxson
Copy link
Collaborator

ngxson commented Jan 28, 2026

My suggestion is that we can avoid adding llama.cpp-specific code paths like what you're doing into the model "definition". Instead, it can be implemented as a transformation inside create_tensor

Otherwise, another way to do is to abstract the loading of FFN into a new function, create_tensor_ffn. This can already significantly reduce the amount of code, while making it more aligned with build_ffn

@am17an
Copy link
Collaborator Author

am17an commented Jan 28, 2026

Definition of merged gate/up and QKV are not the same because they can be inferred from the existing code. In terms of functional programming: you can have a pure function to transform from the unmerged "definition" to merged one.

Not sure what you mean. We can merge Q,K,V into a one big matrix and same for gate and up. I'm not able to see the difference and even your proposed implementation uses the same path for both.

Otherwise, if we only support the 3 models as your said:

Re the create_tensors part, I'm not willing to take an entirely different implementation approach to the current PR. So we can limit the scope to the 2-3 models and we can move on

@am17an
Copy link
Collaborator Author

am17an commented Jan 28, 2026

And btw this pattern already exists in llama-model.cpp, checking for the fused tensor and fallback if not

llama.cpp/src/llama-model.cpp

Lines 3297 to 3309 in 362bff9

layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", i), {n_embd, n_embd + 2*n_embd_gqa}, TENSOR_NOT_REQUIRED);
layer.bqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "bias", i), {n_embd + 2*n_embd_gqa}, TENSOR_NOT_REQUIRED);
if (!layer.wqkv) {
layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "weight", i), {n_embd, n_embd}, 0);
layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd}, 0);
layer.wk = create_tensor(tn(LLM_TENSOR_ATTN_K, "weight", i), {n_embd, n_embd_gqa}, 0);
layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd_gqa}, 0);
layer.wv = create_tensor(tn(LLM_TENSOR_ATTN_V, "weight", i), {n_embd, n_embd_gqa}, 0);
layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd_gqa}, 0);
}

@CISC
Copy link
Collaborator

CISC commented Jan 28, 2026

And btw this pattern already exists in llama-model.cpp, checking for the fused tensor and fallback if not

A create_tensor_ffn/create_tensor_qkv sounds like a good idea, though since it would have to update layers[i] directly that would probably not be a good name for it.

Edit: Either way, leave it for a follow-up.

@ngxson
Copy link
Collaborator

ngxson commented Jan 28, 2026

Not sure what you mean. We can merge Q,K,V into a one big matrix and same for gate and up. I'm not able to see the difference and even your proposed implementation uses the same path for both.

What I mean was that when writing the tensor loading code, the create_tensor is merely to describe what are the shape of the QKV tensor and some addition flags.

The detail of whether they are merged or not should be abstracted out, because it's has nothing with the model definition itself.

In terms of functional programming, it doesn't make sense to write something like:

q = load_q(n_embd, n_dim_k)
k = load_k(n_embd, n_dim_k)
v = load_v(n_embd, n_dim_v)
if (!q && !k && !v)
  qkv = load_qkv(n_embd * 3, n_dim_qkv)

Instead, load_qkv can already be derived from load_q | k | v, duplicating the code is redundant. It can be abstracted out, something like create_tensor_qkv

It doesn't change the implementation under the hood, but the point is that: model definition should ONLY convey what make up the model, NOT what we added to make it faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants