llama: Add option to merge gate and exp weights #19139

am17an · 2026-01-27T14:28:45Z

Continuing on #18740 and #18866, add option --fuse_gate_up_exps to convert_hf_to_gguf.py.
I've just added the gate_up tracking for deepseek2 (GLM 4.7 flash) and gpt-oss - although for gpt-oss we need even more changes (it goes through the generate_extra_tensors for generating expert weights). This PR is not complete as we would need to add this check in all MoE models and their tensors, but putting it out there in any case.

on 5090:

Master

model	size	params	backend	ngl	fa	test	t/s
deepseek2 30B.A3B Q4_K - Medium	16.88 GiB	29.94 B	CUDA	99	1	pp2048	6269.31 ± 11.73
deepseek2 30B.A3B Q4_K - Medium	16.88 GiB	29.94 B	CUDA	99	1	pp4096	5833.01 ± 11.12
deepseek2 30B.A3B Q4_K - Medium	16.88 GiB	29.94 B	CUDA	99	1	pp8192	5137.33 ± 16.16
deepseek2 30B.A3B Q4_K - Medium	16.88 GiB	29.94 B	CUDA	99	1	tg128	168.98 ± 0.68

this PR (with the fused GGUF)

model	size	params	backend	ngl	fa	test	t/s
deepseek2 30B.A3B Q4_K - Medium	16.88 GiB	29.94 B	CUDA	99	1	pp2048	6846.90 ± 10.22
deepseek2 30B.A3B Q4_K - Medium	16.88 GiB	29.94 B	CUDA	99	1	pp4096	6303.35 ± 15.73
deepseek2 30B.A3B Q4_K - Medium	16.88 GiB	29.94 B	CUDA	99	1	pp8192	5502.23 ± 9.50
deepseek2 30B.A3B Q4_K - Medium	16.88 GiB	29.94 B	CUDA	99	1	tg128	167.92 ± 0.75

CISC · 2026-01-27T14:45:46Z

although for gpt-oss we need even more changes (it goes through the generate_extra_tensors for generating expert weights).

Ah, should probably also go through those to ensure they are all generators and yielding from base then.

convert_hf_to_gguf.py

gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

convert_hf_to_gguf.py

ngxson · 2026-01-27T18:59:40Z

Still, I think there are some problems with the conversion script:

For gpt-oss, the model already come with the gate_up fused. It's split and repacked inside generate_extra_tensors, so in theory, up/gate won't be passed to modify_tensors. Instead, GptOssModel.generate_extra_tensors() method must be aware of either fuse_gate_up_exps is enabled or not, and repack it accordingly without going through modify_tensors (which is known to do dequant --> requant roud trip)
The cpp code currently only support gpt-oss and deepseek2, so I think the conversion script must throw an error on any other models. This is to avoid user to produce a non-working GGUF

am17an · 2026-01-28T09:47:44Z

I tried to address 2) above. Full disclosure I used AI for this to make a list of MoE models and make all the edits.

For gpt-oss, I don't know what is the correct way to do it, one way is what I added in #18740, I can add the same here.

CISC · 2026-01-28T09:50:42Z

I tried to address 2) above.

You must update all arches in ~~tensor_mapping.py~~constants.py too, otherwise conversion will fail (so, really, this point was already addressed :)).

am17an · 2026-01-28T10:13:05Z

You must update all arches in tensor_mapping.py too

I don't understand what you mean. Please assume I have no experience with conversion.

CISC · 2026-01-28T10:18:30Z

You must update all arches in tensor_mapping.py too

I don't understand what you mean. Please assume I have no experience with conversion.

Sorry, I meant constants.py, since we changed it to use format_tensor_name:

llama.cpp/convert_hf_to_gguf.py

Lines 489 to 491 in 566884c

    
           def format_tensor_name(self, key: gguf.MODEL_TENSOR, bid: int | None = None, suffix: str = ".weight") -> str: 
        
               if key not in gguf.MODEL_TENSORS[self.model_arch]: 
        
                   raise ValueError(f"Missing {key!r} for MODEL_TENSORS of {self.model_arch!r}")

ngxson · 2026-01-28T10:46:00Z

IMO it's probably better to find a way to generalize this, rather than apply the change to each models. Also, maybe only apply this to a limited number of models first, then see if it actually works.

What I'm worry is that contributors may not aware of this change and some models will eventually forget to support this. Also, it addes a bit more boilerplate to new models' code. We should find a way to establish a new rule that new models should use merged gate/up by default

am17an · 2026-01-28T11:07:45Z

My initial PR #18740 did this only for 1 model. The discussion was that it should be done for all models, for this we needed to call ModelBase::modify_tensors from each model's modify_tensors. See #18866, #19084, #19091. That was to lay the groundwork for doing this. This PR is also initially did it for 1 model (GLM 4.7 flash) but it would be weird to support a flag for one model and I can foresee people asking for support for other models.

Also, it addes a bit more boilerplate to new models' code.

The llama-arch.cpp file is full of boilerplate code. I don't see what this PR changes. A function can be made to abstract the if statements if that's you mean. I was planning to do that for the Q,K,V merge anyway.

With that said, I don't care if all models get support, especially since it involves a lot of code changes which aren't exactly fun or foolproof. My aim is extract maximum performance out of a select few models which are proven to have many users like gpt-oss, qwen3 and now glm 4.7 flash. So the solution I see is this PR with some arch's which have been verified to work, the rest can either fail through constants.py or we can hardcode the supported arch set.

ngxson · 2026-01-28T11:52:05Z

The llama-arch.cpp file is full of boilerplate code. I don't see what this PR changes. A function can be made to abstract the if statements if that's you mean. I was planning to do that for the Q,K,V merge anyway.

Most of them are not really boilerplate code, for example: you may see most models copy the same FFN/FFN_EXPS code, but this is because the tensor dimension must be specified and some models indeed can have different shapes for it, this is totally up to the original model (or the people who created the model) to decide. So, they are not actually boilerplate code, but they are configurations.

Definition of merged gate/up and QKV are not the same because they can be inferred from the existing code. In terms of functional programming: you can have a pure function to transform from the unmerged "definition" to merged one.

My suggestion is that we can avoid adding llama.cpp-specific code paths like what you're doing into the model "definition". Instead, it can be implemented as a transformation inside create_tensor, logic will be somewhat like:

if the tensor name tn is gate or up, see if we already had another up or gate added
if yes, look up in the list of tensors if we have gate_up merged tensor
if no, treat is as normal

The same logic can be reused for QKV merged. We can potentially generalize it further so more merged FFN and merged QKV can reuse exactly the same code path with minor modifications.

Otherwise, if we only support the 3 models as your said:

My aim is extract maximum performance out of a select few models which are proven to have many users like gpt-oss, qwen3 and now glm 4.7 flash.

I (more) agree with this because it makes things easier to test and to track down issues. As pointed out in my comments earlier, there were some issues regarding gpt-oss. So it makes more sense to scale down the scope of this PR, and put more efforts into testing it thoroughly

ngxson · 2026-01-28T12:02:28Z

My suggestion is that we can avoid adding llama.cpp-specific code paths like what you're doing into the model "definition". Instead, it can be implemented as a transformation inside create_tensor

Otherwise, another way to do is to abstract the loading of FFN into a new function, create_tensor_ffn. This can already significantly reduce the amount of code, while making it more aligned with build_ffn

am17an · 2026-01-28T12:17:03Z

Definition of merged gate/up and QKV are not the same because they can be inferred from the existing code. In terms of functional programming: you can have a pure function to transform from the unmerged "definition" to merged one.

Not sure what you mean. We can merge Q,K,V into a one big matrix and same for gate and up. I'm not able to see the difference and even your proposed implementation uses the same path for both.

Otherwise, if we only support the 3 models as your said:

Re the create_tensors part, I'm not willing to take an entirely different implementation approach to the current PR. So we can limit the scope to the 2-3 models and we can move on

am17an · 2026-01-28T12:21:23Z

And btw this pattern already exists in llama-model.cpp, checking for the fused tensor and fallback if not

llama.cpp/src/llama-model.cpp

Lines 3297 to 3309 in 362bff9

    
           layer.wqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "weight", i), {n_embd, n_embd + 2*n_embd_gqa}, TENSOR_NOT_REQUIRED); 
        
           layer.bqkv = create_tensor(tn(LLM_TENSOR_ATTN_QKV, "bias", i), {n_embd + 2*n_embd_gqa}, TENSOR_NOT_REQUIRED); 
        
           if (!layer.wqkv) { 
        
               layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q,   "weight", i), {n_embd, n_embd}, 0); 
        
               layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q,   "bias", i),   {n_embd}, 0); 
        
               layer.wk = create_tensor(tn(LLM_TENSOR_ATTN_K,   "weight", i), {n_embd, n_embd_gqa}, 0); 
        
               layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K,   "bias", i),   {n_embd_gqa}, 0); 
        
               layer.wv = create_tensor(tn(LLM_TENSOR_ATTN_V,   "weight", i), {n_embd, n_embd_gqa}, 0); 
        
               layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V,   "bias", i),   {n_embd_gqa}, 0); 
        
           }

CISC · 2026-01-28T12:26:55Z

And btw this pattern already exists in llama-model.cpp, checking for the fused tensor and fallback if not

A create_tensor_ffn/create_tensor_qkv sounds like a good idea, though since it would have to update layers[i] directly that would probably not be a good name for it.

Edit: Either way, leave it for a follow-up.

ngxson · 2026-01-28T15:37:18Z

Not sure what you mean. We can merge Q,K,V into a one big matrix and same for gate and up. I'm not able to see the difference and even your proposed implementation uses the same path for both.

What I mean was that when writing the tensor loading code, the create_tensor is merely to describe what are the shape of the QKV tensor and some addition flags.

The detail of whether they are merged or not should be abstracted out, because it's has nothing with the model definition itself.

In terms of functional programming, it doesn't make sense to write something like:

q = load_q(n_embd, n_dim_k)
k = load_k(n_embd, n_dim_k)
v = load_v(n_embd, n_dim_v)
if (!q && !k && !v)
  qkv = load_qkv(n_embd * 3, n_dim_qkv)

Instead, load_qkv can already be derived from load_q | k | v, duplicating the code is redundant. It can be abstracted out, something like create_tensor_qkv

It doesn't change the implementation under the hood, but the point is that: model definition should ONLY convey what make up the model, NOT what we added to make it faster

llama: Add option to merge gate and exp weights

3c264fa

am17an requested a review from CISC as a code owner January 27, 2026 14:28

github-actions bot added model Model specific python python script changes labels Jan 27, 2026

loci-dev mentioned this pull request Jan 27, 2026

UPSTREAM PR #19139: llama: Add option to merge gate and exp weights auroralabs-loci/llama.cpp#1050

Open

CISC reviewed Jan 27, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

gguf-py/gguf/constants.py Show resolved Hide resolved

am17an and others added 2 commits January 27, 2026 23:12

Update convert_hf_to_gguf.py

d379334

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Update convert_hf_to_gguf.py

d29e8d1

Co-authored-by: Sigbjørn Skjæret <[email protected]>

ngxson reviewed Jan 27, 2026

View reviewed changes

convert_hf_to_gguf.py Show resolved Hide resolved

ngxson reviewed Jan 27, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

update constants.py

93c09b0

am17an force-pushed the exp_weight_merge_glm branch from c846c0b to 93c09b0 Compare January 27, 2026 17:11

ngxson reviewed Jan 27, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

am17an added 2 commits January 28, 2026 10:18

add gate_up for the all MoE models

d527830

convert: simplify merge tensor condition

566884c

update constants.py

362bff9

llama: Add option to merge gate and exp weights #19139

Are you sure you want to change the base?

llama: Add option to merge gate and exp weights #19139

Conversation

am17an commented Jan 27, 2026

Uh oh!

CISC commented Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jan 28, 2026

Uh oh!

CISC commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jan 28, 2026

Uh oh!

CISC commented Jan 28, 2026

Uh oh!

ngxson commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jan 28, 2026

Uh oh!

ngxson commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jan 28, 2026

Uh oh!

am17an commented Jan 28, 2026

Uh oh!

am17an commented Jan 28, 2026

Uh oh!

CISC commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Jan 27, 2026 •

edited

Loading

CISC commented Jan 28, 2026 •

edited

Loading

ngxson commented Jan 28, 2026 •

edited

Loading

ngxson commented Jan 28, 2026 •

edited

Loading

CISC commented Jan 28, 2026 •

edited

Loading