Skip to content

Conversation

@Kuangdd01
Copy link
Contributor

@Kuangdd01 Kuangdd01 commented Jan 13, 2026

Description

  • IMO, GLM4_MOE can be treated as a combination of deepseek_v3 and qwen/llama. It keep the first dense layer, one share expert per sparse layer and route_bias with GQA. So we just combine the template of qwen and deepseek_v3 and specify some arguments like rope_percent.

  • This modification was verified with a tiny random model on LlamaFactory.

loss1
  • training and merge process pass ✅

Help Need

  • Real model should be tested
  • MTP moudle should be address after GLM4.5

cc @hiyouga @chocoded

@chocoded
Copy link
Collaborator

It appears that the final layer (Layer 46) has a different structure compared to the intermediate layers. This requires some special handling which seems to have been overlooked in the current implementation. Could you please look into this?

截屏2026-01-13 20 14 33

@PanAndy
Copy link
Collaborator

PanAndy commented Jan 13, 2026

@chocoded

@PanAndy PanAndy requested a review from chocoded January 13, 2026 12:31
@chocoded chocoded self-assigned this Jan 13, 2026
@Kuangdd01
Copy link
Contributor Author

Sure, the last layer is mtp-specific layer and we will look through this.

@chocoded chocoded closed this Jan 13, 2026
@chocoded chocoded reopened this Jan 13, 2026
@chocoded
Copy link
Collaborator

For reference, you can check out the implementation here: converters and hf_invalid_keys. Could you please update the code to account for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants