feat(server): add cross-backend MoE expert compute foundation#375
Draft
weicj wants to merge 2 commits into
Draft
feat(server): add cross-backend MoE expert compute foundation#375weicj wants to merge 2 commits into
weicj wants to merge 2 commits into
Conversation
Contributor
There was a problem hiding this comment.
5 issues found across 25 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/moe_hybrid_storage.cpp">
<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:509">
P0: Skipping cold expert allocation here breaks the IPC daemon runtime: the offload path still expects cold expert tensors, but this branch leaves them null when `allocate_cold=false`.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
|
|
||
| // Allocate cold expert tensors on CPU | ||
| if (cold_count > 0) { | ||
| if (allocate_cold && cold_count > 0) { |
Contributor
There was a problem hiding this comment.
P0: Skipping cold expert allocation here breaks the IPC daemon runtime: the offload path still expects cold expert tensors, but this branch leaves them null when allocate_cold=false.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_storage.cpp, line 509:
<comment>Skipping cold expert allocation here breaks the IPC daemon runtime: the offload path still expects cold expert tensors, but this branch leaves them null when `allocate_cold=false`.</comment>
<file context>
@@ -497,7 +506,7 @@ bool build_moe_hybrid_storage_from_file(
// Allocate cold expert tensors on CPU
- if (cold_count > 0) {
+ if (allocate_cold && cold_count > 0) {
ggml_init_params ip{};
ip.mem_size = 16 * ggml_tensor_overhead();
</file context>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a cross-backend MoE expert compute foundation. It gives MoE backends one shared runtime for selected expert FFN work across local CPU fallback and a backend IPC daemon, instead of making each model backend own its own remote compute lifecycle.
MoeExpertComputeRuntimenow owns the reusable pieces: runtime startup, IPC/CPU fallback selection, placement fingerprint reuse, and per-layer expert metadata construction. Qwen35MoE and Laguna plug into that runtime by providing model-specific layer descriptors and call sites, so future MoE adapters can reuse the same boundary without copying Qwen/Laguna lifecycle code.The first runtime use is non-local expert offload through the existing placement model. The same shared runtime boundary also gives later MoE execution modes, including EP/DP/TP-style work, a common place to build on.
Changes
moe-expert-computebackend IPC mode, genericDFLASH_MOE_EXPERT_COMPUTE_*runtime knobs, and remote expert compute client/daemon implementation.MoeExpertComputeinterface,compute_batch()prefill path, andMoeExpertComputeRuntimelifecycle wrapper.MoeExpertLayerconstruction, daemon reuse, and selected-expert global-id mapping into common code.general.architecture, so each MoE backend only supplies its model-specific metadata hook.free > totalon Strix Halo-style unified memory.Notes
backend-ipc ready mode=moe-expert-computeandchat DONE ok=true.target runtime ready arch=laguna,backend-ipc ready mode=moe-expert-compute, andchat DONE ok=true.