Skip to content

feat(server): add cross-backend MoE expert compute foundation#375

Draft
weicj wants to merge 2 commits into
Luce-Org:mainfrom
weicj:feat-cross-backend-moe-expert-compute
Draft

feat(server): add cross-backend MoE expert compute foundation#375
weicj wants to merge 2 commits into
Luce-Org:mainfrom
weicj:feat-cross-backend-moe-expert-compute

Conversation

@weicj

@weicj weicj commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds a cross-backend MoE expert compute foundation. It gives MoE backends one shared runtime for selected expert FFN work across local CPU fallback and a backend IPC daemon, instead of making each model backend own its own remote compute lifecycle.

MoeExpertComputeRuntime now owns the reusable pieces: runtime startup, IPC/CPU fallback selection, placement fingerprint reuse, and per-layer expert metadata construction. Qwen35MoE and Laguna plug into that runtime by providing model-specific layer descriptors and call sites, so future MoE adapters can reuse the same boundary without copying Qwen/Laguna lifecycle code.

The first runtime use is non-local expert offload through the existing placement model. The same shared runtime boundary also gives later MoE execution modes, including EP/DP/TP-style work, a common place to build on.

Changes

  • Add the common moe-expert-compute backend IPC mode, generic DFLASH_MOE_EXPERT_COMPUTE_* runtime knobs, and remote expert compute client/daemon implementation.
  • Add the shared MoeExpertCompute interface, compute_batch() prefill path, and MoeExpertComputeRuntime lifecycle wrapper.
  • Move placement fingerprinting, MoeExpertLayer construction, daemon reuse, and selected-expert global-id mapping into common code.
  • Dispatch remote daemon metadata loading by general.architecture, so each MoE backend only supplies its model-specific metadata hook.
  • Add the Qwen35MoE adapter on the shared runtime, routing pipelined decode/prefill through the common path while preserving existing placement behavior.
  • Add the Laguna adapter on the same runtime, routing hybrid prefill/decode through the common path, adding metadata-only Laguna target loading for the remote daemon, and clamping UMA/HIP memory accounting when free > total on Strix Halo-style unified memory.

Notes

  • Qwen35MoE adapter: local HIP Pro VII gfx906 parent + CUDA Tesla P4 sm61 remote expert smoke passed for both AR and DFlash on Qwopus3.6-35B-A3B APEX-I-Mini; remote lucebox HIP Strix Halo gfx1151 parent + CUDA RTX 3090 sm86 remote expert smoke also passed. Logs confirm backend-ipc ready mode=moe-expert-compute and chat DONE ok=true.
  • Laguna adapter: local HIP Pro VII gfx906 parent + CUDA Tesla P4 sm61 remote expert smoke passed on Laguna-XS.2 Q4_K_M. Remote lucebox HIP Strix Halo gfx1151 parent + CUDA RTX 3090 sm86 remote expert smoke also passed with a capped parent expert budget, confirming 3137 non-local experts, target runtime ready arch=laguna, backend-ipc ready mode=moe-expert-compute, and chat DONE ok=true.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 25 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/moe_hybrid_storage.cpp">

<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:509">
P0: Skipping cold expert allocation here breaks the IPC daemon runtime: the offload path still expects cold expert tensors, but this branch leaves them null when `allocate_cold=false`.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic


// Allocate cold expert tensors on CPU
if (cold_count > 0) {
if (allocate_cold && cold_count > 0) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0: Skipping cold expert allocation here breaks the IPC daemon runtime: the offload path still expects cold expert tensors, but this branch leaves them null when allocate_cold=false.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_storage.cpp, line 509:

<comment>Skipping cold expert allocation here breaks the IPC daemon runtime: the offload path still expects cold expert tensors, but this branch leaves them null when `allocate_cold=false`.</comment>

<file context>
@@ -497,7 +506,7 @@ bool build_moe_hybrid_storage_from_file(
 
         // Allocate cold expert tensors on CPU
-        if (cold_count > 0) {
+        if (allocate_cold && cold_count > 0) {
             ggml_init_params ip{};
             ip.mem_size   = 16 * ggml_tensor_overhead();
</file context>

Comment thread server/src/common/moe_expert_compute.cpp
Comment thread server/src/common/moe_expert_compute_cpu.cpp
Comment thread server/test/test_server_unit.cpp
Comment thread server/src/common/moe_expert_compute_ipc.cpp Outdated
@weicj weicj marked this pull request as draft June 12, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant