feat(server): add cross-backend MoE expert compute foundation by weicj · Pull Request #375 · Luce-Org/lucebox-hub

weicj · 2026-06-12T12:31:00Z

Summary

This PR adds a cross-backend MoE expert compute foundation. It gives MoE backends one shared runtime for selected expert FFN work across local CPU fallback and a backend IPC daemon, instead of making each model backend own its own remote compute lifecycle.

MoeExpertComputeRuntime now owns the reusable pieces: runtime startup, IPC/CPU fallback selection, placement fingerprint reuse, and per-layer expert metadata construction. Qwen35MoE and Laguna plug into that runtime by providing model-specific layer descriptors and call sites, so future MoE adapters can reuse the same boundary without copying Qwen/Laguna lifecycle code.

The first runtime use is non-local expert offload through the existing placement model. The same shared runtime boundary also gives later MoE execution modes, including EP/DP/TP-style work, a common place to build on.

Changes

Add the common moe-expert-compute backend IPC mode, generic DFLASH_MOE_EXPERT_COMPUTE_* runtime knobs, and remote expert compute client/daemon implementation.
Add the shared MoeExpertCompute interface, compute_batch() prefill path, and MoeExpertComputeRuntime lifecycle wrapper.
Move placement fingerprinting, MoeExpertLayer construction, daemon reuse, and selected-expert global-id mapping into common code.
Dispatch remote daemon metadata loading by general.architecture, so each MoE backend only supplies its model-specific metadata hook.
Add the Qwen35MoE adapter on the shared runtime, routing pipelined decode/prefill through the common path while preserving existing placement behavior.
Add the Laguna adapter on the same runtime, routing hybrid prefill/decode through the common path, adding metadata-only Laguna target loading for the remote daemon, and clamping UMA/HIP memory accounting when free > total on Strix Halo-style unified memory.

Notes

Qwen35MoE adapter: local HIP Pro VII gfx906 parent + CUDA Tesla P4 sm61 remote expert smoke passed for both AR and DFlash on Qwopus3.6-35B-A3B APEX-I-Mini; remote lucebox HIP Strix Halo gfx1151 parent + CUDA RTX 3090 sm86 remote expert smoke also passed. Logs confirm backend-ipc ready mode=moe-expert-compute and chat DONE ok=true.
Laguna adapter: local HIP Pro VII gfx906 parent + CUDA Tesla P4 sm61 remote expert smoke passed on Laguna-XS.2 Q4_K_M. Remote lucebox HIP Strix Halo gfx1151 parent + CUDA RTX 3090 sm86 remote expert smoke also passed with a capped parent expert budget, confirming 3137 non-local experts, target runtime ready arch=laguna, backend-ipc ready mode=moe-expert-compute, and chat DONE ok=true.

cubic-dev-ai

5 issues found across 25 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/moe_hybrid_storage.cpp">

<violation number="1" location="server/src/common/moe_hybrid_storage.cpp:509">
P0: Skipping cold expert allocation here breaks the IPC daemon runtime: the offload path still expects cold expert tensors, but this branch leaves them null when `allocate_cold=false`.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-12T12:51:29Z


        // Allocate cold expert tensors on CPU
-        if (cold_count > 0) {
+        if (allocate_cold && cold_count > 0) {


P0: Skipping cold expert allocation here breaks the IPC daemon runtime: the offload path still expects cold expert tensors, but this branch leaves them null when allocate_cold=false.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_storage.cpp, line 509: <comment>Skipping cold expert allocation here breaks the IPC daemon runtime: the offload path still expects cold expert tensors, but this branch leaves them null when `allocate_cold=false`.</comment> <file context> @@ -497,7 +506,7 @@ bool build_moe_hybrid_storage_from_file( // Allocate cold expert tensors on CPU - if (cold_count > 0) { + if (allocate_cold && cold_count > 0) { ggml_init_params ip{}; ip.mem_size = 16 * ggml_tensor_overhead(); </file context>

feat(server): add cross-backend MoE expert compute foundation

25c4260

cubic-dev-ai Bot reviewed Jun 12, 2026

View reviewed changes

fix(server): address MoE expert compute review feedback

e4e0d8d

weicj marked this pull request as draft June 12, 2026 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): add cross-backend MoE expert compute foundation#375

feat(server): add cross-backend MoE expert compute foundation#375
weicj wants to merge 2 commits into
Luce-Org:mainfrom
weicj:feat-cross-backend-moe-expert-compute

weicj commented Jun 12, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

cubic-dev-ai Bot Jun 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weicj commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weicj commented Jun 12, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading