Skip to content

fix(server): restore qwen35moe hybrid prefix snapshots#362

Merged
davide221 merged 1 commit into
Luce-Org:mainfrom
weicj:fix-qwen35moe-hybrid-prefix-restore
Jun 10, 2026
Merged

fix(server): restore qwen35moe hybrid prefix snapshots#362
davide221 merged 1 commit into
Luce-Org:mainfrom
weicj:fix-qwen35moe-hybrid-prefix-restore

Conversation

@weicj

@weicj weicj commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR fixes prefix snapshot restore for the qwen35moe hybrid split-load path.

When hybrid placement is active, some MoE experts stay on GPU while cold experts are placed in RAM. Previously, this path did not actually restore prefix snapshots in restore_and_generate_impl(): it fell back to a full generate_impl() call, so a prefix/disk cache hit could still re-prefill from token 0.

This PR adds real restore support for the hybrid path:

  • Restore target KV / recurrent state from an existing prefix snapshot.
  • Replay only the prompt delta starting at the snapshot position.
  • Save scoped snapshots only when the current request is exactly prefilling to the selected boundary, so the prompt-tail state is not saved as an earlier boundary.
  • Keep temperature-sampling restore on the full-generate fallback for correctness; the validated disk-prefix path uses greedy decoding.

Changes

  • Adds a protected snapshot restore helper in the qwen35 backend so the qwen35moe subclass can reuse the existing snapshot storage.
  • Allows qwen35moe hybrid generation to save a scoped snapshot at an exact boundary.
  • Allows qwen35moe hybrid restore to:
    • restore the snapshot into the target cache;
    • replay req.prompt[snap_pos:] delta tokens;
    • continue decode through the existing pipelined hybrid decode path.
  • Limits the behavioral change to qwen35moe hybrid split-load; restore/generate paths for other model backends are unchanged.

Validation

Real qwen35moe hybrid live probe:

  • Runtime: mixed hot/cold expert placement, max_ctx=65536
  • Prompt: 2251 tokens

Result:

  • Cold first prefill: 42.7s
  • Scoped snapshot: 2227 tokens, 118.5 MB
  • Same-process restore: disk_hit=true, prefill 1.5s
  • Cold-start disk hit after server restart:
    • scanned 1 files, 118.5 MB
    • [disk-cache] hit policy=auto:30 len=2227 slot=63 pos=2227
    • disk_hit=true
    • prefill 1.6s

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Re-trigger cubic

@davide221 davide221 merged commit f339caa into Luce-Org:main Jun 10, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants