Skip to content

feat(server): add scoped disk prefix cache policy#364

Open
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:feat-scoped-disk-prefix-cache-policy
Open

feat(server): add scoped disk prefix cache policy#364
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:feat-scoped-disk-prefix-cache-policy

Conversation

@weicj

@weicj weicj commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds a configurable prefix-selection policy for disk prefix cache to improve hit rate on real agent-style workloads.

The existing disk prefix restore behavior, introduced by PR #227 and extended for target split / mixed-backend restore by PR #325 and PR #352, was oriented around saving/restoring the full prompt. That works for repeated identical requests, but agent workloads usually have a large stable context followed by a dynamic tail: tool results, recent turns, and task state can change between requests. When the full prompt is used as the cache scope, small tail changes can prevent reuse of an otherwise valuable cached prefix.

This PR adds a controllable prefix scope for disk cache. By caching a slightly shorter but more stable prefix, the server can trade a small number of cached tokens for a higher cross-request hit rate. The scope can be a fixed token count provided by the user, or auto can infer a stable boundary from recent similar requests.

CLI:

--disk-prefix-cache off
--disk-prefix-cache full
--disk-prefix-cache auto
--disk-prefix-cache auto:30
--disk-prefix-cache 1000

Requests can also override the policy through prefix_cache.scope:

{
  "prefix_cache": {
    "scope": "auto:30"
  }
}

Semantics:

  • off: disables disk prefix cache restore/save.
  • full: keeps the existing full-prompt disk cache behavior, including cold-prefix and continued checkpoints.
  • auto: defaults to auto:30; searches the most similar token prefix in the most recent 30 requests, then aligns down to a safe chat boundary.
  • auto:N: sets the lookback window to N requests. N is the candidate window size; it does not require all N requests to share the same prefix.
  • N: caches/restores the first N prompt tokens, for example 1000.

The server only compares token prefixes. It does not make semantic decisions about system prompts, tools, AGENTS.md, RAG, or dynamic tails. Disk keys remain token-prefix hashes, so semantically similar but token-different prompts do not collide.

Window / Hit-rate Tradeoff

auto:N uses N as the recent-request lookback window. Smaller windows usually cache more recent-context tokens, but the key is more likely to drift with the dynamic tail; larger windows usually cache fewer tokens but produce a more stable hit rate.

Tokenizer-level simulation results:

workload auto:2 auto:8 auto:30
independent agent tasks avg 1432 tokens, 28/30 hits (93%) avg 1432 tokens, 28/30 hits (93%) avg 1432 tokens, 28/30 hits (93%)
synthetic rolling chat avg 1988 tokens, 1/30 hits (3%) avg 1783 tokens, 7/30 hits (23%) avg 1452 tokens, 28/30 hits (93%)
real rolling trace A avg 1776 tokens, 8/20 hits (40%) avg 1389 tokens, 10/20 hits (50%) avg 1049 tokens, 18/20 hits (90%)
real rolling trace B avg 2204 tokens, 7/20 hits (35%) avg 1489 tokens, 10/20 hits (50%) avg 869 tokens, 18/20 hits (90%)

The default auto:30 is therefore intentionally conservative: it caches fewer high-variance tail tokens to get a higher disk prefix hit rate. Users that want a longer recent-context cache can still use auto:2, auto:8, or a fixed token count.

Changes

  • Adds DiskPrefixCachePolicy with off, full, auto[:window], and fixed-token modes.
  • Adds --disk-prefix-cache off|full|auto|auto:N|N.
  • Adds request-level support for the same values through prefix_cache.scope; prefix_cache.window can override the auto window.
  • Makes auto use the best-match longest common token prefix from the recent request window, aligned down to a safe chat boundary.
  • Sets the default auto window to 30.
  • Makes fixed / auto prefill exactly to the selected token boundary before saving the KV snapshot, keeping the disk key and snapshot position aligned.
  • Keeps full on the existing full-prompt, cold-prefix, and continued-checkpoint behavior.
  • Keeps --kv-cache-min-tokens as the global minimum save threshold.
  • Limits --kv-cache-cold-max to full mode cold-prefix selection; it does not cap auto or fixed-number boundaries.
  • Disables auto / fixed when PFlash rewrites the effective prompt; full-prompt restore keeps the existing behavior.
  • Exposes the active disk policy in /props.full_cache.disk_policy.
  • Adds [disk-cache] auto scope: ... selected=... diagnostics for boundary selection.

End-to-end Restore Checks

End-to-end restore closure was validated on both a 27B dense backend and an OpenClaw-style MoE request. The dense backend check showed that a scoped prefix selected by auto:30 can be persisted, found again after server restart as a cold-start disk hit, and reduce prefill from a 77.6s cold run to 1.6s after restore. The MoE check showed that auto boundary selection maps to a stable scoped prefix on a real agent prompt shape. qwen35moe hybrid restore itself is covered by companion bugfix PR #362: #362. This PR only covers disk prefix policy and boundary selection; the hit-rate behavior is covered by the window tradeoff simulation above.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

Re-trigger cubic

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 10, 2026
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant