feat(server): add scoped disk prefix cache policy#364
Open
weicj wants to merge 1 commit into
Open
Conversation
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 10, 2026
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a configurable prefix-selection policy for disk prefix cache to improve hit rate on real agent-style workloads.
The existing disk prefix restore behavior, introduced by PR #227 and extended for target split / mixed-backend restore by PR #325 and PR #352, was oriented around saving/restoring the full prompt. That works for repeated identical requests, but agent workloads usually have a large stable context followed by a dynamic tail: tool results, recent turns, and task state can change between requests. When the full prompt is used as the cache scope, small tail changes can prevent reuse of an otherwise valuable cached prefix.
This PR adds a controllable prefix scope for disk cache. By caching a slightly shorter but more stable prefix, the server can trade a small number of cached tokens for a higher cross-request hit rate. The scope can be a fixed token count provided by the user, or
autocan infer a stable boundary from recent similar requests.CLI:
Requests can also override the policy through
prefix_cache.scope:{ "prefix_cache": { "scope": "auto:30" } }Semantics:
off: disables disk prefix cache restore/save.full: keeps the existing full-prompt disk cache behavior, including cold-prefix and continued checkpoints.auto: defaults toauto:30; searches the most similar token prefix in the most recent 30 requests, then aligns down to a safe chat boundary.auto:N: sets the lookback window to N requests. N is the candidate window size; it does not require all N requests to share the same prefix.N: caches/restores the first N prompt tokens, for example1000.The server only compares token prefixes. It does not make semantic decisions about system prompts, tools, AGENTS.md, RAG, or dynamic tails. Disk keys remain token-prefix hashes, so semantically similar but token-different prompts do not collide.
Window / Hit-rate Tradeoff
auto:Nuses N as the recent-request lookback window. Smaller windows usually cache more recent-context tokens, but the key is more likely to drift with the dynamic tail; larger windows usually cache fewer tokens but produce a more stable hit rate.Tokenizer-level simulation results:
auto:2auto:8auto:301432tokens,28/30hits (93%)1432tokens,28/30hits (93%)1432tokens,28/30hits (93%)1988tokens,1/30hits (3%)1783tokens,7/30hits (23%)1452tokens,28/30hits (93%)1776tokens,8/20hits (40%)1389tokens,10/20hits (50%)1049tokens,18/20hits (90%)2204tokens,7/20hits (35%)1489tokens,10/20hits (50%)869tokens,18/20hits (90%)The default
auto:30is therefore intentionally conservative: it caches fewer high-variance tail tokens to get a higher disk prefix hit rate. Users that want a longer recent-context cache can still useauto:2,auto:8, or a fixed token count.Changes
DiskPrefixCachePolicywithoff,full,auto[:window], and fixed-token modes.--disk-prefix-cache off|full|auto|auto:N|N.prefix_cache.scope;prefix_cache.windowcan override the auto window.autouse the best-match longest common token prefix from the recent request window, aligned down to a safe chat boundary.fixed/autoprefill exactly to the selected token boundary before saving the KV snapshot, keeping the disk key and snapshot position aligned.fullon the existing full-prompt, cold-prefix, and continued-checkpoint behavior.--kv-cache-min-tokensas the global minimum save threshold.--kv-cache-cold-maxtofullmode cold-prefix selection; it does not capautoor fixed-number boundaries.auto/fixedwhen PFlash rewrites the effective prompt; full-prompt restore keeps the existing behavior./props.full_cache.disk_policy.[disk-cache] auto scope: ... selected=...diagnostics for boundary selection.End-to-end Restore Checks
End-to-end restore closure was validated on both a 27B dense backend and an OpenClaw-style MoE request. The dense backend check showed that a scoped prefix selected by
auto:30can be persisted, found again after server restart as a cold-start disk hit, and reduce prefill from a77.6scold run to1.6safter restore. The MoE check showed that auto boundary selection maps to a stable scoped prefix on a real agent prompt shape. qwen35moe hybrid restore itself is covered by companion bugfix PR #362: #362. This PR only covers disk prefix policy and boundary selection; the hit-rate behavior is covered by the window tradeoff simulation above.