server : host-memory prompt caching #16391

ggerganov · 2025-10-02T19:48:56Z

Initial version of automatic memory offloading to host memory using an extended logic for minimizing the prompt reprocessing. The host-memory prompt cache acts as "extra slots" with which we can calculate prefix similarity and decide to hot-swap them into the llama_context if it would reduce the processing.

Still WIP, but probably should be useable already.

Note: mtmd workarounds are starting to cause some headaches. For example server_tokens is not copyable which complicates the cache logic and makes the prompt caching feature incompatible with mtmd.

Server refactor

Replace server_slot members with a single server_task
Remove server_slot.n_predict
Remove prompt truncation logic (obsolete and not useful anymore)
slot.task is now const ptr to reflect that the task parameters should not change when it is passed to the slot

TODOs

ggerganov · 2025-10-07T17:43:54Z

Looking for some feedback of how this new logic performs in different use cases. I've been testing it with the llama.vscode agent and it significantly improves the experience since we can now use a single server slot without trashing the prompt cache.

The current implementation should work with any model (dense, MoE, SWA, SSM, etc.). I think the default settings should be good for most use cases, though we'll probably add some options to adjust cache limits if needed.

Pay attention to these new messages in the logs:

Interested in testing agentic use cases, such as Claude Code and similar, where we have a single large context with various auxilary calls (keyword extraction, summarization, etc.) interleaved. The expectation is that prompt reprocessing should be significantly reduces in such cases.

github-actions bot added examples server labels Oct 2, 2025

ggerganov force-pushed the gg/prompt-cache-ext branch 2 times, most recently from 0787f03 to 5c0cec4 Compare October 3, 2025 18:49

This comment was marked as spam.

Sign in to view

ggerganov added 3 commits October 7, 2025 10:35

minor : code style

1ba4ea6

server : fix prompt similarity calculation

1fabae1

server : initial host-memory prompt caching

1440ec5

ggerganov force-pushed the gg/prompt-cache-ext branch from 5c0cec4 to 1440ec5 Compare October 7, 2025 07:40

ggerganov changed the base branch from master to gg/server-checkpoints-improve October 7, 2025 07:41

ggerganov added 3 commits October 7, 2025 12:50

cont

f5d5cb2

server : refactor

bac951f

cont

f3870a7

github-actions bot added the python python script changes label Oct 7, 2025

ggerganov mentioned this pull request Oct 7, 2025

server : add SWA checkpoints #15293

Merged

3 tasks

ggerganov added 3 commits October 7, 2025 16:50

cont : make the server task of the slot const

d601893

cont : minor [no ci]

d637973

server : cache prompts and checkpoints only for completion tasks

cf7dd4b

ggerganov force-pushed the gg/prompt-cache-ext branch from 9de8392 to cf7dd4b Compare October 7, 2025 15:09

server : improve prompt caching logic

05ba850

cont : fix check for number of cached prompts [no ci]

65e8991

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : host-memory prompt caching #16391

server : host-memory prompt caching #16391

ggerganov commented Oct 2, 2025 •

edited

Loading

Uh oh!

This comment was marked as spam.

ggerganov commented Oct 7, 2025

Uh oh!

Uh oh!

server : host-memory prompt caching #16391

Are you sure you want to change the base?

server : host-memory prompt caching #16391

Conversation

ggerganov commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Server refactor

TODOs

Uh oh!

This comment was marked as spam.

ggerganov commented Oct 7, 2025

Uh oh!

Uh oh!

ggerganov commented Oct 2, 2025 •

edited

Loading