UPSTREAM PR #19572: server: add Anthropic-compatible cache_read_input_tokens to usage metrics#1172
Open
UPSTREAM PR #19572: server: add Anthropic-compatible cache_read_input_tokens to usage metrics#1172
cache_read_input_tokens to usage metrics#1172Conversation
…se structure - Added `n_cache_read_input_tokens` field to `server_task_result_cmpl_final` and `server_task_result_cmpl_partial` structs - Populated `cache_read_input_tokens` in JSON output for both final and streaming responses - Ensured `cache_read_input_tokens` is non-negative by clamping to zero if negative - Updated unit tests to validate presence, type, and non-negativity of `cache_read_input_tokens` in usage metrics
|
No meaningful performance changes were detected across 115429 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-gemma3-cli. 🔎 Full breakdown: Loci Inspector. |
10f8f26 to
a6ecec6
Compare
9ea4a65 to
c001e9f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Source pull request: ggml-org/llama.cpp#19572
Summary
This PR adds the
cache_read_input_tokensfield to the server's usage metrics in API responses, aligning with the Anthropic API's prompt caching usage reporting.When using llama-server as a drop-in replacement for the Anthropic API, clients expect
cache_read_input_tokensin theusageobject of the response. This field reports the number of input tokens that were read from the KV cache rather than being recomputed. It is useful for:Changes
n_cache_read_input_tokensfield toserver_task_result_cmpl_finalandserver_task_result_cmpl_partialstructscache_read_input_tokensin JSON output for both final and streaming responsescache_read_input_tokensto zero if negative (defensive programming)cache_read_input_tokens>= 0)Testing
cache_read_input_tokens:use claude code for review my PR and spot and issue