Skip to content

Comments

UPSTREAM PR #19572: server: add Anthropic-compatible cache_read_input_tokens to usage metrics#1172

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19572-anthropic-usage
Open

UPSTREAM PR #19572: server: add Anthropic-compatible cache_read_input_tokens to usage metrics#1172
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19572-anthropic-usage

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#19572

Summary

This PR adds the cache_read_input_tokens field to the server's usage metrics in API responses, aligning with the Anthropic API's prompt caching usage reporting.

When using llama-server as a drop-in replacement for the Anthropic API, clients expect cache_read_input_tokens in the usage object of the response. This field reports the number of input tokens that were read from the KV cache rather than being recomputed. It is useful for:

  • monitoring cache efficiency
  • estimating cost savings

Changes

  • Added n_cache_read_input_tokens field to server_task_result_cmpl_final and server_task_result_cmpl_partial structs
  • Populated cache_read_input_tokens in JSON output for both final and streaming responses
  • Clamped cache_read_input_tokens to zero if negative (defensive programming)
  • Updated unit tests to validate:
    • presence of cache_read_input_tokens
    • correct type (integer)
    • non-negativity (>= 0)

Testing

  • All existing server tests still pass
  • Added new assertions to verify that cache_read_input_tokens:
    • is present in usage object
    • is an integer
    • is ≥ 0

use claude code for review my PR and spot and issue

…se structure

- Added `n_cache_read_input_tokens` field to `server_task_result_cmpl_final` and `server_task_result_cmpl_partial` structs
- Populated `cache_read_input_tokens` in JSON output for both final and streaming responses
- Ensured `cache_read_input_tokens` is non-negative by clamping to zero if negative
- Updated unit tests to validate presence, type, and non-negativity of `cache_read_input_tokens` in usage metrics
@loci-review
Copy link

loci-review bot commented Feb 13, 2026

No meaningful performance changes were detected across 115429 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-gemma3-cli.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants