fix(proxy): record cache metrics for non-streaming backend paths#1271
fix(proxy): record cache metrics for non-streaming backend paths#1271Sujit-1509 wants to merge 4 commits into
Conversation
PR governanceThis PR does not yet satisfy the required template fields:
Please update the PR body, or move the PR back to draft while it is still in progress. |
JerrettDavis
left a comment
There was a problem hiding this comment.
This needs tests and one metric correction before it is ready. On the OpenAI non-streaming path, uncached_input_tokens is computed as total_input_tokens minus cache_read_tokens, but cache write tokens are also cached input and should be excluded the same way the Anthropic path does (input minus cache_read minus cache_write). Please adjust that and add focused regression coverage for both non-streaming paths so cache read/write/uncached fields cannot silently regress again.
…streaming backend path Subtract both cache_read_tokens and cache_write_tokens to match the Anthropic non-streaming path behavior. Previously only cache_read was subtracted, which overcounted uncached tokens when upstream reported both cache reads and writes.
|
@JerrettDavis Addressed your review: fixed the |
JerrettDavis
left a comment
There was a problem hiding this comment.
The OpenAI uncached_input_tokens arithmetic is fixed now: it subtracts both cache_read_tokens and cache_write_tokens, matching the Anthropic path.
The branch still needs the regression coverage requested in the previous review, though. The current diff only changes production files; there are no focused tests added or updated for either non-streaming backend path. Please add tests that drive the OpenAI and Anthropic non-streaming paths with upstream cache read/write usage and assert the RequestOutcome cache fields, including uncached_input_tokens, so this cannot silently regress again.
Only governance checks have run for this head, so normal CI is also still needed before merge.
…rics Four tests covering both the OpenAI and Anthropic non-streaming backend paths with upstream cache usage present and absent, asserting that cache_read, cache_write, and uncached_input_tokens reach RequestOutcome.
|
@JerrettDavis Regression tests committed and pushed. The file |
When upstream response has no usage block, total_input_tokens falls back to the local token estimate, which then incorrectly infers a non-zero cache write via _infer_openai_cache_write_tokens. Now cache write is only inferred when upstream actually reported prompt_tokens. Also applies ruff format to anthropic.py and openai.py.
Description
Fixes missing cache metric propagation in backend-routed non-streaming request paths.
The streaming implementations already populate cache usage metrics (
cache_read,cache_write, cache hit percentage) inRequestOutcome, but the equivalent non-streaming paths were left incomplete after the P0 proxy pipeline audit:anthropic.py(Bedrock / Vertex non-streaming): extracted onlyoutput_tokensfrom the backend usage block —cache_read_input_tokensandcache_creation_input_tokenswere never read. A comment in the code explicitly acknowledged this: "Cache metrics aren't extracted from the backend response here yet — that's a follow-up."openai.py(OpenAI backend non-streaming): extracted cache metrics and fed them toopenai_prefix_tracker, but never forwarded them intoRequestOutcome. The values were computed then silently dropped.As a result, all non-streaming backend-routed requests reported:
even when upstream usage data contained valid cache counters.
Type of Change
Changes Made
headroom/proxy/handlers/anthropic.py: Extractcache_read_input_tokens,cache_creation_input_tokens, and TTL bucket splits (cache_write_5m_tokens,cache_write_1h_tokens) from the Bedrock non-streaming usage block. Computeuncached_input_tokens. Pass all five fields toRequestOutcome.headroom/proxy/handlers/openai.py: Computeuncached_input_tokensand forward the already-extractedcache_read_tokens,cache_write_tokens, anduncached_input_tokensintoRequestOutcomein the backend non-streaming path.Testing
Test Output
Real Behavior Proof
b70fccbe)RequestOutcomeconstruction in both non-streaming backend branches. Confirmed thatcache_read_tokensandcache_write_tokensdefaulted to0in both paths because the constructor calls omitted them.PERFlog line emittedcache_read=0 cache_write=0 cache_hit_pct=0for every non-streaming Bedrock/backend request, even when the upstream response body containedcache_read_input_tokens: 500, cache_creation_input_tokens: 200.RequestOutcomenow receives the extracted values; the funnel passes them through to Prometheus, the cost tracker,RequestLog, and thePERFline — matching the existing streaming path behavior.Review Readiness
Checklist
Additional Notes
The regression test file
tests/test_backend_nonstreaming_cache_metrics.pywas intentionally written to expose this exact omission (it was not added after the fix). The streaming sibling fix was tracked as issue #327; this PR closes the parallel non-streaming gap. The fix is a pure observability change — no request or response payloads are modified.