-
Notifications
You must be signed in to change notification settings - Fork 705
feat: add cached tokens prometheus metric #4534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Vladislav Nosivskoy <[email protected]>
|
👋 Hi vladnosiv! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
Signed-off-by: Vladislav Nosivskoy <[email protected]>
WalkthroughThis PR introduces cached token metrics tracking across the system. It adds a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Areas requiring extra attention:
Poem
Pre-merge checks✅ Passed checks (3 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
291-295: get_usage helper matches completions implementation and keeps totals consistentThis
get_usageaccessor (clone +saturating_addfortotal_tokens) mirrors the completions variant and gives a single place to define how totals are derived, which is helpful now that multiple callers (usage chunk + metrics observers) rely on the same semantics. See my comment inlib/llm/src/protocols/openai/completions/delta.rsabout optional factoring of the shared logic.
315-334: Token usage aggregation logic mirrors completions pathThe always-on increment of
self.usage.completion_tokensfromdelta.token_ids.len()and propagation ofprompt_tokens_detailsare consistent with the completions DeltaGenerator, enabling metrics even wheninclude_usageis false for the client. Any concerns about the exactBackendOutput::token_idscontract are covered in my comment on the completions DeltaGenerator.
🧹 Nitpick comments (3)
lib/llm/src/http/service/openai.rs (1)
840-853: Consistent implementation across all streaming endpoints.The ghost event filtering pattern is consistently applied across all three streaming paths (single completions, batch completions, and chat completions), ensuring uniform behavior for metrics collection and SSE stream filtering.
While the pattern is repeated three times, the different contexts (handler functions, types, and closures) make extraction potentially more complex than beneficial. If additional streaming endpoints are added in the future, consider extracting this pattern into a helper function.
lib/llm/src/http/service/metrics.rs (1)
245-258:observe_cached_tokenscorrectly enforces one observation per request; consider unary path coverageThe
cached_tokens_observedflag plusOption<usize>argument ensure the histogram records at most one sample per request while tolerating multiple metric annotations, which matches the per-request semantic you want.
As a follow-up,process_response_and_observe_metrics(used on non-SSE paths) currently does not callobserve_cached_tokens, socached_sequence_lengthwill only be populated for the EventConverter/SSE pipeline. If you also care about cached-token stats for non-streaming responses, consider invokingobserve_cached_tokens(metrics.cached_tokens)there as well.Also applies to: 839-872
lib/llm/src/protocols/openai/completions/delta.rs (1)
225-229: Centralized get_usage helper is clear; consider avoiding duplication across endpointsThis implementation cleanly defines how
total_tokensis derived (prompt_tokens.saturating_add(completion_tokens)) and is identical to the chat-completions variant, which keeps behavior consistent between endpoints. If we add more fields toCompletionUsagein future, it might be worth factoring this into a small shared helper to avoid the two implementations drifting, but that’s optional for now.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
docs/observability/metrics.md(1 hunks)lib/bindings/python/src/dynamo/prometheus_names.py(2 hunks)lib/llm/src/http/service/metrics.rs(12 hunks)lib/llm/src/http/service/openai.rs(3 hunks)lib/llm/src/preprocessor.rs(3 hunks)lib/llm/src/protocols/openai.rs(1 hunks)lib/llm/src/protocols/openai/chat_completions/delta.rs(4 hunks)lib/llm/src/protocols/openai/completions/delta.rs(3 hunks)lib/runtime/src/metrics/prometheus_names.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.
Applied to files:
lib/llm/src/protocols/openai.rslib/llm/src/http/service/openai.rslib/llm/src/protocols/openai/completions/delta.rslib/llm/src/protocols/openai/chat_completions/delta.rs
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.
Applied to files:
docs/observability/metrics.md
📚 Learning: 2025-09-16T00:27:43.992Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.
Applied to files:
docs/observability/metrics.md
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The ai-dynamo/dynamo team uses _total as a semantic unit suffix across all metric types (including gauges like INFLIGHT_REQUESTS_TOTAL) for internal consistency, as evidenced by patterns in prometheus_names.rs. This is a deliberate architectural choice to prioritize uniform naming conventions over strict Prometheus conventions that reserve _total only for counters.
Applied to files:
docs/observability/metrics.md
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.
Applied to files:
lib/llm/src/http/service/openai.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.
Applied to files:
lib/llm/src/protocols/openai/completions/delta.rslib/llm/src/protocols/openai/chat_completions/delta.rs
🧬 Code graph analysis (6)
lib/llm/src/protocols/openai.rs (2)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
get_usage(291-295)get_usage(417-419)lib/llm/src/protocols/openai/completions/delta.rs (2)
get_usage(225-229)get_usage(320-322)
lib/llm/src/http/service/openai.rs (2)
lib/async-openai/src/client.rs (1)
stream(486-528)lib/llm/src/http/service/metrics.rs (2)
process_response_using_event_converter_and_observe_metrics(975-1035)from(964-966)
lib/llm/src/protocols/openai/completions/delta.rs (2)
lib/llm/src/protocols/openai.rs (2)
get_usage(229-229)choice_from_postprocessor(214-217)lib/llm/src/protocols/openai/chat_completions/delta.rs (3)
get_usage(291-295)get_usage(417-419)choice_from_postprocessor(311-403)
lib/llm/src/preprocessor.rs (1)
lib/runtime/src/protocols/annotated.rs (1)
from_data(44-51)
lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
lib/llm/src/protocols/openai.rs (1)
get_usage(229-229)lib/llm/src/protocols/openai/completions/delta.rs (2)
get_usage(225-229)get_usage(320-322)
lib/llm/src/http/service/metrics.rs (3)
lib/llm/src/preprocessor.rs (2)
new(122-128)from_annotation(83-101)lib/bindings/python/src/dynamo/prometheus_names.py (1)
frontend_service(38-80)lib/llm/src/http/service/openai.rs (2)
chat_completions(740-892)from(202-210)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (.)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (17)
lib/llm/src/http/service/openai.rs (2)
411-424: LGTM! Correct implementation of ghost event filtering.The stream processing correctly implements the pattern described in the PR:
mapcalls the metrics observer which returnsResult<Option<Event>>, observing metrics on all events including ghost eventsfilter_mapwithresult.transpose()elegantly convertsResult<Option<T>>→Option<Result<T>>, filtering out ghost events (Ok(None)) while preserving valid events and propagating errorsfuture::readyappropriately wraps the synchronoustranspose()to satisfyfilter_map's async requirementsThis ensures server-side metrics capture all events while clients only receive non-empty SSE frames.
573-586: Consistent pattern applied to batch path.The same ghost event filtering pattern is correctly applied to the batch completions streaming path, maintaining consistency with the single completions path.
lib/llm/src/preprocessor.rs (2)
68-74: LLMMetricAnnotation extension withcached_tokensis safe and backward-compatibleUsing
Option<usize>forcached_tokensmeans older annotations without this field will still deserialize asNone, and callers that ignore the new field are unaffected. No issues spotted with the struct change.
624-632: Final metrics/ghost-event emission logic matches the intended contractThe per-chunk annotation uses
cached_tokens: None, while the final branch (only when afinish_reasonwas seen andusage_chunk_sentis false) synthesizes a single metrics event populated fromget_usage(), includingcached_tokensfromprompt_tokens_details. Gating the payload onis_usage_enabled()while always emitting the annotated event cleanly supports internal metrics collection plus client-facing usage opt-out.Also applies to: 657-687
lib/llm/src/http/service/metrics.rs (3)
168-182:cached_sequence_lengthhistogram integrates cleanly with existing frontend metricsThe new
cached_sequence_lengthHistogramVecis named consistently withfrontend_service::CACHED_SEQUENCE_LENGTH, reuses the ISL bucket configuration, and is properly included inMetrics::newandregister(). The wiring looks correct and low-risk.Also applies to: 442-451, 510-528, 606-618
969-975: EventConverter metrics path scrapes LLM metrics and filters ghost events as intendedThe updated
process_response_using_event_converter_and_observe_metricsobserves OSL, cached tokens, TTFT/ITL, then removes the LLM metrics annotation from the outgoingAnnotatedso it doesn’t leak into client-visible SSE. Treating events with nodataand noeventas service/ghost events and returningOk(None)aligns with the HTTP handler’sfilter_mapusage, ensuring metrics-only frames are dropped from the client stream while still counted.Also applies to: 982-1005, 1009-1034
1396-1557: Tests provide solid coverage for cached-token metrics and ghost-event behaviorThe new tests validate that
cached_sequence_lengthis registered under the expected name, only records once per request (even with repeated calls and varying values), and that ghost LLM-metrics events are converted intoOk(None)while still incrementing the histogram. This should catch most regressions in the new metrics path.docs/observability/metrics.md (1)
151-160: Doc entry fordynamo_frontend_cached_sequence_lengthmatches implementationThe new bullet accurately describes the frontend histogram (“number of cached tokens (prefix cache hits) per request”) and uses the same name exposed by the Rust and Python metric constants. No additional doc changes are needed.
lib/runtime/src/metrics/prometheus_names.rs (1)
116-117:CACHED_SEQUENCE_LENGTHconstant is consistent with frontend metric namingThe new
frontend_service::CACHED_SEQUENCE_LENGTHconstant and its value"cached_sequence_length"line up with the frontend metric name (dynamo_frontend_cached_sequence_length) and the docs entry, keeping the naming centralized in this module.lib/llm/src/protocols/openai.rs (1)
211-230: DeltaGeneratorExt::get_usage addition cleanly extends the traitAdding
get_usage(&self) -> CompletionUsagecentralizes usage computation (includingtotal_tokens) while keeping existing methods untouched and the trait object-safe. The referenced delta generators implement this method, so the new preprocessor/metrics callers have a consistent source of finalized usage stats.lib/bindings/python/src/dynamo/prometheus_names.py (2)
58-59: New cached_sequence_length metric constant is consistent with existing namingThe
CACHED_SEQUENCE_LENGTHconstant and its docstring align with the nearby input/output sequence metrics and the documenteddynamo_frontend_cached_sequence_lengthhistogram; looks good for exposing the new metric through Python bindings.
98-101: KV prefix cache hit rate metric names and docs look correct
HOST_CACHE_HIT_RATEandDISK_CACHE_HIT_RATEclearly describe normalized hit rates (0.0–1.0) and fit the existingkvbmnaming style. For an auto-generated surface this is sufficient; no further changes needed here.lib/llm/src/protocols/openai/chat_completions/delta.rs (2)
270-283: create_usage_chunk now correctly reuses centralized get_usage()Routing usage computation through
self.get_usage()avoids duplicating thetotal_tokenscalculation and keeps client-facing usage chunks and internal metrics consumers in sync.
417-419: DeltaGeneratorExt::get_usage correctly delegates to the inherent helperExposing
get_usagethroughDeltaGeneratorExtlets upstream components access finalized usage for metrics without consuming the generator, while still honoringis_usage_enabledfor what is sent back to clients.lib/llm/src/protocols/openai/completions/delta.rs (3)
203-218: create_usage_chunk correctly reuses get_usage() for final usage-only chunkUsing
self.get_usage()here ensures the final usage-only completion chunk and any server-side metrics readers see a consistent view of prompt/completion/total tokens without duplicating aggregation logic.
237-247: Confirm BackendOutput::token_ids semantics for completion token aggregation
self.usage.completion_tokensis incremented bydelta.token_ids.len()on every streamed chunk, regardless ofenable_usage. That’s exactly right ifdelta.token_idsonly contains the new tokens for that chunk; if the backend instead reports a cumulative list of token IDs per request, this would double-count completion tokens across chunks. Since this value now feeds both client-visible usage (when enabled) and internal metrics, it’s worth double-checking theBackendOutputcontract and adjusting to use a per-chunk delta or backend-reportedcompletion_usage.completion_tokensif necessary.Also applies to: 250-255
320-322: DeltaGeneratorExt::get_usage wiring looks correctDelegating
DeltaGeneratorExt::get_usageto the inherentDeltaGenerator::get_usagekeeps the trait surface small and ensures all callers share the same finalized usage computation.
Signed-off-by: Vladislav Nosivskoy <[email protected]>
|
/ok to test 5773040 |
Overview:
This PR implements the collection of the cached_sequence_length metric (prefix cache hits) and refactors the underlying token usage accounting mechanism.
Previously, the server focused on observing ISL and OSL metrics, which could be calculated incrementally during request processing. However, since the final usage chunk was optional (dependent on client request), it was impossible to consistently track metrics that are typically only available in the final completion_usage, such as cached_tokens, audio_tokens, or reasoning_tokens. To address this, the pipeline now ensures a final event containing an annotation with all finalized metrics is always generated. To maintain the user contract regarding optional usage reporting, these events can be "empty" (payload-free) but annotated. These empty events are captured by the metrics observer and subsequently filtered out, ensuring they are not transmitted to the client.
Details:
New Metric:
cached_sequence_lengthAdded a generic cached_sequence_length histogram. This metric tracks the number of cached tokens (prefix cache hits) per request.
"Always-On" Token Accounting
Modified DeltaGenerator implementations in protocols/openai/... to always aggregate token usage internally, removing the condition if self.options.enable_usage.
Added get_usage() method to the DeltaGeneratorExt trait to safely retrieve statistics without consuming the generator state.
"Ghost Event" Mechanism for Metrics
Updated preprocessor.rs to always generate a final stream event containing LLMMetricAnnotation (including cached tokens and final ISL/OSL).
If the client did not request usage statistics (stream_options: {"include_usage": false}), this event contains data: None but still carries the metric annotation.
Updated process_response_using_event_converter_and_observe_metrics in metrics.rs to return Option. It returns None for these "ghost events" after scraping the metrics.
Stream Filtering in HTTP Layer
Refactored the stream handling in openai.rs (for chat and completions) to use .filter_map.
This efficiently filters out the "ghost events" (where data is None) so that the HTTP client receives a standard SSE stream without empty frames, while the server side still captures the metrics.
Where should the reviewer start?
lib/llm/src/protocols/openai/chat_completions/delta.rs: Notice that token aggregation is no longer conditional.lib/llm/src/preprocessor.rs: Look at transform_postprocessor_stream logic.lib/llm/src/http/service/metrics.rs: Check the process_response_... signature change to Result<Option> and the new cached_sequence_length histogram.lib/llm/src/http/service/openai.rs: Review the filter_map implementation for streams.Related Issues:
Relates to observability of additional tokens statistics.
Summary by CodeRabbit
Release Notes
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.