feat: add cached tokens prometheus metric #4534

vladnosiv · 2025-11-21T12:08:20Z

Overview:

This PR implements the collection of the cached_sequence_length metric (prefix cache hits) and refactors the underlying token usage accounting mechanism.

Previously, the server focused on observing ISL and OSL metrics, which could be calculated incrementally during request processing. However, since the final usage chunk was optional (dependent on client request), it was impossible to consistently track metrics that are typically only available in the final completion_usage, such as cached_tokens, audio_tokens, or reasoning_tokens. To address this, the pipeline now ensures a final event containing an annotation with all finalized metrics is always generated. To maintain the user contract regarding optional usage reporting, these events can be "empty" (payload-free) but annotated. These empty events are captured by the metrics observer and subsequently filtered out, ensuring they are not transmitted to the client.

Details:

New Metric: cached_sequence_length
Added a generic cached_sequence_length histogram. This metric tracks the number of cached tokens (prefix cache hits) per request.
"Always-On" Token Accounting
Modified DeltaGenerator implementations in protocols/openai/... to always aggregate token usage internally, removing the condition if self.options.enable_usage.
Added get_usage() method to the DeltaGeneratorExt trait to safely retrieve statistics without consuming the generator state.
"Ghost Event" Mechanism for Metrics
Updated preprocessor.rs to always generate a final stream event containing LLMMetricAnnotation (including cached tokens and final ISL/OSL).
If the client did not request usage statistics (stream_options: {"include_usage": false}), this event contains data: None but still carries the metric annotation.
Updated process_response_using_event_converter_and_observe_metrics in metrics.rs to return Option. It returns None for these "ghost events" after scraping the metrics.
Stream Filtering in HTTP Layer
Refactored the stream handling in openai.rs (for chat and completions) to use .filter_map.
This efficiently filters out the "ghost events" (where data is None) so that the HTTP client receives a standard SSE stream without empty frames, while the server side still captures the metrics.

Where should the reviewer start?

lib/llm/src/protocols/openai/chat_completions/delta.rs: Notice that token aggregation is no longer conditional.
lib/llm/src/preprocessor.rs: Look at transform_postprocessor_stream logic.
lib/llm/src/http/service/metrics.rs: Check the process_response_... signature change to Result<Option> and the new cached_sequence_length histogram.
lib/llm/src/http/service/openai.rs: Review the filter_map implementation for streams.

Related Issues:

Relates to observability of additional tokens statistics.

Summary by CodeRabbit

Release Notes

New Features
- Added cache performance metrics to track prefix cache hits and cache hit rates, enabling enhanced monitoring of system caching efficiency.
Documentation
- Updated observability documentation with details on new caching metrics.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Vladislav Nosivskoy <[email protected]>

copy-pr-bot · 2025-11-21T12:08:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-11-21T12:08:29Z

👋 Hi vladnosiv! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

Signed-off-by: Vladislav Nosivskoy <[email protected]>

coderabbitai · 2025-11-21T12:16:49Z

Walkthrough

This PR introduces cached token metrics tracking across the system. It adds a cached_sequence_length histogram metric to measure prefix cache hits per request, defines new Prometheus metric name constants in multiple locations, refactors usage computation into reusable methods, and updates the streaming response pipeline to filter optional token results.

Changes

Cohort / File(s)	Summary
Metrics Documentation & Constants `docs/observability/metrics.md`, `lib/bindings/python/src/dynamo/prometheus_names.py`, `lib/runtime/src/metrics/prometheus_names.rs`	Added new metric name constants (`CACHED_SEQUENCE_LENGTH`) and cache hit rate constants (`HOST_CACHE_HIT_RATE`, `DISK_CACHE_HIT_RATE`) to Prometheus names classes. Updated documentation to describe the new cached sequence length histogram.
Core Metrics Infrastructure `lib/llm/src/http/service/metrics.rs`	Added `cached_sequence_length` histogram field to `Metrics` struct; introduced `cached_tokens_observed` flag to `ResponseMetricCollector` to ensure single-increment-per-request semantics; added `observe_cached_tokens()` method; changed return type of `process_response_using_event_converter_and_observe_metrics` from `Result<Event, Error>` to `Result<Option<Event>, Error>` to support filtering ghost/service events.
Streaming Response Processing `lib/llm/src/http/service/openai.rs`	Added `filter_map` step with transposition logic to three streaming code paths (`completions_single`, `completions_batch`, `chat_completions`) to filter out `None` token results and propagate only concrete results.
Metric Annotations `lib/llm/src/preprocessor.rs`	Added optional `cached_tokens` field to `LLMMetricAnnotation` struct and integrated cached token computation (from `usage.prompt_tokens_details`) into OpenAI-like stream finalization logic.
Usage Computation Refactoring `lib/llm/src/protocols/openai.rs`, `lib/llm/src/protocols/openai/chat_completions/delta.rs`, `lib/llm/src/protocols/openai/completions/delta.rs`	Added `get_usage()` method to `DeltaGeneratorExt` trait and implemented it in both chat and completions delta generators to centralize usage token calculation and simplify usage chunk creation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Areas requiring extra attention:

Signature change to process_response_using_event_converter_and_observe_metrics returning Option<Event>—verify all callers properly handle the new return type
filter_map with transposition logic in three streaming paths—ensure filtering semantics are correct and consistent across all three implementations
cached_tokens_observed flag idempotency—verify the flag properly prevents double-counting across request lifecycle
Integration flow between observe_cached_tokens() call and the metrics collection from annotated responses

Poem

🐰 Tokens cached, counted, and true,
Histograms gleaming in Prometheus blue,
Filter those None-s, let the real ones through,
Usage refactored with methods so new! ✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a new cached tokens Prometheus metric, which aligns with the primary objective of the PR.
Description check	✅ Passed	The description follows the template structure with clear Overview, Details, reviewer guidance, and related issues sections. It comprehensively explains the motivation, implementation approach, and key changes across multiple files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

291-295: get_usage helper matches completions implementation and keeps totals consistent

This get_usage accessor (clone + saturating_add for total_tokens) mirrors the completions variant and gives a single place to define how totals are derived, which is helpful now that multiple callers (usage chunk + metrics observers) rely on the same semantics. See my comment in lib/llm/src/protocols/openai/completions/delta.rs about optional factoring of the shared logic.

315-334: Token usage aggregation logic mirrors completions path

The always-on increment of self.usage.completion_tokens from delta.token_ids.len() and propagation of prompt_tokens_details are consistent with the completions DeltaGenerator, enabling metrics even when include_usage is false for the client. Any concerns about the exact BackendOutput::token_ids contract are covered in my comment on the completions DeltaGenerator.

🧹 Nitpick comments (3)

lib/llm/src/http/service/openai.rs (1)

840-853: Consistent implementation across all streaming endpoints.

The ghost event filtering pattern is consistently applied across all three streaming paths (single completions, batch completions, and chat completions), ensuring uniform behavior for metrics collection and SSE stream filtering.

While the pattern is repeated three times, the different contexts (handler functions, types, and closures) make extraction potentially more complex than beneficial. If additional streaming endpoints are added in the future, consider extracting this pattern into a helper function.

lib/llm/src/http/service/metrics.rs (1)

245-258: observe_cached_tokens correctly enforces one observation per request; consider unary path coverage

The cached_tokens_observed flag plus Option<usize> argument ensure the histogram records at most one sample per request while tolerating multiple metric annotations, which matches the per-request semantic you want.
As a follow-up, process_response_and_observe_metrics (used on non-SSE paths) currently does not call observe_cached_tokens, so cached_sequence_length will only be populated for the EventConverter/SSE pipeline. If you also care about cached-token stats for non-streaming responses, consider invoking observe_cached_tokens(metrics.cached_tokens) there as well.

Also applies to: 839-872

lib/llm/src/protocols/openai/completions/delta.rs (1)

225-229: Centralized get_usage helper is clear; consider avoiding duplication across endpoints

This implementation cleanly defines how total_tokens is derived (prompt_tokens.saturating_add(completion_tokens)) and is identical to the chat-completions variant, which keeps behavior consistent between endpoints. If we add more fields to CompletionUsage in future, it might be worth factoring this into a small shared helper to avoid the two implementations drifting, but that’s optional for now.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d821a8b and 88e7f11.

📒 Files selected for processing (9)

docs/observability/metrics.md (1 hunks)
lib/bindings/python/src/dynamo/prometheus_names.py (2 hunks)
lib/llm/src/http/service/metrics.rs (12 hunks)
lib/llm/src/http/service/openai.rs (3 hunks)
lib/llm/src/preprocessor.rs (3 hunks)
lib/llm/src/protocols/openai.rs (1 hunks)
lib/llm/src/protocols/openai/chat_completions/delta.rs (4 hunks)
lib/llm/src/protocols/openai/completions/delta.rs (3 hunks)
lib/runtime/src/metrics/prometheus_names.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

📚 Learning: 2025-08-22T19:55:41.608Z

Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

lib/llm/src/protocols/openai.rs
lib/llm/src/http/service/openai.rs
lib/llm/src/protocols/openai/completions/delta.rs
lib/llm/src/protocols/openai/chat_completions/delta.rs

📚 Learning: 2025-09-16T00:26:43.641Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

docs/observability/metrics.md

📚 Learning: 2025-09-16T00:27:43.992Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.

Applied to files:

docs/observability/metrics.md

📚 Learning: 2025-09-16T00:26:43.641Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The ai-dynamo/dynamo team uses _total as a semantic unit suffix across all metric types (including gauges like INFLIGHT_REQUESTS_TOTAL) for internal consistency, as evidenced by patterns in prometheus_names.rs. This is a deliberate architectural choice to prioritize uniform naming conventions over strict Prometheus conventions that reserve _total only for counters.

Applied to files:

docs/observability/metrics.md

📚 Learning: 2025-09-11T03:24:47.820Z

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

lib/llm/src/http/service/openai.rs

📚 Learning: 2025-08-22T19:55:41.608Z

Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

lib/llm/src/protocols/openai/completions/delta.rs
lib/llm/src/protocols/openai/chat_completions/delta.rs

🧬 Code graph analysis (6)

lib/llm/src/protocols/openai.rs (2)

lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

get_usage (291-295)

get_usage (417-419)

lib/llm/src/protocols/openai/completions/delta.rs (2)

get_usage (225-229)

get_usage (320-322)

lib/llm/src/http/service/openai.rs (2)

lib/async-openai/src/client.rs (1)

stream (486-528)

lib/llm/src/http/service/metrics.rs (2)

process_response_using_event_converter_and_observe_metrics (975-1035)

from (964-966)

lib/llm/src/protocols/openai/completions/delta.rs (2)

lib/llm/src/protocols/openai.rs (2)

get_usage (229-229)

choice_from_postprocessor (214-217)

lib/llm/src/protocols/openai/chat_completions/delta.rs (3)

get_usage (291-295)

get_usage (417-419)

choice_from_postprocessor (311-403)

lib/llm/src/preprocessor.rs (1)

lib/runtime/src/protocols/annotated.rs (1)

from_data (44-51)

lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

lib/llm/src/protocols/openai.rs (1)

get_usage (229-229)

lib/llm/src/protocols/openai/completions/delta.rs (2)

get_usage (225-229)

get_usage (320-322)

lib/llm/src/http/service/metrics.rs (3)

lib/llm/src/preprocessor.rs (2)

new (122-128)

from_annotation (83-101)

lib/bindings/python/src/dynamo/prometheus_names.py (1)

frontend_service (38-80)

lib/llm/src/http/service/openai.rs (2)

chat_completions (740-892)

from (202-210)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: clippy (lib/bindings/python)
GitHub Check: clippy (.)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (17)

lib/llm/src/http/service/openai.rs (2)

411-424: LGTM! Correct implementation of ghost event filtering.

The stream processing correctly implements the pattern described in the PR:

map calls the metrics observer which returns Result<Option<Event>>, observing metrics on all events including ghost events

filter_map with result.transpose() elegantly converts Result<Option<T>> → Option<Result<T>>, filtering out ghost events (Ok(None)) while preserving valid events and propagating errors

future::ready appropriately wraps the synchronous transpose() to satisfy filter_map's async requirements

This ensures server-side metrics capture all events while clients only receive non-empty SSE frames.

573-586: Consistent pattern applied to batch path.

The same ghost event filtering pattern is correctly applied to the batch completions streaming path, maintaining consistency with the single completions path.

lib/llm/src/preprocessor.rs (2)

68-74: LLMMetricAnnotation extension with cached_tokens is safe and backward-compatible

Using Option<usize> for cached_tokens means older annotations without this field will still deserialize as None, and callers that ignore the new field are unaffected. No issues spotted with the struct change.

624-632: Final metrics/ghost-event emission logic matches the intended contract

The per-chunk annotation uses cached_tokens: None, while the final branch (only when a finish_reason was seen and usage_chunk_sent is false) synthesizes a single metrics event populated from get_usage(), including cached_tokens from prompt_tokens_details. Gating the payload on is_usage_enabled() while always emitting the annotated event cleanly supports internal metrics collection plus client-facing usage opt-out.

Also applies to: 657-687

lib/llm/src/http/service/metrics.rs (3)

168-182: cached_sequence_length histogram integrates cleanly with existing frontend metrics

The new cached_sequence_length HistogramVec is named consistently with frontend_service::CACHED_SEQUENCE_LENGTH, reuses the ISL bucket configuration, and is properly included in Metrics::new and register(). The wiring looks correct and low-risk.

Also applies to: 442-451, 510-528, 606-618

969-975: EventConverter metrics path scrapes LLM metrics and filters ghost events as intended

The updated process_response_using_event_converter_and_observe_metrics observes OSL, cached tokens, TTFT/ITL, then removes the LLM metrics annotation from the outgoing Annotated so it doesn’t leak into client-visible SSE. Treating events with no data and no event as service/ghost events and returning Ok(None) aligns with the HTTP handler’s filter_map usage, ensuring metrics-only frames are dropped from the client stream while still counted.

Also applies to: 982-1005, 1009-1034

1396-1557: Tests provide solid coverage for cached-token metrics and ghost-event behavior

The new tests validate that cached_sequence_length is registered under the expected name, only records once per request (even with repeated calls and varying values), and that ghost LLM-metrics events are converted into Ok(None) while still incrementing the histogram. This should catch most regressions in the new metrics path.

docs/observability/metrics.md (1)

151-160: Doc entry for dynamo_frontend_cached_sequence_length matches implementation

The new bullet accurately describes the frontend histogram (“number of cached tokens (prefix cache hits) per request”) and uses the same name exposed by the Rust and Python metric constants. No additional doc changes are needed.

lib/runtime/src/metrics/prometheus_names.rs (1)

116-117: CACHED_SEQUENCE_LENGTH constant is consistent with frontend metric naming

The new frontend_service::CACHED_SEQUENCE_LENGTH constant and its value "cached_sequence_length" line up with the frontend metric name (dynamo_frontend_cached_sequence_length) and the docs entry, keeping the naming centralized in this module.

lib/llm/src/protocols/openai.rs (1)

211-230: DeltaGeneratorExt::get_usage addition cleanly extends the trait

Adding get_usage(&self) -> CompletionUsage centralizes usage computation (including total_tokens) while keeping existing methods untouched and the trait object-safe. The referenced delta generators implement this method, so the new preprocessor/metrics callers have a consistent source of finalized usage stats.

lib/bindings/python/src/dynamo/prometheus_names.py (2)

58-59: New cached_sequence_length metric constant is consistent with existing naming

The CACHED_SEQUENCE_LENGTH constant and its docstring align with the nearby input/output sequence metrics and the documented dynamo_frontend_cached_sequence_length histogram; looks good for exposing the new metric through Python bindings.

98-101: KV prefix cache hit rate metric names and docs look correct

HOST_CACHE_HIT_RATE and DISK_CACHE_HIT_RATE clearly describe normalized hit rates (0.0–1.0) and fit the existing kvbm naming style. For an auto-generated surface this is sufficient; no further changes needed here.

lib/llm/src/protocols/openai/chat_completions/delta.rs (2)

270-283: create_usage_chunk now correctly reuses centralized get_usage()

Routing usage computation through self.get_usage() avoids duplicating the total_tokens calculation and keeps client-facing usage chunks and internal metrics consumers in sync.

417-419: DeltaGeneratorExt::get_usage correctly delegates to the inherent helper

Exposing get_usage through DeltaGeneratorExt lets upstream components access finalized usage for metrics without consuming the generator, while still honoring is_usage_enabled for what is sent back to clients.

lib/llm/src/protocols/openai/completions/delta.rs (3)

203-218: create_usage_chunk correctly reuses get_usage() for final usage-only chunk

Using self.get_usage() here ensures the final usage-only completion chunk and any server-side metrics readers see a consistent view of prompt/completion/total tokens without duplicating aggregation logic.

237-247: Confirm BackendOutput::token_ids semantics for completion token aggregation

self.usage.completion_tokens is incremented by delta.token_ids.len() on every streamed chunk, regardless of enable_usage. That’s exactly right if delta.token_ids only contains the new tokens for that chunk; if the backend instead reports a cumulative list of token IDs per request, this would double-count completion tokens across chunks. Since this value now feeds both client-visible usage (when enabled) and internal metrics, it’s worth double-checking the BackendOutput contract and adjusting to use a per-chunk delta or backend-reported completion_usage.completion_tokens if necessary.

Also applies to: 250-255

320-322: DeltaGeneratorExt::get_usage wiring looks correct

Delegating DeltaGeneratorExt::get_usage to the inherent DeltaGenerator::get_usage keeps the trait surface small and ensures all callers share the same finalized usage computation.

Signed-off-by: Vladislav Nosivskoy <[email protected]>

grahamking · 2025-11-21T20:52:25Z

/ok to test 5773040

add cached tokens metric

88e7f11

Signed-off-by: Vladislav Nosivskoy <[email protected]>

vladnosiv requested review from a team as code owners November 21, 2025 12:08

pull-request-size bot added the size/L label Nov 21, 2025

github-actions bot added external-contribution Pull request is from an external contributor feat labels Nov 21, 2025

fix fmt

740a553

Signed-off-by: Vladislav Nosivskoy <[email protected]>

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

fix test streaming usage

5773040

Signed-off-by: Vladislav Nosivskoy <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 21, 2025 20:52 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 21, 2025 20:57 Inactive

Merge branch 'main' into observe-cached-tokens-metrics

117b58d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add cached tokens prometheus metric #4534

feat: add cached tokens prometheus metric #4534

Uh oh!

vladnosiv commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

coderabbitai bot commented Nov 21, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

grahamking commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add cached tokens prometheus metric #4534

Are you sure you want to change the base?

feat: add cached tokens prometheus metric #4534

Uh oh!

Conversation

vladnosiv commented Nov 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

coderabbitai bot commented Nov 21, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

grahamking commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vladnosiv commented Nov 21, 2025 •

edited by coderabbitai bot

Loading