Skip to content

Conversation

@zhongxuanwang-nv
Copy link
Member

@zhongxuanwang-nv zhongxuanwang-nv commented Nov 21, 2025

Overview:

This PR wires up accurate “cached_tokens” reporting from vLLM into Dynamo’s responses (usage.prompt_tokens_details fields). It's OpenAI compatible because it's also an option in OpenAI API spec.

Details:

Added:

  • Expose router-estimated overlap blocks in vLLM response metadata via disaggregated_params["overlap_blocks"] in components/src/dynamo/vllm/handlers.py.
  • Combine router overlap and vLLM-reported cached token stats into a unified PromptTokensDetails.cached_tokens in lib/llm/src/kv_router/prefill_router.rs.

Changed:

  • _build_completion_usage now always emits prompt_tokens_details.cached_tokens when num_cached_tokens is non-negative (including 0), instead of treating 0 as falsy and omitting the field. (components/src/dynamo/vllm/handlers.py)
  • generate handler now constructs a single disaggregated_params dict carrying both kv_transfer_params and overlap_blocks, ensuring downstream consumers see both router overlap and KV transfer metadata. (components/src/dynamo/vllm/handlers.py)
  • PrefillRouter::disabled now requires and stores block_size, wiring it from card.kv_cache_block_size in the engine input path so disabled routers still know the KV block size. (lib/llm/src/entrypoint/input/common.rs, lib/llm/src/kv_router/prefill_router.rs)
  • PrefillRouter::call_prefill signature extended to accept block_size and return overlap_blocks alongside the existing result and worker id, allowing it to compute cached token counts consistently. (lib/llm/src/kv_router/prefill_router.rs)
  • Prefill success path now prefers vLLM’s prompt_tokens_details.cached_tokens when available and falls back to overlap_blocks * block_size, then rewrites PromptTokensDetails to include a definitive cached_tokens while preserving any audio_tokens. (lib/llm/src/kv_router/prefill_router.rs)

Breaking changes / Migrations:

  • Internal-only signature change for PrefillRouter::disabled and PrefillRouter::call_prefill (now requires block_size and returns overlap_blocks); all in-tree call sites are updated. External APIs and JSON schema remain backward compatible.

Where should the reviewer start?

  1. lib/llm/src/kv_router/prefill_router.rs
    This is where cached_tokens is computed, where overlap_blocks is introduced, and where the rewrite of PromptTokensDetails happens. Understanding this file explains why the handler changes are needed.

  2. components/src/dynamo/vllm/handlers.py
    Once the router logic is clear, the handler updates (surfacing overlap_blocks, merging disaggregated_params, and always emitting cached_tokens) make immediate sense.

  3. lib/llm/src/entrypoint/input/common.rs
    Very small but essential: this wires block_size into PrefillRouter::disabled, enabling the router to compute fallback cached-token counts.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • Relates to Dynamo Requirements doc from NAT team

Summary by CodeRabbit

  • Bug Fixes

    • Fixed token detail construction to properly handle zero and false-like values in usage reporting.
    • Improved accuracy of cached token calculations in usage metrics.
  • Improvements

    • Enhanced KV cache block overlap tracking and propagation through request processing.
    • Updated token usage reporting to include computed cache overlap information.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi zhongxuanwang-nv! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added external-contribution Pull request is from an external contributor feat labels Nov 21, 2025
@zhongxuanwang-nv zhongxuanwang-nv changed the title feat: Cached Token Stats reporting feat: cached prompt tokens reporting in vLLM Nov 21, 2025
Signed-off-by: Zhongxuan Wang <[email protected]>
Signed-off-by: Zhongxuan Wang <[email protected]>
@zhongxuanwang-nv
Copy link
Member Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

Walkthrough

The changes introduce block_size tracking and overlap_blocks propagation through the prefill router infrastructure, spanning Python vLLM handlers and Rust router implementations. Method signatures are updated to accept and return block_size and overlap information, enabling more precise KV cache utilization tracking and completion token usage reporting.

Changes

Cohort / File(s) Change Summary
Prefill Router Infrastructure
lib/llm/src/entrypoint/input/common.rs, lib/llm/src/kv_router/prefill_router.rs
Added block_size field (u32) to PrefillRouter and wired through construction and activation paths. Updated PrefillRouter::disabled signature to accept block_size parameter. Extended call_prefill to accept block_size parameter and return overlap_blocks (u32) alongside existing results. Implemented logic to compute final cached_tokens from overlap_blocks and block_size, or use vLLM-provided value if available, then embed into PromptTokensDetails.
vLLM Handler Updates
components/src/dynamo/vllm/handlers.py
Modified completion usage construction to only include prompt_tokens_details when num_cached_tokens is not None and >= 0. Updated PrefillWorkerHandler.generate to extract overlap_blocks from request and propagate through disaggregated_params alongside kv_transfer_params. Output construction now consistently uses disaggregated_params instead of conditionally including kv_transfer_params.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Method signature changes: Review updated PrefillRouter::disabled and call_prefill signatures for correctness across all call sites
  • Cached tokens computation logic: Verify the fallback logic correctly prioritizes vLLM-provided cached_tokens over computed overlap_blocks * block_size
  • Data propagation: Ensure overlap_blocks is correctly threaded from vLLM handlers through disaggregated_params to prefill router and back to completion usage construction
  • Type consistency: Confirm u32 overflow handling is appropriate for overlap_blocks and block_size multiplication

Poem

🐰 A block-size we measure, through routers it flows,
Overlap blocks counted as the cache memory grows,
From Python to Rust, we compute tokens with care,
KV cache precision now answered by layer! ✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: cached prompt tokens reporting in vLLM' clearly and concisely captures the main objective of the PR: adding cached token reporting functionality for vLLM.
Description check ✅ Passed The description follows the template structure with all required sections (Overview, Details, Where should the reviewer start, Related Issues) and provides comprehensive information about the changes.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
lib/llm/src/kv_router/prefill_router.rs (1)

186-187: Consider removing or documenting the unused _block_size parameter.

The _block_size parameter is currently unused in call_prefill. If it's reserved for future use or API consistency, consider adding a comment explaining its purpose. Otherwise, it can be removed since self.block_size is already available.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36f58e3 and def2b10.

📒 Files selected for processing (3)
  • components/src/dynamo/vllm/handlers.py (3 hunks)
  • lib/llm/src/entrypoint/input/common.rs (1 hunks)
  • lib/llm/src/kv_router/prefill_router.rs (6 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.
📚 Learning: 2025-05-29T00:02:35.018Z
Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Applied to files:

  • lib/llm/src/entrypoint/input/common.rs
  • lib/llm/src/kv_router/prefill_router.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (.)
  • GitHub Check: tests (lib/bindings/python)
🔇 Additional comments (7)
lib/llm/src/entrypoint/input/common.rs (1)

270-272: LGTM! Clean integration of block_size parameter.

The addition of block_size from the card and its propagation to PrefillRouter::disabled correctly wires the KV cache block size through the prefill routing infrastructure.

lib/llm/src/kv_router/prefill_router.rs (3)

60-60: LGTM!

The block_size field addition correctly enables block-aware cached token computation.


65-72: LGTM! Signature extension is consistent.

The addition of block_size to the disabled constructor correctly supports the new caching logic. Internal call sites are updated per the PR summary.


314-314: LGTM!

The call_prefill invocation correctly passes self.block_size.

components/src/dynamo/vllm/handlers.py (3)

214-216: LGTM! Critical fix for zero-value reporting.

The explicit is not None and >= 0 check correctly ensures that cached_tokens=0 is included in the response, whereas the previous truthiness check would have incorrectly omitted it.


353-354: LGTM!

The extraction of overlap_blocks from the request correctly defaults to 0 when not present.


398-409: LGTM! Unified disaggregated_params construction.

The refactored logic correctly builds disaggregated_params to include both kv_transfer_params (when present) and overlap_blocks, providing complete metadata to downstream consumers.

Signed-off-by: Zhongxuan Wang <[email protected]>
@zhongxuanwang-nv zhongxuanwang-nv marked this pull request as ready for review November 21, 2025 06:23
@zhongxuanwang-nv zhongxuanwang-nv requested review from a team as code owners November 21, 2025 06:23
@zhongxuanwang-nv
Copy link
Member Author

/ok to test 671fceb

@zhongxuanwang-nv zhongxuanwang-nv removed the external-contribution Pull request is from an external contributor label Nov 21, 2025
@vladnosiv
Copy link
Contributor

vladnosiv commented Nov 21, 2025

Hello! May I ask a couple of questions?

I would like to highlight a small point: many providers bills the cached tokens differently from others, as they offer a discount. The previous version of cached tokens always considered the inference engine response to be the only source of truth. The current PR seems to be starting to trust the router, treating its stats as a fallback. This potentially carries risks, as Dynamo's usage responses may start to overestimate actual cache hits. The risks are mainly associated with billing schemes on top of that. Could you please explain the reasoning behind this decision?

Thanks !

@zhongxuanwang-nv
Copy link
Member Author

zhongxuanwang-nv commented Nov 21, 2025

Hi @vladnosiv , thanks for your attention! I agree that it's important to get the cached_tokens count accurately. Please correct me if I'm wrong — I believe the original code doesn't expose cached_tokens yet, and the fallback only kicks in when None is reported from vLLM, which might signal:

  • Internal vLLM errors, but the cache hit should still ideally happen.
  • It's a vLLM's multimodal request, because they don't report cached_tokens even though they are accumulated,

So in these cases we fallback to a router's estimate, also because it has a global view of the KV caches, but it's also not perfect because there might be cache evictions that happened but the router didn't know, because it just maintains an old snapshot that only updates after a request finishes, and therefore the cached_tokens is an estimate.

Also, I think we should define well what the cached_tokens here should really represent — because if we use it for billings only, it makes sense that we should only report cached_tokens for these that reside in a GPU. However, Dynamo allows cache offloading and therefore has four different cache storage mediums (GPU Memory, CPU Memory, Local Disk Remote Disk) and here a Remote Disk's KV Cache hits count as a hit, but the cost should definitely be far more than the cost of a GPU Memory KV cache hit. I'm not too sure how billings for cached tokens should generally be determined, but I feel like the billings should also not just rely on this cached_tokens stats and also consider different mediums?

But indeed that's an excellent point — I will talk to other Dynamo folks to see what they think.

@vladnosiv
Copy link
Contributor

@zhongxuanwang-nv Thanks for the detailed answer!

As for the current code, it already returns cached tokens, and only returns prompt_tokens_details = None when there's no cache hit.

I think I actually missed the fact that replacement would only occur in these (None and multimodal) cases. The case with the multimodal response is interesting. I wasn't aware of this behavior in vLLM. The case with vLLM errors does look a bit like a silent failure, but that seems unlikely.

Then, my only nitpick is that if vLLM-frontend now starts returning "prompt_tokens_details" = { "cached_tokens": 0, "audio_tokens:": null } instead of missing details, it might look a bit inconsistent with other engines.

Regarding billing, any cache hit, regardless of the level in the cache hierarchy, still saves GPU-time, and the cost of billing cached tokens is usually such that it significantly outweighs the cost of cache retrieval. But these are already deep details i think :)

In any case, thanks, it's much clearer now!

@zhongxuanwang-nv
Copy link
Member Author

zhongxuanwang-nv commented Nov 21, 2025

Thanks @vladnosiv too!

As for the current code, it already returns cached tokens, and only returns prompt_tokens_details = None when there's no cache hit.

Hmm interesting, because when I did testing with vLLM in the past, even though there were cache hits, they were not reported. So this PR also adds that and also a fallback option, but I'm open to change stuff around!

Then, my only nitpick is that if vLLM-frontend now starts returning "prompt_tokens_details" = { "cached_tokens": 0, "audio_tokens:": null } instead of missing details, it might look a bit inconsistent with other engines.

Yes! I was planning to make sure other front-ends have the consistent reporting ideally by next week :)

Regarding billing, any cache hit, regardless of the level in the cache hierarchy, still saves GPU-time, and the cost of billing cached tokens is usually such that it significantly outweighs the cost of cache retrieval. But these are already deep details i think :)

I agree, so it sounds like you would also prefer the reporting to be a sum of all tiers' cache hits? I'm also planning to implement that, and also will soon make a PR that adds an optional field innvext for reporting per-tier cache hit stats. :)

Thanks for your attention again!

@zhongxuanwang-nv zhongxuanwang-nv marked this pull request as draft November 22, 2025 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants