-
Notifications
You must be signed in to change notification settings - Fork 712
feat: cached prompt tokens reporting in vLLM #4530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: cached prompt tokens reporting in vLLM #4530
Conversation
|
👋 Hi zhongxuanwang-nv! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
Signed-off-by: Zhongxuan Wang <[email protected]>
47eca42 to
394b204
Compare
Signed-off-by: Zhongxuan Wang <[email protected]>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughThe changes introduce block_size tracking and overlap_blocks propagation through the prefill router infrastructure, spanning Python vLLM handlers and Rust router implementations. Method signatures are updated to accept and return block_size and overlap information, enabling more precise KV cache utilization tracking and completion token usage reporting. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
lib/llm/src/kv_router/prefill_router.rs (1)
186-187: Consider removing or documenting the unused_block_sizeparameter.The
_block_sizeparameter is currently unused incall_prefill. If it's reserved for future use or API consistency, consider adding a comment explaining its purpose. Otherwise, it can be removed sinceself.block_sizeis already available.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/src/dynamo/vllm/handlers.py(3 hunks)lib/llm/src/entrypoint/input/common.rs(1 hunks)lib/llm/src/kv_router/prefill_router.rs(6 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.
📚 Learning: 2025-05-29T00:02:35.018Z
Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Applied to files:
lib/llm/src/entrypoint/input/common.rslib/llm/src/kv_router/prefill_router.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build and Test - dynamo
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (.)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (.)
- GitHub Check: tests (lib/bindings/python)
🔇 Additional comments (7)
lib/llm/src/entrypoint/input/common.rs (1)
270-272: LGTM! Clean integration of block_size parameter.The addition of
block_sizefrom the card and its propagation toPrefillRouter::disabledcorrectly wires the KV cache block size through the prefill routing infrastructure.lib/llm/src/kv_router/prefill_router.rs (3)
60-60: LGTM!The
block_sizefield addition correctly enables block-aware cached token computation.
65-72: LGTM! Signature extension is consistent.The addition of
block_sizeto thedisabledconstructor correctly supports the new caching logic. Internal call sites are updated per the PR summary.
314-314: LGTM!The
call_prefillinvocation correctly passesself.block_size.components/src/dynamo/vllm/handlers.py (3)
214-216: LGTM! Critical fix for zero-value reporting.The explicit
is not None and >= 0check correctly ensures thatcached_tokens=0is included in the response, whereas the previous truthiness check would have incorrectly omitted it.
353-354: LGTM!The extraction of
overlap_blocksfrom the request correctly defaults to 0 when not present.
398-409: LGTM! Unified disaggregated_params construction.The refactored logic correctly builds
disaggregated_paramsto include bothkv_transfer_params(when present) andoverlap_blocks, providing complete metadata to downstream consumers.
Signed-off-by: Zhongxuan Wang <[email protected]>
|
/ok to test 671fceb |
|
Hello! May I ask a couple of questions? I would like to highlight a small point: many providers bills the cached tokens differently from others, as they offer a discount. The previous version of cached tokens always considered the inference engine response to be the only source of truth. The current PR seems to be starting to trust the router, treating its stats as a fallback. This potentially carries risks, as Dynamo's usage responses may start to overestimate actual cache hits. The risks are mainly associated with billing schemes on top of that. Could you please explain the reasoning behind this decision? Thanks ! |
|
Hi @vladnosiv , thanks for your attention! I agree that it's important to get the
So in these cases we fallback to a router's estimate, also because it has a global view of the KV caches, but it's also not perfect because there might be cache evictions that happened but the router didn't know, because it just maintains an old snapshot that only updates after a request finishes, and therefore the Also, I think we should define well what the But indeed that's an excellent point — I will talk to other Dynamo folks to see what they think. |
|
@zhongxuanwang-nv Thanks for the detailed answer! As for the current code, it already returns cached tokens, and only returns I think I actually missed the fact that replacement would only occur in these (None and multimodal) cases. The case with the multimodal response is interesting. I wasn't aware of this behavior in vLLM. The case with vLLM errors does look a bit like a silent failure, but that seems unlikely. Then, my only nitpick is that if vLLM-frontend now starts returning Regarding billing, any cache hit, regardless of the level in the cache hierarchy, still saves GPU-time, and the cost of billing cached tokens is usually such that it significantly outweighs the cost of cache retrieval. But these are already deep details i think :) In any case, thanks, it's much clearer now! |
|
Thanks @vladnosiv too!
Hmm interesting, because when I did testing with vLLM in the past, even though there were cache hits, they were not reported. So this PR also adds that and also a fallback option, but I'm open to change stuff around!
Yes! I was planning to make sure other front-ends have the consistent reporting ideally by next week :)
I agree, so it sounds like you would also prefer the reporting to be a sum of all tiers' cache hits? I'm also planning to implement that, and also will soon make a PR that adds an optional field in Thanks for your attention again! |
Overview:
This PR wires up accurate “cached_tokens” reporting from vLLM into Dynamo’s responses (
usage.prompt_tokens_detailsfields). It's OpenAI compatible because it's also an option in OpenAI API spec.Details:
Added:
disaggregated_params["overlap_blocks"]incomponents/src/dynamo/vllm/handlers.py.PromptTokensDetails.cached_tokensinlib/llm/src/kv_router/prefill_router.rs.Changed:
_build_completion_usagenow always emitsprompt_tokens_details.cached_tokenswhennum_cached_tokensis non-negative (including0), instead of treating0as falsy and omitting the field. (components/src/dynamo/vllm/handlers.py)generatehandler now constructs a singledisaggregated_paramsdict carrying bothkv_transfer_paramsandoverlap_blocks, ensuring downstream consumers see both router overlap and KV transfer metadata. (components/src/dynamo/vllm/handlers.py)PrefillRouter::disablednow requires and storesblock_size, wiring it fromcard.kv_cache_block_sizein the engine input path so disabled routers still know the KV block size. (lib/llm/src/entrypoint/input/common.rs,lib/llm/src/kv_router/prefill_router.rs)PrefillRouter::call_prefillsignature extended to acceptblock_sizeand returnoverlap_blocksalongside the existing result and worker id, allowing it to compute cached token counts consistently. (lib/llm/src/kv_router/prefill_router.rs)prompt_tokens_details.cached_tokenswhen available and falls back tooverlap_blocks * block_size, then rewritesPromptTokensDetailsto include a definitivecached_tokenswhile preserving anyaudio_tokens. (lib/llm/src/kv_router/prefill_router.rs)Breaking changes / Migrations:
PrefillRouter::disabledandPrefillRouter::call_prefill(now requiresblock_sizeand returnsoverlap_blocks); all in-tree call sites are updated. External APIs and JSON schema remain backward compatible.Where should the reviewer start?
lib/llm/src/kv_router/prefill_router.rs
This is where cached_tokens is computed, where overlap_blocks is introduced, and where the rewrite of PromptTokensDetails happens. Understanding this file explains why the handler changes are needed.
components/src/dynamo/vllm/handlers.py
Once the router logic is clear, the handler updates (surfacing overlap_blocks, merging disaggregated_params, and always emitting cached_tokens) make immediate sense.
lib/llm/src/entrypoint/input/common.rs
Very small but essential: this wires block_size into PrefillRouter::disabled, enabling the router to compute fallback cached-token counts.
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Bug Fixes
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.