feat: do not consider kv reuse for decoding for Router #3999

PeaBrane · 2025-10-30T18:48:15Z

Overview:

This change introduces a new router configuration flag active_kv_reuse (controlled via --router-with-cascade-attention CLI flag) that determines whether the router tracks KV reuse during the decode phase. By default, this flag is set to false, as most inference engines do not currently implement cascade attention or make maximal use of KV reuse during decoding (beyond implicit L2 caching). When disabled, the router generates random hashes for sequence blocks during decode, preventing deduplication and avoiding overhead from tracking active blocks that won't be reused.

For backends that do implement cascade attention (which can effectively reuse KV cache during decode), users can enable --router-with-cascade-attention to track actual sequence hashes and optimize routing decisions based on decode-time KV reuse. This maintains backwards compatibility while providing an opt-in path for advanced backends to benefit from KV-aware routing during both prefill and decode phases.

Summary by CodeRabbit

Release Notes

New Features
- Added command-line flag --router-with-cascade-attention to enable KV reuse tracking during the decode phase for cascade attention backends, improving routing optimization decisions.
Documentation
- Updated documentation to describe the new cascade attention configuration option and its impact on decode-time KV reuse tracking.

Signed-off-by: PeaBrane <[email protected]>

coderabbitai · 2025-10-30T18:52:48Z

Walkthrough

This change adds support for a new CLI flag --router-with-cascade-attention that enables KV reuse tracking during the decode phase for cascade attention backends. The parameter is threaded through multiple configuration layers: CLI parsing, router config structures, scheduler initialization, and down to sequence tracking logic.

Changes

Cohort / File(s)	Summary
CLI and Documentation `components/src/dynamo/frontend/main.py`, `docs/router/kv_cache_routing.md`	Adds new CLI flag `--router-with-cascade-attention` mapped to `active_kv_reuse` boolean, and documents the flag's behavior for enabling KV reuse tracking during decode phase.
C Bindings `lib/bindings/c/src/lib.rs`	Extends `dynamo_create_worker_selection_pipeline` to pass `Some(false)` as the new `active_kv_reuse` parameter to `KvRouterConfig` construction.
Python Bindings `lib/bindings/python/rust/llm/entrypoint.rs`	Updates `KvRouterConfig::new` signature to accept `active_kv_reuse: bool` parameter and propagates it to inner `RsKvRouterConfig` construction.
Core Router Configuration `lib/llm/src/kv_router.rs`	Adds `pub active_kv_reuse: bool` field to `KvRouterConfig` struct with default value `false`; updates constructor signature to accept optional `active_kv_reuse` parameter and threads it through to `KvScheduler::start` call.
Scheduler Initialization `lib/llm/src/kv_router/scheduler.rs`	Extends `KvScheduler::start` method signature with new `active_kv_reuse: bool` parameter and forwards it to `ActiveSequencesMultiWorker` construction.
Sequence Tracking Logic `lib/llm/src/kv_router/sequence.rs`	Adds `pub active_kv_reuse: bool` field to both `ActiveSequences` and `ActiveSequencesMultiWorker` structs; implements conditional logic in `add_request` to replace sequence blocks with random hashes when `active_kv_reuse` is false; updates constructors and method signatures throughout to accept and propagate the parameter; updates test scaffolding accordingly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Large set of homogeneous changes following a consistent pattern (adding the same boolean parameter across multiple files)
The logic in lib/llm/src/kv_router/sequence.rs where sequence blocks are conditionally replaced with random hashes warrants closer inspection to verify correctness
Verify that test coverage adequately exercises the new active_kv_reuse=true code path, particularly the sequence hashing behavior

Poem

🐰 A flag hops forth through layers deep,
From CLI args to schedulers' keep,
KV reuse now tracked with care,
Cascade attention blooming fair! 🌸

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The provided PR description only fills in the "Overview" section of the required template and leaves three other required sections completely empty: "Details" (specific technical changes), "Where should the reviewer start?" (file-level guidance), and "Related Issues" (GitHub issue references). While the Overview section is well-written and explains the feature's purpose, design rationale, and backwards compatibility clearly, the description is largely incomplete as it addresses only 1 out of 4 template sections. To meet the repository's documentation standards, the description needs to be expanded to include the missing sections.	Add the missing template sections: (1) Details—summarize the specific changes across files such as the new CLI flag in main.py, the KvRouterConfig modifications in lib/llm/src/kv_router.rs, and related scheduler and sequence changes; (2) Where should the reviewer start—recommend key files to review first (e.g., lib/llm/src/kv_router.rs for the core flag introduction); (3) Related Issues—reference any GitHub issue(s) this PR addresses with action keywords like "Closes #xxx".

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat: do not consider kv reuse for decoding for Router" is related to the changeset and refers to a genuine aspect of the change—specifically, the default behavior where KV reuse is not tracked during decoding. However, it does not fully capture the main point of this PR, which is adding a new configurable flag (`active_kv_reuse` via `--router-with-cascade-attention`) that allows users to control this behavior. The title reads as a behavioral change rather than clearly indicating that this is a feature addition that introduces a new configuration option. This is still a valid description of a real aspect of the work, but it obscures the primary intent of making KV reuse tracking optional and configurable.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

lib/llm/src/kv_router/sequence.rs (1)

143-152: Verify: Random hash generation approach for disabling KV reuse.

When active_kv_reuse is false, the code generates random hashes using Uuid::new_v4().as_u128() & 0xFFFFFFFFFFFFFFFF to prevent block deduplication. This approach ensures blocks are never reused during decode.

A few points to verify:

Is this the intended approach? An alternative could be to skip sequence tracking entirely when active_kv_reuse is false (passing None for token_sequence), rather than replacing with random hashes.

Memory overhead: Generating random hashes still creates entries in unique_blocks map. Would it be more efficient to skip tracking altogether?

Test coverage: All tests currently pass active_kv_reuse=true (lines 927, 993, 1001, 1152, 1160), leaving the random hash path untested. Consider adding a test that verifies the behavior when active_kv_reuse=false to ensure blocks are not deduplicated.

Based on learnings

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb3a486 and 7caa794.

📒 Files selected for processing (7)

components/src/dynamo/frontend/main.py (2 hunks)
docs/router/kv_cache_routing.md (1 hunks)
lib/bindings/c/src/lib.rs (1 hunks)
lib/bindings/python/rust/llm/entrypoint.rs (3 hunks)
lib/llm/src/kv_router.rs (5 hunks)
lib/llm/src/kv_router/scheduler.rs (3 hunks)
lib/llm/src/kv_router/sequence.rs (13 hunks)

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.333Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3184
File: docs/architecture/kv_cache_routing.md:70-73
Timestamp: 2025-09-23T20:08:37.105Z
Learning: PeaBrane prefers to keep documentation diagrams simplified to avoid visual overload, even when this means sacrificing some technical precision for the sake of clarity and comprehension. They prioritize pedagogical effectiveness over exhaustive technical detail in architectural diagrams.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.330Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/subscriber.rs:200-223
Timestamp: 2025-09-17T20:55:41.416Z
Learning: In the dynamo codebase, PeaBrane prefers to maintain consistency with existing etcd key parsing patterns (like splitting on '/' and parsing the last segment) rather than introducing more robust parsing approaches, even when the current approach might be brittle, to keep the codebase aligned and avoid divergent patterns.

📚 Learning: 2025-08-29T10:08:18.434Z

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.

Applied to files:

lib/bindings/c/src/lib.rs
lib/llm/src/kv_router/scheduler.rs
components/src/dynamo/frontend/main.py
lib/bindings/python/rust/llm/entrypoint.rs
lib/llm/src/kv_router.rs

📚 Learning: 2025-09-03T19:31:32.621Z

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.

Applied to files:

lib/llm/src/kv_router/scheduler.rs
lib/llm/src/kv_router.rs
lib/llm/src/kv_router/sequence.rs

📚 Learning: 2025-09-17T01:00:50.937Z

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

lib/llm/src/kv_router.rs
lib/llm/src/kv_router/sequence.rs

📚 Learning: 2025-05-29T06:20:12.901Z

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Applied to files:

lib/llm/src/kv_router/sequence.rs

📚 Learning: 2025-10-14T00:58:05.744Z

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3597
File: lib/llm/src/kv_router/indexer.rs:437-441
Timestamp: 2025-10-14T00:58:05.744Z
Learning: In lib/llm/src/kv_router/indexer.rs, when a KvCacheEventData::Cleared event is received, the system intentionally clears all dp_ranks for the given worker_id by calling clear_all_blocks(worker.worker_id). This is the desired behavior and should not be scoped to individual dp_ranks.

Applied to files:

lib/llm/src/kv_router/sequence.rs

📚 Learning: 2025-09-17T01:00:50.937Z

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane suggested using tokio::select! arm ordering with the existing biased directive in the indexer to create a natural barrier for dump requests, ensuring KV events are drained before snapshotting. This approach leverages existing architecture (biased select) to solve race conditions with minimal code changes, which aligns with their preference for contained solutions.

Applied to files:

lib/llm/src/kv_router/sequence.rs

📚 Learning: 2025-06-02T19:37:27.666Z

Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Applied to files:

lib/llm/src/kv_router/sequence.rs

🧬 Code graph analysis (2)

components/src/dynamo/frontend/main.py (1)

lib/llm/src/kv_router.rs (1)

default (123-134)

lib/llm/src/kv_router/sequence.rs (2)

lib/bindings/python/rust/llm/entrypoint.rs (3)

new (46-68)

new (83-93)

new (133-179)

lib/llm/src/kv_router.rs (4)

new (141-164)

new (227-322)

new (489-494)

block_size (419-421)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: sglang (amd64)
GitHub Check: vllm (amd64)
GitHub Check: operator (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: tests (launch/dynamo-run)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: tests (lib/runtime/examples)
GitHub Check: tests (lib/bindings/python)
GitHub Check: clippy (.)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: tests (.)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (7)

docs/router/kv_cache_routing.md (1)

178-184: LGTM! Clear documentation for the new cascade attention flag.

The documentation accurately describes the new --router-with-cascade-attention flag and its purpose. The explanation of default behavior (KV reuse not tracked during decode) and the use case (cascade attention backends) is clear.

lib/bindings/python/rust/llm/entrypoint.rs (1)

45-65: LGTM! Python bindings correctly expose active_kv_reuse.

The parameter is properly added to the Python KvRouterConfig constructor with the correct default value (false) and is correctly threaded through to the inner RsKvRouterConfig.

components/src/dynamo/frontend/main.py (1)

178-184: LGTM! CLI flag properly implemented.

The new --router-with-cascade-attention flag is correctly configured with appropriate defaults and help text. The flag is properly threaded into KvRouterConfig construction at line 281.

lib/llm/src/kv_router/scheduler.rs (1)

96-132: LGTM! Scheduler correctly propagates active_kv_reuse.

The parameter is properly threaded from KvScheduler::start through to ActiveSequencesMultiWorker::new. The clippy annotation for too_many_arguments is appropriate given the additional parameter.

lib/llm/src/kv_router.rs (1)

116-119: LGTM! Core config properly implements active_kv_reuse.

The new field is well-documented, has the correct default value (false), and is properly threaded through KvRouterConfig::new() to KvScheduler::start(). The implementation is clean and consistent with existing config patterns.

Also applies to: 132-132, 149-149, 162-162, 290-290

lib/llm/src/kv_router/sequence.rs (2)

73-75: LGTM! Field properly added to ActiveSequences.

The active_kv_reuse field is correctly added with a getter and properly initialized through the constructor.

Also applies to: 80-80, 92-92

306-316: LGTM! ActiveSequencesMultiWorker properly propagates active_kv_reuse.

The parameter is correctly threaded through the multi-worker initialization, worker spawning, and worker updates. The implementation ensures all workers are created with the consistent active_kv_reuse setting.

Also applies to: 340-341, 388-388, 404-404, 622-622

lib/bindings/c/src/lib.rs

Signed-off-by: PeaBrane <[email protected]>

first commit

7caa794

Signed-off-by: PeaBrane <[email protected]>

pull-request-size bot added the size/M label Oct 30, 2025

github-actions bot added the feat label Oct 30, 2025

PeaBrane marked this pull request as ready for review October 30, 2025 18:48

PeaBrane requested review from a team as code owners October 30, 2025 18:48

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

lib/bindings/c/src/lib.rs Show resolved Hide resolved

clippy

e8d143a

Signed-off-by: PeaBrane <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 19:58 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 20:04 Inactive

dynamo-run clippy

6f9d406

Signed-off-by: PeaBrane <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 20:24 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 20:29 Inactive

PeaBrane marked this pull request as draft October 30, 2025 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: do not consider kv reuse for decoding for Router #3999

feat: do not consider kv reuse for decoding for Router #3999

Uh oh!

PeaBrane commented Oct 30, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 30, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: do not consider kv reuse for decoding for Router #3999

Are you sure you want to change the base?

feat: do not consider kv reuse for decoding for Router #3999

Uh oh!

Conversation

PeaBrane commented Oct 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 30, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PeaBrane commented Oct 30, 2025 •

edited by coderabbitai bot

Loading