Skip to content

Conversation

@karen-sy
Copy link
Contributor

@karen-sy karen-sy commented Nov 20, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added optional local KV cache event buffering capability. When enabled, events are buffered locally with configurable size, and you can query recent events or view the full buffer before publication.
  • Chores

    • Added kvbm-hub as a new workspace member.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions github-actions bot added the feat label Nov 20, 2025
@karen-sy karen-sy marked this pull request as ready for review November 21, 2025 00:08
@karen-sy karen-sy requested a review from a team as a code owner November 21, 2025 00:08
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

Walkthrough

This change adds a new workspace member lib/kvbm-hub, introduces a LocalKvIndexer type that wraps KvIndexer with event buffering capabilities, and integrates it into the event publisher to optionally buffer KV cache events locally before NATS publication.

Changes

Cohort / File(s) Summary
Workspace Configuration
Cargo.toml
Adds lib/kvbm-hub to workspace members, enabling it to be built as part of the workspace.
KV Router Enhancement
lib/llm/src/kv_router/indexer.rs
Introduces LocalKvIndexer struct wrapping KvIndexer with a circular event buffer. Includes constructor, accessors (indexer(), get_recent_events(), get_all_events(), buffer_len()), buffer management methods, and KvIndexerInterface trait implementation that delegates to the underlying indexer while recording events.
Publisher Integration
lib/llm/src/kv_router/publisher.rs
Extends KvEventPublisher with optional local_indexer field and new new_with_local_indexer() constructor. Updates event processing loop to apply events to local indexer before NATS publication. Modifies start_event_processor() signature to accept optional LocalKvIndexer, threads it through task spawning, and expands test coverage for local-indexer scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • LocalKvIndexer trait delegation: Verify that trait method implementations correctly delegate to underlying indexer while maintaining buffer semantics
  • Async buffering mechanism: Confirm thread-safety of Arc/Mutex wrapped VecDeque and correctness of record_event() trimming logic
  • Conditional event flow: Review publisher integration to ensure events are consistently applied to local indexer when present and do not block NATS publication on local indexer failures
  • Test coverage: Validate new test scenarios exercise both enabled and disabled local-indexer paths, including edge cases around buffer capacity and worker registration

Poem

🐰 A buffer springs to life, events do queue,
Local indexing hops in something new,
Through publisher flows, our KV routers dance,
With optional buffering—oh what a chance!
The workspace grows, and cache blooms bright! 🌿

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely empty/placeholder content with no actual details about the changes, objectives, or related issues filled in. Fill in the template sections with: a high-level overview of the feature, detailed description of the LocalKvIndexer wrapper and integration changes, guidance on reviewing Cargo.toml, indexer.rs, and publisher.rs, and specify the related GitHub issue number.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a worker-local KvIndexer feature to KvEventPublisher, which aligns with the core functionality introduced across the modified files.

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
lib/llm/src/kv_router/indexer.rs (1)

1031-1123: LocalKvIndexer buffering logic and wiring look correct overall

The wrapper correctly:

  • Keeps a bounded circular buffer of (WorkerId, KvCacheEvent) with VecDeque and trims from the front.
  • Records events before forwarding them to the underlying KvIndexer via event_sender().send(...).
  • Exposes helpers for recent/all events, clearing, and buffer length, which will be useful for debugging and tests.

Two follow‑ups worth considering:

  1. Error handling in trait impls
  • apply_event_with_buffer surfaces IndexerOffline via Result, and start_event_processor logs this, which is good.
  • However, the KvIndexerInterface impl swallows errors:
    • async fn apply_event(&mut self, event: RouterEvent) ignores the Result from apply_event_with_buffer.
    • async fn remove_worker(&mut self, worker: WorkerId) ignores the Result from remove_worker_sender().send(worker).await.

This diverges from KvIndexer’s behavior (which unwrap()s and fails fast if the channels are broken). If the trait paths are ever used in production, you probably want at least a tracing::warn! on these error paths, or to mirror the “panic on impossible state” behavior.

  1. Minor ergonomics
  • Deref<Target = Arc<KvIndexer>> is usable (double‑deref to KvIndexer works), but Target = KvIndexer would be more natural and avoid exposing the Arc at the API surface.
  • Using tokio::sync::Mutex instead of std::sync::Mutex for event_buffer would avoid blocking the async scheduler, though this is likely low impact since the critical section is tiny.
lib/llm/src/kv_router/publisher.rs (3)

134-145: Local indexer construction is reasonable but has a couple of tunables

The local indexer creation is sound:

  • Uses the same CancellationToken as the publisher, so shutdown behavior is unified.
  • Uses KvIndexerMetrics::new_unregistered() to avoid polluting global metrics, which is defensible for a per‑worker helper.

Two knobs you might eventually want to expose:

  • max_buffer_size is hard‑coded as 100 with a TODO; consider threading this as a parameter on new_with_local_indexer (or via config) so it can be tuned per deployment.
  • If you ever want visibility into local indexer behavior in production, allowing injection of KvIndexerMetrics::from_component here would help.

194-197: local_indexer accessor matches the internal representation

pub fn local_indexer(&self) -> Option<&Arc<LocalKvIndexer>> correctly exposes the optional local indexer for tests or higher‑level consumers.

If you later change Deref<Target> on LocalKvIndexer, you might also consider returning Option<&LocalKvIndexer> instead of the Arc, but that’s cosmetic.


1218-1279: BlockRemoved + local indexer test looks correct but could log indexer failures

The test_event_processor_block_removed_with_local_indexer test:

  • Sends a Stored event followed by a Removed event through the same channel.
  • Ensures:
    • The global publish path sees both events.
    • The local indexer’s find_matches returns no hits after the removal.

That correctly validates that KvCacheEventData::Removed is applied to the local indexer.

Given that LocalKvIndexer::remove_worker and apply_event trait methods currently ignore send errors, consider adding minimal logging there as well, to mirror the “log and continue” behavior used in start_event_processor. This isn’t test‑blocking but would help diagnose broken channels in non‑publisher call sites.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f18b23 and a415a1a.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • Cargo.toml (1 hunks)
  • lib/llm/src/kv_router/indexer.rs (1 hunks)
  • lib/llm/src/kv_router/publisher.rs (10 hunks)
🧰 Additional context used
🧠 Learnings (8)
📚 Learning: 2025-10-14T00:58:05.744Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3597
File: lib/llm/src/kv_router/indexer.rs:437-441
Timestamp: 2025-10-14T00:58:05.744Z
Learning: In lib/llm/src/kv_router/indexer.rs, when a KvCacheEventData::Cleared event is received, the system intentionally clears all dp_ranks for the given worker_id by calling clear_all_blocks(worker.worker_id). This is the desired behavior and should not be scoped to individual dp_ranks.

Applied to files:

  • lib/llm/src/kv_router/indexer.rs
  • lib/llm/src/kv_router/publisher.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

  • lib/llm/src/kv_router/indexer.rs
  • lib/llm/src/kv_router/publisher.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane suggested using tokio::select! arm ordering with the existing biased directive in the indexer to create a natural barrier for dump requests, ensuring KV events are drained before snapshotting. This approach leverages existing architecture (biased select) to solve race conditions with minimal code changes, which aligns with their preference for contained solutions.

Applied to files:

  • lib/llm/src/kv_router/indexer.rs
  • lib/llm/src/kv_router/publisher.rs
📚 Learning: 2025-05-29T00:02:35.018Z
Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Applied to files:

  • lib/llm/src/kv_router/indexer.rs
  • lib/llm/src/kv_router/publisher.rs
📚 Learning: 2025-09-17T20:55:06.333Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.333Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.

Applied to files:

  • lib/llm/src/kv_router/publisher.rs
📚 Learning: 2025-05-30T06:38:09.630Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Applied to files:

  • lib/llm/src/kv_router/publisher.rs
📚 Learning: 2025-06-05T01:02:15.318Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Applied to files:

  • lib/llm/src/kv_router/publisher.rs
📚 Learning: 2025-06-02T19:37:27.666Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Applied to files:

  • lib/llm/src/kv_router/publisher.rs
🔇 Additional comments (9)
Cargo.toml (1)

18-20: Workspace member addition looks fine

Adding "lib/kvbm-hub" to the workspace members is consistent with the existing layout; no issues from this file’s perspective. Just ensure the new crate exists and has its own Cargo.toml before merging.

Run cargo metadata or cargo check -p kvbm-hub locally to verify the new member is wired correctly.

lib/llm/src/kv_router/publisher.rs (8)

4-9: Importing LocalKvIndexer and compute_block_hash_for_seq is appropriate

Bringing LocalKvIndexer and compute_block_hash_for_seq into the publisher module matches their usage below (local indexer wiring and block hash computation). No issues here.


95-107: Constructor split/new_with_local_indexer maintains backward compatibility

  • KvEventPublisher::new(...) now delegates to new_with_local_indexer(..., enable_local_indexer = false), so existing callers see identical behavior.
  • new_with_local_indexer provides an explicit opt‑in for worker‑local indexing, which is a clean extension point.

Looks good from an API‑evolution standpoint.


159-175: Threading local_indexer into the event processor is correct

Passing a cloned Option<Arc<LocalKvIndexer>> into the spawned start_event_processor task is the right way to:

  • Share the local indexer across all events for this worker.
  • Keep its lifetime tied to the publisher/event loop without extra shutdown plumbing.

No functional issues spotted here.


216-255: Event processor correctly applies local indexer then publishes, with good failure isolation

The updated start_event_processor:

  • Wraps incoming KvCacheEvent into a RouterEvent with the worker’s ID (unchanged behavior).
  • If local_indexer is present, calls apply_event_with_buffer(router_event.clone()):
    • This ensures the worker‑local indexer stays in sync and records the event before global distribution.
    • On failure (IndexerOffline), logs a warning but continues.
  • Always attempts to publish the same RouterEvent to NATS, regardless of local indexer outcome.

This is the right failure‑isolation boundary: local indexer issues don’t take down or stall global event publishing. Logging at warn! is appropriate, since a permanently offline local indexer indicates misconfiguration or a shutdown condition.

No changes requested here.


1047-1107: MockComponent and updated test_start_event_processor stay valid after the signature change

The MockComponent stand‑in for EventPublisher remains minimal and correct, and test_start_event_processor has been updated to pass None for the new local_indexer parameter, still asserting a single publish to QUEUE_NAME.

This keeps the original behavior covered while leaving room for local‑indexer tests.


1142-1213: test_start_event_processor_with_local_indexer exercises the happy path well

This test validates the critical behavior when a local indexer is enabled:

  • It verifies the NATS publish path still receives exactly one event on QUEUE_NAME.
  • It uses get_workers_sender on the underlying KvIndexer (through LocalKvIndexer) to confirm that worker 1 is now tracked, proving the local indexer applied the event.

The use of KvIndexerMetrics::new_unregistered() and a dedicated CancellationToken is consistent with how the production code constructs a local indexer.

No issues here.


1284-1343: AllBlocksCleared behavior with local indexer is covered and matches router semantics

This test:

  • Stores a block, then sends a KvCacheEventData::Cleared event.
  • Verifies:
    • Local indexer find_matches yields no matches, i.e., the worker’s blocks are cleared.
    • Two events were still published globally.

This aligns with the intended behavior in RadixTree::apply_event where Cleared clears all blocks for the worker (all dp_ranks), as per prior design notes. Based on learnings.


1348-1389: Local indexer failure test correctly ensures NATS publishing proceeds

test_event_processor_local_indexer_failure_continues intentionally cancels the local indexer’s token before sending an event, then:

  • Runs start_event_processor with that dead indexer.
  • Asserts exactly one published event to NATS.

This proves the “log‑warn but keep publishing” behavior on local indexer failure works as intended.

Implementation and test are aligned; no change requested.

@PeaBrane
Copy link
Contributor

@karen-sy Thanks for putting this up! I like the general shape. When the CI is green, I’ll go through it in detail.

@PeaBrane PeaBrane added the router Relates to routing, KV-aware routing, etc. label Nov 21, 2025
/// A thin wrapper around KvIndexer that buffers recent events
/// (e.g. which may be queued by router upon startup)
///
pub struct LocalKvIndexer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would probably also need the following features

  1. Some check that the event ids are queued in consecutive integers.
  2. Some sort of binary search mechanism (can use a crate), to pinpoint the event in the buffer with event_id = ... and get the events from that point on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, I'm going to prototype the router<->localindexer comm route first (initially assuming that the router only ever asks for a full event dump) and then implement those features

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat router Relates to routing, KV-aware routing, etc. size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants