Skip to content

Conversation

@grahamking
Copy link
Contributor

@grahamking grahamking commented Oct 8, 2025

This will gradually replace etcd_client, and later extend to replacing nats_client. It is backed by etcd normally, or memory when etcd is not available.

Also use it to list instances, as an example and first usage.

This is the same interface that ModelDeploymentCards have always used for publishing and reading, so it is well tested.

Summary by CodeRabbit

  • New Features
    • Introduces a unified storage backend, enabling both in-memory and etcd-backed operation.
    • Adds a connection identifier for storage clients to aid diagnostics.
  • Refactor
    • Migrates instance discovery and runtime services to the new storage abstraction for consistency across components.
    • Streamlines configuration: external coordination now requires an available storage backend.
  • Bug Fixes
    • Improves resilience of health checks and instance listing by gracefully handling storage errors and continuing to serve responses with clear warnings.

@grahamking grahamking requested a review from a team as a code owner October 8, 2025 21:19
@github-actions github-actions bot added the feat label Oct 8, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 8, 2025

Walkthrough

Replaces direct etcd usage with a storage abstraction across runtime and HTTP layers. Adds storage_client accessors, updates initialization paths to create Memory or Etcd-backed storage, and migrates instance listing to bucket-based iteration. Extends KeyValueStore with connection_id and implements it for memory, etcd, and NATS backends. Adjusts builders and signatures accordingly.

Changes

Cohort / File(s) Summary of Changes
HTTP service state and config
lib/llm/src/http/service/service_v2.rs
Removed Default derive on State. Added storage_client(&self) -> Arc<dyn KeyValueStore>. Changed State::new_with_etcd to require concrete etcd::Client (no Option). Builder now requires present etcd_client and errors if missing.
HTTP health endpoint retrieval
lib/llm/src/http/service/health.rs
Replaced conditional etcd fetch with state.storage_client() usage. On error, logs warning mentioning storage and returns empty list. Removed else branch for missing etcd.
Distributed runtime storage abstraction
lib/runtime/src/distributed.rs, lib/runtime/src/lib.rs
Added storage_client: Arc<dyn KeyValueStore> field and accessor. On static mode, use MemoryStorage and no etcd client; otherwise wrap etcd in EtcdStorage. Imported storage traits/impls. Internal struct layout updated.
Instance listing via storage buckets
lib/runtime/src/instances.rs, lib/runtime/src/component.rs
Migrated from etcd prefix scans to KeyValueStore bucket iteration. list_all_instances now takes Arc<dyn KeyValueStore>. Deserializes entries from bucket values; logs warnings on parse errors. Component::list_instances updated similarly with adjusted error messages.
Storage trait and implementations
lib/runtime/src/storage/key_value_store.rs, lib/runtime/src/storage/key_value_store/etcd.rs, lib/runtime/src/storage/key_value_store/mem.rs, lib/runtime/src/storage/key_value_store/nats.rs
Added fn connection_id(&self) -> u64 to KeyValueStore. Implemented connection_id for EtcdStorage (lease ID), MemoryStorage (random init field), and NATSStorage (server client_id). MemoryStorage gains private connection_id field and initializes it in new().

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant HTTP as HTTP Service (health)
  participant State
  participant Storage as KeyValueStore
  participant Bucket

  Client->>HTTP: GET /health
  HTTP->>State: storage_client()
  State-->>HTTP: Arc<dyn KeyValueStore>
  HTTP->>Storage: get_bucket(INSTANCE_ROOT_PATH)
  alt bucket exists
    Storage-->>HTTP: Bucket
    HTTP->>Bucket: entries()
    Bucket-->>HTTP: [(name, bytes)] (iter)
    loop for each entry
      HTTP->>HTTP: deserialize Instance from bytes
      alt deserialization error
        HTTP->>HTTP: log warn with name
      end
    end
    HTTP-->>Client: 200 with instances summary
  else bucket missing/error
    HTTP->>HTTP: log warn (storage error)
    HTTP-->>Client: 200 with empty instances
  end
Loading
sequenceDiagram
  autonumber
  participant Builder as Runtime Builder
  participant RT as DistributedRuntime
  participant Etcd as etcd::Client
  participant Storage as KeyValueStore

  alt is_static == true
    Builder->>RT: init(is_static=true)
    RT->>RT: storage_client = MemoryStorage::new()
    RT->>RT: etcd_client = None
  else is_static == false
    Builder->>RT: init(is_static=false, etcd_client)
    RT->>Storage: EtcdStorage::new(Etcd)
    RT->>RT: storage_client = EtcdStorage
    RT->>RT: etcd_client = Some(Etcd)
  end
  RT-->>Builder: DistributedRuntime { storage_client, ... }
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

In burrows of code, I twitch my nose,
From etcd thickets to storage rows.
Buckets brim with bytes to glean,
IDs sparkle—u64 sheen.
I thump, I hop, I validate,
New paths mapped, no keys to wait.
Carrots cached—release shipmate! 🥕🚀

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.05% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The pull request description does not follow the required repository template because it lacks the mandated “Overview,” “Details,” “Where should the reviewer start?,” and “Related Issues” sections and instead provides only free-form paragraphs without the specified headings or structure. Please update the PR description to use the provided template by adding an “Overview” that summarizes the changes, a “Details” section describing file-level modifications, a “Where should the reviewer start?” section highlighting key files, and a “Related Issues” section with any issue references and appropriate action keywords.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title succinctly captures the primary change of introducing a storage_client field into DistributedRuntime, reflecting the main structural update without extraneous details or noise.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
lib/runtime/src/instances.rs (1)

21-27: Preserve deterministic ordering when listing instances.

entries() hands back a HashMap, so iterating it yields a non-deterministic order. The previous etcd-backed path returned keys in lexical order, so this change can reshuffle results between calls/tests. Please sort by key (or iterate a BTreeMap) before pushing into the output vector to keep list ordering stable.

-    let entries = bucket.entries().await?;
-    let mut instances = Vec::with_capacity(entries.len());
-    for (name, bytes) in entries.into_iter() {
+    let entries = bucket.entries().await?;
+    let mut sorted_entries: Vec<_> = entries.into_iter().collect();
+    sorted_entries.sort_by(|(a, _), (b, _)| a.cmp(b));
+
+    let mut instances = Vec::with_capacity(sorted_entries.len());
+    for (name, bytes) in sorted_entries {
         match serde_json::from_slice::<Instance>(&bytes) {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b5bab7 and 89e1281.

📒 Files selected for processing (10)
  • lib/llm/src/http/service/health.rs (1 hunks)
  • lib/llm/src/http/service/service_v2.rs (5 hunks)
  • lib/runtime/src/component.rs (1 hunks)
  • lib/runtime/src/distributed.rs (4 hunks)
  • lib/runtime/src/instances.rs (1 hunks)
  • lib/runtime/src/lib.rs (2 hunks)
  • lib/runtime/src/storage/key_value_store.rs (1 hunks)
  • lib/runtime/src/storage/key_value_store/etcd.rs (1 hunks)
  • lib/runtime/src/storage/key_value_store/mem.rs (4 hunks)
  • lib/runtime/src/storage/key_value_store/nats.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-17T00:50:44.845Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.

Applied to files:

  • lib/runtime/src/storage/key_value_store/mem.rs
🧬 Code graph analysis (9)
lib/runtime/src/storage/key_value_store/etcd.rs (3)
lib/runtime/src/storage/key_value_store.rs (1)
  • connection_id (79-79)
lib/runtime/src/storage/key_value_store/mem.rs (1)
  • connection_id (103-105)
lib/runtime/src/storage/key_value_store/nats.rs (1)
  • connection_id (49-51)
lib/runtime/src/instances.rs (4)
lib/runtime/src/storage/key_value_store.rs (1)
  • entries (196-196)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • entries (131-149)
lib/runtime/src/storage/key_value_store/mem.rs (1)
  • entries (202-212)
lib/runtime/src/storage/key_value_store/nats.rs (1)
  • entries (171-184)
lib/runtime/src/storage/key_value_store.rs (3)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (49-53)
lib/runtime/src/storage/key_value_store/mem.rs (1)
  • connection_id (103-105)
lib/runtime/src/storage/key_value_store/nats.rs (1)
  • connection_id (49-51)
lib/runtime/src/distributed.rs (2)
lib/llm/src/http/service/service_v2.rs (2)
  • etcd_client (119-121)
  • storage_client (123-125)
lib/runtime/src/transports/etcd.rs (3)
  • etcd_client (129-131)
  • new (93-126)
  • new (516-559)
lib/llm/src/http/service/health.rs (1)
lib/runtime/src/instances.rs (1)
  • list_all_instances (16-33)
lib/llm/src/http/service/service_v2.rs (2)
lib/runtime/src/distributed.rs (3)
  • new (43-191)
  • etcd_client (276-278)
  • storage_client (281-283)
lib/runtime/src/storage/key_value_store/mem.rs (1)
  • default (26-28)
lib/runtime/src/storage/key_value_store/nats.rs (3)
lib/runtime/src/storage/key_value_store.rs (1)
  • connection_id (79-79)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (49-53)
lib/runtime/src/storage/key_value_store/mem.rs (1)
  • connection_id (103-105)
lib/runtime/src/component.rs (4)
lib/runtime/src/storage/key_value_store.rs (1)
  • entries (196-196)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • entries (131-149)
lib/runtime/src/storage/key_value_store/mem.rs (1)
  • entries (202-212)
lib/runtime/src/storage/key_value_store/nats.rs (1)
  • entries (171-184)
lib/runtime/src/storage/key_value_store/mem.rs (3)
lib/runtime/src/storage/key_value_store.rs (1)
  • connection_id (79-79)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
  • connection_id (49-53)
lib/runtime/src/storage/key_value_store/nats.rs (1)
  • connection_id (49-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (.)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (.)
🔇 Additional comments (13)
lib/llm/src/http/service/service_v2.rs (4)

22-24: LGTM!

The imports for the storage abstraction layer are appropriate and align with the introduction of storage_client in the State struct.


36-36: LGTM!

The storage_client field is properly initialized with a default MemoryStorage implementation in the State::new() constructor. This provides a sensible fallback when etcd is not available.

Also applies to: 82-82


123-125: LGTM!

The public accessor storage_client() properly returns Arc<dyn KeyValueStore>, enabling callers to interact with the storage abstraction without depending on etcd directly.


306-308: LGTM!

The builder now enforces that etcd_client must be present, bailing with a clear error message if missing. This aligns with the signature change in new_with_etcd.

lib/runtime/src/storage/key_value_store/etcd.rs (1)

48-53: LGTM!

The connection_id() implementation correctly returns the etcd lease ID cast to u64. The comment clearly explains why the cast from i64 to u64 is safe, providing valuable context for future maintainers.

lib/llm/src/http/service/health.rs (1)

55-60: LGTM!

The update correctly replaces direct etcd usage with the storage abstraction via state.storage_client(). Error handling is appropriate, logging a warning and returning an empty list on failure, which prevents the health check from failing completely.

lib/runtime/src/storage/key_value_store/nats.rs (1)

48-50: LGTM!

The connection_id() implementation for NATSStorage correctly returns the client ID from the NATS server info, providing a stable connection identifier consistent with the trait contract.

lib/runtime/src/component.rs (1)

242-259: LGTM!

The migration from etcd-specific operations to the storage abstraction is well-implemented. The method now uses storage_client() and iterates over bucket.entries(), which is more idiomatic for the new abstraction. Error handling includes the key name in the error message, which aids debugging.

lib/runtime/src/lib.rs (1)

54-56: LGTM!

The addition of the storage_client field and the necessary imports properly integrate the storage abstraction into DistributedRuntime. The field is private, maintaining encapsulation, and the public accessor is provided in distributed.rs.

Also applies to: 157-157

lib/runtime/src/storage/key_value_store/mem.rs (2)

11-11: LGTM!

The addition of a randomly generated connection_id field to MemoryStorage is appropriate for an in-memory storage backend. Each instance gets a unique identifier, which is consistent with the behavior of other storage backends.

Also applies to: 22-22, 63-63


102-105: LGTM!

The connection_id() implementation correctly returns the stored connection ID, fulfilling the KeyValueStore trait requirement.

lib/runtime/src/storage/key_value_store.rs (1)

78-79: No external KeyValueStore implementations found. All in-tree implementations have been updated; this breaking change is contained within the repo.

lib/runtime/src/distributed.rs (1)

48-57: Nice storage-client initialization split.

The branch cleanly wires a memory fallback while keeping the etcd-backed path, making the staged migration straightforward.

Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but I think we should look at overall what services the runtime needs to provide - I'm not sure we need a distributed key value store - maybe its more that we need a way to store configuration and we need discovery? for model information we will use model express, for router - I'm thinking we would embed prefix tree storage into the router library, for discovery likewise we will use discovery interfaces... what remains in that case for key value store?

This will gradually replace `etcd_client`. It is backed by etcd
normally, or memory when etcd is not available.

Also use it to list instances, as an example and first usage.

This is the same interface that ModelDeploymentCards have always used
for publishing and reading, so it is well tested.

Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]>
`*Storage` -> `*Store`
`storage_client` -> `store`
`etcd_root` -> `instance_root`

Add a comment explaining that although `store` is `dyn KeyValueStore`
right now, it has ambitions to be the one-ring of Dynamo storage.

Signed-off-by: Graham King <[email protected]>
@grahamking
Copy link
Contributor Author

Updated names, added a comment clarifying that it hopes to be a more general interface than just key-value, and rebased on main.

@grahamking
Copy link
Contributor Author

LGTM - but I think we should look at overall what services the runtime needs to provide - I'm not sure we need a distributed key value store - maybe its more that we need a way to store configuration and we need discovery? for model information we will use model express, for router - I'm thinking we would embed prefix tree storage into the router library, for discovery likewise we will use discovery interfaces... what remains in that case for key value store?

We're doing that work here right? ai-dynamo/enhancements#39

@grahamking grahamking merged commit 7a7d397 into main Oct 10, 2025
29 of 30 checks passed
@grahamking grahamking deleted the gk-drt-key-value branch October 10, 2025 15:45
Copy link
Contributor

@keivenchang keivenchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late to the game as this is already merged. Most of the other questions are already asked and answered. Concerns:

  • Woule love to see comments explaining the reasons for some of the changes. For example, explain that Store now contains the same EtcdClient/etcd::Client because of xyz reasons (e.g. // This is just a transition phase)
  • Naming. Thanks for making naming more consistent (e.g. Storage->Store). How about variable names, do you plan to rename variables like etcd_client (too specific) and store (too generic) to something like keyval_store in later PRs?
  • Would love to see examples or documents on how we can replace KeyValueStore (currently EtcdClient) with memory instances in unit tests.

metrics: Arc<Metrics>,
manager: Arc<ModelManager>,
etcd_client: Option<etcd::Client>,
store: Arc<dyn KeyValueStore>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already merged but for future PRs:

I'm looking at this and I'm thinking, wait, couldn't you set etc_client as etcd::Client, and also store as EtcdStore (which contains etcd::Client)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I see that you're doing that in 92 (double storing etcd).

It would be good to see a comment saying something like
// Right now I'll store both during this transition phase. In the next PR I will get rid of etcd_client and just use store.


// Publish the Model Deployment Card to KV store
let kvstore: Box<dyn KeyValueStore> = Box::new(EtcdStorage::new(etcd_client.clone()));
let kvstore: Box<dyn KeyValueStore> = Box::new(EtcdStore::new(etcd_client.clone()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI is taking over the word kv and in the future this variable is probably best renamed as just keyval_store

ziqifan617 pushed a commit that referenced this pull request Oct 10, 2025
ziqifan617 pushed a commit that referenced this pull request Oct 20, 2025
nv-tusharma pushed a commit that referenced this pull request Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants