-
Notifications
You must be signed in to change notification settings - Fork 663
feat: Introduce storage_client in DistributedRuntime #3507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughReplaces direct etcd usage with a storage abstraction across runtime and HTTP layers. Adds storage_client accessors, updates initialization paths to create Memory or Etcd-backed storage, and migrates instance listing to bucket-based iteration. Extends KeyValueStore with connection_id and implements it for memory, etcd, and NATS backends. Adjusts builders and signatures accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant HTTP as HTTP Service (health)
participant State
participant Storage as KeyValueStore
participant Bucket
Client->>HTTP: GET /health
HTTP->>State: storage_client()
State-->>HTTP: Arc<dyn KeyValueStore>
HTTP->>Storage: get_bucket(INSTANCE_ROOT_PATH)
alt bucket exists
Storage-->>HTTP: Bucket
HTTP->>Bucket: entries()
Bucket-->>HTTP: [(name, bytes)] (iter)
loop for each entry
HTTP->>HTTP: deserialize Instance from bytes
alt deserialization error
HTTP->>HTTP: log warn with name
end
end
HTTP-->>Client: 200 with instances summary
else bucket missing/error
HTTP->>HTTP: log warn (storage error)
HTTP-->>Client: 200 with empty instances
end
sequenceDiagram
autonumber
participant Builder as Runtime Builder
participant RT as DistributedRuntime
participant Etcd as etcd::Client
participant Storage as KeyValueStore
alt is_static == true
Builder->>RT: init(is_static=true)
RT->>RT: storage_client = MemoryStorage::new()
RT->>RT: etcd_client = None
else is_static == false
Builder->>RT: init(is_static=false, etcd_client)
RT->>Storage: EtcdStorage::new(Etcd)
RT->>RT: storage_client = EtcdStorage
RT->>RT: etcd_client = Some(Etcd)
end
RT-->>Builder: DistributedRuntime { storage_client, ... }
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
lib/runtime/src/instances.rs (1)
21-27: Preserve deterministic ordering when listing instances.
entries()hands back aHashMap, so iterating it yields a non-deterministic order. The previous etcd-backed path returned keys in lexical order, so this change can reshuffle results between calls/tests. Please sort by key (or iterate aBTreeMap) before pushing into the output vector to keep list ordering stable.- let entries = bucket.entries().await?; - let mut instances = Vec::with_capacity(entries.len()); - for (name, bytes) in entries.into_iter() { + let entries = bucket.entries().await?; + let mut sorted_entries: Vec<_> = entries.into_iter().collect(); + sorted_entries.sort_by(|(a, _), (b, _)| a.cmp(b)); + + let mut instances = Vec::with_capacity(sorted_entries.len()); + for (name, bytes) in sorted_entries { match serde_json::from_slice::<Instance>(&bytes) {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
lib/llm/src/http/service/health.rs(1 hunks)lib/llm/src/http/service/service_v2.rs(5 hunks)lib/runtime/src/component.rs(1 hunks)lib/runtime/src/distributed.rs(4 hunks)lib/runtime/src/instances.rs(1 hunks)lib/runtime/src/lib.rs(2 hunks)lib/runtime/src/storage/key_value_store.rs(1 hunks)lib/runtime/src/storage/key_value_store/etcd.rs(1 hunks)lib/runtime/src/storage/key_value_store/mem.rs(4 hunks)lib/runtime/src/storage/key_value_store/nats.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-17T00:50:44.845Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
Applied to files:
lib/runtime/src/storage/key_value_store/mem.rs
🧬 Code graph analysis (9)
lib/runtime/src/storage/key_value_store/etcd.rs (3)
lib/runtime/src/storage/key_value_store.rs (1)
connection_id(79-79)lib/runtime/src/storage/key_value_store/mem.rs (1)
connection_id(103-105)lib/runtime/src/storage/key_value_store/nats.rs (1)
connection_id(49-51)
lib/runtime/src/instances.rs (4)
lib/runtime/src/storage/key_value_store.rs (1)
entries(196-196)lib/runtime/src/storage/key_value_store/etcd.rs (1)
entries(131-149)lib/runtime/src/storage/key_value_store/mem.rs (1)
entries(202-212)lib/runtime/src/storage/key_value_store/nats.rs (1)
entries(171-184)
lib/runtime/src/storage/key_value_store.rs (3)
lib/runtime/src/storage/key_value_store/etcd.rs (1)
connection_id(49-53)lib/runtime/src/storage/key_value_store/mem.rs (1)
connection_id(103-105)lib/runtime/src/storage/key_value_store/nats.rs (1)
connection_id(49-51)
lib/runtime/src/distributed.rs (2)
lib/llm/src/http/service/service_v2.rs (2)
etcd_client(119-121)storage_client(123-125)lib/runtime/src/transports/etcd.rs (3)
etcd_client(129-131)new(93-126)new(516-559)
lib/llm/src/http/service/health.rs (1)
lib/runtime/src/instances.rs (1)
list_all_instances(16-33)
lib/llm/src/http/service/service_v2.rs (2)
lib/runtime/src/distributed.rs (3)
new(43-191)etcd_client(276-278)storage_client(281-283)lib/runtime/src/storage/key_value_store/mem.rs (1)
default(26-28)
lib/runtime/src/storage/key_value_store/nats.rs (3)
lib/runtime/src/storage/key_value_store.rs (1)
connection_id(79-79)lib/runtime/src/storage/key_value_store/etcd.rs (1)
connection_id(49-53)lib/runtime/src/storage/key_value_store/mem.rs (1)
connection_id(103-105)
lib/runtime/src/component.rs (4)
lib/runtime/src/storage/key_value_store.rs (1)
entries(196-196)lib/runtime/src/storage/key_value_store/etcd.rs (1)
entries(131-149)lib/runtime/src/storage/key_value_store/mem.rs (1)
entries(202-212)lib/runtime/src/storage/key_value_store/nats.rs (1)
entries(171-184)
lib/runtime/src/storage/key_value_store/mem.rs (3)
lib/runtime/src/storage/key_value_store.rs (1)
connection_id(79-79)lib/runtime/src/storage/key_value_store/etcd.rs (1)
connection_id(49-53)lib/runtime/src/storage/key_value_store/nats.rs (1)
connection_id(49-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang
- GitHub Check: Build and Test - dynamo
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (.)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (.)
🔇 Additional comments (13)
lib/llm/src/http/service/service_v2.rs (4)
22-24: LGTM!The imports for the storage abstraction layer are appropriate and align with the introduction of
storage_clientin theStatestruct.
36-36: LGTM!The
storage_clientfield is properly initialized with a defaultMemoryStorageimplementation in theState::new()constructor. This provides a sensible fallback when etcd is not available.Also applies to: 82-82
123-125: LGTM!The public accessor
storage_client()properly returnsArc<dyn KeyValueStore>, enabling callers to interact with the storage abstraction without depending on etcd directly.
306-308: LGTM!The builder now enforces that
etcd_clientmust be present, bailing with a clear error message if missing. This aligns with the signature change innew_with_etcd.lib/runtime/src/storage/key_value_store/etcd.rs (1)
48-53: LGTM!The
connection_id()implementation correctly returns the etcd lease ID cast tou64. The comment clearly explains why the cast fromi64tou64is safe, providing valuable context for future maintainers.lib/llm/src/http/service/health.rs (1)
55-60: LGTM!The update correctly replaces direct etcd usage with the storage abstraction via
state.storage_client(). Error handling is appropriate, logging a warning and returning an empty list on failure, which prevents the health check from failing completely.lib/runtime/src/storage/key_value_store/nats.rs (1)
48-50: LGTM!The
connection_id()implementation forNATSStoragecorrectly returns the client ID from the NATS server info, providing a stable connection identifier consistent with the trait contract.lib/runtime/src/component.rs (1)
242-259: LGTM!The migration from etcd-specific operations to the storage abstraction is well-implemented. The method now uses
storage_client()and iterates overbucket.entries(), which is more idiomatic for the new abstraction. Error handling includes the key name in the error message, which aids debugging.lib/runtime/src/lib.rs (1)
54-56: LGTM!The addition of the
storage_clientfield and the necessary imports properly integrate the storage abstraction intoDistributedRuntime. The field is private, maintaining encapsulation, and the public accessor is provided indistributed.rs.Also applies to: 157-157
lib/runtime/src/storage/key_value_store/mem.rs (2)
11-11: LGTM!The addition of a randomly generated
connection_idfield toMemoryStorageis appropriate for an in-memory storage backend. Each instance gets a unique identifier, which is consistent with the behavior of other storage backends.Also applies to: 22-22, 63-63
102-105: LGTM!The
connection_id()implementation correctly returns the stored connection ID, fulfilling theKeyValueStoretrait requirement.lib/runtime/src/storage/key_value_store.rs (1)
78-79: No externalKeyValueStoreimplementations found. All in-tree implementations have been updated; this breaking change is contained within the repo.lib/runtime/src/distributed.rs (1)
48-57: Nice storage-client initialization split.The branch cleanly wires a memory fallback while keeping the etcd-backed path, making the staged migration straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - but I think we should look at overall what services the runtime needs to provide - I'm not sure we need a distributed key value store - maybe its more that we need a way to store configuration and we need discovery? for model information we will use model express, for router - I'm thinking we would embed prefix tree storage into the router library, for discovery likewise we will use discovery interfaces... what remains in that case for key value store?
This will gradually replace `etcd_client`. It is backed by etcd normally, or memory when etcd is not available. Also use it to list instances, as an example and first usage. This is the same interface that ModelDeploymentCards have always used for publishing and reading, so it is well tested. Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]>
`*Storage` -> `*Store` `storage_client` -> `store` `etcd_root` -> `instance_root` Add a comment explaining that although `store` is `dyn KeyValueStore` right now, it has ambitions to be the one-ring of Dynamo storage. Signed-off-by: Graham King <[email protected]>
1a33f6e to
e2c90e6
Compare
|
Updated names, added a comment clarifying that it hopes to be a more general interface than just key-value, and rebased on main. |
We're doing that work here right? ai-dynamo/enhancements#39 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm late to the game as this is already merged. Most of the other questions are already asked and answered. Concerns:
- Woule love to see comments explaining the reasons for some of the changes. For example, explain that Store now contains the same EtcdClient/etcd::Client because of xyz reasons (e.g.
// This is just a transition phase) - Naming. Thanks for making naming more consistent (e.g. Storage->Store). How about variable names, do you plan to rename variables like
etcd_client(too specific) andstore(too generic) to something likekeyval_storein later PRs? - Would love to see examples or documents on how we can replace KeyValueStore (currently EtcdClient) with memory instances in unit tests.
| metrics: Arc<Metrics>, | ||
| manager: Arc<ModelManager>, | ||
| etcd_client: Option<etcd::Client>, | ||
| store: Arc<dyn KeyValueStore>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already merged but for future PRs:
I'm looking at this and I'm thinking, wait, couldn't you set etc_client as etcd::Client, and also store as EtcdStore (which contains etcd::Client)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, I see that you're doing that in 92 (double storing etcd).
It would be good to see a comment saying something like
// Right now I'll store both during this transition phase. In the next PR I will get rid of etcd_client and just use store.
|
|
||
| // Publish the Model Deployment Card to KV store | ||
| let kvstore: Box<dyn KeyValueStore> = Box::new(EtcdStorage::new(etcd_client.clone())); | ||
| let kvstore: Box<dyn KeyValueStore> = Box::new(EtcdStore::new(etcd_client.clone())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AI is taking over the word kv and in the future this variable is probably best renamed as just keyval_store
Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]>
This will gradually replace
etcd_client, and later extend to replacing nats_client. It is backed by etcd normally, or memory when etcd is not available.Also use it to list instances, as an example and first usage.
This is the same interface that ModelDeploymentCards have always used for publishing and reading, so it is well tested.
Summary by CodeRabbit