fix: introduce service discovery interface (1/n) #3937

mohammedabdulwahhab · 2025-10-28T17:25:43Z

Introduce a discovery client interface
Create an attribute on the DistributedRuntime to persist the client
Create a mock discovery backend using memory (to be replaced with KeyValueStoreDiscoveryClient)

Summary by CodeRabbit

New Features

Service discovery system now integrated into the runtime, enabling registration and monitoring of service endpoints.
Supports dynamic tracking of service instances with real-time event notifications for added and removed services.
Mock discovery implementation included for testing and development purposes with thread-safe, in-memory storage.

Signed-off-by: mohammedabdulwahhab <[email protected]>

mohammedabdulwahhab · 2025-10-28T17:36:49Z

lib/runtime/src/discovery/mock.rs

@@ -0,0 +1,174 @@
+// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


This mock will eventually be replaced with one that uses the memory impl for KeyValueStoreManager

coderabbitai · 2025-10-28T17:41:43Z

Walkthrough

This PR introduces a pluggable service discovery system to the runtime. It defines core discovery abstractions (DiscoveryKey, DiscoveryInstance, DiscoveryEvent, DiscoveryClient trait) and provides a mock in-memory implementation for testing. The discovery client is integrated into DistributedRuntime as a lazy-initialized field.

Changes

Cohort / File(s)	Summary
Discovery Module Interface `lib/runtime/src/discovery/mod.rs`	Establishes discovery abstraction layer with DiscoveryKey, DiscoveryInstance, and DiscoveryEvent enums. Defines DiscoveryClient trait with instance_id(), serve(), and list_and_watch() methods. Re-exports mock types.
Mock Discovery Implementation `lib/runtime/src/discovery/mock.rs`	Implements in-memory mock discovery system with SharedMockRegistry (thread-safe Arc<Mutex>) and MockDiscoveryClient. Includes streaming list_and_watch() with 10ms polling interval and unit test verifying Added/Removed event propagation.
DistributedRuntime Integration `lib/runtime/src/distributed.rs`, `lib/runtime/src/lib.rs`	Adds lazy-initialized discovery_client field (Arc<OnceCell<Arc>>) to DistributedRuntime and discovery_client() async method. Declares new pub mod discovery.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

Review trait design and streaming implementation in mock discovery for correctness of event emission logic
Verify async/await patterns and Arc usage in mock registry are thread-safe
Confirm lazy initialization pattern with OnceCell is correctly integrated in DistributedRuntime

Poem

🐰 A discovery path is now laid,
With traits and streams in arclight shade,
Mock instances hop and play,
Events added, swept away—
The registry hops to serve the day!

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description provides useful information about the changes (introducing a discovery client interface, persisting it on DistributedRuntime, and creating a mock implementation) but does not follow the required template structure. The description is missing critical sections including an Overview header, a Details section with proper organization, the "Where should the reviewer start?" section that identifies specific files for review, and the Related Issues section with action keywords. While the content touches on what was changed, the failure to follow the template structure and omission of key navigation and linking information makes the description largely incomplete compared to the template requirements.	Revise the pull request description to follow the provided template structure. Add an Overview section summarizing the purpose, organize the changes under a Details section header, add a "Where should the reviewer start?" section identifying lib/runtime/src/discovery/mod.rs and lib/runtime/src/discovery/mock.rs as key files for review, and include a Related Issues section with the appropriate action keyword (e.g., "Closes #xxx" or "Relates to #xxx") if applicable.
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "fix: introduce service discovery interface (1/n)" directly and clearly describes the main objective of the pull request. The changes establish a new discovery client interface, add it to DistributedRuntime, and provide a mock implementation, all of which are accurately captured by the title. The title is concise, specific, and avoids vague terminology while properly conveying the primary change from the changeset.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/runtime/src/distributed.rs (1)
20-24: Name collision confirmed: error imported twice (compile error E0252)

The figment::error import conflicts with the super module's re-export of anyhow::anyhow as error. Line 234 uses the anyhow macro (error!("...")), while figment::error is unused. Remove the figment import.
-use figment::error;

🧹 Nitpick comments (8)

lib/runtime/src/lib.rs (1)

99-101: Trait-object field is fine; consider a type alias for clarity

Arc<OnceCell<Arc<dyn discovery::DiscoveryClient>>> is correct. For readability and fewer repeated bounds, consider aliasing:
+// in lib/runtime/src/discovery/mod.rs (or a shared prelude)
+pub type DiscoveryClientRef = Arc<dyn DiscoveryClient + 'static>;
+
-// here
-discovery_client: Arc<OnceCell<Arc<dyn discovery::DiscoveryClient>>>,
+discovery_client: Arc<OnceCell<DiscoveryClientRef>>,

lib/runtime/src/distributed.rs (1)

228-239: Wire a usable default for early testing instead of hard error

Returning Err(error!(...)) blocks adopters. Prefer initializing a mock when self.is_static or under a cfg(test)/feature flag.

-    pub async fn discovery_client(&self) -> Result<Arc<dyn DiscoveryClient>> {
-        let client = self
-            .discovery_client
-            .get_or_try_init(async {
-                // TODO: Replace when KeyValueDiscoveryClient is implemented
-                Err(error!("No discovery clients yet implemented."))
-            })
-            .await?;
-        Ok(client.clone())
-    }
+    pub async fn discovery_client(&self) -> Result<Arc<dyn DiscoveryClient>> {
+        let client = self
+            .discovery_client
+            .get_or_try_init(async {
+                #[cfg(feature = "discovery-mock")]
+                {
+                    // Use in-memory mock for static or test environments
+                    let registry = crate::discovery::SharedMockRegistry::new();
+                    let mock = crate::discovery::MockDiscoveryClient::new(
+                        format!("drt-{}", self.connection_id()),
+                        registry,
+                    );
+                    return OK(Arc::new(mock) as Arc<dyn DiscoveryClient>);
+                }
+                // TODO: Replace with KeyValueDiscoveryClient (etcd/KV-backed)
+                Err(error!("No discovery clients yet implemented"))
+            })
+            .await?;
+        Ok(client.clone())
+    }

Add imports near the top:

+use crate::discovery::{MockDiscoveryClient, SharedMockRegistry};

lib/runtime/src/discovery/mod.rs (3)

36-43: Include key context in Removed events

Removed(String) only carries instance_id. Downstream often needs key context for metrics/logs; consider:
-pub enum DiscoveryEvent {
-    Added(DiscoveryInstance),
-    Removed(String),
-}
+pub enum DiscoveryEvent {
+    Added(DiscoveryInstance),
+    Removed { key: DiscoveryKey, instance_id: String },
+}
This avoids coupling on external state to correlate removals.

45-47: Stream alias is good; consider futures_core::Stream

To reduce dependency surface in trait signatures, you can use futures_core::Stream instead of futures::Stream. Optional.
-use futures::Stream;
+use futures_core::Stream;
49-60: Lifecycle: add an unserve/deregister path

Most backends need explicit deregistration or a lease guard. Consider adding:
async fn unserve(&self, instance: &DiscoveryInstance) -> Result<()>;
Alternatively, make serve return a drop-guard that deregisters on Drop.

lib/runtime/src/discovery/mock.rs (3)

10-20: Avoid std::sync::Mutex in async code

Using std::sync::Mutex inside async tasks can block the scheduler. Since this is test-only, it’s acceptable, but prefer parking_lot::Mutex (fast) or tokio::sync::Mutex if held across .await.

-use std::sync::{Arc, Mutex};
+use std::sync::Arc;
+use parking_lot::Mutex;

69-110: Polling loop: use interval and configurable period

Tight sleep(10ms) loops can cause jitter and unnecessary wakeups. Use tokio::time::interval and make the period configurable.

-        let stream = async_stream::stream! {
-            let mut known_instances = HashSet::new();
-            loop {
+        let stream = async_stream::stream! {
+            let mut known_instances = HashSet::new();
+            let mut ticker = tokio::time::interval(tokio::time::Duration::from_millis(50));
+            loop {
                 // ...
-                tokio::time::sleep(tokio::time::Duration::from_millis(10)).await;
+                ticker.tick().await;
             }
         };

118-174: Harden test with timeouts and helper removal API

Tests can hang if events don’t arrive. Wrap next() with timeout, and consider exposing a remove() helper instead of mutating internals.

-        let event = stream.next().await.unwrap().unwrap();
+        let event = tokio::time::timeout(
+            tokio::time::Duration::from_secs(1),
+            stream.next()
+        ).await.expect("event timed out").unwrap().unwrap();

Optionally add:

impl SharedMockRegistry {
    pub fn remove(&self, key: &DiscoveryKey, instance_id: &str) {
        let mut g = self.instances.lock().unwrap();
        if let Some(vec) = g.get_mut(key) {
            vec.retain(|i| matches!(i, DiscoveryInstance::Endpoint { instance_id: id, .. } if id != instance_id));
        }
    }
}

Then use registry.remove(&key, "instance-1");.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 259b2d3 and e262cd3.

📒 Files selected for processing (4)

lib/runtime/src/discovery/mock.rs (1 hunks)
lib/runtime/src/discovery/mod.rs (1 hunks)
lib/runtime/src/distributed.rs (3 hunks)
lib/runtime/src/lib.rs (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

lib/runtime/src/discovery/mock.rs (1)

lib/runtime/src/discovery/mod.rs (3)

instance_id (53-53)

serve (56-56)

list_and_watch (59-59)

lib/runtime/src/discovery/mod.rs (2)

lib/bindings/python/src/dynamo/_core.pyi (1)

Endpoint (133-174)

lib/runtime/src/discovery/mock.rs (3)

instance_id (40-42)

serve (44-67)

list_and_watch (69-110)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: vllm (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang
GitHub Check: operator (amd64)
GitHub Check: vllm (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: operator (arm64)
GitHub Check: clippy (.)
GitHub Check: tests (lib/runtime/examples)
GitHub Check: tests (lib/bindings/python)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: tests (.)
GitHub Check: Build and Test - dynamo
GitHub Check: tests (launch/dynamo-run)

🔇 Additional comments (3)

lib/runtime/src/lib.rs (1)

25-25: Module exposure looks good

Publicly exposing pub mod discovery; aligns with the new API surface. No issues.

lib/runtime/src/distributed.rs (1)

94-94: OnceCell for discovery client: LGTM

Lazy shared init via Arc<OnceCell<...>> matches existing patterns (e.g., tcp_server).

lib/runtime/src/discovery/mod.rs (1)

12-22: Key shape is fine for v1

DiscoveryKey::Endpoint { namespace, component, endpoint } is a clear minimal surface. No issues.

lib/runtime/src/discovery/mock.rs

grahamking · 2025-10-28T20:05:04Z

lib/runtime/src/discovery/mod.rs

+        component: String,
+        endpoint: String,
+    },
+}


What other types to you anticipate having here?

grahamking · 2025-10-28T20:06:27Z

lib/runtime/src/discovery/mod.rs

+    fn instance_id(&self) -> String;
+
+    /// Registers an object in the discovery plane with the instance id
+    async fn serve(&self, key: DiscoveryKey) -> Result<DiscoveryInstance>;


Could this be register? The first word in the comment is "Registers".

serve makes me think of a server, like an HTTP server for example, so I expect a long running thread.

Sure, I was also thinking something along the lines of publish or broadcast could work

grahamking · 2025-10-28T20:08:51Z

lib/runtime/src/discovery/mod.rs

+    async fn serve(&self, key: DiscoveryKey) -> Result<DiscoveryInstance>;
+
+    /// Returns a stream of discovery events (Added/Removed) for the given discovery key
+    async fn list_and_watch(&self, key: DiscoveryKey) -> Result<DiscoveryStream>;


To discover new models you watch model_card::ROOT_PATH which is v1/mdc. So not a DiscoveryKey.

To discover new instances you watch component::INSTANCE_ROOT_PATH which is v1/instances.

grahamking · 2025-10-28T20:11:09Z

lib/runtime/src/distributed.rs

+                // TODO: Replace when KeyValueDiscoveryClient is implemented
+                Err(error!("No discovery clients yet implemented."))
+            })
+            .await?;


Could you initialize it in new? Then you don't need the OnceCell.

Agreed, this makes sense

fix: introduce service discovery interface

aaa5166

Signed-off-by: mohammedabdulwahhab <[email protected]>

pull-request-size bot added the size/L label Oct 28, 2025

github-actions bot added the fix label Oct 28, 2025

fix: resolve merge conflicts

e262cd3

Signed-off-by: mohammedabdulwahhab <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 28, 2025 17:35 Inactive

mohammedabdulwahhab marked this pull request as ready for review October 28, 2025 17:35

mohammedabdulwahhab requested a review from a team as a code owner October 28, 2025 17:35

copy-pr-bot bot temporarily deployed to GITLAB October 28, 2025 17:36 Inactive

mohammedabdulwahhab commented Oct 28, 2025

View reviewed changes

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

lib/runtime/src/discovery/mock.rs Show resolved Hide resolved

Merge branch 'main' into mabdulwahhab/etcdless-interface-2

19baa05

copy-pr-bot bot temporarily deployed to GITLAB October 28, 2025 18:04 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 28, 2025 18:08 Inactive

Merge branch 'main' into mabdulwahhab/etcdless-interface-2

4a40b91

copy-pr-bot bot temporarily deployed to GITLAB October 28, 2025 19:40 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 28, 2025 19:41 Inactive

grahamking reviewed Oct 28, 2025

View reviewed changes

lib/runtime/src/discovery/mod.rs

component: String,

endpoint: String,

},

}

Copy link

Contributor

grahamking Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other types to you anticipate having here?

grahamking reviewed Oct 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: introduce service discovery interface (1/n) #3937

fix: introduce service discovery interface (1/n) #3937

mohammedabdulwahhab commented Oct 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

mohammedabdulwahhab Oct 28, 2025

Uh oh!

coderabbitai bot commented Oct 28, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

grahamking Oct 28, 2025

Uh oh!

grahamking Oct 28, 2025

Uh oh!

mohammedabdulwahhab Oct 28, 2025

Uh oh!

grahamking Oct 28, 2025 •

edited

Loading

Uh oh!

grahamking Oct 28, 2025

Uh oh!

grahamking Oct 28, 2025

Uh oh!

mohammedabdulwahhab Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,174 @@
		// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Uh oh!

fix: introduce service discovery interface (1/n) #3937

Are you sure you want to change the base?

fix: introduce service discovery interface (1/n) #3937

Conversation

mohammedabdulwahhab commented Oct 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

New Features

Uh oh!

mohammedabdulwahhab Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Oct 28, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grahamking Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

grahamking Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

mohammedabdulwahhab Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

grahamking Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grahamking Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

grahamking Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

mohammedabdulwahhab Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohammedabdulwahhab commented Oct 28, 2025 •

edited by coderabbitai bot

Loading

grahamking Oct 28, 2025 •

edited

Loading