Skip to content

Conversation

@mohammedabdulwahhab
Copy link
Contributor

@mohammedabdulwahhab mohammedabdulwahhab commented Oct 28, 2025

  • Introduce a discovery client interface
  • Create an attribute on the DistributedRuntime to persist the client
  • Create a mock discovery backend using memory (to be replaced with KeyValueStoreDiscoveryClient)

Summary by CodeRabbit

New Features

  • Service discovery system now integrated into the runtime, enabling registration and monitoring of service endpoints.
  • Supports dynamic tracking of service instances with real-time event notifications for added and removed services.
  • Mock discovery implementation included for testing and development purposes with thread-safe, in-memory storage.

Signed-off-by: mohammedabdulwahhab <[email protected]>
@mohammedabdulwahhab mohammedabdulwahhab marked this pull request as ready for review October 28, 2025 17:35
@mohammedabdulwahhab mohammedabdulwahhab requested a review from a team as a code owner October 28, 2025 17:35
@@ -0,0 +1,174 @@
// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mock will eventually be replaced with one that uses the memory impl for KeyValueStoreManager

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 28, 2025

Walkthrough

This PR introduces a pluggable service discovery system to the runtime. It defines core discovery abstractions (DiscoveryKey, DiscoveryInstance, DiscoveryEvent, DiscoveryClient trait) and provides a mock in-memory implementation for testing. The discovery client is integrated into DistributedRuntime as a lazy-initialized field.

Changes

Cohort / File(s) Summary
Discovery Module Interface
lib/runtime/src/discovery/mod.rs
Establishes discovery abstraction layer with DiscoveryKey, DiscoveryInstance, and DiscoveryEvent enums. Defines DiscoveryClient trait with instance_id(), serve(), and list_and_watch() methods. Re-exports mock types.
Mock Discovery Implementation
lib/runtime/src/discovery/mock.rs
Implements in-memory mock discovery system with SharedMockRegistry (thread-safe Arc<Mutex>) and MockDiscoveryClient. Includes streaming list_and_watch() with 10ms polling interval and unit test verifying Added/Removed event propagation.
DistributedRuntime Integration
lib/runtime/src/distributed.rs, lib/runtime/src/lib.rs
Adds lazy-initialized discovery_client field (Arc<OnceCell<Arc>>) to DistributedRuntime and discovery_client() async method. Declares new pub mod discovery.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

  • Review trait design and streaming implementation in mock discovery for correctness of event emission logic
  • Verify async/await patterns and Arc usage in mock registry are thread-safe
  • Confirm lazy initialization pattern with OnceCell is correctly integrated in DistributedRuntime

Poem

🐰 A discovery path is now laid,
With traits and streams in arclight shade,
Mock instances hop and play,
Events added, swept away—
The registry hops to serve the day!

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description provides useful information about the changes (introducing a discovery client interface, persisting it on DistributedRuntime, and creating a mock implementation) but does not follow the required template structure. The description is missing critical sections including an Overview header, a Details section with proper organization, the "Where should the reviewer start?" section that identifies specific files for review, and the Related Issues section with action keywords. While the content touches on what was changed, the failure to follow the template structure and omission of key navigation and linking information makes the description largely incomplete compared to the template requirements. Revise the pull request description to follow the provided template structure. Add an Overview section summarizing the purpose, organize the changes under a Details section header, add a "Where should the reviewer start?" section identifying lib/runtime/src/discovery/mod.rs and lib/runtime/src/discovery/mock.rs as key files for review, and include a Related Issues section with the appropriate action keyword (e.g., "Closes #xxx" or "Relates to #xxx") if applicable.
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "fix: introduce service discovery interface (1/n)" directly and clearly describes the main objective of the pull request. The changes establish a new discovery client interface, add it to DistributedRuntime, and provide a mock implementation, all of which are accurately captured by the title. The title is concise, specific, and avoids vague terminology while properly conveying the primary change from the changeset.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/runtime/src/distributed.rs (1)

20-24: Name collision confirmed: error imported twice (compile error E0252)

The figment::error import conflicts with the super module's re-export of anyhow::anyhow as error. Line 234 uses the anyhow macro (error!("...")), while figment::error is unused. Remove the figment import.

-use figment::error;
🧹 Nitpick comments (8)
lib/runtime/src/lib.rs (1)

99-101: Trait-object field is fine; consider a type alias for clarity

Arc<OnceCell<Arc<dyn discovery::DiscoveryClient>>> is correct. For readability and fewer repeated bounds, consider aliasing:

+// in lib/runtime/src/discovery/mod.rs (or a shared prelude)
+pub type DiscoveryClientRef = Arc<dyn DiscoveryClient + 'static>;
+
-// here
-discovery_client: Arc<OnceCell<Arc<dyn discovery::DiscoveryClient>>>,
+discovery_client: Arc<OnceCell<DiscoveryClientRef>>,
lib/runtime/src/distributed.rs (1)

228-239: Wire a usable default for early testing instead of hard error

Returning Err(error!(...)) blocks adopters. Prefer initializing a mock when self.is_static or under a cfg(test)/feature flag.

-    pub async fn discovery_client(&self) -> Result<Arc<dyn DiscoveryClient>> {
-        let client = self
-            .discovery_client
-            .get_or_try_init(async {
-                // TODO: Replace when KeyValueDiscoveryClient is implemented
-                Err(error!("No discovery clients yet implemented."))
-            })
-            .await?;
-        Ok(client.clone())
-    }
+    pub async fn discovery_client(&self) -> Result<Arc<dyn DiscoveryClient>> {
+        let client = self
+            .discovery_client
+            .get_or_try_init(async {
+                #[cfg(feature = "discovery-mock")]
+                {
+                    // Use in-memory mock for static or test environments
+                    let registry = crate::discovery::SharedMockRegistry::new();
+                    let mock = crate::discovery::MockDiscoveryClient::new(
+                        format!("drt-{}", self.connection_id()),
+                        registry,
+                    );
+                    return OK(Arc::new(mock) as Arc<dyn DiscoveryClient>);
+                }
+                // TODO: Replace with KeyValueDiscoveryClient (etcd/KV-backed)
+                Err(error!("No discovery clients yet implemented"))
+            })
+            .await?;
+        Ok(client.clone())
+    }

Add imports near the top:

+use crate::discovery::{MockDiscoveryClient, SharedMockRegistry};
lib/runtime/src/discovery/mod.rs (3)

36-43: Include key context in Removed events

Removed(String) only carries instance_id. Downstream often needs key context for metrics/logs; consider:

-pub enum DiscoveryEvent {
-    Added(DiscoveryInstance),
-    Removed(String),
-}
+pub enum DiscoveryEvent {
+    Added(DiscoveryInstance),
+    Removed { key: DiscoveryKey, instance_id: String },
+}

This avoids coupling on external state to correlate removals.


45-47: Stream alias is good; consider futures_core::Stream

To reduce dependency surface in trait signatures, you can use futures_core::Stream instead of futures::Stream. Optional.

-use futures::Stream;
+use futures_core::Stream;

49-60: Lifecycle: add an unserve/deregister path

Most backends need explicit deregistration or a lease guard. Consider adding:

async fn unserve(&self, instance: &DiscoveryInstance) -> Result<()>;

Alternatively, make serve return a drop-guard that deregisters on Drop.

lib/runtime/src/discovery/mock.rs (3)

10-20: Avoid std::sync::Mutex in async code

Using std::sync::Mutex inside async tasks can block the scheduler. Since this is test-only, it’s acceptable, but prefer parking_lot::Mutex (fast) or tokio::sync::Mutex if held across .await.

-use std::sync::{Arc, Mutex};
+use std::sync::Arc;
+use parking_lot::Mutex;

69-110: Polling loop: use interval and configurable period

Tight sleep(10ms) loops can cause jitter and unnecessary wakeups. Use tokio::time::interval and make the period configurable.

-        let stream = async_stream::stream! {
-            let mut known_instances = HashSet::new();
-            loop {
+        let stream = async_stream::stream! {
+            let mut known_instances = HashSet::new();
+            let mut ticker = tokio::time::interval(tokio::time::Duration::from_millis(50));
+            loop {
                 // ...
-                tokio::time::sleep(tokio::time::Duration::from_millis(10)).await;
+                ticker.tick().await;
             }
         };

118-174: Harden test with timeouts and helper removal API

Tests can hang if events don’t arrive. Wrap next() with timeout, and consider exposing a remove() helper instead of mutating internals.

-        let event = stream.next().await.unwrap().unwrap();
+        let event = tokio::time::timeout(
+            tokio::time::Duration::from_secs(1),
+            stream.next()
+        ).await.expect("event timed out").unwrap().unwrap();

Optionally add:

impl SharedMockRegistry {
    pub fn remove(&self, key: &DiscoveryKey, instance_id: &str) {
        let mut g = self.instances.lock().unwrap();
        if let Some(vec) = g.get_mut(key) {
            vec.retain(|i| matches!(i, DiscoveryInstance::Endpoint { instance_id: id, .. } if id != instance_id));
        }
    }
}

Then use registry.remove(&key, "instance-1");.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 259b2d3 and e262cd3.

📒 Files selected for processing (4)
  • lib/runtime/src/discovery/mock.rs (1 hunks)
  • lib/runtime/src/discovery/mod.rs (1 hunks)
  • lib/runtime/src/distributed.rs (3 hunks)
  • lib/runtime/src/lib.rs (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
lib/runtime/src/discovery/mock.rs (1)
lib/runtime/src/discovery/mod.rs (3)
  • instance_id (53-53)
  • serve (56-56)
  • list_and_watch (59-59)
lib/runtime/src/discovery/mod.rs (2)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • Endpoint (133-174)
lib/runtime/src/discovery/mock.rs (3)
  • instance_id (40-42)
  • serve (44-67)
  • list_and_watch (69-110)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: vllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (.)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: tests (launch/dynamo-run)
🔇 Additional comments (3)
lib/runtime/src/lib.rs (1)

25-25: Module exposure looks good

Publicly exposing pub mod discovery; aligns with the new API surface. No issues.

lib/runtime/src/distributed.rs (1)

94-94: OnceCell for discovery client: LGTM

Lazy shared init via Arc<OnceCell<...>> matches existing patterns (e.g., tcp_server).

lib/runtime/src/discovery/mod.rs (1)

12-22: Key shape is fine for v1

DiscoveryKey::Endpoint { namespace, component, endpoint } is a clear minimal surface. No issues.

component: String,
endpoint: String,
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other types to you anticipate having here?

fn instance_id(&self) -> String;

/// Registers an object in the discovery plane with the instance id
async fn serve(&self, key: DiscoveryKey) -> Result<DiscoveryInstance>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be register? The first word in the comment is "Registers".

serve makes me think of a server, like an HTTP server for example, so I expect a long running thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was also thinking something along the lines of publish or broadcast could work

async fn serve(&self, key: DiscoveryKey) -> Result<DiscoveryInstance>;

/// Returns a stream of discovery events (Added/Removed) for the given discovery key
async fn list_and_watch(&self, key: DiscoveryKey) -> Result<DiscoveryStream>;
Copy link
Contributor

@grahamking grahamking Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To discover new models you watch model_card::ROOT_PATH which is v1/mdc. So not a DiscoveryKey.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To discover new instances you watch component::INSTANCE_ROOT_PATH which is v1/instances.

// TODO: Replace when KeyValueDiscoveryClient is implemented
Err(error!("No discovery clients yet implemented."))
})
.await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you initialize it in new? Then you don't need the OnceCell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants