-
Notifications
You must be signed in to change notification settings - Fork 718
feat: Set instance endpoint status and endpoint health status #3411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughUpdates system_health logic to clone status when inserting, parse endpoint strings (namespace.component.endpoint-instance_id) to derive a base endpoint (without the trailing instance ID), and insert the same status for both the exact endpoint and the derived base endpoint. No signature or public API changes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant SystemHealth
Caller->>SystemHealth: set_endpoint_health_status(endpoint, status)
activate SystemHealth
SystemHealth->>SystemHealth: Insert status for exact endpoint (clone)
SystemHealth->>SystemHealth: Parse endpoint (namespace.component.endpoint-instance_id)
alt Base endpoint extracted
SystemHealth->>SystemHealth: Derive base endpoint (remove -instance_id)
SystemHealth->>SystemHealth: Insert same status for base endpoint
else No base endpoint
SystemHealth->>SystemHealth: No additional insert
end
deactivate SystemHealth
SystemHealth-->>Caller: return
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
nnshah1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of parsing and duplicating the status - can we use the endpoint instead of the endpoint subject?
…e name with lease id Signed-off-by: [email protected] <[email protected]>
|
Tested see https://linear.app/nvidia/issue/DIS-702/sglang-engine-health-check#comment-f3f164cf. Now we registered the instance using endpoint name, since we will always have 1 endpoint - 1 instance mapping |
indrajit96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: [email protected] <[email protected]>
Overview:
This PR enhances the endpoint health status management in the SystemHealth module to automatically set health status for both endpoint instances and their corresponding base endpoints. This ensures that health status is properly propagated for both instance-specific endpoints and their base endpoint names.
Env
Before:
After:
Details:
The changes modify the set_endpoint_health_status method in lib/runtime/src/system_health.rs to:
Endpoint format: namespace.component.endpoint-instance_id
Example: sglang-agg_backend.generate-5e7b99a870acae05
Sets health status for the full endpoint instance (with instance ID)
Automatically extracts and sets the same health status for the base endpoint name (e.g., "generate")
This change ensures that health checks and status queries can work at both the instance level and the base endpoint level, providing more flexibility in health monitoring and status reporting.
Where should the reviewer start?
lib/runtime/src/system_health.rs (lines 93-109): Focus on the set_endpoint_health_status method implementation
Review the endpoint name parsing logic
Verify the string manipulation for extracting base endpoint names is robust
Consider edge cases for endpoint naming conventions
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
DIS-702
Summary by CodeRabbit