Skip to content

Conversation

@tzulingk
Copy link
Contributor

@tzulingk tzulingk commented Oct 3, 2025

Overview:

This PR enhances the endpoint health status management in the SystemHealth module to automatically set health status for both endpoint instances and their corresponding base endpoints. This ensures that health status is properly propagated for both instance-specific endpoints and their base endpoint names.

Env

DYN_SYSTEM_ENABLED=true
DYN_SYSTEM_PORT=9090
DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS="generate"

Before:

# Check worker health after known failure
$ curl localhost:${DYN_SYSTEM_PORT}/health
...
"generate" - healthy   # ISSUE 1 - we don't update health of 'generate' endpoint, but we check it as the signal
...
"generate_<instance_id>" - not healthy   # ISSUE 2 - we have a separate endpoint that we update but don't check

After:

# Check worker health after known failure
$ curl localhost:${DYN_SYSTEM_PORT}/health
...
"generate" - NOT healthy  # FIX 1 - we update the health of 'generate' endpoint
# FIX 2 - we remove the extra instance-specific endpoint here since there's only 1 per DRT/pod 

Details:

The changes modify the set_endpoint_health_status method in lib/runtime/src/system_health.rs to:

  1. Parse endpoint names to extract the base endpoint name from the full endpoint identifier
    Endpoint format: namespace.component.endpoint-instance_id
    Example: sglang-agg_backend.generate-5e7b99a870acae05
  2. Set health status for both levels:
    Sets health status for the full endpoint instance (with instance ID)
    Automatically extracts and sets the same health status for the base endpoint name (e.g., "generate")
  3. Handle instance ID suffix removal: The implementation identifies and removes the instance ID suffix (everything after the last hyphen) to derive the base endpoint name
    This change ensures that health checks and status queries can work at both the instance level and the base endpoint level, providing more flexibility in health monitoring and status reporting.

Where should the reviewer start?

lib/runtime/src/system_health.rs (lines 93-109): Focus on the set_endpoint_health_status method implementation
Review the endpoint name parsing logic
Verify the string manipulation for extracting base endpoint names is robust
Consider edge cases for endpoint naming conventions

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-702

Summary by CodeRabbit

  • New Features
    • Health status set for an instance-specific endpoint now also updates the corresponding base endpoint, improving aggregated visibility in health views.
  • Bug Fixes
    • Ensures endpoint health updates are consistently reflected across related endpoints, reducing mismatches in monitoring dashboards.
  • Chores
    • No changes to the public API; behavior refined to improve status propagation.

@tzulingk tzulingk requested a review from a team as a code owner October 3, 2025 18:38
@github-actions github-actions bot added the feat label Oct 3, 2025
@tzulingk tzulingk enabled auto-merge (squash) October 3, 2025 18:39
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 3, 2025

Walkthrough

Updates system_health logic to clone status when inserting, parse endpoint strings (namespace.component.endpoint-instance_id) to derive a base endpoint (without the trailing instance ID), and insert the same status for both the exact endpoint and the derived base endpoint. No signature or public API changes.

Changes

Cohort / File(s) Summary of Changes
System health status propagation
lib/runtime/src/system_health.rs
Clone status for reuse; parse endpoint to extract base name by removing suffix after last hyphen in the final segment; insert health status for exact endpoint and derived base endpoint; no API/signature changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant SystemHealth

  Caller->>SystemHealth: set_endpoint_health_status(endpoint, status)
  activate SystemHealth
  SystemHealth->>SystemHealth: Insert status for exact endpoint (clone)
  SystemHealth->>SystemHealth: Parse endpoint (namespace.component.endpoint-instance_id)
  alt Base endpoint extracted
    SystemHealth->>SystemHealth: Derive base endpoint (remove -instance_id)
    SystemHealth->>SystemHealth: Insert same status for base endpoint
  else No base endpoint
    SystemHealth->>SystemHealth: No additional insert
  end
  deactivate SystemHealth
  SystemHealth-->>Caller: return
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I hop between dots in the meadow of names,
Sniff out the base from hyphenated games.
Clone a carrot of status, share it twice,
Exact and base get the same crisp slice.
Burrow secured, endpoints align—
Health blooms green in the warren’s design. 🥕

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description correctly includes the Overview, Details, and Where should the reviewer start sections per the template, but the Related Issues section lacks an action keyword and proper bullet formatting as specified in the repository’s description template. Please update the Related Issues section to include a dash and an appropriate action keyword (for example “- Relates to DIS-702” or “- Closes DIS-702”) to fully comply with the template’s required format.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly summarizes the primary change by indicating that the feature adds setting of health status for both endpoint instances and endpoints generally, which matches the core update to the set_endpoint_health_status method.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1faf015 and 7835f1b.

📒 Files selected for processing (1)
  • lib/runtime/src/system_health.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (.)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (lib/runtime/examples)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: clippy (.)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
lib/runtime/src/system_health.rs (2)

95-95: LGTM: Clone enables dual status insertion.

Cloning the status here is necessary to allow reuse for the base endpoint insertion at line 106.


97-108: Parsing logic is correct—instance_id is an integer
Instance.instance_id is an i64 (numeric), so IDs never contain hyphens. rfind('-') therefore cleanly separates any hyphens in the endpoint name from the numeric ID. No changes required.

Likely an incorrect or invalid review comment.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of parsing and duplicating the status - can we use the endpoint instead of the endpoint subject?

@tzulingk
Copy link
Contributor Author

tzulingk commented Oct 6, 2025

Tested see https://linear.app/nvidia/issue/DIS-702/sglang-engine-health-check#comment-f3f164cf.

Now we registered the instance using endpoint name, since we will always have 1 endpoint - 1 instance mapping

@tzulingk tzulingk requested review from indrajit96 and nnshah1 October 6, 2025 18:29
Copy link
Contributor

@indrajit96 indrajit96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tzulingk tzulingk merged commit 332482a into main Oct 6, 2025
25 of 26 checks passed
@tzulingk tzulingk deleted the tzulingk/health_status branch October 6, 2025 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants