Skip to content

feat(grpc): implement continuous Watch streaming for health servicers#917

Draft
V2arK wants to merge 9 commits intolightseekorg:mainfrom
V2arK:feat/grpc-watch-continuous-stream
Draft

feat(grpc): implement continuous Watch streaming for health servicers#917
V2arK wants to merge 9 commits intolightseekorg:mainfrom
V2arK:feat/grpc-watch-continuous-stream

Conversation

@V2arK
Copy link
Copy Markdown
Contributor

@V2arK V2arK commented Mar 26, 2026

Description

Problem

SGLangHealthServicer.Watch() and VllmHealthServicer.Watch() yield a single response then close the stream. This violates the gRPC Health Checking Protocol, which requires Watch to be a long-lived server-streaming RPC that sends updates whenever the service's health status changes.

Additionally, SGLangHealthServicer.Watch() delegates to self.Check(), which calls context.set_code(NOT_FOUND) and context.set_details() for unknown services, polluting the streaming response context.

Follow-up from #885. Ref: vllm-project/vllm#38016.

Solution

Add HealthWatchMixin providing the Watch loop skeleton (poll + asyncio.Event for immediate shutdown wakeup, yield-on-change, cancel handling). Both servicers integrate the mixin and implement _compute_watch_status() and _is_shutting_down().

  • SGLang: sync status computation (dict lookup + scheduler responsiveness check)
  • vLLM: async status computation (await async_llm.check_health())

The mixin's _resolve_watch_status() bridge method auto-detects sync vs async implementations via asyncio.iscoroutine(), so each servicer uses its natural calling convention.

Spec deviation: for unknown services, the stream sends SERVICE_UNKNOWN once then exits (spec says keep open for dynamic registration, but smg services are statically defined).

Test Plan

cd grpc_servicer
pip install -e ".[test]"
pytest tests/ -xvs

Unit tests: 14/14 passed (macOS + x86 Linux)

# Test SGLang vLLM
1 Initial status sent immediately PASS PASS
2 Status change yields new response PASS PASS
3 Shutdown exits stream PASS PASS
4 Client cancel handled cleanly PASS PASS
5 Unknown service: SERVICE_UNKNOWN, no context.set_code PASS PASS
6 No duplicate sends on stable status PASS PASS
7 Shutdown edge case (graceful_exit poll / shutdown overrides healthy) PASS PASS

vLLM E2E Watch deferred -- requires vllm-project/vllm#38016 to register grpc.health.v1.Health in the gRPC server.

Checklist
  • cargo +nightly fmt passes (no Rust changes)
  • cargo clippy --all-targets --all-features -- -D warnings passes (no Rust changes)
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • Chores

    • Updated package version to 0.6.0.
    • Added test dependencies: pytest, pytest-asyncio, and pytest-timeout.
  • New Features

    • Refactored health check streaming protocol implementation.
  • Tests

    • Added test configuration module with fixtures for gRPC mocking and health status constants.
    • Added comprehensive test coverage for health streaming behavior across service implementations.

V2arK added 9 commits March 26, 2026 12:06
Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>
TDD red phase: 7 tests for SGLangHealthServicer.Watch() continuous
streaming. 5 fail against current single-yield implementation.
Adds sglang MagicMock stubs to conftest to allow collection without
a full SGLang installation.

Signed-off-by: Honglin Zhu <honglin@nvidia.com>
Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>
TDD red phase: 7 tests for VllmHealthServicer.Watch() continuous
streaming. 3 fail (exits_on_shutdown, engine_failure, no_duplicate)
as expected; 4 pass against current single-yield stub. Also adds
vllm module stubs to conftest so tests collect without vLLM installed.

Signed-off-by: Honglin <honglin@nvidia.com>
Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>
Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>
_notify_shutdown() now also sets self._watch_notified_shutdown = True
so subclasses can detect explicit shutdown (via set_not_serving()) in
_is_shutting_down() independently of their engine-specific flags.

Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>
Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>
Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>
Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>
Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>
@V2arK V2arK marked this pull request as draft March 26, 2026 17:05
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

📝 Walkthrough

Walkthrough

This PR updates the grpc-servicer package to version 0.6.0, introducing a new HealthWatchMixin base class that centralizes gRPC Health Checking Protocol Watch RPC streaming behavior. The SGLang and vLLM health servicers are refactored to inherit from this mixin, replacing their inline Watch implementations. Test dependencies are added along with comprehensive test suites covering the new functionality.

Changes

Cohort / File(s) Summary
Package Configuration
grpc_servicer/pyproject.toml
Version bumped from 0.5.1 to 0.6.0; added optional test dependencies group with pytest, pytest-asyncio, and pytest-timeout.
Health Watch Mixin
grpc_servicer/smg_grpc_servicer/health_watch.py
New module introducing HealthWatchMixin class that implements streaming Watch RPC with status polling, change detection, and graceful shutdown handling.
Servicer Refactoring
grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py, grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py
Both servicers now inherit from HealthWatchMixin; removed inline Watch implementations; added _compute_watch_status() and _is_shutting_down() hooks specific to each servicer's health logic.
Test Infrastructure
grpc_servicer/tests/conftest.py
New pytest configuration with module mocks for vllm and sglang dependencies, health status constants, and fixtures for gRPC context and request messages.
Health Watch Tests
grpc_servicer/tests/test_sglang_health_watch.py, grpc_servicer/tests/test_vllm_health_watch.py
Comprehensive async test suites validating Watch stream behavior: initial status emission, status change detection, graceful shutdown, client cancellation handling, unknown service responses, and no duplicate messages during stable periods.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

grpc

Suggested reviewers

  • CatherineSue
  • slin1237
  • gongwei-130

Poem

🐰 Hops with glee through Watch RPC streams,
A mixin born of healthcheck dreams,
Status flows when changes appear,
Graceful shutdowns crystal clear,
Tests ensure the logic's sound, 🏥✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: implementing continuous Watch streaming for health servicers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added dependencies Dependency updates tests Test changes labels Mar 26, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a shared HealthWatchMixin to implement the gRPC Health Checking Protocol's Watch RPC for both SGLang and vLLM inference engines. The mixin provides a continuous streaming response that updates clients on health status changes or server shutdown. The PR also includes a version bump to 0.6.0, the addition of test-specific dependencies in pyproject.toml, and a comprehensive suite of unit tests. Feedback was provided regarding the use of inspect.isawaitable() for more robust detection of asynchronous results in the mixin's status resolution logic.

async def _resolve_watch_status(self, service_name: str) -> int:
"""Call _compute_watch_status, handling both sync and async impls."""
result = self._compute_watch_status(service_name)
if asyncio.iscoroutine(result):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For more robust detection of awaitable results from _compute_watch_status, it's better to use inspect.isawaitable() instead of asyncio.iscoroutine(). isawaitable() is more general and correctly handles not just coroutines from async def functions, but also other awaitable objects like asyncio.Future or custom objects with an __await__ method. This makes the mixin more resilient to different implementation patterns in subclasses.

You'll need to add import inspect at the top of the file.

Suggested change
if asyncio.iscoroutine(result):
if inspect.isawaitable(result):

@V2arK
Copy link
Copy Markdown
Contributor Author

V2arK commented Mar 26, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration
Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b35dd2b6-6b84-4827-892b-4ba6ea2d1214

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

✨ Finishing Touches
🧪 Generate unit tests (beta)

  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

@V2arK Sure, I'll review the PR now!


🧠 Learnings used
Learnt from: V2arK
Repo: lightseekorg/smg PR: 885
File: grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py:0-0
Timestamp: 2026-03-24T18:53:58.838Z
Learning: In repo lightseekorg/smg, the single-yield Watch() pattern in grpc health servicers is intentional and consistent across both `grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py` and `grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py`. Kubernetes only uses the Check RPC, so Watch yields one status response then closes. A persistent-stream Watch is a known future improvement to be implemented for both servicers together in a follow-up PR. Do not flag the single-yield Watch as a protocol violation in these files.

Learnt from: V2arK
Repo: lightseekorg/smg PR: 885
File: grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py:49-52
Timestamp: 2026-03-24T18:53:15.536Z
Learning: In repo lightseekorg/smg, `VllmHealthServicer` (grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py) is intentionally only defined in smg — it is a library component. Instantiation, gRPC server registration, and `set_not_serving()` invocation during graceful shutdown are all wired in the consumer repo (vLLM's `vllm/entrypoints/grpc_server.py`, `serve_grpc()` function). This is the same pattern as `SGLangHealthServicer`, which is defined in smg but wired externally by the consumer. Do not flag missing server wiring in health_servicer.py as a bug.
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 012e5f6e16

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

self._watch_shutdown_event.wait(),
timeout=self.WATCH_POLL_INTERVAL_S,
)
except TimeoutError:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Catch asyncio.TimeoutError in Watch poll loop

Watch() currently catches built-in TimeoutError, but on Python 3.10 (which is supported via requires-python >=3.10) asyncio.wait_for() raises asyncio.TimeoutError instead. When a stream is healthy and no shutdown event occurs for one poll interval, that timeout escapes the loop and aborts the RPC rather than continuing to poll, so long-lived watch streams terminate unexpectedly in normal operation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py`:
- Around line 149-176: Extract the hard-coded 30s timeout into a class-level
constant (e.g., SCHEDULER_RESPONSIVENESS_TIMEOUT_S) and replace the literal 30
in both _compute_watch_status and Check with
self.SCHEDULER_RESPONSIVENESS_TIMEOUT_S; add the constant to the class
definition, update the time_since comparison in _compute_watch_status and the
corresponding check in Check() to use that constant, and ensure any tests or
other methods referencing the 30s behavior use the new constant name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 033c0e2b-a6dd-400c-b263-c6c6868ccb12

📥 Commits

Reviewing files that changed from the base of the PR and between cb8407f and 012e5f6.

📒 Files selected for processing (8)
  • grpc_servicer/pyproject.toml
  • grpc_servicer/smg_grpc_servicer/health_watch.py
  • grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py
  • grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py
  • grpc_servicer/tests/__init__.py
  • grpc_servicer/tests/conftest.py
  • grpc_servicer/tests/test_sglang_health_watch.py
  • grpc_servicer/tests/test_vllm_health_watch.py

Comment on lines +149 to +176
def _is_shutting_down(self) -> bool:
# _watch_notified_shutdown is set by _notify_shutdown() in set_not_serving();
# gracefully_exit covers external shutdown from the request manager.
return self.request_manager.gracefully_exit or self._watch_notified_shutdown

Yields:
HealthCheckResponse messages when status changes
"""
service_name = request.service
logger.debug(f"Health watch request for service: '{service_name}'")
def _compute_watch_status(self, service_name: str) -> int:
"""Sync status computation -- no I/O needed."""
if self.request_manager.gracefully_exit:
return NOT_SERVING

# Send current status
response = await self.Check(request, context)
yield response
if service_name == self.OVERALL_SERVER:
return self._serving_status.get(self.OVERALL_SERVER, NOT_SERVING)

if service_name == self.SGLANG_SERVICE:
base_status = self._serving_status.get(self.SGLANG_SERVICE, NOT_SERVING)
if base_status != SERVING:
return base_status
time_since = time.time() - self.request_manager.last_receive_tstamp
if time_since > 30 and len(self.request_manager.rid_to_state) > 0:
logger.warning(
"Scheduler not responsive (%.1fs, %d pending)",
time_since,
len(self.request_manager.rid_to_state),
)
return NOT_SERVING
return SERVING

# Note: Full Watch implementation would monitor status changes
# and stream updates. For K8s probes, Check is sufficient.
return SERVICE_UNKNOWN
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

LGTM with minor note.

The implementations are correct:

  • _is_shutting_down() appropriately checks both gracefully_exit and _watch_notified_shutdown to handle external shutdown signals
  • _compute_watch_status() correctly mirrors Check() logic as a sync method
💡 Consider extracting the magic number 30 as a constant

The 30-second scheduler responsiveness timeout is duplicated between Check() (line 127) and _compute_watch_status() (line 167). Consider extracting this as a class constant for maintainability:

+    SCHEDULER_RESPONSIVENESS_TIMEOUT_S = 30
+
     # Service names we support
     OVERALL_SERVER = ""

Then use self.SCHEDULER_RESPONSIVENESS_TIMEOUT_S in both methods.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py` around lines 149 -
176, Extract the hard-coded 30s timeout into a class-level constant (e.g.,
SCHEDULER_RESPONSIVENESS_TIMEOUT_S) and replace the literal 30 in both
_compute_watch_status and Check with self.SCHEDULER_RESPONSIVENESS_TIMEOUT_S;
add the constant to the class definition, update the time_since comparison in
_compute_watch_status and the corresponding check in Check() to use that
constant, and ensure any tests or other methods referencing the 30s behavior use
the new constant name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant