feat(grpc): implement continuous Watch streaming for health servicers by V2arK · Pull Request #917 · lightseekorg/smg

V2arK · 2026-03-26T17:04:58Z

Description

Problem

SGLangHealthServicer.Watch() and VllmHealthServicer.Watch() yield a single response then close the stream. This violates the gRPC Health Checking Protocol, which requires Watch to be a long-lived server-streaming RPC that sends updates whenever the service's health status changes.

Additionally, SGLangHealthServicer.Watch() delegates to self.Check(), which calls context.set_code(NOT_FOUND) and context.set_details() for unknown services, polluting the streaming response context.

Follow-up from #885. Ref: vllm-project/vllm#38016.

Solution

Add HealthWatchMixin providing the Watch loop skeleton (poll + asyncio.Event for immediate shutdown wakeup, yield-on-change, cancel handling). Both servicers integrate the mixin and implement _compute_watch_status() and _is_shutting_down().

SGLang: sync status computation (dict lookup + scheduler responsiveness check)
vLLM: async status computation (await async_llm.check_health())

The mixin's _resolve_watch_status() bridge method auto-detects sync vs async implementations via asyncio.iscoroutine(), so each servicer uses its natural calling convention.

Spec deviation: for unknown services, the stream sends SERVICE_UNKNOWN once then exits (spec says keep open for dynamic registration, but smg services are statically defined).

Test Plan

cd grpc_servicer
pip install -e ".[test]"
pytest tests/ -xvs

Unit tests: 14/14 passed (macOS + x86 Linux)

#	Test	SGLang	vLLM
1	Initial status sent immediately	PASS	PASS
2	Status change yields new response	PASS	PASS
3	Shutdown exits stream	PASS	PASS
4	Client cancel handled cleanly	PASS	PASS
5	Unknown service: SERVICE_UNKNOWN, no context.set_code	PASS	PASS
6	No duplicate sends on stable status	PASS	PASS
7	Shutdown edge case (graceful_exit poll / shutdown overrides healthy)	PASS	PASS

vLLM E2E Watch deferred -- requires vllm-project/vllm#38016 to register grpc.health.v1.Health in the gRPC server.

Checklist

cargo +nightly fmt passes (no Rust changes)
cargo clippy --all-targets --all-features -- -D warnings passes (no Rust changes)
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

Chores
- Updated package version to 0.6.0.
- Added test dependencies: pytest, pytest-asyncio, and pytest-timeout.
New Features
- Refactored health check streaming protocol implementation.
Tests
- Added test configuration module with fixtures for gRPC mocking and health status constants.
- Added comprehensive test coverage for health streaming behavior across service implementations.

Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

TDD red phase: 7 tests for SGLangHealthServicer.Watch() continuous streaming. 5 fail against current single-yield implementation. Adds sglang MagicMock stubs to conftest to allow collection without a full SGLang installation. Signed-off-by: Honglin Zhu <honglin@nvidia.com> Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

TDD red phase: 7 tests for VllmHealthServicer.Watch() continuous streaming. 3 fail (exits_on_shutdown, engine_failure, no_duplicate) as expected; 4 pass against current single-yield stub. Also adds vllm module stubs to conftest so tests collect without vLLM installed. Signed-off-by: Honglin <honglin@nvidia.com> Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

_notify_shutdown() now also sets self._watch_notified_shutdown = True so subclasses can detect explicit shutdown (via set_not_serving()) in _is_shutting_down() independently of their engine-specific flags. Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

coderabbitai · 2026-03-26T17:05:07Z

📝 Walkthrough

Walkthrough

This PR updates the grpc-servicer package to version 0.6.0, introducing a new HealthWatchMixin base class that centralizes gRPC Health Checking Protocol Watch RPC streaming behavior. The SGLang and vLLM health servicers are refactored to inherit from this mixin, replacing their inline Watch implementations. Test dependencies are added along with comprehensive test suites covering the new functionality.

Changes

Cohort / File(s)	Summary
Package Configuration `grpc_servicer/pyproject.toml`	Version bumped from 0.5.1 to 0.6.0; added optional test dependencies group with pytest, pytest-asyncio, and pytest-timeout.
Health Watch Mixin `grpc_servicer/smg_grpc_servicer/health_watch.py`	New module introducing `HealthWatchMixin` class that implements streaming Watch RPC with status polling, change detection, and graceful shutdown handling.
Servicer Refactoring `grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py`, `grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py`	Both servicers now inherit from `HealthWatchMixin`; removed inline Watch implementations; added `_compute_watch_status()` and `_is_shutting_down()` hooks specific to each servicer's health logic.
Test Infrastructure `grpc_servicer/tests/conftest.py`	New pytest configuration with module mocks for vllm and sglang dependencies, health status constants, and fixtures for gRPC context and request messages.
Health Watch Tests `grpc_servicer/tests/test_sglang_health_watch.py`, `grpc_servicer/tests/test_vllm_health_watch.py`	Comprehensive async test suites validating Watch stream behavior: initial status emission, status change detection, graceful shutdown, client cancellation handling, unknown service responses, and no duplicate messages during stable periods.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat(grpc_servicer): add sglang support with multi-backend extras #745: Introduced SGLang health servicer support; now being refactored to use HealthWatchMixin for centralized Watch streaming behavior.
feat(grpc): add VllmHealthServicer for standard gRPC health checking (grpc.health.v1) #885: Previously modified vLLM health servicer Watch/Check logic; directly affected by this mixin-based refactoring.
feat: add smg-grpc-servicer package #638: Established the smg-grpc-servicer package foundation; this PR extends it with reusable health check patterns.

Suggested labels

grpc

Suggested reviewers

CatherineSue
slin1237
gongwei-130

Poem

🐰 Hops with glee through Watch RPC streams,
A mixin born of healthcheck dreams,
Status flows when changes appear,
Graceful shutdowns crystal clear,
Tests ensure the logic's sound, 🏥✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.70% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: implementing continuous Watch streaming for health servicers.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a shared HealthWatchMixin to implement the gRPC Health Checking Protocol's Watch RPC for both SGLang and vLLM inference engines. The mixin provides a continuous streaming response that updates clients on health status changes or server shutdown. The PR also includes a version bump to 0.6.0, the addition of test-specific dependencies in pyproject.toml, and a comprehensive suite of unit tests. Feedback was provided regarding the use of inspect.isawaitable() for more robust detection of asynchronous results in the mixin's status resolution logic.

gemini-code-assist · 2026-03-26T17:06:59Z

grpc_servicer/smg_grpc_servicer/health_watch.py

+    async def _resolve_watch_status(self, service_name: str) -> int:
+        """Call _compute_watch_status, handling both sync and async impls."""
+        result = self._compute_watch_status(service_name)
+        if asyncio.iscoroutine(result):


For more robust detection of awaitable results from _compute_watch_status, it's better to use inspect.isawaitable() instead of asyncio.iscoroutine(). isawaitable() is more general and correctly handles not just coroutines from async def functions, but also other awaitable objects like asyncio.Future or custom objects with an __await__ method. This makes the mixin more resilient to different implementation patterns in subclasses.

You'll need to add import inspect at the top of the file.

Suggested change

if asyncio.iscoroutine(result):

if inspect.isawaitable(result):

V2arK · 2026-03-26T17:07:37Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration
Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b35dd2b6-6b84-4827-892b-4ba6ea2d1214

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches
🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

X

Mastodon

Reddit

LinkedIn

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai review

coderabbitai · 2026-03-26T17:07:53Z

@V2arK Sure, I'll review the PR now!

🧠 Learnings used

Learnt from: V2arK
Repo: lightseekorg/smg PR: 885
File: grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py:0-0
Timestamp: 2026-03-24T18:53:58.838Z
Learning: In repo lightseekorg/smg, the single-yield Watch() pattern in grpc health servicers is intentional and consistent across both `grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py` and `grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py`. Kubernetes only uses the Check RPC, so Watch yields one status response then closes. A persistent-stream Watch is a known future improvement to be implemented for both servicers together in a follow-up PR. Do not flag the single-yield Watch as a protocol violation in these files.

Learnt from: V2arK
Repo: lightseekorg/smg PR: 885
File: grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py:49-52
Timestamp: 2026-03-24T18:53:15.536Z
Learning: In repo lightseekorg/smg, `VllmHealthServicer` (grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py) is intentionally only defined in smg — it is a library component. Instantiation, gRPC server registration, and `set_not_serving()` invocation during graceful shutdown are all wired in the consumer repo (vLLM's `vllm/entrypoints/grpc_server.py`, `serve_grpc()` function). This is the same pattern as `SGLangHealthServicer`, which is defined in smg but wired externally by the consumer. Do not flag missing server wiring in health_servicer.py as a bug.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 012e5f6e16

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T17:08:03Z

grpc_servicer/smg_grpc_servicer/health_watch.py

+                        self._watch_shutdown_event.wait(),
+                        timeout=self.WATCH_POLL_INTERVAL_S,
+                    )
+                except TimeoutError:


Catch asyncio.TimeoutError in Watch poll loop

Watch() currently catches built-in TimeoutError, but on Python 3.10 (which is supported via requires-python >=3.10) asyncio.wait_for() raises asyncio.TimeoutError instead. When a stream is healthy and no shutdown event occurs for one poll interval, that timeout escapes the loop and aborts the RPC rather than continuing to poll, so long-lived watch streams terminate unexpectedly in normal operation.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py`:
- Around line 149-176: Extract the hard-coded 30s timeout into a class-level
constant (e.g., SCHEDULER_RESPONSIVENESS_TIMEOUT_S) and replace the literal 30
in both _compute_watch_status and Check with
self.SCHEDULER_RESPONSIVENESS_TIMEOUT_S; add the constant to the class
definition, update the time_since comparison in _compute_watch_status and the
corresponding check in Check() to use that constant, and ensure any tests or
other methods referencing the 30s behavior use the new constant name.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 033c0e2b-a6dd-400c-b263-c6c6868ccb12

📥 Commits

Reviewing files that changed from the base of the PR and between cb8407f and 012e5f6.

📒 Files selected for processing (8)

grpc_servicer/pyproject.toml
grpc_servicer/smg_grpc_servicer/health_watch.py
grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py
grpc_servicer/smg_grpc_servicer/vllm/health_servicer.py
grpc_servicer/tests/__init__.py
grpc_servicer/tests/conftest.py
grpc_servicer/tests/test_sglang_health_watch.py
grpc_servicer/tests/test_vllm_health_watch.py

coderabbitai · 2026-03-26T17:14:12Z

grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py

+    def _is_shutting_down(self) -> bool:
+        # _watch_notified_shutdown is set by _notify_shutdown() in set_not_serving();
+        # gracefully_exit covers external shutdown from the request manager.
+        return self.request_manager.gracefully_exit or self._watch_notified_shutdown

-        Yields:
-            HealthCheckResponse messages when status changes
-        """
-        service_name = request.service
-        logger.debug(f"Health watch request for service: '{service_name}'")
+    def _compute_watch_status(self, service_name: str) -> int:
+        """Sync status computation -- no I/O needed."""
+        if self.request_manager.gracefully_exit:
+            return NOT_SERVING

-        # Send current status
-        response = await self.Check(request, context)
-        yield response
+        if service_name == self.OVERALL_SERVER:
+            return self._serving_status.get(self.OVERALL_SERVER, NOT_SERVING)
+
+        if service_name == self.SGLANG_SERVICE:
+            base_status = self._serving_status.get(self.SGLANG_SERVICE, NOT_SERVING)
+            if base_status != SERVING:
+                return base_status
+            time_since = time.time() - self.request_manager.last_receive_tstamp
+            if time_since > 30 and len(self.request_manager.rid_to_state) > 0:
+                logger.warning(
+                    "Scheduler not responsive (%.1fs, %d pending)",
+                    time_since,
+                    len(self.request_manager.rid_to_state),
+                )
+                return NOT_SERVING
+            return SERVING

-        # Note: Full Watch implementation would monitor status changes
-        # and stream updates. For K8s probes, Check is sufficient.
+        return SERVICE_UNKNOWN


🧹 Nitpick | 🔵 Trivial

LGTM with minor note.

The implementations are correct:

_is_shutting_down() appropriately checks both gracefully_exit and _watch_notified_shutdown to handle external shutdown signals

_compute_watch_status() correctly mirrors Check() logic as a sync method

💡 Consider extracting the magic number 30 as a constant

The 30-second scheduler responsiveness timeout is duplicated between Check() (line 127) and _compute_watch_status() (line 167). Consider extracting this as a class constant for maintainability:

+ SCHEDULER_RESPONSIVENESS_TIMEOUT_S = 30 + # Service names we support OVERALL_SERVER = ""

Then use self.SCHEDULER_RESPONSIVENESS_TIMEOUT_S in both methods.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/sglang/health_servicer.py` around lines 149 - 176, Extract the hard-coded 30s timeout into a class-level constant (e.g., SCHEDULER_RESPONSIVENESS_TIMEOUT_S) and replace the literal 30 in both _compute_watch_status and Check with self.SCHEDULER_RESPONSIVENESS_TIMEOUT_S; add the constant to the class definition, update the time_since comparison in _compute_watch_status and the corresponding check in Check() to use that constant, and ensure any tests or other methods referencing the 30s behavior use the new constant name.

V2arK added 9 commits March 26, 2026 12:06

chore(grpc): add test infrastructure for grpc_servicer

c368ae5

Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

feat(grpc): add HealthWatchMixin for continuous Watch streaming

9105c10

Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

feat(grpc): integrate HealthWatchMixin into SGLangHealthServicer

2b8a905

Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

feat(grpc): integrate HealthWatchMixin into VllmHealthServicer

e437813

Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

style(grpc): fix import sorting in Watch test files

ad887b3

Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

chore(grpc): bump smg-grpc-servicer to 0.6.0

012e5f6

Signed-off-by: Honglin Cao <Caohonglin317@hotmail.com>

V2arK requested review from CatherineSue and slin1237 as code owners March 26, 2026 17:04

V2arK marked this pull request as draft March 26, 2026 17:05

github-actions bot added dependencies Dependency updates tests Test changes labels Mar 26, 2026

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 26, 2026

View reviewed changes

coderabbitai bot requested changes Mar 26, 2026

View reviewed changes

	if asyncio.iscoroutine(result):
	if inspect.isawaitable(result):

Conversation

V2arK commented Mar 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

V2arK commented Mar 26, 2026

Review skipped

Uh oh!

coderabbitai bot commented Mar 26, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

V2arK commented Mar 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 26, 2026 •

edited

Loading