fix: add prefill metrics support for SGLang disaggregated mode #3992

keivenchang · 2025-10-30T17:35:50Z

Overview:

Adds prefill metrics support for SGLang disaggregated mode. Both prefill and decode workers now expose metrics on separate ports (8081 and 8082) for single-GPU testing.

Details:

New Launch Script: Added disagg_same_gpu.sh for single-GPU disaggregated testing with dual metrics
Prometheus Registry Refactoring: Extracted setup_prometheus_registry() for reusability across init functions
Test Enhancement: Added disaggregated_same_gpu test validating metrics from both workers
Test Infrastructure: Extended metric_payload_default() to accept port parameter for multi-port testing

Where should the reviewer start?

components/backends/sglang/launch/disagg_same_gpu.sh - new launch script with metrics on ports 8081/8082
components/src/dynamo/sglang/publisher.py - Prometheus registry refactoring
tests/serve/test_sglang.py - dual metrics validation test

Related Issues:

DIS-911

/coderabbit profile chill

Summary by CodeRabbit

Release Notes

New Features
- Added support for disaggregated GPU serving on a single GPU with customizable memory allocation (default 24%)
- Enhanced metrics collection with Prometheus registry support for prefill and decode worker monitoring
Tests
- Added test configuration for disaggregated same GPU deployment scenarios

- Add disagg_same_gpu.sh for single-GPU disaggregated testing with metrics - Both prefill (port 8081) and decode (port 8082) workers expose metrics - Refactor setup_prometheus_registry() for reusability across init functions - Add port parameter to metric_payload_default() for multi-port testing - Add test validation for dual metrics endpoints - Include 10-second wait between workers to prevent OOM race conditions - Add aggressive memory optimizations and KV router mode for frontend Signed-off-by: Keiven Chang <[email protected]>

coderabbitai · 2025-10-30T17:39:11Z

Walkthrough

This pull request introduces disaggregated GPU inference on a single GPU with coordinated worker orchestration. Changes include a new shell script that manages prefill and decode worker processes, Prometheus metrics registry setup functions, refactored metrics initialization across SGLang components, a new test configuration, and an enhanced metrics payload utility.

Changes

Cohort / File(s)	Summary
GPU Orchestration Script `components/backends/sglang/launch/disagg_same_gpu.sh`	New shell script orchestrating single-GPU disaggregated setup with optional memory fraction parameter, GPU memory availability checks (requires ≥24 GB), graceful shutdown traps, and sequential launch of ingress router (port 8000), prefill worker (port 8081), and decode worker (port 8082) with 5-second initialization delay.
Metrics Registry Setup `components/src/dynamo/sglang/publisher.py`	Added new `setup_prometheus_registry()` function to create and configure CollectorRegistry with MultiProcessCollector and engine metrics filtering. Refactored `setup_sgl_metrics()` signature to accept component parameter and return tuple of publisher, task, and metric list. Centralized Prometheus registry initialization outside setup_sgl_metrics.
Metrics Initialization `components/src/dynamo/sglang/main.py`	Imported and integrated `setup_prometheus_registry()` calls conditionally in `init()`, `init_prefill()`, and `init_embedding()` functions following `setup_sgl_metrics()` invocation when metrics are enabled. Removed inline Prometheus setup comments.
Test Configuration `tests/serve/test_sglang.py`	Added new "disaggregated_same_gpu" test configuration using disagg_same_gpu.sh script, GPU 1, Qwen3-0.6B model, with chat payloads and dual metric probes on ports 8081 and 8082.
Payload Utility Enhancement `tests/utils/payload_builder.py`	Added optional `port` parameter (default 8081) to `metric_payload_default()` function and propagated it to MetricsPayload construction.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

disagg_same_gpu.sh: Verify GPU memory validation logic (torch.cuda.mem_get_info), process launch sequencing, port assignments, and cleanup trap correctness
publisher.py: Confirm refactoring maintains backward compatibility, registry configuration with MultiProcessCollector properly filters metrics, and tuple return type aligns with all callers
main.py: Validate consistent application of setup_prometheus_registry() across three initialization functions and correct conditional logic

Poem

🐰 A disaggregated dream on silicon so bright,
Two workers split the labor—prefill and decode in flight!
With metrics dancing merrily on ports 8081, 8082,
One GPU now orchestrates what once took two—
A hop, a skip, a Prometheus-tracked debut! ✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "fix: add prefill metrics support for SGLang disaggregated mode" is directly related to the main changes in the pull request. The changeset introduces metrics support for SGLang disaggregated mode, with a new launch script that exposes metrics on ports 8081 (prefill) and 8082 (decode), Prometheus registry refactoring, and test enhancements to validate this functionality. While the title emphasizes "prefill metrics support" specifically, this accurately reflects the primary enhancement being added, and the decode metrics follow naturally as part of the same refactoring effort. The title is clear, concise, and specific enough that a reviewer scanning the history would understand the key change.
Description Check	✅ Passed	The PR description is well-structured and covers all major template sections. It includes a clear overview explaining what prefill metrics support means in this context, detailed bullet points describing the specific changes (launch script, Prometheus refactoring, test enhancements, and test infrastructure), and specific file paths for reviewer focus. The "Related Issues" section references "DIS-911," though it deviates from the template's expected format of using action keywords (Closes/Fixes/Resolves) with a GitHub issue number. This is a minor formatting variation rather than a missing critical section, and the issue reference is clearly provided.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

components/src/dynamo/sglang/publisher.py (1)

199-223: Consider removing unused engine parameter.

The engine parameter is not used in the function body. Unless you're keeping it for future extensibility or API consistency, consider removing it to clean up the signature.

Apply this diff if the parameter is not needed:

-def setup_prometheus_registry(
-    engine: sgl.Engine, generate_endpoint: Endpoint
-) -> CollectorRegistry:
+def setup_prometheus_registry(generate_endpoint: Endpoint) -> CollectorRegistry:
     """Set up Prometheus registry for SGLang metrics collection.
 
     SGLang uses multiprocess architecture where metrics are stored in shared memory.
@@ -207,7 +206,6 @@
     (typically port 8081) when DYN_SYSTEM_ENABLED=true.
 
     Args:
-        engine: The SGLang engine instance.
         generate_endpoint: The Dynamo endpoint for generation requests.
 
     Returns:

And update the call sites in main.py:

-        setup_prometheus_registry(engine, generate_endpoint)
+        setup_prometheus_registry(generate_endpoint)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1da9d70 and 300fff3.

📒 Files selected for processing (5)

components/backends/sglang/launch/disagg_same_gpu.sh (1 hunks)
components/src/dynamo/sglang/main.py (4 hunks)
components/src/dynamo/sglang/publisher.py (1 hunks)
tests/serve/test_sglang.py (1 hunks)
tests/utils/payload_builder.py (2 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

Learnt from: keivenchang
PR: ai-dynamo/dynamo#3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.

📚 Learning: 2025-09-24T17:26:17.225Z

Learnt from: tzulingk
PR: ai-dynamo/dynamo#3194
File: tests/fault_tolerance/deploy/client.py:196-216
Timestamp: 2025-09-24T17:26:17.225Z
Learning: In tests/fault_tolerance/deploy/client.py, when no pods are ready, the port defaults to 0, creating URL "http://localhost:0/{endpoint}". The requests.post() call will raise ConnectionError or ConnectTimeout, which are caught by requests.RequestException in _single_request function.

Applied to files:

tests/utils/payload_builder.py

📚 Learning: 2025-08-25T23:24:42.076Z

Learnt from: tzulingk
PR: ai-dynamo/dynamo#2666
File: components/backends/trtllm/src/dynamo/trtllm/publisher.py:0-0
Timestamp: 2025-08-25T23:24:42.076Z
Learning: WorkerMetricsPublisher.create_endpoint method signature has been updated in _core.pyi to include metrics_labels parameter: `def create_endpoint(self, component: str, metrics_labels: Optional[List[Tuple[str, str]]] = None)`, making the metrics_labels parameter optional with default value of None.

Applied to files:

components/src/dynamo/sglang/publisher.py

📚 Learning: 2025-09-16T00:26:43.641Z

Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

components/src/dynamo/sglang/main.py

📚 Learning: 2025-06-05T01:04:24.775Z

Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.

Applied to files:

components/src/dynamo/sglang/main.py

🧬 Code graph analysis (3)

tests/serve/test_sglang.py (1)

tests/utils/payload_builder.py (2)

chat_payload_default (18-41)

metric_payload_default (65-80)

components/src/dynamo/sglang/publisher.py (2)

lib/bindings/python/src/dynamo/_core.pyi (2)

Endpoint (133-174)

endpoint (117-121)

components/src/dynamo/common/utils/prometheus.py (1)

register_engine_metrics_callback (30-73)

components/src/dynamo/sglang/main.py (1)

components/src/dynamo/sglang/publisher.py (2)

setup_prometheus_registry (199-223)

setup_sgl_metrics (226-252)

🪛 Ruff (0.14.2)

components/src/dynamo/sglang/publisher.py

200-200: Unused function argument: engine

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: operator (amd64)
GitHub Check: operator (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: vllm (arm64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (9)

components/src/dynamo/sglang/main.py (2)

21-21: LGTM: Clean import of centralized registry setup.

The import of setup_prometheus_registry aligns with the refactoring that extracts Prometheus registry initialization into a reusable function.

101-103: LGTM: Consistent Prometheus registry initialization across worker types.

The conditional registry setup follows a consistent pattern across init(), init_prefill(), and init_embedding() functions. The separation of concerns—metrics publishing via setup_sgl_metrics() and registry wiring via setup_prometheus_registry()—improves modularity.

Also applies to: 162-164, 212-214

components/backends/sglang/launch/disagg_same_gpu.sh (4)

12-27: LGTM: Robust GPU memory validation.

The memory check using PyTorch is defensive and provides clear error messages. The 24GB requirement is appropriate for disaggregated mode, and the Python-based floating-point comparison avoids external dependencies like bc.

29-36: LGTM: Proper cleanup handling for background processes.

The trap correctly kills the background processes (ingress and prefill). The decode worker runs in foreground, so it's the main process that the trap handles when it exits.

64-71: LGTM: Well-documented initialization delay.

The 5-second sleep is clearly explained with the three initialization steps (model loading, checkpoint cleanup, service discovery). While this is a fixed delay that might not suit all environments, it's a reasonable starting point for single-GPU testing.

43-91: LGTM: Clean separation of prefill and decode workers.

The worker configurations are symmetric and well-structured:

Prefill runs on port 8081 (background)

Decode runs on port 8082 (foreground)

Both use the same memory fraction and model settings

Environment variables (DYN_SYSTEM_ENABLED, DYN_SYSTEM_PORT) correctly enable metrics on separate ports

components/src/dynamo/sglang/publisher.py (1)

226-252: LGTM: Clean refactoring with component parameter.

The addition of the component parameter and the updated return type improve the function's reusability. The publisher is now correctly initialized with the component, aligning with the updated DynamoSglangPublisher constructor.

tests/utils/payload_builder.py (1)

65-80: LGTM: Clean extension for multi-port metrics testing.

The addition of the port parameter with a sensible default (8081) enables the test infrastructure to validate metrics from multiple workers on different ports. The change is backwards compatible and aligns with the disaggregated testing requirements.

tests/serve/test_sglang.py (1)

67-82: LGTM: Well-structured test for disaggregated single-GPU mode.

The new test configuration properly validates the disaggregated setup:

Uses the new disagg_same_gpu.sh script

Includes metric payloads for both prefill (8081) and decode (8082) workers

Clear inline documentation of the test's purpose

Consistent with existing SGLang test patterns

rmccorm4 · 2025-10-30T21:29:15Z

tests/serve/test_sglang.py

        models_port=8000,
        request_payloads=[chat_payload_default(), completion_payload_default()],
    ),
+    "disaggregated_same_gpu": SGLangConfig(


consider reusing 2gpu disagg.sh it it helps stabilize OOM error you're seeing on CI

keivenchang requested a review from ishandhanani October 30, 2025 17:35

keivenchang self-assigned this Oct 30, 2025

keivenchang requested review from a team as code owners October 30, 2025 17:35

pull-request-size bot added the size/L label Oct 30, 2025

github-actions bot added the fix label Oct 30, 2025

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

rmccorm4 reviewed Oct 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add prefill metrics support for SGLang disaggregated mode #3992

fix: add prefill metrics support for SGLang disaggregated mode #3992

keivenchang commented Oct 30, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 30, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

rmccorm4 Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: add prefill metrics support for SGLang disaggregated mode #3992

Are you sure you want to change the base?

fix: add prefill metrics support for SGLang disaggregated mode #3992

Conversation

keivenchang commented Oct 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 30, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rmccorm4 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

keivenchang commented Oct 30, 2025 •

edited by coderabbitai bot

Loading