Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Oct 30, 2025

Overview:

Adds prefill metrics support for SGLang disaggregated mode. Both prefill and decode workers now expose metrics on separate ports (8081 and 8082) for single-GPU testing.

Details:

  • New Launch Script: Added disagg_same_gpu.sh for single-GPU disaggregated testing with dual metrics
  • Prometheus Registry Refactoring: Extracted setup_prometheus_registry() for reusability across init functions
  • Test Enhancement: Added disaggregated_same_gpu test validating metrics from both workers
  • Test Infrastructure: Extended metric_payload_default() to accept port parameter for multi-port testing

Where should the reviewer start?

  • components/backends/sglang/launch/disagg_same_gpu.sh - new launch script with metrics on ports 8081/8082
  • components/src/dynamo/sglang/publisher.py - Prometheus registry refactoring
  • tests/serve/test_sglang.py - dual metrics validation test

Related Issues:

DIS-911

/coderabbit profile chill

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for disaggregated GPU serving on a single GPU with customizable memory allocation (default 24%)
    • Enhanced metrics collection with Prometheus registry support for prefill and decode worker monitoring
  • Tests

    • Added test configuration for disaggregated same GPU deployment scenarios

- Add disagg_same_gpu.sh for single-GPU disaggregated testing with metrics
- Both prefill (port 8081) and decode (port 8082) workers expose metrics
- Refactor setup_prometheus_registry() for reusability across init functions
- Add port parameter to metric_payload_default() for multi-port testing
- Add test validation for dual metrics endpoints
- Include 10-second wait between workers to prevent OOM race conditions
- Add aggressive memory optimizations and KV router mode for frontend

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang self-assigned this Oct 30, 2025
@keivenchang keivenchang requested review from a team as code owners October 30, 2025 17:35
@github-actions github-actions bot added the fix label Oct 30, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

Walkthrough

This pull request introduces disaggregated GPU inference on a single GPU with coordinated worker orchestration. Changes include a new shell script that manages prefill and decode worker processes, Prometheus metrics registry setup functions, refactored metrics initialization across SGLang components, a new test configuration, and an enhanced metrics payload utility.

Changes

Cohort / File(s) Summary
GPU Orchestration Script
components/backends/sglang/launch/disagg_same_gpu.sh
New shell script orchestrating single-GPU disaggregated setup with optional memory fraction parameter, GPU memory availability checks (requires ≥24 GB), graceful shutdown traps, and sequential launch of ingress router (port 8000), prefill worker (port 8081), and decode worker (port 8082) with 5-second initialization delay.
Metrics Registry Setup
components/src/dynamo/sglang/publisher.py
Added new setup_prometheus_registry() function to create and configure CollectorRegistry with MultiProcessCollector and engine metrics filtering. Refactored setup_sgl_metrics() signature to accept component parameter and return tuple of publisher, task, and metric list. Centralized Prometheus registry initialization outside setup_sgl_metrics.
Metrics Initialization
components/src/dynamo/sglang/main.py
Imported and integrated setup_prometheus_registry() calls conditionally in init(), init_prefill(), and init_embedding() functions following setup_sgl_metrics() invocation when metrics are enabled. Removed inline Prometheus setup comments.
Test Configuration
tests/serve/test_sglang.py
Added new "disaggregated_same_gpu" test configuration using disagg_same_gpu.sh script, GPU 1, Qwen3-0.6B model, with chat payloads and dual metric probes on ports 8081 and 8082.
Payload Utility Enhancement
tests/utils/payload_builder.py
Added optional port parameter (default 8081) to metric_payload_default() function and propagated it to MetricsPayload construction.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • disagg_same_gpu.sh: Verify GPU memory validation logic (torch.cuda.mem_get_info), process launch sequencing, port assignments, and cleanup trap correctness
  • publisher.py: Confirm refactoring maintains backward compatibility, registry configuration with MultiProcessCollector properly filters metrics, and tuple return type aligns with all callers
  • main.py: Validate consistent application of setup_prometheus_registry() across three initialization functions and correct conditional logic

Poem

🐰 A disaggregated dream on silicon so bright,
Two workers split the labor—prefill and decode in flight!
With metrics dancing merrily on ports 8081, 8082,
One GPU now orchestrates what once took two—
A hop, a skip, a Prometheus-tracked debut!

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "fix: add prefill metrics support for SGLang disaggregated mode" is directly related to the main changes in the pull request. The changeset introduces metrics support for SGLang disaggregated mode, with a new launch script that exposes metrics on ports 8081 (prefill) and 8082 (decode), Prometheus registry refactoring, and test enhancements to validate this functionality. While the title emphasizes "prefill metrics support" specifically, this accurately reflects the primary enhancement being added, and the decode metrics follow naturally as part of the same refactoring effort. The title is clear, concise, and specific enough that a reviewer scanning the history would understand the key change.
Description Check ✅ Passed The PR description is well-structured and covers all major template sections. It includes a clear overview explaining what prefill metrics support means in this context, detailed bullet points describing the specific changes (launch script, Prometheus refactoring, test enhancements, and test infrastructure), and specific file paths for reviewer focus. The "Related Issues" section references "DIS-911," though it deviates from the template's expected format of using action keywords (Closes/Fixes/Resolves) with a GitHub issue number. This is a minor formatting variation rather than a missing critical section, and the issue reference is clearly provided.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
components/src/dynamo/sglang/publisher.py (1)

199-223: Consider removing unused engine parameter.

The engine parameter is not used in the function body. Unless you're keeping it for future extensibility or API consistency, consider removing it to clean up the signature.

Apply this diff if the parameter is not needed:

-def setup_prometheus_registry(
-    engine: sgl.Engine, generate_endpoint: Endpoint
-) -> CollectorRegistry:
+def setup_prometheus_registry(generate_endpoint: Endpoint) -> CollectorRegistry:
     """Set up Prometheus registry for SGLang metrics collection.
 
     SGLang uses multiprocess architecture where metrics are stored in shared memory.
@@ -207,7 +206,6 @@
     (typically port 8081) when DYN_SYSTEM_ENABLED=true.
 
     Args:
-        engine: The SGLang engine instance.
         generate_endpoint: The Dynamo endpoint for generation requests.
 
     Returns:

And update the call sites in main.py:

-        setup_prometheus_registry(engine, generate_endpoint)
+        setup_prometheus_registry(generate_endpoint)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1da9d70 and 300fff3.

📒 Files selected for processing (5)
  • components/backends/sglang/launch/disagg_same_gpu.sh (1 hunks)
  • components/src/dynamo/sglang/main.py (4 hunks)
  • components/src/dynamo/sglang/publisher.py (1 hunks)
  • tests/serve/test_sglang.py (1 hunks)
  • tests/utils/payload_builder.py (2 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-09-24T17:26:17.225Z
Learnt from: tzulingk
PR: ai-dynamo/dynamo#3194
File: tests/fault_tolerance/deploy/client.py:196-216
Timestamp: 2025-09-24T17:26:17.225Z
Learning: In tests/fault_tolerance/deploy/client.py, when no pods are ready, the port defaults to 0, creating URL "http://localhost:0/{endpoint}". The requests.post() call will raise ConnectionError or ConnectTimeout, which are caught by requests.RequestException in _single_request function.

Applied to files:

  • tests/utils/payload_builder.py
📚 Learning: 2025-08-25T23:24:42.076Z
Learnt from: tzulingk
PR: ai-dynamo/dynamo#2666
File: components/backends/trtllm/src/dynamo/trtllm/publisher.py:0-0
Timestamp: 2025-08-25T23:24:42.076Z
Learning: WorkerMetricsPublisher.create_endpoint method signature has been updated in _core.pyi to include metrics_labels parameter: `def create_endpoint(self, component: str, metrics_labels: Optional[List[Tuple[str, str]]] = None)`, making the metrics_labels parameter optional with default value of None.

Applied to files:

  • components/src/dynamo/sglang/publisher.py
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

  • components/src/dynamo/sglang/main.py
📚 Learning: 2025-06-05T01:04:24.775Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.

Applied to files:

  • components/src/dynamo/sglang/main.py
🧬 Code graph analysis (3)
tests/serve/test_sglang.py (1)
tests/utils/payload_builder.py (2)
  • chat_payload_default (18-41)
  • metric_payload_default (65-80)
components/src/dynamo/sglang/publisher.py (2)
lib/bindings/python/src/dynamo/_core.pyi (2)
  • Endpoint (133-174)
  • endpoint (117-121)
components/src/dynamo/common/utils/prometheus.py (1)
  • register_engine_metrics_callback (30-73)
components/src/dynamo/sglang/main.py (1)
components/src/dynamo/sglang/publisher.py (2)
  • setup_prometheus_registry (199-223)
  • setup_sgl_metrics (226-252)
🪛 Ruff (0.14.2)
components/src/dynamo/sglang/publisher.py

200-200: Unused function argument: engine

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: operator (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (9)
components/src/dynamo/sglang/main.py (2)

21-21: LGTM: Clean import of centralized registry setup.

The import of setup_prometheus_registry aligns with the refactoring that extracts Prometheus registry initialization into a reusable function.


101-103: LGTM: Consistent Prometheus registry initialization across worker types.

The conditional registry setup follows a consistent pattern across init(), init_prefill(), and init_embedding() functions. The separation of concerns—metrics publishing via setup_sgl_metrics() and registry wiring via setup_prometheus_registry()—improves modularity.

Also applies to: 162-164, 212-214

components/backends/sglang/launch/disagg_same_gpu.sh (4)

12-27: LGTM: Robust GPU memory validation.

The memory check using PyTorch is defensive and provides clear error messages. The 24GB requirement is appropriate for disaggregated mode, and the Python-based floating-point comparison avoids external dependencies like bc.


29-36: LGTM: Proper cleanup handling for background processes.

The trap correctly kills the background processes (ingress and prefill). The decode worker runs in foreground, so it's the main process that the trap handles when it exits.


64-71: LGTM: Well-documented initialization delay.

The 5-second sleep is clearly explained with the three initialization steps (model loading, checkpoint cleanup, service discovery). While this is a fixed delay that might not suit all environments, it's a reasonable starting point for single-GPU testing.


43-91: LGTM: Clean separation of prefill and decode workers.

The worker configurations are symmetric and well-structured:

  • Prefill runs on port 8081 (background)
  • Decode runs on port 8082 (foreground)
  • Both use the same memory fraction and model settings
  • Environment variables (DYN_SYSTEM_ENABLED, DYN_SYSTEM_PORT) correctly enable metrics on separate ports
components/src/dynamo/sglang/publisher.py (1)

226-252: LGTM: Clean refactoring with component parameter.

The addition of the component parameter and the updated return type improve the function's reusability. The publisher is now correctly initialized with the component, aligning with the updated DynamoSglangPublisher constructor.

tests/utils/payload_builder.py (1)

65-80: LGTM: Clean extension for multi-port metrics testing.

The addition of the port parameter with a sensible default (8081) enables the test infrastructure to validate metrics from multiple workers on different ports. The change is backwards compatible and aligns with the disaggregated testing requirements.

tests/serve/test_sglang.py (1)

67-82: LGTM: Well-structured test for disaggregated single-GPU mode.

The new test configuration properly validates the disaggregated setup:

  • Uses the new disagg_same_gpu.sh script
  • Includes metric payloads for both prefill (8081) and decode (8082) workers
  • Clear inline documentation of the test's purpose
  • Consistent with existing SGLang test patterns

models_port=8000,
request_payloads=[chat_payload_default(), completion_payload_default()],
),
"disaggregated_same_gpu": SGLangConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider reusing 2gpu disagg.sh it it helps stabilize OOM error you're seeing on CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants