- 
                Notifications
    You must be signed in to change notification settings 
- Fork 663
fix: add prefill metrics support for SGLang disaggregated mode #3992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: add prefill metrics support for SGLang disaggregated mode #3992
Conversation
- Add disagg_same_gpu.sh for single-GPU disaggregated testing with metrics - Both prefill (port 8081) and decode (port 8082) workers expose metrics - Refactor setup_prometheus_registry() for reusability across init functions - Add port parameter to metric_payload_default() for multi-port testing - Add test validation for dual metrics endpoints - Include 10-second wait between workers to prevent OOM race conditions - Add aggressive memory optimizations and KV router mode for frontend Signed-off-by: Keiven Chang <[email protected]>
| WalkthroughThis pull request introduces disaggregated GPU inference on a single GPU with coordinated worker orchestration. Changes include a new shell script that manages prefill and decode worker processes, Prometheus metrics registry setup functions, refactored metrics initialization across SGLang components, a new test configuration, and an enhanced metrics payload utility. Changes
 Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 
 Poem
 Pre-merge checks❌ Failed checks (1 warning)
 ✅ Passed checks (2 passed)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
components/src/dynamo/sglang/publisher.py (1)
199-223: Consider removing unusedengineparameter.The
engineparameter is not used in the function body. Unless you're keeping it for future extensibility or API consistency, consider removing it to clean up the signature.Apply this diff if the parameter is not needed:
-def setup_prometheus_registry( - engine: sgl.Engine, generate_endpoint: Endpoint -) -> CollectorRegistry: +def setup_prometheus_registry(generate_endpoint: Endpoint) -> CollectorRegistry: """Set up Prometheus registry for SGLang metrics collection. SGLang uses multiprocess architecture where metrics are stored in shared memory. @@ -207,7 +206,6 @@ (typically port 8081) when DYN_SYSTEM_ENABLED=true. Args: - engine: The SGLang engine instance. generate_endpoint: The Dynamo endpoint for generation requests. Returns:And update the call sites in
main.py:- setup_prometheus_registry(engine, generate_endpoint) + setup_prometheus_registry(generate_endpoint)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
- components/backends/sglang/launch/disagg_same_gpu.sh(1 hunks)
- components/src/dynamo/sglang/main.py(4 hunks)
- components/src/dynamo/sglang/publisher.py(1 hunks)
- tests/serve/test_sglang.py(1 hunks)
- tests/utils/payload_builder.py(2 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-09-24T17:26:17.225Z
Learnt from: tzulingk
PR: ai-dynamo/dynamo#3194
File: tests/fault_tolerance/deploy/client.py:196-216
Timestamp: 2025-09-24T17:26:17.225Z
Learning: In tests/fault_tolerance/deploy/client.py, when no pods are ready, the port defaults to 0, creating URL "http://localhost:0/{endpoint}". The requests.post() call will raise ConnectionError or ConnectTimeout, which are caught by requests.RequestException in _single_request function.
Applied to files:
- tests/utils/payload_builder.py
📚 Learning: 2025-08-25T23:24:42.076Z
Learnt from: tzulingk
PR: ai-dynamo/dynamo#2666
File: components/backends/trtllm/src/dynamo/trtllm/publisher.py:0-0
Timestamp: 2025-08-25T23:24:42.076Z
Learning: WorkerMetricsPublisher.create_endpoint method signature has been updated in _core.pyi to include metrics_labels parameter: `def create_endpoint(self, component: str, metrics_labels: Optional[List[Tuple[str, str]]] = None)`, making the metrics_labels parameter optional with default value of None.
Applied to files:
- components/src/dynamo/sglang/publisher.py
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.
Applied to files:
- components/src/dynamo/sglang/main.py
📚 Learning: 2025-06-05T01:04:24.775Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
Applied to files:
- components/src/dynamo/sglang/main.py
🧬 Code graph analysis (3)
tests/serve/test_sglang.py (1)
tests/utils/payload_builder.py (2)
chat_payload_default(18-41)
metric_payload_default(65-80)
components/src/dynamo/sglang/publisher.py (2)
lib/bindings/python/src/dynamo/_core.pyi (2)
Endpoint(133-174)
endpoint(117-121)components/src/dynamo/common/utils/prometheus.py (1)
register_engine_metrics_callback(30-73)
components/src/dynamo/sglang/main.py (1)
components/src/dynamo/sglang/publisher.py (2)
setup_prometheus_registry(199-223)
setup_sgl_metrics(226-252)
🪛 Ruff (0.14.2)
components/src/dynamo/sglang/publisher.py
200-200: Unused function argument: engine
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: operator (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (9)
components/src/dynamo/sglang/main.py (2)
21-21: LGTM: Clean import of centralized registry setup.The import of
setup_prometheus_registryaligns with the refactoring that extracts Prometheus registry initialization into a reusable function.
101-103: LGTM: Consistent Prometheus registry initialization across worker types.The conditional registry setup follows a consistent pattern across
init(),init_prefill(), andinit_embedding()functions. The separation of concerns—metrics publishing viasetup_sgl_metrics()and registry wiring viasetup_prometheus_registry()—improves modularity.Also applies to: 162-164, 212-214
components/backends/sglang/launch/disagg_same_gpu.sh (4)
12-27: LGTM: Robust GPU memory validation.The memory check using PyTorch is defensive and provides clear error messages. The 24GB requirement is appropriate for disaggregated mode, and the Python-based floating-point comparison avoids external dependencies like
bc.
29-36: LGTM: Proper cleanup handling for background processes.The trap correctly kills the background processes (ingress and prefill). The decode worker runs in foreground, so it's the main process that the trap handles when it exits.
64-71: LGTM: Well-documented initialization delay.The 5-second sleep is clearly explained with the three initialization steps (model loading, checkpoint cleanup, service discovery). While this is a fixed delay that might not suit all environments, it's a reasonable starting point for single-GPU testing.
43-91: LGTM: Clean separation of prefill and decode workers.The worker configurations are symmetric and well-structured:
- Prefill runs on port 8081 (background)
- Decode runs on port 8082 (foreground)
- Both use the same memory fraction and model settings
- Environment variables (
DYN_SYSTEM_ENABLED,DYN_SYSTEM_PORT) correctly enable metrics on separate portscomponents/src/dynamo/sglang/publisher.py (1)
226-252: LGTM: Clean refactoring with component parameter.The addition of the
componentparameter and the updated return type improve the function's reusability. The publisher is now correctly initialized with the component, aligning with the updatedDynamoSglangPublisherconstructor.tests/utils/payload_builder.py (1)
65-80: LGTM: Clean extension for multi-port metrics testing.The addition of the
portparameter with a sensible default (8081) enables the test infrastructure to validate metrics from multiple workers on different ports. The change is backwards compatible and aligns with the disaggregated testing requirements.tests/serve/test_sglang.py (1)
67-82: LGTM: Well-structured test for disaggregated single-GPU mode.The new test configuration properly validates the disaggregated setup:
- Uses the new
disagg_same_gpu.shscript- Includes metric payloads for both prefill (8081) and decode (8082) workers
- Clear inline documentation of the test's purpose
- Consistent with existing SGLang test patterns
| models_port=8000, | ||
| request_payloads=[chat_payload_default(), completion_payload_default()], | ||
| ), | ||
| "disaggregated_same_gpu": SGLangConfig( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider reusing 2gpu disagg.sh it it helps stabilize OOM error you're seeing on CI
Overview:
Adds prefill metrics support for SGLang disaggregated mode. Both prefill and decode workers now expose metrics on separate ports (8081 and 8082) for single-GPU testing.
Details:
disagg_same_gpu.shfor single-GPU disaggregated testing with dual metricssetup_prometheus_registry()for reusability across init functionsdisaggregated_same_gputest validating metrics from both workersmetric_payload_default()to accept port parameter for multi-port testingWhere should the reviewer start?
components/backends/sglang/launch/disagg_same_gpu.sh- new launch script with metrics on ports 8081/8082components/src/dynamo/sglang/publisher.py- Prometheus registry refactoringtests/serve/test_sglang.py- dual metrics validation testRelated Issues:
DIS-911
/coderabbit profile chill
Summary by CodeRabbit
Release Notes
New Features
Tests