Skip to content

Commit 32ac359

Browse files
keivenchangziqifan617
authored andcommitted
feat: add SGLang and vLLM passthrough metrics on Dynamo backend worker (#3539)
Signed-off-by: Keiven Chang <[email protected]> Co-authored-by: Keiven Chang <[email protected]>
1 parent b04b0a9 commit 32ac359

File tree

17 files changed

+593
-69
lines changed

17 files changed

+593
-69
lines changed
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# SGLang Prometheus Metrics
2+
3+
**📚 Official Documentation**: [SGLang Production Metrics](https://docs.sglang.ai/references/production_metrics.html)
4+
5+
This document describes how SGLang Prometheus metrics are exposed in Dynamo.
6+
7+
## Overview
8+
9+
When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with `sglang:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
10+
11+
For the complete and authoritative list of all SGLang metrics, always refer to the official documentation linked above.
12+
13+
Dynamo runtime metrics are documented in [docs/guides/metrics.md](../../../docs/guides/metrics.md).
14+
15+
## Metric Reference
16+
17+
The official documentation includes:
18+
- Complete metric definitions with HELP and TYPE descriptions
19+
- Example metric output in Prometheus exposition format
20+
- Counter, Gauge, and Histogram metrics
21+
- Metric labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`)
22+
- Setup guide for Prometheus + Grafana monitoring
23+
- Troubleshooting tips and configuration examples
24+
25+
## Metric Categories
26+
27+
SGLang provides metrics in the following categories (all prefixed with `sglang:`):
28+
- Throughput metrics
29+
- Resource usage
30+
- Latency metrics
31+
- Disaggregation metrics (when enabled)
32+
33+
**Note:** Specific metrics are subject to change between SGLang versions. Always refer to the [official documentation](https://docs.sglang.ai/references/production_metrics.html) or inspect the `/metrics` endpoint for your SGLang version.
34+
35+
## Enabling Metrics in Dynamo
36+
37+
SGLang metrics are automatically exposed when running SGLang through Dynamo with metrics enabled.
38+
39+
## Inspecting Metrics
40+
41+
To see the actual metrics available in your SGLang version:
42+
43+
### 1. Launch SGLang with Metrics Enabled
44+
45+
```bash
46+
# Set environment variables
47+
export DYN_SYSTEM_ENABLED=true
48+
export DYN_SYSTEM_PORT=8081
49+
50+
# Start SGLang worker with metrics enabled
51+
python -m dynamo.sglang --model <model_name> --enable-metrics
52+
53+
# Wait for engine to initialize
54+
```
55+
56+
Metrics will be available at: `http://localhost:8081/metrics`
57+
58+
### 2. Fetch Metrics via curl
59+
60+
```bash
61+
curl http://localhost:8081/metrics | grep "^sglang:"
62+
```
63+
64+
### 3. Example Output
65+
66+
**Note:** The specific metrics shown below are examples and may vary depending on your SGLang version. Always inspect your actual `/metrics` endpoint for the current list.
67+
68+
```
69+
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
70+
# TYPE sglang:prompt_tokens_total counter
71+
sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8128902.0
72+
# HELP sglang:generation_tokens_total Number of generation tokens processed.
73+
# TYPE sglang:generation_tokens_total counter
74+
sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7557572.0
75+
# HELP sglang:cache_hit_rate The cache hit rate
76+
# TYPE sglang:cache_hit_rate gauge
77+
sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075
78+
```
79+
80+
## Implementation Details
81+
82+
- SGLang uses multiprocess metrics collection via `prometheus_client.multiprocess.MultiProcessCollector`
83+
- Metrics are filtered by the `sglang:` prefix before being exposed
84+
- The integration uses Dynamo's `register_engine_metrics_callback()` function
85+
- Metrics appear after SGLang engine initialization completes
86+
87+
## See Also
88+
89+
### SGLang Metrics
90+
- [Official SGLang Production Metrics](https://docs.sglang.ai/references/production_metrics.html)
91+
- [SGLang GitHub - Metrics Collector](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/metrics/collector.py)
92+
93+
### Dynamo Metrics
94+
- **Dynamo Metrics Guide**: See `docs/guides/metrics.md` for complete documentation on Dynamo runtime metrics
95+
- **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces
96+
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
97+
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
98+
- Available at the same `/metrics` endpoint alongside SGLang metrics
99+
- **Integration Code**: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# vLLM Prometheus Metrics
2+
3+
**📚 Official Documentation**: [vLLM Metrics Design](https://docs.vllm.ai/en/latest/design/metrics.html)
4+
5+
This document describes how vLLM Prometheus metrics are exposed in Dynamo.
6+
7+
## Overview
8+
9+
When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
10+
11+
For the complete and authoritative list of all vLLM metrics, always refer to the official documentation linked above.
12+
13+
Dynamo runtime metrics are documented in [docs/guides/metrics.md](../../../docs/guides/metrics.md).
14+
15+
## Metric Reference
16+
17+
The official documentation includes:
18+
- Complete metric definitions with detailed explanations
19+
- Counter, Gauge, and Histogram metrics
20+
- Metric labels (e.g., `model_name`, `finished_reason`, `scheduling_event`)
21+
- Design rationale and implementation details
22+
- Information about v1 metrics migration
23+
- Future work and deprecated metrics
24+
25+
## Metric Categories
26+
27+
vLLM provides metrics in the following categories (all prefixed with `vllm:`):
28+
- Request metrics
29+
- Performance metrics
30+
- Resource usage
31+
- Scheduler metrics
32+
- Disaggregation metrics (when enabled)
33+
34+
**Note:** Specific metrics are subject to change between vLLM versions. Always refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) or inspect the `/metrics` endpoint for your vLLM version.
35+
36+
## Enabling Metrics in Dynamo
37+
38+
vLLM metrics are automatically exposed when running vLLM through Dynamo with metrics enabled.
39+
40+
## Inspecting Metrics
41+
42+
To see the actual metrics available in your vLLM version:
43+
44+
### 1. Launch vLLM with Metrics Enabled
45+
46+
```bash
47+
# Set environment variables
48+
export DYN_SYSTEM_ENABLED=true
49+
export DYN_SYSTEM_PORT=8081
50+
51+
# Start vLLM worker (metrics enabled by default via --disable-log-stats=false)
52+
python -m dynamo.vllm --model <model_name>
53+
54+
# Wait for engine to initialize
55+
```
56+
57+
Metrics will be available at: `http://localhost:8081/metrics`
58+
59+
### 2. Fetch Metrics via curl
60+
61+
```bash
62+
curl http://localhost:8081/metrics | grep "^vllm:"
63+
```
64+
65+
### 3. Example Output
66+
67+
**Note:** The specific metrics shown below are examples and may vary depending on your vLLM version. Always inspect your actual `/metrics` endpoint for the current list.
68+
69+
```
70+
# HELP vllm:request_success_total Number of successfully finished requests.
71+
# TYPE vllm:request_success_total counter
72+
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B"} 15.0
73+
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B"} 150.0
74+
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
75+
# TYPE vllm:time_to_first_token_seconds histogram
76+
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B"} 0.0
77+
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B"} 5.0
78+
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165.0
79+
vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
80+
```
81+
82+
## Implementation Details
83+
84+
- vLLM v1 uses multiprocess metrics collection via `prometheus_client.multiprocess`
85+
- `PROMETHEUS_MULTIPROC_DIR`: vLLM sets this environment variable to a temporary directory where multiprocess metrics are stored as memory-mapped files. Each worker process writes its metrics to separate files in this directory, which are aggregated when `/metrics` is scraped.
86+
- Metrics are filtered by the `vllm:` prefix before being exposed
87+
- The integration uses Dynamo's `register_engine_metrics_callback()` function
88+
- Metrics appear after vLLM engine initialization completes
89+
- vLLM v1 metrics are different from v0 - see the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for migration details
90+
91+
## See Also
92+
93+
### vLLM Metrics
94+
- [Official vLLM Metrics Design Documentation](https://docs.vllm.ai/en/latest/design/metrics.html)
95+
- [vLLM Production Metrics User Guide](https://docs.vllm.ai/en/latest/user/production_metrics.html)
96+
- [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/engine/metrics)
97+
98+
### Dynamo Metrics
99+
- **Dynamo Metrics Guide**: See `docs/guides/metrics.md` for complete documentation on Dynamo runtime metrics
100+
- **Dynamo Runtime Metrics**: Metrics prefixed with `dynamo_*` for runtime, components, endpoints, and namespaces
101+
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
102+
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
103+
- Available at the same `/metrics` endpoint alongside vLLM metrics
104+
- **Integration Code**: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration

components/src/dynamo/common/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,10 @@
99
1010
Main submodules:
1111
- config_dump: Configuration dumping and system diagnostics utilities
12+
- utils: Common utilities including environment and prometheus helpers
1213
"""
1314

14-
from dynamo.common import config_dump
15+
from dynamo.common import config_dump, utils
1516

1617
try:
1718
from ._version import __version__
@@ -23,4 +24,4 @@
2324
except Exception:
2425
__version__ = "0.0.0+unknown"
2526

26-
__all__ = ["__version__", "config_dump"]
27+
__all__ = ["__version__", "config_dump", "utils"]
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
"""
5+
Dynamo Common Utils Module
6+
7+
This module contains shared utility functions used across multiple
8+
Dynamo backends and components.
9+
10+
Submodules:
11+
- prometheus: Prometheus metrics collection and logging utilities
12+
"""
13+
14+
from dynamo.common.utils import prometheus
15+
16+
__all__ = ["prometheus"]
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
"""
5+
Prometheus metrics utilities for Dynamo components.
6+
7+
This module provides shared functionality for collecting and exposing Prometheus metrics
8+
from backend engines (SGLang, vLLM, etc.) via Dynamo's metrics endpoint.
9+
10+
Note: Engine metrics take time to appear after engine initialization,
11+
while Dynamo runtime metrics are available immediately after component creation.
12+
"""
13+
14+
import logging
15+
import re
16+
from typing import TYPE_CHECKING, Optional
17+
18+
from prometheus_client import generate_latest
19+
20+
from dynamo._core import Endpoint
21+
22+
# Import CollectorRegistry only for type hints to avoid importing prometheus_client at module load time.
23+
# prometheus_client must be imported AFTER set_prometheus_multiproc_dir() is called.
24+
# See main.py worker() function for detailed explanation.
25+
if TYPE_CHECKING:
26+
from prometheus_client import CollectorRegistry
27+
28+
29+
def register_engine_metrics_callback(
30+
endpoint: Endpoint,
31+
registry: "CollectorRegistry",
32+
metric_prefix: str,
33+
engine_name: str,
34+
) -> None:
35+
"""
36+
Register a callback to expose engine Prometheus metrics via Dynamo's metrics endpoint.
37+
38+
This registers a callback that is invoked when /metrics is scraped, passing through
39+
engine-specific metrics alongside Dynamo runtime metrics.
40+
41+
Args:
42+
endpoint: Dynamo endpoint object with metrics.register_prometheus_expfmt_callback()
43+
registry: Prometheus registry to collect from (e.g., REGISTRY or CollectorRegistry)
44+
metric_prefix: Prefix to filter metrics (e.g., "vllm:" or "sglang:")
45+
engine_name: Name of the engine for logging (e.g., "vLLM" or "SGLang")
46+
47+
Example:
48+
from prometheus_client import REGISTRY
49+
register_engine_metrics_callback(
50+
generate_endpoint, REGISTRY, "vllm:", "vLLM"
51+
)
52+
"""
53+
54+
def get_expfmt() -> str:
55+
"""Callback to return engine Prometheus metrics in exposition format"""
56+
return get_prometheus_expfmt(registry, metric_prefix_filter=metric_prefix)
57+
58+
endpoint.metrics.register_prometheus_expfmt_callback(get_expfmt)
59+
60+
61+
def get_prometheus_expfmt(
62+
registry,
63+
metric_prefix_filter: Optional[str] = None,
64+
) -> str:
65+
"""
66+
Get Prometheus metrics from a registry formatted as text using the standard text encoder.
67+
68+
Collects all metrics from the registry and returns them in Prometheus text exposition format.
69+
Optionally filters metrics by prefix.
70+
71+
Prometheus exposition format consists of:
72+
- Comment lines starting with # (HELP and TYPE declarations)
73+
- Metric lines with format: metric_name{label="value"} metric_value timestamp
74+
75+
Example output format:
76+
# HELP vllm:request_success_total Number of successful requests
77+
# TYPE vllm:request_success_total counter
78+
vllm:request_success_total{model="llama2",endpoint="generate"} 150.0
79+
# HELP vllm:time_to_first_token_seconds Time to first token
80+
# TYPE vllm:time_to_first_token_seconds histogram
81+
vllm:time_to_first_token_seconds_bucket{model="llama2",le="0.01"} 10.0
82+
vllm:time_to_first_token_seconds_bucket{model="llama2",le="0.1"} 45.0
83+
vllm:time_to_first_token_seconds_count{model="llama2"} 50.0
84+
vllm:time_to_first_token_seconds_sum{model="llama2"} 2.5
85+
86+
Args:
87+
registry: Prometheus registry to collect from.
88+
Pass CollectorRegistry with MultiProcessCollector for SGLang.
89+
Pass REGISTRY for vLLM single-process mode.
90+
metric_prefix_filter: Optional prefix to filter displayed metrics (e.g., "vllm:").
91+
If None, returns all metrics. (default: None)
92+
93+
Returns:
94+
Formatted metrics text in Prometheus exposition format. Returns empty string on error.
95+
96+
Example:
97+
from prometheus_client import REGISTRY
98+
metrics_text = get_prometheus_expfmt(REGISTRY)
99+
print(metrics_text)
100+
101+
# With filter
102+
vllm_metrics = get_prometheus_expfmt(REGISTRY, metric_prefix_filter="vllm:")
103+
"""
104+
try:
105+
# Generate metrics in Prometheus text format
106+
metrics_text = generate_latest(registry).decode("utf-8")
107+
108+
if metric_prefix_filter:
109+
# Filter lines: keep metric lines starting with prefix and their HELP/TYPE comments
110+
escaped_prefix = re.escape(metric_prefix_filter)
111+
pattern = rf"^(?:{escaped_prefix}|# (?:HELP|TYPE) {escaped_prefix})"
112+
filtered_lines = [
113+
line for line in metrics_text.split("\n") if re.match(pattern, line)
114+
]
115+
result = "\n".join(filtered_lines)
116+
if result:
117+
# Ensure result ends with newline
118+
if result and not result.endswith("\n"):
119+
result += "\n"
120+
return result
121+
else:
122+
# Ensure metrics_text ends with newline
123+
if metrics_text and not metrics_text.endswith("\n"):
124+
metrics_text += "\n"
125+
return metrics_text
126+
127+
except Exception as e:
128+
logging.error(f"Error getting metrics: {e}")
129+
return ""

components/src/dynamo/sglang/publisher.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,10 @@
99
import sglang as sgl
1010
import zmq
1111
import zmq.asyncio
12+
from prometheus_client import CollectorRegistry, multiprocess
1213
from sglang.srt.utils import get_local_ip_auto, get_zmq_socket
1314

15+
from dynamo.common.utils.prometheus import register_engine_metrics_callback
1416
from dynamo.llm import (
1517
ForwardPassMetrics,
1618
KvStats,
@@ -217,6 +219,16 @@ async def setup_sgl_metrics(
217219
publisher.init_engine_metrics_publish()
218220
publisher.init_kv_event_publish()
219221

222+
# Register Prometheus metrics callback if enabled
223+
if engine.server_args.enable_metrics:
224+
# SGLang uses multiprocess architecture where metrics are stored in shared memory.
225+
# MultiProcessCollector aggregates metrics from all worker processes.
226+
registry = CollectorRegistry()
227+
multiprocess.MultiProcessCollector(registry)
228+
register_engine_metrics_callback(
229+
generate_endpoint, registry, "sglang:", "SGLang"
230+
)
231+
220232
task = asyncio.create_task(publisher.run())
221233
logging.info("SGLang metrics loop started")
222234
return publisher, task, metrics_labels

0 commit comments

Comments
 (0)