feat(observability): Add HTTP sidecar for ingesting engine metrics from SGLang gRPC workers#813
feat(observability): Add HTTP sidecar for ingesting engine metrics from SGLang gRPC workers#813
Conversation
Signed-off-by: Scott Lee <scott@together.ai>
📝 WalkthroughWalkthroughgRPC server now computes and runs a background Prometheus HTTP metrics sidecar (with multiprocess registry setup and graceful shutdown). Host:port formatting and metrics-port derivation helpers added. Worker manager generalized fan-out and now derives protocol-aware metrics URLs for both HTTP and gRPC workers. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
rect rgba(240,248,255,0.5)
participant Client
participant WarmupThread as Warmup Thread
participant GRPC as gRPC Server
participant MetricsHTTP as Metrics Sidecar
participant Scheduler as Scheduler Processes
participant Prometheus
end
Client->>GRPC: connect to gRPC address (host:port)
GRPC->>WarmupThread: spawn warmup connection to formatted host:port
GRPC->>Scheduler: launch scheduler processes (inherits PROMETHEUS multiproc dir)
GRPC->>MetricsHTTP: start metrics HTTP server on derived metrics port
Prometheus->>MetricsHTTP: scrape /metrics
Note right of MetricsHTTP: serves multiprocess CollectorRegistry /metrics
Client->>GRPC: shutdown request
GRPC->>MetricsHTTP: shutdown, close, join thread (timeout)
GRPC->>WarmupThread: join warmup thread
GRPC->>Scheduler: stop schedulers
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses the current limitation where SGLang gRPC workers do not expose engine metrics directly. It introduces a new architecture that enables observability for these workers by deploying a dedicated HTTP sidecar alongside each gRPC servicer. This sidecar collects metrics from a shared Prometheus multiprocess directory and exposes them via an HTTP endpoint, allowing the SMG Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces Prometheus metrics collection for the gRPC servicer and model gateway. In sglang/server.py, it adds functionality to set up Prometheus multiprocess directories and starts a dedicated HTTP server to expose /metrics endpoints, running on grpc_port + 1. In worker_manager.rs, the fan_out mechanism is refactored to support custom URL building, and a new metrics_url function is added to correctly determine the metrics endpoint for both HTTP and gRPC workers (by inferring the sidecar port for gRPC). Review comments suggest moving imports related to Prometheus and the HTTP server to the top of sglang/server.py for better Python best practices and dependency management. Additionally, the reviewer noted that the new metrics HTTP server in sglang/server.py lacks graceful shutdown, potentially leading to resource leaks or un-flushed metrics, and recommended explicit lifecycle management. Finally, the reviewer pointed out that the URL parsing logic in metrics_url in worker_manager.rs could be more robust, as it might silently generate incorrect URLs for malformed gRPC worker addresses, suggesting improved error handling and logging.
| from sglang.srt.utils import set_prometheus_multiproc_dir | ||
|
|
||
| set_prometheus_multiproc_dir() |
There was a problem hiding this comment.
It's generally considered best practice in Python to place all import statements at the top of the file, outside of conditional blocks. Importing inside an if statement can lead to unexpected behavior if the condition is not met, or if the module is later needed elsewhere in the file where the import might not have executed. Moving this import to the top of the file improves readability and ensures the dependency is always available.
from sglang.srt.utils import set_prometheus_multiproc_dir| metrics_port = server_args.port + 1 | ||
| _start_metrics_http_server(server_args.host, metrics_port) |
There was a problem hiding this comment.
The _start_metrics_http_server function starts a daemon thread that runs httpd.serve_forever(). While daemon=True ensures the program can exit, it doesn't allow for a graceful shutdown of the HTTP server itself. For better resource management and to ensure metrics are properly flushed before exit, the serve_grpc function should manage the lifecycle of the metrics HTTP server, including calling httpd.shutdown() during the overall server shutdown process. This would require _start_metrics_http_server to return the HTTPServer instance and the thread.
References
- When managing resources (e.g., workers) that are acquired and need to be released, ensure that all acquired resources are tracked immediately and consistently, especially when exceptions might occur during subsequent operations (e.g., launching new resources). This prevents resource leaks by ensuring that even partially successful acquisitions are accounted for cleanup.
| from http.server import BaseHTTPRequestHandler, HTTPServer | ||
|
|
||
| from prometheus_client import CollectorRegistry, generate_latest, multiprocess |
There was a problem hiding this comment.
Similar to the previous comment, it's best practice to place imports at the top of the file. Importing http.server and prometheus_client inside this function means they are imported every time the function is called (though in this case, it's only once). Moving these to the top of the file improves code clarity and adheres to standard Python style.
from http.server import BaseHTTPRequestHandler, HTTPServer
from prometheus_client import CollectorRegistry, generate_latest, multiprocess| let stripped = url | ||
| .trim_start_matches("grpc://") | ||
| .trim_start_matches("http://") | ||
| .trim_start_matches("https://"); | ||
| if let Some((host, port_str)) = stripped.rsplit_once(':') { | ||
| if let Ok(port) = port_str.parse::<u16>() { | ||
| return format!("http://{}:{}/{}", host, port + 1, endpoint); | ||
| } | ||
| } | ||
| format!("http://{stripped}/{endpoint}") |
There was a problem hiding this comment.
The URL parsing logic in metrics_url for ConnectionMode::Grpc could be more robust. If stripped.rsplit_once(':') returns None (meaning no colon, so no port) or port_str.parse::<u16>() fails, the function falls back to format!("http://{stripped}/{endpoint}"). This might result in an invalid or unintended URL if the original grpc:// address was malformed (e.g., grpc://host without a port). It would be safer to log a warning or error and perhaps return an Option<String> or Result<String, Error> to indicate a failure in constructing a valid metrics URL, rather than silently creating a potentially incorrect one.
References
- Instead of silently ignoring potential failures (e.g., from serialization), log them as warnings to aid in debugging. In Rust, prefer using
unwrap_or_elseto log an error overunwrap_or_defaultwhich would fail silently.
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 918c8c9c7b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| metrics_port = server_args.port + 1 | ||
| metrics_httpd = _start_metrics_http_server(server_args.host, metrics_port) |
There was a problem hiding this comment.
Avoid deriving the metrics sidecar port from
grpc_port + 1
Using server_args.port + 1 here breaks common multi-worker gRPC layouts: the repo’s own example in docs/getting-started/grpc-workers.md:103-105 uses grpc://worker1:50051 grpc://worker2:50052, so worker 1’s metrics sidecar will try to bind 50052, colliding with worker 2’s gRPC listener. With enable_metrics on, adjacent worker ports make startup fail nondeterministically or leave one worker without metrics, so the sidecar needs a collision-free port assignment instead of assuming +1 is free.
Useful? React with 👍 / 👎.
| def log_message(self, format, *args): | ||
| pass | ||
|
|
||
| httpd = HTTPServer((host, port), MetricsHandler) |
There was a problem hiding this comment.
Use an IPv6-capable HTTP server for the metrics sidecar
server_args.host is documented to support IPv6 values such as :: and ::1 (docs/reference/configuration.md:34-37), but HTTPServer is AF_INET-only. In that configuration HTTPServer((host, port), ...) raises gaierror: Address family for hostname not supported, so enabling metrics turns an otherwise valid IPv6 gRPC worker into a startup failure. The sidecar needs to select an AF_INET6 server when the host is IPv6.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
model_gateway/src/core/worker_manager.rs (1)
56-79:⚠️ Potential issue | 🟠 MajorDo not forward worker API keys to the unauthenticated metrics endpoint.
fan_out_with()unconditionally applies bearer auth to all requests, andget_engine_metrics()uses it to fetch metrics fromhttp://host:(port+1)/metricsfor gRPC workers. The metrics HTTP server insglang/server.pyhas no authentication check, so this unnecessarily exposes bearer tokens to an endpoint that doesn't validate them. Make auth opt-in per call site or explicitly skipbearer_auth()when targeting the metrics endpoint.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/core/worker_manager.rs` around lines 56 - 79, fan_out_with currently always applies bearer_auth to outgoing requests (see fan_out_with and its use in get_engine_metrics), which forwards worker API keys to an unauthenticated /metrics endpoint; change fan_out_with to accept an explicit flag or enum (e.g., use_auth: bool or AuthMode) and only call req.bearer_auth(key) when that flag indicates auth is allowed, update callers such as get_engine_metrics to call fan_out_with with authentication disabled for metrics requests, and ensure WorkerResponse construction and signature are updated accordingly so other callers that need auth continue to work unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/server.py`:
- Around line 273-280: The metrics HTTP server is started before the main try
block so if _start_metrics_http_server(server_args.host, server_args.port + 1)
raises (e.g., port unavailable or overflow) cleanup (servicer.shutdown(),
server.stop(), and the scheduler termination loop) is skipped; move the metrics
server initialization (the metrics_port calculation and the call to
_start_metrics_http_server) into the same try block that starts the gRPC server
so failures trigger the existing shutdown/stop/termination logic, and keep the
metrics_httpd variable declared outside so it can still be referenced in the
finally/cleanup code.
- Around line 310-313: The shutdown logic currently only calls
metrics_httpd.shutdown(), which leaves the listening socket and thread alive;
update _start_metrics_http_server to return both the HTTPServer instance and its
background thread (e.g., return httpd, metrics_thread), then change the shutdown
sequence to call metrics_httpd.shutdown(), metrics_httpd.server_close(), and
metrics_thread.join() (similar to how warmup_thread is handled) so the socket is
closed and the thread is properly joined.
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 101-107: The current port increment uses plain arithmetic (port +
1) which can overflow for u16; change the code that handles rsplit_once for
host/port_str to parse port into port: u16 and then call port.checked_add(1). If
checked_add returns Some(new_port) build and return the formatted URL using
new_port and endpoint (same format as before), otherwise fall back to the
existing graceful error path used when parsing fails (i.e., do not wrap/overflow
— return the original/unchanged URL or the existing fallback behavior). Update
references around the variables stripped, host, port_str, port and endpoint
accordingly.
---
Outside diff comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 56-79: fan_out_with currently always applies bearer_auth to
outgoing requests (see fan_out_with and its use in get_engine_metrics), which
forwards worker API keys to an unauthenticated /metrics endpoint; change
fan_out_with to accept an explicit flag or enum (e.g., use_auth: bool or
AuthMode) and only call req.bearer_auth(key) when that flag indicates auth is
allowed, update callers such as get_engine_metrics to call fan_out_with with
authentication disabled for metrics requests, and ensure WorkerResponse
construction and signature are updated accordingly so other callers that need
auth continue to work unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 111e50f6-dd9d-4b33-9c7a-81ea6a0e41e5
📒 Files selected for processing (2)
grpc_servicer/smg_grpc_servicer/sglang/server.pymodel_gateway/src/core/worker_manager.rs
| let stripped = url | ||
| .trim_start_matches("grpc://") | ||
| .trim_start_matches("http://") | ||
| .trim_start_matches("https://"); | ||
| if let Some((host, port_str)) = stripped.rsplit_once(':') { | ||
| if let Ok(port) = port_str.parse::<u16>() { | ||
| return format!("http://{}:{}/{}", host, port + 1, endpoint); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n model_gateway/src/core/worker_manager.rs | sed -n '95,115p'Repository: lightseekorg/smg
Length of output: 1028
🏁 Script executed:
cat -n model_gateway/src/core/worker_manager.rs | sed -n '85,120p'Repository: lightseekorg/smg
Length of output: 1566
🏁 Script executed:
rg -t rust "Worker|ConnectionMode" model_gateway/src --max-count=5 | head -20Repository: lightseekorg/smg
Length of output: 1977
🏁 Script executed:
fd -t rs -e rs . model_gateway/src | xargs grep -l "ConnectionMode" | head -5Repository: lightseekorg/smg
Length of output: 229
🏁 Script executed:
cat -n model_gateway/src/core/worker.rs | head -50Repository: lightseekorg/smg
Length of output: 2065
🏁 Script executed:
rg -t rust "fn.*Worker|impl Worker" model_gateway/src --max-count=10Repository: lightseekorg/smg
Length of output: 10969
🏁 Script executed:
rg -t rust "parse.*port|port.*parse" model_gateway/src -A 3 -B 3 | head -40Repository: lightseekorg/smg
Length of output: 2734
🏁 Script executed:
rg -t rust "metrics_url|fetch_grpc_load" model_gateway/src -B 2 -A 5 | head -60Repository: lightseekorg/smg
Length of output: 3541
🏁 Script executed:
cat -n model_gateway/src/config/validation.rs | grep -A 10 "check port"Repository: lightseekorg/smg
Length of output: 588
🏁 Script executed:
rg -t rust "port.*65535|65535.*port" model_gateway/srcRepository: lightseekorg/smg
Length of output: 350
🏁 Script executed:
cat -n model_gateway/src/core/worker_manager.rs | sed -n '95,115p'Repository: lightseekorg/smg
Length of output: 1028
Guard the port + 1 arithmetic with checked_add(1).
For grpc://host:65535, port + 1 overflows u16—with overflow checks enabled it panics, and without them it wraps to 0, producing http://host:0/.... Use checked_add(1) and fall back to the existing graceful error path on line 110 instead of deriving a wrapped port.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/core/worker_manager.rs` around lines 101 - 107, The current
port increment uses plain arithmetic (port + 1) which can overflow for u16;
change the code that handles rsplit_once for host/port_str to parse port into
port: u16 and then call port.checked_add(1). If checked_add returns
Some(new_port) build and return the formatted URL using new_port and endpoint
(same format as before), otherwise fall back to the existing graceful error path
used when parsing fails (i.e., do not wrap/overflow — return the
original/unchanged URL or the existing fallback behavior). Update references
around the variables stripped, host, port_str, port and endpoint accordingly.
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7634160757
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
|
|
||
| fn parse_worker_host_port(url: &str) -> Option<(String, u16)> { | ||
| let parsed = url::Url::parse(strip_dp_rank_suffix(url)).ok()?; |
There was a problem hiding this comment.
Preserve DP ranks when scraping gRPC engine metrics
When --dp-aware is enabled, CreateWorkerStep registers one Worker per rank (model_gateway/src/core/steps/worker/local/create_worker.rs:126-157). Calling strip_dp_rank_suffix() here collapses all of those gRPC URLs onto the same sidecar endpoint, but get_engine_metrics() still labels each response with the original worker_addr and aggregate_metrics() keeps both series by adding that label (model_gateway/src/core/worker_manager.rs:356-363, model_gateway/src/core/metrics_aggregator.rs:42-48). In a DP-aware SGLang deployment, /engine_metrics will therefore emit the same exposition once per rank, multiplying every counter/gauge by dp_size instead of reporting the real values.
Useful? React with 👍 / 👎.
| if grpc_port + METRICS_PORT_OFFSET <= MAX_TCP_PORT: | ||
| return grpc_port + METRICS_PORT_OFFSET | ||
|
|
||
| wrapped_port = grpc_port - METRICS_PORT_OFFSET |
There was a problem hiding this comment.
Use a collision-free mapping for the sidecar metrics port
This offset-and-wrap rule still collides for same-host workers whose gRPC ports differ by 10,000. For example, a worker on 50051 binds its sidecar on 60051, while a worker on 60051 wraps back to 50051, so each sidecar tries to take the other worker's gRPC listener. Because serve_grpc() binds the HTTP sidecar with this function and the gateway mirrors the same mapping in model_gateway/src/core/worker_manager.rs, enabling metrics still breaks valid multi-worker layouts even though the adjacent-port case was fixed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
model_gateway/src/core/worker_manager.rs (1)
64-80:⚠️ Potential issue | 🟠 MajorDo not forward worker API keys to the derived gRPC metrics URL.
fan_out_with()still unconditionally appliesworker.api_key(), butmetrics_url()rewrites gRPC workers tohttp://host:.../metrics. If a gRPC worker has an API key configured, this sends that secret over plain HTTP to an unauthenticated sidecar that does not need it. Plumb auth separately here and suppress it for gRPC metrics scrapes.Also applies to: 125-143
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/core/worker_manager.rs` around lines 64 - 80, The current fan_out_with() builder unconditionally applies worker.api_key() when building requests (in the async block creating WorkerResponse), which forwards gRPC worker API keys to rewritten metrics URLs; change this so auth is only attached when appropriate by detecting metrics scrapes and skipping bearer_auth for gRPC metrics. Concretely: in the closure in fan_out_with()/the async move block that calls url_fn(worker, endpoint) (and constructs full_url), determine whether this request is a metrics scrape for a gRPC worker (e.g., via worker.kind()/is_grpc() or by comparing endpoint/metrics_url semantics) and only call req.bearer_auth(api_key) when NOT a gRPC metrics scrape; keep WorkerResponse { url: worker_url, result: ... } behavior otherwise and ensure any other places (the second occurrence around lines 125-143) follow the same conditional auth logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 92-99: The current logic in strip_dp_rank_suffix() makes addresses
like "grpc://host:port@0" and "@1" canonicalize to the same metrics URL but
get_engine_metrics() still fans out per original worker address, causing
duplicate scrapes and duplicate worker_addr series; update get_engine_metrics()
(and the other similar blocks at the regions referenced) to first map each
original worker address to its canonical metrics URL via strip_dp_rank_suffix(),
deduplicate the list of canonical URLs, and then call fan_out_with() once per
unique canonical URL; also ensure that the emitted label for the endpoint uses
the canonical URL (or a new canonical_label) instead of the original per-worker
address so identical endpoints produce a single metrics series.
---
Outside diff comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 64-80: The current fan_out_with() builder unconditionally applies
worker.api_key() when building requests (in the async block creating
WorkerResponse), which forwards gRPC worker API keys to rewritten metrics URLs;
change this so auth is only attached when appropriate by detecting metrics
scrapes and skipping bearer_auth for gRPC metrics. Concretely: in the closure in
fan_out_with()/the async move block that calls url_fn(worker, endpoint) (and
constructs full_url), determine whether this request is a metrics scrape for a
gRPC worker (e.g., via worker.kind()/is_grpc() or by comparing
endpoint/metrics_url semantics) and only call req.bearer_auth(api_key) when NOT
a gRPC metrics scrape; keep WorkerResponse { url: worker_url, result: ... }
behavior otherwise and ensure any other places (the second occurrence around
lines 125-143) follow the same conditional auth logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 363b0059-98f4-4f61-a936-730279a9e7da
📒 Files selected for processing (2)
grpc_servicer/smg_grpc_servicer/sglang/server.pymodel_gateway/src/core/worker_manager.rs
| fn strip_dp_rank_suffix(url: &str) -> &str { | ||
| if let Some(at_pos) = url.rfind('@') { | ||
| let suffix = &url[at_pos + 1..]; | ||
| if !suffix.is_empty() && suffix.chars().all(|c| c.is_ascii_digit()) { | ||
| return &url[..at_pos]; | ||
| } | ||
| } | ||
| url |
There was a problem hiding this comment.
Deduplicate scrapes that collapse onto the same sidecar endpoint.
strip_dp_rank_suffix() intentionally makes URLs like grpc://host:port@0 and @1 resolve to the same metrics URL, but get_engine_metrics() still fans out once per logical worker. That will fetch the same multiprocess payload multiple times and emit duplicate worker_addr series, inflating any cross-worker sums. De-dup on the derived metrics URL before calling fan_out_with(), and use a canonical label for that endpoint.
Also applies to: 125-143, 347-354
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/core/worker_manager.rs` around lines 92 - 99, The current
logic in strip_dp_rank_suffix() makes addresses like "grpc://host:port@0" and
"@1" canonicalize to the same metrics URL but get_engine_metrics() still fans
out per original worker address, causing duplicate scrapes and duplicate
worker_addr series; update get_engine_metrics() (and the other similar blocks at
the regions referenced) to first map each original worker address to its
canonical metrics URL via strip_dp_rank_suffix(), deduplicate the list of
canonical URLs, and then call fan_out_with() once per unique canonical URL; also
ensure that the emitted label for the endpoint uses the canonical URL (or a new
canonical_label) instead of the original per-worker address so identical
endpoints produce a single metrics series.
|
This pull request has been automatically marked as stale because it has not had any activity within 14 days. It will be automatically closed if no further activity occurs within 16 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
Hi @scottjlee, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch: git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease |
Description
Problem
Currently, the SMG
/engine_metricsendpoint does not support emitting engine metrics when running in gRPC mode. Moreover, in gRPC mode, SGLang does not expose an HTTP endpoint which SMG can directly retrieve metrics from.Solution
Add a lightweight metrics sidecar to fetch metrics emitted by SGLang engine (written to
PROMETHEUS_MULTIPROC_DIR) and expose them in SMG via HTTP server.Changes
server.py): Whenenable_metricsis set, initialize the Prometheus multiprocess directory and starts a lightweight HTTP server ongrpc_port + 10000(if it overflows, fall back togrpc_port - 10000) that serves/metricsusing the multiprocess collector.worker_manager.rs): Introduce utility functions for gRPC workers to generate sidecar URLs with the new port mapping. The engine metrics path now uses this so Prometheus scrapes reach gRPC workers correctly and expose them via the/engine_metricsendpoint.Test Plan
Example output of engine metrics via SMG:
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit