Skip to content

feat(observability): Add HTTP sidecar for ingesting engine metrics from SGLang gRPC workers#813

Closed
scottjlee wants to merge 6 commits intomainfrom
sjl/0318-engine-metrics-sglang
Closed

feat(observability): Add HTTP sidecar for ingesting engine metrics from SGLang gRPC workers#813
scottjlee wants to merge 6 commits intomainfrom
sjl/0318-engine-metrics-sglang

Conversation

@scottjlee
Copy link
Copy Markdown
Collaborator

@scottjlee scottjlee commented Mar 19, 2026

Description

Problem

Currently, the SMG /engine_metrics endpoint does not support emitting engine metrics when running in gRPC mode. Moreover, in gRPC mode, SGLang does not expose an HTTP endpoint which SMG can directly retrieve metrics from.

Solution

Add a lightweight metrics sidecar to fetch metrics emitted by SGLang engine (written to PROMETHEUS_MULTIPROC_DIR) and expose them in SMG via HTTP server.

Changes

  • gRPC servicer (server.py): When enable_metrics is set, initialize the Prometheus multiprocess directory and starts a lightweight HTTP server on grpc_port + 10000 (if it overflows, fall back to grpc_port - 10000) that serves /metrics using the multiprocess collector.
  • Worker manager (worker_manager.rs): Introduce utility functions for gRPC workers to generate sidecar URLs with the new port mapping. The engine metrics path now uses this so Prometheus scrapes reach gRPC workers correctly and expose them via the /engine_metrics endpoint.

Test Plan

Example output of engine metrics via SMG:
curl http://127.0.0.1:30001/engine_metrics
# HELP sglang_num_used_tokens The number of used tokens.
# TYPE sglang_num_used_tokens gauge
sglang_num_used_tokens{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 852

# HELP sglang_engine_load_weights_time The time taken for the engine to load weights.
# TYPE sglang_engine_load_weights_time gauge
sglang_engine_load_weights_time{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_spec_accept_rate The average acceptance rate of speculative decoding (`accepted tokens / total draft tokens` in batch).
# TYPE sglang_spec_accept_rate gauge
sglang_spec_accept_rate{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_num_grammar_queue_reqs The number of requests in the grammar waiting queue.
# TYPE sglang_num_grammar_queue_reqs gauge
sglang_num_grammar_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_gen_throughput The generation throughput (token/s).
# TYPE sglang_gen_throughput gauge
sglang_gen_throughput{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 450.51289169675726

# HELP sglang_num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang_num_queue_reqs gauge
sglang_num_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_num_decode_transfer_queue_reqs The number of requests in the decode transfer queue.
# TYPE sglang_num_decode_transfer_queue_reqs gauge
sglang_num_decode_transfer_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_cache_config_info Cache configuration information.
# TYPE sglang_cache_config_info gauge
sglang_cache_config_info{num_pages="410615",page_size="1",worker_addr="grpc://127.0.0.1:8001"} 1

# HELP sglang_num_running_reqs_offline_batch The number of running low-priority offline batch requests(label is 'batch').
# TYPE sglang_num_running_reqs_offline_batch gauge
sglang_num_running_reqs_offline_batch{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_full_token_usage The token usage for full attention layers.
# TYPE sglang_full_token_usage gauge
sglang_full_token_usage{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0.002074936375923919

# HELP sglang_num_paused_reqs The number of paused requests by async weight sync.
# TYPE sglang_num_paused_reqs gauge
sglang_num_paused_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pid="3161",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_decode_sum_seq_lens The sum of all sequence lengths in decode.
# TYPE sglang_decode_sum_seq_lens gauge
sglang_decode_sum_seq_lens{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 933

# HELP sglang_num_prefill_inflight_queue_reqs The number of requests in the prefill inflight queue.
# TYPE sglang_num_prefill_inflight_queue_reqs gauge
sglang_num_prefill_inflight_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_utilization The utilization.
# TYPE sglang_utilization gauge
sglang_utilization{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_cuda_graph_passes_total Total number of forward passes categorized by CUDA graph.
# TYPE sglang_cuda_graph_passes_total counter
sglang_cuda_graph_passes_total{engine_type="unified",mode="prefill_cuda_graph",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 6
sglang_cuda_graph_passes_total{engine_type="unified",mode="decode_cuda_graph",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 311

# HELP sglang_num_prefill_prealloc_queue_reqs The number of requests in the prefill prealloc queue.
# TYPE sglang_num_prefill_prealloc_queue_reqs gauge
sglang_num_prefill_prealloc_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_cache_hit_rate The prefix cache hit rate.
# TYPE sglang_cache_hit_rate gauge
sglang_cache_hit_rate{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_realtime_tokens_total Total number of tokens processed (updated on each log interval). mode_ prefill_compute, prefill_cache, decode.
# TYPE sglang_realtime_tokens_total counter
sglang_realtime_tokens_total{engine_type="unified",mode="prefill_compute",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 55
sglang_realtime_tokens_total{engine_type="unified",mode="decode",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 901
sglang_realtime_tokens_total{engine_type="unified",mode="prefill_cache",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 123

# HELP sglang_pending_prealloc_token_usage The token usage for pending preallocated tokens (not preallocated yet).
# TYPE sglang_pending_prealloc_token_usage gauge
sglang_pending_prealloc_token_usage{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_num_running_reqs The number of running requests.
# TYPE sglang_num_running_reqs gauge
sglang_num_running_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 3

# HELP sglang_mamba_usage The token usage for Mamba layers.
# TYPE sglang_mamba_usage gauge
sglang_mamba_usage{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_max_total_num_tokens Maximum total number of tokens in the KV cache pool.
# TYPE sglang_max_total_num_tokens gauge
sglang_max_total_num_tokens{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 410615

# HELP sglang_engine_startup_time The time taken for the engine to start up.
# TYPE sglang_engine_startup_time gauge
sglang_engine_startup_time{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_token_usage The token usage.
# TYPE sglang_token_usage gauge
sglang_token_usage{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0.002074936375923919

# HELP sglang_new_token_ratio The new token ratio.
# TYPE sglang_new_token_ratio gauge
sglang_new_token_ratio{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0.7

# HELP sglang_num_decode_prealloc_queue_reqs The number of requests in the decode prealloc queue.
# TYPE sglang_num_decode_prealloc_queue_reqs gauge
sglang_num_decode_prealloc_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang_spec_accept_length gauge
sglang_spec_accept_length{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_swa_token_usage The token usage for SWA layers.
# TYPE sglang_swa_token_usage gauge
sglang_swa_token_usage{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_num_unique_running_routing_keys Number of unique routing keys in running batch.
# TYPE sglang_num_unique_running_routing_keys gauge
sglang_num_unique_running_routing_keys{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_per_stage_req_latency_seconds The latency of each stage of requests.
# TYPE sglang_per_stage_req_latency_seconds histogram
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.001"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.0016200000000000001"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.0026244000000000007"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.004251528"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.006887475360000002"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.011157710083200004"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.018075490334784006"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.029282294342350097"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.047437316834607154"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.07684845327206359"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.12449449430074302"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.20168108076720373"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.3267233508428701"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.5292918283654495"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.8574527619520282"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1.389073474362286"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="2.2502990284669036"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="3.6454844261163837"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="5.905684770308542"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="9.567209327899839"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="15.49887911119774"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="25.10818416014034"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="40.67525833942735"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="65.89391850987231"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="106.74814798599316"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="172.93199973730896"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="280.1498395744405"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="453.84274011059364"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="735.2252389791618"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1191.064887146242"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="+Inf"} 8
sglang_per_stage_req_latency_seconds_sum{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0.0001513659954071045
sglang_per_stage_req_latency_seconds_count{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.001"} 0
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.0016200000000000001"} 0
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.0026244000000000007"} 0
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.004251528"} 0
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.006887475360000002"} 0
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.011157710083200004"} 3
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.018075490334784006"} 4
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.029282294342350097"} 4
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.047437316834607154"} 4
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.07684845327206359"} 4
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.12449449430074302"} 4
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.20168108076720373"} 7
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.3267233508428701"} 7
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.5292918283654495"} 7
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.8574527619520282"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1.389073474362286"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="2.2502990284669036"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="3.6454844261163837"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="5.905684770308542"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="9.567209327899839"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="15.49887911119774"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="25.10818416014034"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="40.67525833942735"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="65.89391850987231"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="106.74814798599316"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="172.93199973730896"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="280.1498395744405"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="453.84274011059364"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="735.2252389791618"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1191.064887146242"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="+Inf"} 8
sglang_per_stage_req_latency_seconds_sum{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 1.169399008154869
sglang_per_stage_req_latency_seconds_count{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 8

# HELP sglang_queue_time_seconds Histogram of queueing time in seconds.
# TYPE sglang_queue_time_seconds histogram
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0"} 0
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.1"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.2"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.5"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="2"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="3"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="4"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="5"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="10"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="15"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="20"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="30"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="40"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="50"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="60"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="70"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="80"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="90"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="100"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="200"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="300"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="400"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="500"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="600"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="700"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="800"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="900"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1000"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1200"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1400"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1600"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1800"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="2000"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="2500"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="3000"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="+Inf"} 8
sglang_queue_time_seconds_sum{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0.002929016947746277
sglang_queue_time_seconds_count{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 8

# HELP sglang_is_cuda_graph Whether the batch is using CUDA graph.
# TYPE sglang_is_cuda_graph gauge
sglang_is_cuda_graph{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_routing_key_running_req_count Distribution of routing keys by running request count (gt < count <= le).
# TYPE sglang_routing_key_running_req_count gauge
sglang_routing_key_running_req_count{engine_type="unified",gt="0",le="1",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="1",le="2",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="2",le="3",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="3",le="5",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="5",le="7",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="7",le="10",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="10",le="20",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="20",le="50",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="50",le="100",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="100",le="200",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="200",le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_num_retracted_reqs The number of retracted requests.
# TYPE sglang_num_retracted_reqs gauge
sglang_num_retracted_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pid="3161",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_routing_key_all_req_count Distribution of routing keys by running+waiting request count (gt < count <= le).
# TYPE sglang_routing_key_all_req_count gauge
sglang_routing_key_all_req_count{engine_type="unified",gt="0",le="1",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="1",le="2",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="2",le="3",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="3",le="5",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="5",le="7",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="7",le="10",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="10",le="20",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="20",le="50",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="50",le="100",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="100",le="200",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="200",le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features
    • Exposes Prometheus metrics via a background HTTP endpoint alongside the gRPC server, with graceful startup and shutdown.
    • Adds robust host/port handling (including IPv6) for service addresses and warmup connections.
    • Extends metrics collection to reliably target both HTTP and gRPC workers by deriving correct metrics URLs across connection modes.

Signed-off-by: Scott Lee <scott@together.ai>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 19, 2026

📝 Walkthrough

Walkthrough

gRPC server now computes and runs a background Prometheus HTTP metrics sidecar (with multiprocess registry setup and graceful shutdown). Host:port formatting and metrics-port derivation helpers added. Worker manager generalized fan-out and now derives protocol-aware metrics URLs for both HTTP and gRPC workers.

Changes

Cohort / File(s) Summary
gRPC server & metrics sidecar
grpc_servicer/smg_grpc_servicer/sglang/server.py
Added _derive_metrics_port(grpc_port: int), _format_host_port(host: str, port: int), and _start_metrics_http_server(host: str, port: int). Replaced raw f"{host}:{port}" usage with _format_host_port(...). When server_args.enable_metrics is set: calls set_prometheus_multiproc_dir() before scheduler start, starts a background HTTP metrics server (computed metrics port), and tracks/shuts down the metrics HTTP server (shutdown, close, join with timeout) during serve shutdown.
Worker manager metrics URL & fan-out
model_gateway/src/core/worker_manager.rs
Introduced fan_out_with(...) to accept a caller-provided url_fn(worker, endpoint). Added helpers strip_dp_rank_suffix, parse_worker_host_port, derive_grpc_metrics_port, and metrics_url(...) to build protocol-aware metrics URLs (HTTP uses /{endpoint}, gRPC derives an HTTP metrics port via offset/wrap). Updated WorkerManager::get_engine_metrics to call fan_out_with(..., metrics_url) and preserve worker original worker.url() in responses.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    rect rgba(240,248,255,0.5)
    participant Client
    participant WarmupThread as Warmup Thread
    participant GRPC as gRPC Server
    participant MetricsHTTP as Metrics Sidecar
    participant Scheduler as Scheduler Processes
    participant Prometheus
    end

    Client->>GRPC: connect to gRPC address (host:port)
    GRPC->>WarmupThread: spawn warmup connection to formatted host:port
    GRPC->>Scheduler: launch scheduler processes (inherits PROMETHEUS multiproc dir)
    GRPC->>MetricsHTTP: start metrics HTTP server on derived metrics port
    Prometheus->>MetricsHTTP: scrape /metrics
    Note right of MetricsHTTP: serves multiprocess CollectorRegistry /metrics
    Client->>GRPC: shutdown request
    GRPC->>MetricsHTTP: shutdown, close, join thread (timeout)
    GRPC->>WarmupThread: join warmup thread
    GRPC->>Scheduler: stop schedulers
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped from host to port with cheer,

metrics whispered close and near,
sidecar hummed a tiny tune,
schedulers danced beneath the moon,
Prometheus peered — all systems clear!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main change: adding an HTTP sidecar for metrics ingestion from SGLang gRPC workers, which is the core objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sjl/0318-engine-metrics-sglang

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses the current limitation where SGLang gRPC workers do not expose engine metrics directly. It introduces a new architecture that enables observability for these workers by deploying a dedicated HTTP sidecar alongside each gRPC servicer. This sidecar collects metrics from a shared Prometheus multiprocess directory and exposes them via an HTTP endpoint, allowing the SMG /engine_metrics endpoint to successfully retrieve and aggregate performance data from all worker types.

Highlights

  • Enabled Prometheus Metrics for gRPC Workers: Introduced a mechanism to collect and expose Prometheus metrics from SGLang gRPC workers, which previously lacked this capability.
  • HTTP Metrics Sidecar: Implemented a lightweight HTTP sidecar for gRPC servicers, running on grpc_port + 1, to serve /metrics endpoints for Prometheus scraping.
  • Worker Manager Integration: Updated the worker manager to intelligently route metric collection requests, converting gRPC worker addresses to the corresponding HTTP sidecar addresses for metrics retrieval.
  • Prometheus Multiprocess Directory Setup: Configured the Prometheus multiprocess directory in the gRPC servicer to ensure child scheduler processes correctly write metrics to shared files.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions bot added the model-gateway Model gateway crate changes label Mar 19, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces Prometheus metrics collection for the gRPC servicer and model gateway. In sglang/server.py, it adds functionality to set up Prometheus multiprocess directories and starts a dedicated HTTP server to expose /metrics endpoints, running on grpc_port + 1. In worker_manager.rs, the fan_out mechanism is refactored to support custom URL building, and a new metrics_url function is added to correctly determine the metrics endpoint for both HTTP and gRPC workers (by inferring the sidecar port for gRPC). Review comments suggest moving imports related to Prometheus and the HTTP server to the top of sglang/server.py for better Python best practices and dependency management. Additionally, the reviewer noted that the new metrics HTTP server in sglang/server.py lacks graceful shutdown, potentially leading to resource leaks or un-flushed metrics, and recommended explicit lifecycle management. Finally, the reviewer pointed out that the URL parsing logic in metrics_url in worker_manager.rs could be more robust, as it might silently generate incorrect URLs for malformed gRPC worker addresses, suggesting improved error handling and logging.

Comment on lines +45 to +47
from sglang.srt.utils import set_prometheus_multiproc_dir

set_prometheus_multiproc_dir()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's generally considered best practice in Python to place all import statements at the top of the file, outside of conditional blocks. Importing inside an if statement can lead to unexpected behavior if the condition is not met, or if the module is later needed elsewhere in the file where the import might not have executed. Moving this import to the top of the file improves readability and ensures the dependency is always available.

from sglang.srt.utils import set_prometheus_multiproc_dir

Comment on lines +277 to +278
metrics_port = server_args.port + 1
_start_metrics_http_server(server_args.host, metrics_port)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _start_metrics_http_server function starts a daemon thread that runs httpd.serve_forever(). While daemon=True ensures the program can exit, it doesn't allow for a graceful shutdown of the HTTP server itself. For better resource management and to ensure metrics are properly flushed before exit, the serve_grpc function should manage the lifecycle of the metrics HTTP server, including calling httpd.shutdown() during the overall server shutdown process. This would require _start_metrics_http_server to return the HTTPServer instance and the thread.

References
  1. When managing resources (e.g., workers) that are acquired and need to be released, ensure that all acquired resources are tracked immediately and consistently, especially when exceptions might occur during subsequent operations (e.g., launching new resources). This prevents resource leaks by ensuring that even partially successful acquisitions are accounted for cleanup.

Comment on lines +335 to +337
from http.server import BaseHTTPRequestHandler, HTTPServer

from prometheus_client import CollectorRegistry, generate_latest, multiprocess
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the previous comment, it's best practice to place imports at the top of the file. Importing http.server and prometheus_client inside this function means they are imported every time the function is called (though in this case, it's only once). Moving these to the top of the file improves code clarity and adheres to standard Python style.

from http.server import BaseHTTPRequestHandler, HTTPServer
from prometheus_client import CollectorRegistry, generate_latest, multiprocess

Comment on lines +101 to +110
let stripped = url
.trim_start_matches("grpc://")
.trim_start_matches("http://")
.trim_start_matches("https://");
if let Some((host, port_str)) = stripped.rsplit_once(':') {
if let Ok(port) = port_str.parse::<u16>() {
return format!("http://{}:{}/{}", host, port + 1, endpoint);
}
}
format!("http://{stripped}/{endpoint}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The URL parsing logic in metrics_url for ConnectionMode::Grpc could be more robust. If stripped.rsplit_once(':') returns None (meaning no colon, so no port) or port_str.parse::<u16>() fails, the function falls back to format!("http://{stripped}/{endpoint}"). This might result in an invalid or unintended URL if the original grpc:// address was malformed (e.g., grpc://host without a port). It would be safer to log a warning or error and perhaps return an Option<String> or Result<String, Error> to indicate a failure in constructing a valid metrics URL, rather than silently creating a potentially incorrect one.

References
  1. Instead of silently ignoring potential failures (e.g., from serialization), log them as warnings to aid in debugging. In Rust, prefer using unwrap_or_else to log an error over unwrap_or_default which would fail silently.

scottjlee and others added 3 commits March 18, 2026 21:40
@scottjlee scottjlee marked this pull request as ready for review March 19, 2026 19:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 918c8c9c7b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +278 to +279
metrics_port = server_args.port + 1
metrics_httpd = _start_metrics_http_server(server_args.host, metrics_port)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid deriving the metrics sidecar port from grpc_port + 1

Using server_args.port + 1 here breaks common multi-worker gRPC layouts: the repo’s own example in docs/getting-started/grpc-workers.md:103-105 uses grpc://worker1:50051 grpc://worker2:50052, so worker 1’s metrics sidecar will try to bind 50052, colliding with worker 2’s gRPC listener. With enable_metrics on, adjacent worker ports make startup fail nondeterministically or leave one worker without metrics, so the sidecar needs a collision-free port assignment instead of assuming +1 is free.

Useful? React with 👍 / 👎.

def log_message(self, format, *args):
pass

httpd = HTTPServer((host, port), MetricsHandler)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use an IPv6-capable HTTP server for the metrics sidecar

server_args.host is documented to support IPv6 values such as :: and ::1 (docs/reference/configuration.md:34-37), but HTTPServer is AF_INET-only. In that configuration HTTPServer((host, port), ...) raises gaierror: Address family for hostname not supported, so enabling metrics turns an otherwise valid IPv6 gRPC worker into a startup failure. The sidecar needs to select an AF_INET6 server when the host is IPv6.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
model_gateway/src/core/worker_manager.rs (1)

56-79: ⚠️ Potential issue | 🟠 Major

Do not forward worker API keys to the unauthenticated metrics endpoint.

fan_out_with() unconditionally applies bearer auth to all requests, and get_engine_metrics() uses it to fetch metrics from http://host:(port+1)/metrics for gRPC workers. The metrics HTTP server in sglang/server.py has no authentication check, so this unnecessarily exposes bearer tokens to an endpoint that doesn't validate them. Make auth opt-in per call site or explicitly skip bearer_auth() when targeting the metrics endpoint.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/core/worker_manager.rs` around lines 56 - 79, fan_out_with
currently always applies bearer_auth to outgoing requests (see fan_out_with and
its use in get_engine_metrics), which forwards worker API keys to an
unauthenticated /metrics endpoint; change fan_out_with to accept an explicit
flag or enum (e.g., use_auth: bool or AuthMode) and only call
req.bearer_auth(key) when that flag indicates auth is allowed, update callers
such as get_engine_metrics to call fan_out_with with authentication disabled for
metrics requests, and ensure WorkerResponse construction and signature are
updated accordingly so other callers that need auth continue to work unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/server.py`:
- Around line 273-280: The metrics HTTP server is started before the main try
block so if _start_metrics_http_server(server_args.host, server_args.port + 1)
raises (e.g., port unavailable or overflow) cleanup (servicer.shutdown(),
server.stop(), and the scheduler termination loop) is skipped; move the metrics
server initialization (the metrics_port calculation and the call to
_start_metrics_http_server) into the same try block that starts the gRPC server
so failures trigger the existing shutdown/stop/termination logic, and keep the
metrics_httpd variable declared outside so it can still be referenced in the
finally/cleanup code.
- Around line 310-313: The shutdown logic currently only calls
metrics_httpd.shutdown(), which leaves the listening socket and thread alive;
update _start_metrics_http_server to return both the HTTPServer instance and its
background thread (e.g., return httpd, metrics_thread), then change the shutdown
sequence to call metrics_httpd.shutdown(), metrics_httpd.server_close(), and
metrics_thread.join() (similar to how warmup_thread is handled) so the socket is
closed and the thread is properly joined.

In `@model_gateway/src/core/worker_manager.rs`:
- Around line 101-107: The current port increment uses plain arithmetic (port +
1) which can overflow for u16; change the code that handles rsplit_once for
host/port_str to parse port into port: u16 and then call port.checked_add(1). If
checked_add returns Some(new_port) build and return the formatted URL using
new_port and endpoint (same format as before), otherwise fall back to the
existing graceful error path used when parsing fails (i.e., do not wrap/overflow
— return the original/unchanged URL or the existing fallback behavior). Update
references around the variables stripped, host, port_str, port and endpoint
accordingly.

---

Outside diff comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 56-79: fan_out_with currently always applies bearer_auth to
outgoing requests (see fan_out_with and its use in get_engine_metrics), which
forwards worker API keys to an unauthenticated /metrics endpoint; change
fan_out_with to accept an explicit flag or enum (e.g., use_auth: bool or
AuthMode) and only call req.bearer_auth(key) when that flag indicates auth is
allowed, update callers such as get_engine_metrics to call fan_out_with with
authentication disabled for metrics requests, and ensure WorkerResponse
construction and signature are updated accordingly so other callers that need
auth continue to work unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 111e50f6-dd9d-4b33-9c7a-81ea6a0e41e5

📥 Commits

Reviewing files that changed from the base of the PR and between e231a73 and 918c8c9.

📒 Files selected for processing (2)
  • grpc_servicer/smg_grpc_servicer/sglang/server.py
  • model_gateway/src/core/worker_manager.rs

Comment on lines +101 to +107
let stripped = url
.trim_start_matches("grpc://")
.trim_start_matches("http://")
.trim_start_matches("https://");
if let Some((host, port_str)) = stripped.rsplit_once(':') {
if let Ok(port) = port_str.parse::<u16>() {
return format!("http://{}:{}/{}", host, port + 1, endpoint);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n model_gateway/src/core/worker_manager.rs | sed -n '95,115p'

Repository: lightseekorg/smg

Length of output: 1028


🏁 Script executed:

cat -n model_gateway/src/core/worker_manager.rs | sed -n '85,120p'

Repository: lightseekorg/smg

Length of output: 1566


🏁 Script executed:

rg -t rust "Worker|ConnectionMode" model_gateway/src --max-count=5 | head -20

Repository: lightseekorg/smg

Length of output: 1977


🏁 Script executed:

fd -t rs -e rs . model_gateway/src | xargs grep -l "ConnectionMode" | head -5

Repository: lightseekorg/smg

Length of output: 229


🏁 Script executed:

cat -n model_gateway/src/core/worker.rs | head -50

Repository: lightseekorg/smg

Length of output: 2065


🏁 Script executed:

rg -t rust "fn.*Worker|impl Worker" model_gateway/src --max-count=10

Repository: lightseekorg/smg

Length of output: 10969


🏁 Script executed:

rg -t rust "parse.*port|port.*parse" model_gateway/src -A 3 -B 3 | head -40

Repository: lightseekorg/smg

Length of output: 2734


🏁 Script executed:

rg -t rust "metrics_url|fetch_grpc_load" model_gateway/src -B 2 -A 5 | head -60

Repository: lightseekorg/smg

Length of output: 3541


🏁 Script executed:

cat -n model_gateway/src/config/validation.rs | grep -A 10 "check port"

Repository: lightseekorg/smg

Length of output: 588


🏁 Script executed:

rg -t rust "port.*65535|65535.*port" model_gateway/src

Repository: lightseekorg/smg

Length of output: 350


🏁 Script executed:

cat -n model_gateway/src/core/worker_manager.rs | sed -n '95,115p'

Repository: lightseekorg/smg

Length of output: 1028


Guard the port + 1 arithmetic with checked_add(1).

For grpc://host:65535, port + 1 overflows u16—with overflow checks enabled it panics, and without them it wraps to 0, producing http://host:0/.... Use checked_add(1) and fall back to the existing graceful error path on line 110 instead of deriving a wrapped port.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/core/worker_manager.rs` around lines 101 - 107, The current
port increment uses plain arithmetic (port + 1) which can overflow for u16;
change the code that handles rsplit_once for host/port_str to parse port into
port: u16 and then call port.checked_add(1). If checked_add returns
Some(new_port) build and return the formatted URL using new_port and endpoint
(same format as before), otherwise fall back to the existing graceful error path
used when parsing fails (i.e., do not wrap/overflow — return the
original/unchanged URL or the existing fallback behavior). Update references
around the variables stripped, host, port_str, port and endpoint accordingly.

Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7634160757

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}

fn parse_worker_host_port(url: &str) -> Option<(String, u16)> {
let parsed = url::Url::parse(strip_dp_rank_suffix(url)).ok()?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve DP ranks when scraping gRPC engine metrics

When --dp-aware is enabled, CreateWorkerStep registers one Worker per rank (model_gateway/src/core/steps/worker/local/create_worker.rs:126-157). Calling strip_dp_rank_suffix() here collapses all of those gRPC URLs onto the same sidecar endpoint, but get_engine_metrics() still labels each response with the original worker_addr and aggregate_metrics() keeps both series by adding that label (model_gateway/src/core/worker_manager.rs:356-363, model_gateway/src/core/metrics_aggregator.rs:42-48). In a DP-aware SGLang deployment, /engine_metrics will therefore emit the same exposition once per rank, multiplying every counter/gauge by dp_size instead of reporting the real values.

Useful? React with 👍 / 👎.

Comment on lines +49 to +52
if grpc_port + METRICS_PORT_OFFSET <= MAX_TCP_PORT:
return grpc_port + METRICS_PORT_OFFSET

wrapped_port = grpc_port - METRICS_PORT_OFFSET
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use a collision-free mapping for the sidecar metrics port

This offset-and-wrap rule still collides for same-host workers whose gRPC ports differ by 10,000. For example, a worker on 50051 binds its sidecar on 60051, while a worker on 60051 wraps back to 50051, so each sidecar tries to take the other worker's gRPC listener. Because serve_grpc() binds the HTTP sidecar with this function and the gateway mirrors the same mapping in model_gateway/src/core/worker_manager.rs, enabling metrics still breaks valid multi-worker layouts even though the adjacent-port case was fixed.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
model_gateway/src/core/worker_manager.rs (1)

64-80: ⚠️ Potential issue | 🟠 Major

Do not forward worker API keys to the derived gRPC metrics URL.

fan_out_with() still unconditionally applies worker.api_key(), but metrics_url() rewrites gRPC workers to http://host:.../metrics. If a gRPC worker has an API key configured, this sends that secret over plain HTTP to an unauthenticated sidecar that does not need it. Plumb auth separately here and suppress it for gRPC metrics scrapes.

Also applies to: 125-143

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/core/worker_manager.rs` around lines 64 - 80, The current
fan_out_with() builder unconditionally applies worker.api_key() when building
requests (in the async block creating WorkerResponse), which forwards gRPC
worker API keys to rewritten metrics URLs; change this so auth is only attached
when appropriate by detecting metrics scrapes and skipping bearer_auth for gRPC
metrics. Concretely: in the closure in fan_out_with()/the async move block that
calls url_fn(worker, endpoint) (and constructs full_url), determine whether this
request is a metrics scrape for a gRPC worker (e.g., via worker.kind()/is_grpc()
or by comparing endpoint/metrics_url semantics) and only call
req.bearer_auth(api_key) when NOT a gRPC metrics scrape; keep WorkerResponse {
url: worker_url, result: ... } behavior otherwise and ensure any other places
(the second occurrence around lines 125-143) follow the same conditional auth
logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 92-99: The current logic in strip_dp_rank_suffix() makes addresses
like "grpc://host:port@0" and "@1" canonicalize to the same metrics URL but
get_engine_metrics() still fans out per original worker address, causing
duplicate scrapes and duplicate worker_addr series; update get_engine_metrics()
(and the other similar blocks at the regions referenced) to first map each
original worker address to its canonical metrics URL via strip_dp_rank_suffix(),
deduplicate the list of canonical URLs, and then call fan_out_with() once per
unique canonical URL; also ensure that the emitted label for the endpoint uses
the canonical URL (or a new canonical_label) instead of the original per-worker
address so identical endpoints produce a single metrics series.

---

Outside diff comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 64-80: The current fan_out_with() builder unconditionally applies
worker.api_key() when building requests (in the async block creating
WorkerResponse), which forwards gRPC worker API keys to rewritten metrics URLs;
change this so auth is only attached when appropriate by detecting metrics
scrapes and skipping bearer_auth for gRPC metrics. Concretely: in the closure in
fan_out_with()/the async move block that calls url_fn(worker, endpoint) (and
constructs full_url), determine whether this request is a metrics scrape for a
gRPC worker (e.g., via worker.kind()/is_grpc() or by comparing
endpoint/metrics_url semantics) and only call req.bearer_auth(api_key) when NOT
a gRPC metrics scrape; keep WorkerResponse { url: worker_url, result: ... }
behavior otherwise and ensure any other places (the second occurrence around
lines 125-143) follow the same conditional auth logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 363b0059-98f4-4f61-a936-730279a9e7da

📥 Commits

Reviewing files that changed from the base of the PR and between 918c8c9 and 7634160.

📒 Files selected for processing (2)
  • grpc_servicer/smg_grpc_servicer/sglang/server.py
  • model_gateway/src/core/worker_manager.rs

Comment on lines +92 to +99
fn strip_dp_rank_suffix(url: &str) -> &str {
if let Some(at_pos) = url.rfind('@') {
let suffix = &url[at_pos + 1..];
if !suffix.is_empty() && suffix.chars().all(|c| c.is_ascii_digit()) {
return &url[..at_pos];
}
}
url
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Deduplicate scrapes that collapse onto the same sidecar endpoint.

strip_dp_rank_suffix() intentionally makes URLs like grpc://host:port@0 and @1 resolve to the same metrics URL, but get_engine_metrics() still fans out once per logical worker. That will fetch the same multiprocess payload multiple times and emit duplicate worker_addr series, inflating any cross-worker sums. De-dup on the derived metrics URL before calling fan_out_with(), and use a canonical label for that endpoint.

Also applies to: 125-143, 347-354

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/core/worker_manager.rs` around lines 92 - 99, The current
logic in strip_dp_rank_suffix() makes addresses like "grpc://host:port@0" and
"@1" canonicalize to the same metrics URL but get_engine_metrics() still fans
out per original worker address, causing duplicate scrapes and duplicate
worker_addr series; update get_engine_metrics() (and the other similar blocks at
the regions referenced) to first map each original worker address to its
canonical metrics URL via strip_dp_rank_suffix(), deduplicate the list of
canonical URLs, and then call fan_out_with() once per unique canonical URL; also
ensure that the emitted label for the endpoint uses the canonical URL (or a new
canonical_label) instead of the original per-worker address so identical
endpoints produce a single metrics series.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

This pull request has been automatically marked as stale because it has not had any activity within 14 days. It will be automatically closed if no further activity occurs within 16 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale PR has been inactive for 14+ days label Apr 4, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 9, 2026

Hi @scottjlee, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

@mergify mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Apr 9, 2026
@scottjlee scottjlee closed this Apr 9, 2026
@lightseek-bot lightseek-bot deleted the sjl/0318-engine-metrics-sglang branch April 10, 2026 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model-gateway Model gateway crate changes needs-rebase PR has merge conflicts that need to be resolved stale PR has been inactive for 14+ days

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant