feat(observability): Add HTTP sidecar for ingesting engine metrics from SGLang gRPC workers by scottjlee · Pull Request #813 · lightseekorg/smg

scottjlee · 2026-03-19T02:28:02Z

Description

Problem

Currently, the SMG /engine_metrics endpoint does not support emitting engine metrics when running in gRPC mode. Moreover, in gRPC mode, SGLang does not expose an HTTP endpoint which SMG can directly retrieve metrics from.

Solution

Add a lightweight metrics sidecar to fetch metrics emitted by SGLang engine (written to PROMETHEUS_MULTIPROC_DIR) and expose them in SMG via HTTP server.

Changes

gRPC servicer (server.py): When enable_metrics is set, initialize the Prometheus multiprocess directory and starts a lightweight HTTP server on grpc_port + 10000 (if it overflows, fall back to grpc_port - 10000) that serves /metrics using the multiprocess collector.
Worker manager (worker_manager.rs): Introduce utility functions for gRPC workers to generate sidecar URLs with the new port mapping. The engine metrics path now uses this so Prometheus scrapes reach gRPC workers correctly and expose them via the /engine_metrics endpoint.

Test Plan

Example output of engine metrics via SMG:

curl http://127.0.0.1:30001/engine_metrics
# HELP sglang_num_used_tokens The number of used tokens.
# TYPE sglang_num_used_tokens gauge
sglang_num_used_tokens{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 852

# HELP sglang_engine_load_weights_time The time taken for the engine to load weights.
# TYPE sglang_engine_load_weights_time gauge
sglang_engine_load_weights_time{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_spec_accept_rate The average acceptance rate of speculative decoding (`accepted tokens / total draft tokens` in batch).
# TYPE sglang_spec_accept_rate gauge
sglang_spec_accept_rate{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_num_grammar_queue_reqs The number of requests in the grammar waiting queue.
# TYPE sglang_num_grammar_queue_reqs gauge
sglang_num_grammar_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_gen_throughput The generation throughput (token/s).
# TYPE sglang_gen_throughput gauge
sglang_gen_throughput{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 450.51289169675726

# HELP sglang_num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang_num_queue_reqs gauge
sglang_num_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_num_decode_transfer_queue_reqs The number of requests in the decode transfer queue.
# TYPE sglang_num_decode_transfer_queue_reqs gauge
sglang_num_decode_transfer_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_cache_config_info Cache configuration information.
# TYPE sglang_cache_config_info gauge
sglang_cache_config_info{num_pages="410615",page_size="1",worker_addr="grpc://127.0.0.1:8001"} 1

# HELP sglang_num_running_reqs_offline_batch The number of running low-priority offline batch requests(label is 'batch').
# TYPE sglang_num_running_reqs_offline_batch gauge
sglang_num_running_reqs_offline_batch{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_full_token_usage The token usage for full attention layers.
# TYPE sglang_full_token_usage gauge
sglang_full_token_usage{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0.002074936375923919

# HELP sglang_num_paused_reqs The number of paused requests by async weight sync.
# TYPE sglang_num_paused_reqs gauge
sglang_num_paused_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pid="3161",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_decode_sum_seq_lens The sum of all sequence lengths in decode.
# TYPE sglang_decode_sum_seq_lens gauge
sglang_decode_sum_seq_lens{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 933

# HELP sglang_num_prefill_inflight_queue_reqs The number of requests in the prefill inflight queue.
# TYPE sglang_num_prefill_inflight_queue_reqs gauge
sglang_num_prefill_inflight_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_utilization The utilization.
# TYPE sglang_utilization gauge
sglang_utilization{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_cuda_graph_passes_total Total number of forward passes categorized by CUDA graph.
# TYPE sglang_cuda_graph_passes_total counter
sglang_cuda_graph_passes_total{engine_type="unified",mode="prefill_cuda_graph",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 6
sglang_cuda_graph_passes_total{engine_type="unified",mode="decode_cuda_graph",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 311

# HELP sglang_num_prefill_prealloc_queue_reqs The number of requests in the prefill prealloc queue.
# TYPE sglang_num_prefill_prealloc_queue_reqs gauge
sglang_num_prefill_prealloc_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_cache_hit_rate The prefix cache hit rate.
# TYPE sglang_cache_hit_rate gauge
sglang_cache_hit_rate{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_realtime_tokens_total Total number of tokens processed (updated on each log interval). mode_ prefill_compute, prefill_cache, decode.
# TYPE sglang_realtime_tokens_total counter
sglang_realtime_tokens_total{engine_type="unified",mode="prefill_compute",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 55
sglang_realtime_tokens_total{engine_type="unified",mode="decode",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 901
sglang_realtime_tokens_total{engine_type="unified",mode="prefill_cache",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 123

# HELP sglang_pending_prealloc_token_usage The token usage for pending preallocated tokens (not preallocated yet).
# TYPE sglang_pending_prealloc_token_usage gauge
sglang_pending_prealloc_token_usage{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_num_running_reqs The number of running requests.
# TYPE sglang_num_running_reqs gauge
sglang_num_running_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 3

# HELP sglang_mamba_usage The token usage for Mamba layers.
# TYPE sglang_mamba_usage gauge
sglang_mamba_usage{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_max_total_num_tokens Maximum total number of tokens in the KV cache pool.
# TYPE sglang_max_total_num_tokens gauge
sglang_max_total_num_tokens{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 410615

# HELP sglang_engine_startup_time The time taken for the engine to start up.
# TYPE sglang_engine_startup_time gauge
sglang_engine_startup_time{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_token_usage The token usage.
# TYPE sglang_token_usage gauge
sglang_token_usage{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0.002074936375923919

# HELP sglang_new_token_ratio The new token ratio.
# TYPE sglang_new_token_ratio gauge
sglang_new_token_ratio{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0.7

# HELP sglang_num_decode_prealloc_queue_reqs The number of requests in the decode prealloc queue.
# TYPE sglang_num_decode_prealloc_queue_reqs gauge
sglang_num_decode_prealloc_queue_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang_spec_accept_length gauge
sglang_spec_accept_length{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_swa_token_usage The token usage for SWA layers.
# TYPE sglang_swa_token_usage gauge
sglang_swa_token_usage{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_num_unique_running_routing_keys Number of unique routing keys in running batch.
# TYPE sglang_num_unique_running_routing_keys gauge
sglang_num_unique_running_routing_keys{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_per_stage_req_latency_seconds The latency of each stage of requests.
# TYPE sglang_per_stage_req_latency_seconds histogram
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.001"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.0016200000000000001"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.0026244000000000007"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.004251528"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.006887475360000002"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.011157710083200004"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.018075490334784006"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.029282294342350097"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.047437316834607154"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.07684845327206359"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.12449449430074302"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.20168108076720373"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.3267233508428701"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.5292918283654495"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.8574527619520282"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1.389073474362286"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="2.2502990284669036"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="3.6454844261163837"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="5.905684770308542"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="9.567209327899839"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="15.49887911119774"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="25.10818416014034"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="40.67525833942735"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="65.89391850987231"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="106.74814798599316"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="172.93199973730896"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="280.1498395744405"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="453.84274011059364"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="735.2252389791618"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1191.064887146242"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="+Inf"} 8
sglang_per_stage_req_latency_seconds_sum{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0.0001513659954071045
sglang_per_stage_req_latency_seconds_count{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="request_process",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.001"} 0
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.0016200000000000001"} 0
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.0026244000000000007"} 0
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.004251528"} 0
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.006887475360000002"} 0
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.011157710083200004"} 3
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.018075490334784006"} 4
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.029282294342350097"} 4
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.047437316834607154"} 4
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.07684845327206359"} 4
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.12449449430074302"} 4
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.20168108076720373"} 7
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.3267233508428701"} 7
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.5292918283654495"} 7
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.8574527619520282"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1.389073474362286"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="2.2502990284669036"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="3.6454844261163837"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="5.905684770308542"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="9.567209327899839"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="15.49887911119774"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="25.10818416014034"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="40.67525833942735"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="65.89391850987231"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="106.74814798599316"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="172.93199973730896"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="280.1498395744405"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="453.84274011059364"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="735.2252389791618"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1191.064887146242"} 8
sglang_per_stage_req_latency_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="+Inf"} 8
sglang_per_stage_req_latency_seconds_sum{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 1.169399008154869
sglang_per_stage_req_latency_seconds_count{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",stage="prefill_forward",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 8

# HELP sglang_queue_time_seconds Histogram of queueing time in seconds.
# TYPE sglang_queue_time_seconds histogram
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0"} 0
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.1"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.2"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="0.5"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="2"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="3"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="4"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="5"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="10"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="15"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="20"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="30"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="40"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="50"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="60"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="70"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="80"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="90"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="100"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="200"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="300"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="400"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="500"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="600"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="700"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="800"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="900"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1000"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1200"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1400"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1600"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="1800"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="2000"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="2500"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="3000"} 8
sglang_queue_time_seconds_bucket{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001",le="+Inf"} 8
sglang_queue_time_seconds_sum{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0.002929016947746277
sglang_queue_time_seconds_count{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 8

# HELP sglang_is_cuda_graph Whether the batch is using CUDA graph.
# TYPE sglang_is_cuda_graph gauge
sglang_is_cuda_graph{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_routing_key_running_req_count Distribution of routing keys by running request count (gt < count <= le).
# TYPE sglang_routing_key_running_req_count gauge
sglang_routing_key_running_req_count{engine_type="unified",gt="0",le="1",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="1",le="2",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="2",le="3",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="3",le="5",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="5",le="7",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="7",le="10",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="10",le="20",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="20",le="50",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="50",le="100",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="100",le="200",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_running_req_count{engine_type="unified",gt="200",le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_num_retracted_reqs The number of retracted requests.
# TYPE sglang_num_retracted_reqs gauge
sglang_num_retracted_reqs{engine_type="unified",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pid="3161",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

# HELP sglang_routing_key_all_req_count Distribution of routing keys by running+waiting request count (gt < count <= le).
# TYPE sglang_routing_key_all_req_count gauge
sglang_routing_key_all_req_count{engine_type="unified",gt="0",le="1",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="1",le="2",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="2",le="3",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="3",le="5",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="5",le="7",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="7",le="10",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="10",le="20",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="20",le="50",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="50",le="100",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="100",le="200",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0
sglang_routing_key_all_req_count{engine_type="unified",gt="200",le="+Inf",model_name="meta-llama/Llama-3.1-8B-Instruct",moe_ep_rank="0",pp_rank="0",tp_rank="0",worker_addr="grpc://127.0.0.1:8001"} 0

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- Exposes Prometheus metrics via a background HTTP endpoint alongside the gRPC server, with graceful startup and shutdown.
- Adds robust host/port handling (including IPv6) for service addresses and warmup connections.
- Extends metrics collection to reliably target both HTTP and gRPC workers by deriving correct metrics URLs across connection modes.

Signed-off-by: Scott Lee <scott@together.ai>

coderabbitai · 2026-03-19T02:28:10Z

📝 Walkthrough

Walkthrough

gRPC server now computes and runs a background Prometheus HTTP metrics sidecar (with multiprocess registry setup and graceful shutdown). Host:port formatting and metrics-port derivation helpers added. Worker manager generalized fan-out and now derives protocol-aware metrics URLs for both HTTP and gRPC workers.

Changes

Cohort / File(s)	Summary
gRPC server & metrics sidecar `grpc_servicer/smg_grpc_servicer/sglang/server.py`	Added `_derive_metrics_port(grpc_port: int)`, `_format_host_port(host: str, port: int)`, and `_start_metrics_http_server(host: str, port: int)`. Replaced raw `f"{host}:{port}"` usage with `_format_host_port(...)`. When `server_args.enable_metrics` is set: calls `set_prometheus_multiproc_dir()` before scheduler start, starts a background HTTP metrics server (computed metrics port), and tracks/shuts down the metrics HTTP server (shutdown, close, join with timeout) during serve shutdown.
Worker manager metrics URL & fan-out `model_gateway/src/core/worker_manager.rs`	Introduced `fan_out_with(...)` to accept a caller-provided `url_fn(worker, endpoint)`. Added helpers `strip_dp_rank_suffix`, `parse_worker_host_port`, `derive_grpc_metrics_port`, and `metrics_url(...)` to build protocol-aware metrics URLs (HTTP uses `/{endpoint}`, gRPC derives an HTTP metrics port via offset/wrap). Updated `WorkerManager::get_engine_metrics` to call `fan_out_with(..., metrics_url)` and preserve worker original `worker.url()` in responses.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    rect rgba(240,248,255,0.5)
    participant Client
    participant WarmupThread as Warmup Thread
    participant GRPC as gRPC Server
    participant MetricsHTTP as Metrics Sidecar
    participant Scheduler as Scheduler Processes
    participant Prometheus
    end

    Client->>GRPC: connect to gRPC address (host:port)
    GRPC->>WarmupThread: spawn warmup connection to formatted host:port
    GRPC->>Scheduler: launch scheduler processes (inherits PROMETHEUS multiproc dir)
    GRPC->>MetricsHTTP: start metrics HTTP server on derived metrics port
    Prometheus->>MetricsHTTP: scrape /metrics
    Note right of MetricsHTTP: serves multiprocess CollectorRegistry /metrics
    Client->>GRPC: shutdown request
    GRPC->>MetricsHTTP: shutdown, close, join thread (timeout)
    GRPC->>WarmupThread: join warmup thread
    GRPC->>Scheduler: stop schedulers

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped from host to port with cheer,

metrics whispered close and near,
sidecar hummed a tiny tune,
schedulers danced beneath the moon,
Prometheus peered — all systems clear!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly describes the main change: adding an HTTP sidecar for metrics ingestion from SGLang gRPC workers, which is the core objective of this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch sjl/0318-engine-metrics-sglang

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-19T02:28:18Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses the current limitation where SGLang gRPC workers do not expose engine metrics directly. It introduces a new architecture that enables observability for these workers by deploying a dedicated HTTP sidecar alongside each gRPC servicer. This sidecar collects metrics from a shared Prometheus multiprocess directory and exposes them via an HTTP endpoint, allowing the SMG /engine_metrics endpoint to successfully retrieve and aggregate performance data from all worker types.

Highlights

Enabled Prometheus Metrics for gRPC Workers: Introduced a mechanism to collect and expose Prometheus metrics from SGLang gRPC workers, which previously lacked this capability.
HTTP Metrics Sidecar: Implemented a lightweight HTTP sidecar for gRPC servicers, running on grpc_port + 1, to serve /metrics endpoints for Prometheus scraping.
Worker Manager Integration: Updated the worker manager to intelligently route metric collection requests, converting gRPC worker addresses to the corresponding HTTP sidecar addresses for metrics retrieval.
Prometheus Multiprocess Directory Setup: Configured the Prometheus multiprocess directory in the gRPC servicer to ensure child scheduler processes correctly write metrics to shared files.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces Prometheus metrics collection for the gRPC servicer and model gateway. In sglang/server.py, it adds functionality to set up Prometheus multiprocess directories and starts a dedicated HTTP server to expose /metrics endpoints, running on grpc_port + 1. In worker_manager.rs, the fan_out mechanism is refactored to support custom URL building, and a new metrics_url function is added to correctly determine the metrics endpoint for both HTTP and gRPC workers (by inferring the sidecar port for gRPC). Review comments suggest moving imports related to Prometheus and the HTTP server to the top of sglang/server.py for better Python best practices and dependency management. Additionally, the reviewer noted that the new metrics HTTP server in sglang/server.py lacks graceful shutdown, potentially leading to resource leaks or un-flushed metrics, and recommended explicit lifecycle management. Finally, the reviewer pointed out that the URL parsing logic in metrics_url in worker_manager.rs could be more robust, as it might silently generate incorrect URLs for malformed gRPC worker addresses, suggesting improved error handling and logging.

gemini-code-assist · 2026-03-19T02:30:11Z

grpc_servicer/smg_grpc_servicer/sglang/server.py

+        from sglang.srt.utils import set_prometheus_multiproc_dir
+
+        set_prometheus_multiproc_dir()


It's generally considered best practice in Python to place all import statements at the top of the file, outside of conditional blocks. Importing inside an if statement can lead to unexpected behavior if the condition is not met, or if the module is later needed elsewhere in the file where the import might not have executed. Moving this import to the top of the file improves readability and ensures the dependency is always available.

from sglang.srt.utils import set_prometheus_multiproc_dir

gemini-code-assist · 2026-03-19T02:30:11Z

grpc_servicer/smg_grpc_servicer/sglang/server.py

+        metrics_port = server_args.port + 1
+        _start_metrics_http_server(server_args.host, metrics_port)


The _start_metrics_http_server function starts a daemon thread that runs httpd.serve_forever(). While daemon=True ensures the program can exit, it doesn't allow for a graceful shutdown of the HTTP server itself. For better resource management and to ensure metrics are properly flushed before exit, the serve_grpc function should manage the lifecycle of the metrics HTTP server, including calling httpd.shutdown() during the overall server shutdown process. This would require _start_metrics_http_server to return the HTTPServer instance and the thread.

References

When managing resources (e.g., workers) that are acquired and need to be released, ensure that all acquired resources are tracked immediately and consistently, especially when exceptions might occur during subsequent operations (e.g., launching new resources). This prevents resource leaks by ensuring that even partially successful acquisitions are accounted for cleanup.

gemini-code-assist · 2026-03-19T02:30:11Z

grpc_servicer/smg_grpc_servicer/sglang/server.py

+    from http.server import BaseHTTPRequestHandler, HTTPServer
+
+    from prometheus_client import CollectorRegistry, generate_latest, multiprocess


Similar to the previous comment, it's best practice to place imports at the top of the file. Importing http.server and prometheus_client inside this function means they are imported every time the function is called (though in this case, it's only once). Moving these to the top of the file improves code clarity and adheres to standard Python style.

from http.server import BaseHTTPRequestHandler, HTTPServer from prometheus_client import CollectorRegistry, generate_latest, multiprocess

gemini-code-assist · 2026-03-19T02:30:11Z

model_gateway/src/core/worker_manager.rs

+            let stripped = url
+                .trim_start_matches("grpc://")
+                .trim_start_matches("http://")
+                .trim_start_matches("https://");
+            if let Some((host, port_str)) = stripped.rsplit_once(':') {
+                if let Ok(port) = port_str.parse::<u16>() {
+                    return format!("http://{}:{}/{}", host, port + 1, endpoint);
+                }
+            }
+            format!("http://{stripped}/{endpoint}")


The URL parsing logic in metrics_url for ConnectionMode::Grpc could be more robust. If stripped.rsplit_once(':') returns None (meaning no colon, so no port) or port_str.parse::<u16>() fails, the function falls back to format!("http://{stripped}/{endpoint}"). This might result in an invalid or unintended URL if the original grpc:// address was malformed (e.g., grpc://host without a port). It would be safer to log a warning or error and perhaps return an Option<String> or Result<String, Error> to indicate a failure in constructing a valid metrics URL, rather than silently creating a potentially incorrect one.

References

Instead of silently ignoring potential failures (e.g., from serialization), log them as warnings to aid in debugging. In Rust, prefer using unwrap_or_else to log an error over unwrap_or_default which would fail silently.

Signed-off-by: Scott Lee <scott@together.ai>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 918c8c9c7b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-19T19:59:55Z

grpc_servicer/smg_grpc_servicer/sglang/server.py

+        metrics_port = server_args.port + 1
+        metrics_httpd = _start_metrics_http_server(server_args.host, metrics_port)


Avoid deriving the metrics sidecar port from grpc_port + 1

Using server_args.port + 1 here breaks common multi-worker gRPC layouts: the repo’s own example in docs/getting-started/grpc-workers.md:103-105 uses grpc://worker1:50051 grpc://worker2:50052, so worker 1’s metrics sidecar will try to bind 50052, colliding with worker 2’s gRPC listener. With enable_metrics on, adjacent worker ports make startup fail nondeterministically or leave one worker without metrics, so the sidecar needs a collision-free port assignment instead of assuming +1 is free.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-19T19:59:55Z

grpc_servicer/smg_grpc_servicer/sglang/server.py

+        def log_message(self, format, *args):
+            pass
+
+    httpd = HTTPServer((host, port), MetricsHandler)


Use an IPv6-capable HTTP server for the metrics sidecar

server_args.host is documented to support IPv6 values such as :: and ::1 (docs/reference/configuration.md:34-37), but HTTPServer is AF_INET-only. In that configuration HTTPServer((host, port), ...) raises gaierror: Address family for hostname not supported, so enabling metrics turns an otherwise valid IPv6 gRPC worker into a startup failure. The sidecar needs to select an AF_INET6 server when the host is IPv6.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

model_gateway/src/core/worker_manager.rs (1)
56-79: ⚠️ Potential issue | 🟠 Major

Do not forward worker API keys to the unauthenticated metrics endpoint.

fan_out_with() unconditionally applies bearer auth to all requests, and get_engine_metrics() uses it to fetch metrics from http://host:(port+1)/metrics for gRPC workers. The metrics HTTP server in sglang/server.py has no authentication check, so this unnecessarily exposes bearer tokens to an endpoint that doesn't validate them. Make auth opt-in per call site or explicitly skip bearer_auth() when targeting the metrics endpoint.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/core/worker_manager.rs` around lines 56 - 79, fan_out_with
currently always applies bearer_auth to outgoing requests (see fan_out_with and
its use in get_engine_metrics), which forwards worker API keys to an
unauthenticated /metrics endpoint; change fan_out_with to accept an explicit
flag or enum (e.g., use_auth: bool or AuthMode) and only call
req.bearer_auth(key) when that flag indicates auth is allowed, update callers
such as get_engine_metrics to call fan_out_with with authentication disabled for
metrics requests, and ensure WorkerResponse construction and signature are
updated accordingly so other callers that need auth continue to work unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/server.py`:
- Around line 273-280: The metrics HTTP server is started before the main try
block so if _start_metrics_http_server(server_args.host, server_args.port + 1)
raises (e.g., port unavailable or overflow) cleanup (servicer.shutdown(),
server.stop(), and the scheduler termination loop) is skipped; move the metrics
server initialization (the metrics_port calculation and the call to
_start_metrics_http_server) into the same try block that starts the gRPC server
so failures trigger the existing shutdown/stop/termination logic, and keep the
metrics_httpd variable declared outside so it can still be referenced in the
finally/cleanup code.
- Around line 310-313: The shutdown logic currently only calls
metrics_httpd.shutdown(), which leaves the listening socket and thread alive;
update _start_metrics_http_server to return both the HTTPServer instance and its
background thread (e.g., return httpd, metrics_thread), then change the shutdown
sequence to call metrics_httpd.shutdown(), metrics_httpd.server_close(), and
metrics_thread.join() (similar to how warmup_thread is handled) so the socket is
closed and the thread is properly joined.

In `@model_gateway/src/core/worker_manager.rs`:
- Around line 101-107: The current port increment uses plain arithmetic (port +
1) which can overflow for u16; change the code that handles rsplit_once for
host/port_str to parse port into port: u16 and then call port.checked_add(1). If
checked_add returns Some(new_port) build and return the formatted URL using
new_port and endpoint (same format as before), otherwise fall back to the
existing graceful error path used when parsing fails (i.e., do not wrap/overflow
— return the original/unchanged URL or the existing fallback behavior). Update
references around the variables stripped, host, port_str, port and endpoint
accordingly.

---

Outside diff comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 56-79: fan_out_with currently always applies bearer_auth to
outgoing requests (see fan_out_with and its use in get_engine_metrics), which
forwards worker API keys to an unauthenticated /metrics endpoint; change
fan_out_with to accept an explicit flag or enum (e.g., use_auth: bool or
AuthMode) and only call req.bearer_auth(key) when that flag indicates auth is
allowed, update callers such as get_engine_metrics to call fan_out_with with
authentication disabled for metrics requests, and ensure WorkerResponse
construction and signature are updated accordingly so other callers that need
auth continue to work unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 111e50f6-dd9d-4b33-9c7a-81ea6a0e41e5

📥 Commits

Reviewing files that changed from the base of the PR and between e231a73 and 918c8c9.

📒 Files selected for processing (2)

grpc_servicer/smg_grpc_servicer/sglang/server.py
model_gateway/src/core/worker_manager.rs

grpc_servicer/smg_grpc_servicer/sglang/server.py

coderabbitai · 2026-03-19T20:06:47Z

model_gateway/src/core/worker_manager.rs

+            let stripped = url
+                .trim_start_matches("grpc://")
+                .trim_start_matches("http://")
+                .trim_start_matches("https://");
+            if let Some((host, port_str)) = stripped.rsplit_once(':') {
+                if let Ok(port) = port_str.parse::<u16>() {
+                    return format!("http://{}:{}/{}", host, port + 1, endpoint);


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n model_gateway/src/core/worker_manager.rs | sed -n '95,115p'

Repository: lightseekorg/smg

Length of output: 1028

🏁 Script executed:

cat -n model_gateway/src/core/worker_manager.rs | sed -n '85,120p'

Repository: lightseekorg/smg

Length of output: 1566

🏁 Script executed:

rg -t rust "Worker|ConnectionMode" model_gateway/src --max-count=5 | head -20

Repository: lightseekorg/smg

Length of output: 1977

🏁 Script executed:

fd -t rs -e rs . model_gateway/src | xargs grep -l "ConnectionMode" | head -5

Repository: lightseekorg/smg

Length of output: 229

🏁 Script executed:

cat -n model_gateway/src/core/worker.rs | head -50

Repository: lightseekorg/smg

Length of output: 2065

🏁 Script executed:

rg -t rust "fn.*Worker|impl Worker" model_gateway/src --max-count=10

Repository: lightseekorg/smg

Length of output: 10969

🏁 Script executed:

rg -t rust "parse.*port|port.*parse" model_gateway/src -A 3 -B 3 | head -40

Repository: lightseekorg/smg

Length of output: 2734

🏁 Script executed:

rg -t rust "metrics_url|fetch_grpc_load" model_gateway/src -B 2 -A 5 | head -60

Repository: lightseekorg/smg

Length of output: 3541

🏁 Script executed:

cat -n model_gateway/src/config/validation.rs | grep -A 10 "check port"

Repository: lightseekorg/smg

Length of output: 588

🏁 Script executed:

rg -t rust "port.*65535|65535.*port" model_gateway/src

Repository: lightseekorg/smg

Length of output: 350

🏁 Script executed:

cat -n model_gateway/src/core/worker_manager.rs | sed -n '95,115p'

Repository: lightseekorg/smg

Length of output: 1028

Guard the port + 1 arithmetic with checked_add(1).

For grpc://host:65535, port + 1 overflows u16—with overflow checks enabled it panics, and without them it wraps to 0, producing http://host:0/.... Use checked_add(1) and fall back to the existing graceful error path on line 110 instead of deriving a wrapped port.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/core/worker_manager.rs` around lines 101 - 107, The current port increment uses plain arithmetic (port + 1) which can overflow for u16; change the code that handles rsplit_once for host/port_str to parse port into port: u16 and then call port.checked_add(1). If checked_add returns Some(new_port) build and return the formatted URL using new_port and endpoint (same format as before), otherwise fall back to the existing graceful error path used when parsing fails (i.e., do not wrap/overflow — return the original/unchanged URL or the existing fallback behavior). Update references around the variables stripped, host, port_str, port and endpoint accordingly.

Signed-off-by: Scott Lee <scott@together.ai>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7634160757

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-20T21:00:17Z

model_gateway/src/core/worker_manager.rs

+}
+
+fn parse_worker_host_port(url: &str) -> Option<(String, u16)> {
+    let parsed = url::Url::parse(strip_dp_rank_suffix(url)).ok()?;


Preserve DP ranks when scraping gRPC engine metrics

When --dp-aware is enabled, CreateWorkerStep registers one Worker per rank (model_gateway/src/core/steps/worker/local/create_worker.rs:126-157). Calling strip_dp_rank_suffix() here collapses all of those gRPC URLs onto the same sidecar endpoint, but get_engine_metrics() still labels each response with the original worker_addr and aggregate_metrics() keeps both series by adding that label (model_gateway/src/core/worker_manager.rs:356-363, model_gateway/src/core/metrics_aggregator.rs:42-48). In a DP-aware SGLang deployment, /engine_metrics will therefore emit the same exposition once per rank, multiplying every counter/gauge by dp_size instead of reporting the real values.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-20T21:00:17Z

grpc_servicer/smg_grpc_servicer/sglang/server.py

+    if grpc_port + METRICS_PORT_OFFSET <= MAX_TCP_PORT:
+        return grpc_port + METRICS_PORT_OFFSET
+
+    wrapped_port = grpc_port - METRICS_PORT_OFFSET


Use a collision-free mapping for the sidecar metrics port

This offset-and-wrap rule still collides for same-host workers whose gRPC ports differ by 10,000. For example, a worker on 50051 binds its sidecar on 60051, while a worker on 60051 wraps back to 50051, so each sidecar tries to take the other worker's gRPC listener. Because serve_grpc() binds the HTTP sidecar with this function and the gateway mirrors the same mapping in model_gateway/src/core/worker_manager.rs, enabling metrics still breaks valid multi-worker layouts even though the adjacent-port case was fixed.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

model_gateway/src/core/worker_manager.rs (1)
64-80: ⚠️ Potential issue | 🟠 Major

Do not forward worker API keys to the derived gRPC metrics URL.

fan_out_with() still unconditionally applies worker.api_key(), but metrics_url() rewrites gRPC workers to http://host:.../metrics. If a gRPC worker has an API key configured, this sends that secret over plain HTTP to an unauthenticated sidecar that does not need it. Plumb auth separately here and suppress it for gRPC metrics scrapes.

Also applies to: 125-143
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/core/worker_manager.rs` around lines 64 - 80, The current
fan_out_with() builder unconditionally applies worker.api_key() when building
requests (in the async block creating WorkerResponse), which forwards gRPC
worker API keys to rewritten metrics URLs; change this so auth is only attached
when appropriate by detecting metrics scrapes and skipping bearer_auth for gRPC
metrics. Concretely: in the closure in fan_out_with()/the async move block that
calls url_fn(worker, endpoint) (and constructs full_url), determine whether this
request is a metrics scrape for a gRPC worker (e.g., via worker.kind()/is_grpc()
or by comparing endpoint/metrics_url semantics) and only call
req.bearer_auth(api_key) when NOT a gRPC metrics scrape; keep WorkerResponse {
url: worker_url, result: ... } behavior otherwise and ensure any other places
(the second occurrence around lines 125-143) follow the same conditional auth
logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 92-99: The current logic in strip_dp_rank_suffix() makes addresses
like "grpc://host:port@0" and "@1" canonicalize to the same metrics URL but
get_engine_metrics() still fans out per original worker address, causing
duplicate scrapes and duplicate worker_addr series; update get_engine_metrics()
(and the other similar blocks at the regions referenced) to first map each
original worker address to its canonical metrics URL via strip_dp_rank_suffix(),
deduplicate the list of canonical URLs, and then call fan_out_with() once per
unique canonical URL; also ensure that the emitted label for the endpoint uses
the canonical URL (or a new canonical_label) instead of the original per-worker
address so identical endpoints produce a single metrics series.

---

Outside diff comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 64-80: The current fan_out_with() builder unconditionally applies
worker.api_key() when building requests (in the async block creating
WorkerResponse), which forwards gRPC worker API keys to rewritten metrics URLs;
change this so auth is only attached when appropriate by detecting metrics
scrapes and skipping bearer_auth for gRPC metrics. Concretely: in the closure in
fan_out_with()/the async move block that calls url_fn(worker, endpoint) (and
constructs full_url), determine whether this request is a metrics scrape for a
gRPC worker (e.g., via worker.kind()/is_grpc() or by comparing
endpoint/metrics_url semantics) and only call req.bearer_auth(api_key) when NOT
a gRPC metrics scrape; keep WorkerResponse { url: worker_url, result: ... }
behavior otherwise and ensure any other places (the second occurrence around
lines 125-143) follow the same conditional auth logic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 363b0059-98f4-4f61-a936-730279a9e7da

📥 Commits

Reviewing files that changed from the base of the PR and between 918c8c9 and 7634160.

📒 Files selected for processing (2)

grpc_servicer/smg_grpc_servicer/sglang/server.py
model_gateway/src/core/worker_manager.rs

coderabbitai · 2026-03-20T21:04:47Z

model_gateway/src/core/worker_manager.rs

+fn strip_dp_rank_suffix(url: &str) -> &str {
+    if let Some(at_pos) = url.rfind('@') {
+        let suffix = &url[at_pos + 1..];
+        if !suffix.is_empty() && suffix.chars().all(|c| c.is_ascii_digit()) {
+            return &url[..at_pos];
+        }
+    }
+    url


⚠️ Potential issue | 🟠 Major

Deduplicate scrapes that collapse onto the same sidecar endpoint.

strip_dp_rank_suffix() intentionally makes URLs like grpc://host:port@0 and @1 resolve to the same metrics URL, but get_engine_metrics() still fans out once per logical worker. That will fetch the same multiprocess payload multiple times and emit duplicate worker_addr series, inflating any cross-worker sums. De-dup on the derived metrics URL before calling fan_out_with(), and use a canonical label for that endpoint.

Also applies to: 125-143, 347-354

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/core/worker_manager.rs` around lines 92 - 99, The current logic in strip_dp_rank_suffix() makes addresses like "grpc://host:port@0" and "@1" canonicalize to the same metrics URL but get_engine_metrics() still fans out per original worker address, causing duplicate scrapes and duplicate worker_addr series; update get_engine_metrics() (and the other similar blocks at the regions referenced) to first map each original worker address to its canonical metrics URL via strip_dp_rank_suffix(), deduplicate the list of canonical URLs, and then call fan_out_with() once per unique canonical URL; also ensure that the emitted label for the endpoint uses the canonical URL (or a new canonical_label) instead of the original per-worker address so identical endpoints produce a single metrics series.

github-actions · 2026-04-04T02:19:40Z

This pull request has been automatically marked as stale because it has not had any activity within 14 days. It will be automatically closed if no further activity occurs within 16 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2026-04-09T22:51:14Z

Hi @scottjlee, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

add http server for ingesting sglang engine metrics

e3bda97

Signed-off-by: Scott Lee <scott@together.ai>

github-actions bot added the model-gateway Model gateway crate changes label Mar 19, 2026

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

scottjlee and others added 3 commits March 18, 2026 21:40

handle metrics server shutdown

419160c

Signed-off-by: Scott Lee <scott@together.ai>

Merge branch 'main' into sjl/0318-engine-metrics-sglang

4a3b27c

Merge branch 'main' into sjl/0318-engine-metrics-sglang

918c8c9

Signed-off-by: Scott Lee <scott@together.ai>

scottjlee marked this pull request as ready for review March 19, 2026 19:55

scottjlee requested review from CatherineSue and slin1237 as code owners March 19, 2026 19:55

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

coderabbitai bot requested changes Mar 19, 2026

View reviewed changes

scottjlee added 2 commits March 20, 2026 13:50

ipv6 handling, metrics port

abcee91

Signed-off-by: Scott Lee <scott@together.ai>

Merge branch 'main' into sjl/0318-engine-metrics-sglang

7634160

Signed-off-by: Scott Lee <scott@together.ai>

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

coderabbitai bot requested changes Mar 20, 2026

View reviewed changes

github-actions bot added the stale PR has been inactive for 14+ days label Apr 4, 2026

mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Apr 9, 2026

scottjlee closed this Apr 9, 2026

lightseek-bot deleted the sjl/0318-engine-metrics-sglang branch April 10, 2026 04:02

		from sglang.srt.utils import set_prometheus_multiproc_dir

		set_prometheus_multiproc_dir()

		metrics_port = server_args.port + 1
		_start_metrics_http_server(server_args.host, metrics_port)

		from http.server import BaseHTTPRequestHandler, HTTPServer

		from prometheus_client import CollectorRegistry, generate_latest, multiprocess

		metrics_port = server_args.port + 1
		metrics_httpd = _start_metrics_http_server(server_args.host, metrics_port)

Conversation

scottjlee commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

mergify bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

scottjlee commented Mar 19, 2026 •

edited

Loading

coderabbitai bot commented Mar 19, 2026 •

edited

Loading