Add sglang router minimal support #3210

Bihan · 2025-10-21T02:59:46Z

Intro
We want to make it possible to create a gateway which extends the gateway functionality with additional features (all sgl-router features such as cache aware routing, etc) while keeping all the standard gateway features (such as authentication, rate limits).

For the user, using such gateway should be very simple, e.g. setting router to sglang in gateway configurations. Eg:

type: gateway
name: sglang-gateway

backend: aws
region: eu-west-1

domain: example.com
router: sglang

The rest for the user should look the same - the same service endpoint, authentication and rate limits working, etc.

While this first experimental version should only bring minimum features - allow to route replicas traffic through the router (dstack’s gateway/ngnix -> sglang-router -> replica workers), in the future this may be extended with router-specific scaling metrics, such as ttft, e2e, Prefill-Decode Disaggregation, etc).

As the first experimental version, the most critical is to come up with the minimum changes that are tested thoroughly that would allow embedding the router: sglang without breaking any existing functionality.

Note:

In this version installation of pip & sglang-router is done in gateway machine, irrespective of whether router:sglang is in gateway config or not. To make it conditional in future, it should be implemented across backends that support gateway.
Modified upstream block of src/dstack/_internal/proxy/gateway/resources/nginx/service.jinja2 to respect router: sglang in gateway config.

upstream {{ domain }}.upstream {
    {% if router == "sglang" %}
    server 127.0.0.1:3000;  # SGLang router on the gateway
    {% else %}
    {% for replica in replicas %}
    server unix:{{ replica.socket }};  # replica {{ replica.id }}
    {% endfor %}
    {% endif %}
}

Created new nginx conf: src/dstack/_internal/proxy/gateway/resources/nginx/sglang_workers.jinja2

This nginx conf forwards HTTP to Unix socket. dstack workers listen on Unix sockets, while the sglang-router speaks HTTP, so this bridge lets the router reach each worker via local TCP ports.

# Worker 1
upstream sglang_worker_1_upstream {
    server unix:/tmp/tmpazynu7m5/replica.sock;
}

server {
    listen 127.0.0.1:10001;
    access_log off; # disable access logs for this internal endpoint

    proxy_read_timeout 300s;
    proxy_send_timeout 300s;

    location / {
        proxy_pass http://sglang_worker_1_upstream;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Connection "";
        proxy_set_header Upgrade $http_upgrade;
    }
}

# Worker 2
upstream sglang_worker_2_upstream {
    server unix:/tmp/tmpazynu7m6/replica.sock;
}

server {
    listen 127.0.0.1:10002;
    access_log off; # disable access logs for this internal endpoint

    proxy_read_timeout 300s;
    proxy_send_timeout 300s;

    location / {
        proxy_pass http://sglang_worker_2_upstream;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Connection "";
        proxy_set_header Upgrade $http_upgrade;
    }
}

How To Test

Step 1
Replace return value as shown in below example in method get_dstack_gateway_wheel (exact path see here) .

Eg:

def get_dstack_gateway_wheel(build: str) -> str:
    channel = "release" if settings.DSTACK_RELEASE else "stgn"
    base_url = f"https://dstack-gateway-downloads.s3.amazonaws.com/{channel}"
    if build == "latest":
        r = requests.get(f"{base_url}/latest-version", timeout=5)
        r.raise_for_status()
        build = r.text.strip()
        logger.debug("Found the latest gateway build: %s", build)
    # return f"{base_url}/dstack_gateway-{build}-py3-none-any.whl"
    return "https://bihan-test-bucket.s3.eu-west-1.amazonaws.com/dstack_gateway-0.0.0-py3-none-any.whl"

Step 2

Apply below gateway config.

type: gateway
name: sglang-gateway

backend: aws
region: eu-west-1

domain: example.com
router: sglang

Step 3
Update DNS

Step 4

Apply below service config

type: service
name: sglang-service

python: 3.12
nvcc: true

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct

commands:
  - pip install --upgrade pip
  - pip install uv
  - uv pip install sglang --prerelease=allow
  - python -m sglang.launch_server --model-path $MODEL_ID --host 0.0.0.0 --port 8000 --enable-metrics

port: 8000
model: meta-llama/Llama-3.2-3B-Instruct

resources:
  gpu: 24GB

replicas: 0..3
scaling:
  metric: rps
  target: 1

Step 5
To automate request and test autoscaling, you can use below script:
autoscale_test_sglang.py

import asyncio
import aiohttp
import time
import json

# ==== Configuration ====
URL = "https://sglang-service.bihan-gateway.dstack.ai/v1/chat/completions" # <-- replace with your endpoint
TOKEN = "esdfds3263-c36d-41db-ba9b-0d31df4efb15e"   # <-- replace with your token
RPS = 2            # requests per second
DURATION = 1800        # duration in seconds
# =======================

HEADERS = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}

PAYLOAD = {
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Deep Learning?"}
    ]
}


async def send_request(session, idx):
    """Send a single request and print full response"""
    try:
        async with session.post(URL, headers=HEADERS, json=PAYLOAD) as resp:
            text = await resp.text()
            print(f"\n[{idx}] Status: {resp.status}")
            print(f"Response:\n{text}\n")
    except Exception as e:
        print(f"[{idx}] Error: {e}")


async def run_load_test():
    total_requests = RPS * DURATION
    interval = 1.0 / RPS

    async with aiohttp.ClientSession() as session:
        start_time = time.perf_counter()
        tasks = []

        for i in range(total_requests):
            tasks.append(asyncio.create_task(send_request(session, i + 1)))
            await asyncio.sleep(interval)

        await asyncio.gather(*tasks)

        elapsed = time.perf_counter() - start_time
        print(f"\n✅ Sent {total_requests} requests in {elapsed:.2f}s "
              f"(~{total_requests/elapsed:.2f} RPS)")


if __name__ == "__main__":
    asyncio.run(run_load_test())

Step 6
After updating token and service endpoint, run above script python autoscale_test_sglang.py from your local machine.

Once the automated requests start hitting the service endpoint; dstack submits the job. When the service get's deployed and /health check from sglang-router responds with 200 as shown below, you will start to see response from the model.

As the automated requests continue, first dstack scales up to 3 jobs and later adjusts to 2 jobs. If we stop the requests, dstack scales down to 0 jobs.

Logs:

[2025-10-16 07:01:38] INFO:     Application startup complete.
[2025-10-16 07:01:38] INFO:     Uvicorn running on https://sglang-service.bihan-gateway.dstack.ai (Press CTRL+C to quit)
[2025-10-16 07:01:39] INFO:     127.0.0.1:3580 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-16 07:01:39] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-16 07:02:07] INFO:     127.0.0.1:3906 - "GET /health HTTP/1.1" 503 Service Unavailable
[2025-10-16 07:02:46] INFO:     127.0.0.1:3592 - "POST /generate HTTP/1.1" 200 OK
[2025-10-16 07:02:46] The server is fired up and ready to roll!
[2025-10-16 07:03:07] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-16 07:03:08] INFO:     127.0.0.1:3516 - "GET /health HTTP/1.1" 200 OK
[2025-10-16 07:03:08] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-16 07:03:09] INFO:     127.0.0.1:3790 - "GET /health HTTP/1.1" 200 OK

Step 7
You can also use dstack-frontend `http://localhost:3000/projects/main/models/sglang-service for manual requests.

Note: You can check sglang-router logs: cat ~/dstack/router_logs/sgl-router.

Also, maybe in the future we can show sglang-router's log instead of replica's log in dstack CLI

Eg:

sglang-service provisioning completed (running)
Service is published at:
  https://sglang-service.bihan-gateway.dstack.ai
Model meta-llama/Llama-3.2-3B-Instruct is published at:
  https://gateway.bihan-gateway.dstack.ai


2025-10-16 06:59:05  INFO sglang_router_rs::core::worker_manager: src/core/worker_manager.rs:1077: Waiting for 2 workers to become healthy. Unhealthy: ["http://127.0.0.1:10002", "http://127.0.0.1:10001"]
...
...
2025-10-16 07:03:08  INFO sglang_router_rs::core::worker_manager: src/core/worker_manager.rs:1111: All 2 workers are healthy: ["http://127.0.0.1:10002", "http://127.0.0.1:10001"]
...
...
2025-10-16 07:03:08  INFO sglang_router_rs::server: src/server.rs:1066: Router ready | workers: ["http://127.0.0.1:10002", "http://127.0.0.1:10001"]
2025-10-16 07:03:08  INFO sglang_router_rs::server: src/server.rs:1094: Starting server on 0.0.0.0:3000

Bihan · 2025-10-21T03:03:16Z

Completed

Add an example of how to test the new router
Please ensure auto-scaling works (incl. downscaling to 0), and also that dstack uses routers' API to add/remove workers without restarting the gateway

Next Steps

And only after that, refactor the code to move the sgl-router implementation to a separate sg-lang-related subclass - to ensure the normal gateway code doesn't have any sgl-router specific code - similar to how each backend encapsulates its own logic
Ensure tests are working

Bihan Rana added 16 commits October 13, 2025 14:56

Add SGLang Router Min Support

b28898a

Add Test Log to check Registration conf

82bae8e

Add start sglang-router

27c5204

Add sglang_workers jinga template

b2f1093

Modify service.jinja2 upstream block

3871f05

Add sglang log file

fa6d992

Add sglang router clean up in unregister method

4a47d86

Add test log to check unregister

ccae80e

Increase sglang router-request-timeout

0b7a6a1

Change sglang process to sglang::router

b32c8dd

Clean development code

36e84e6

Test is_sglang_router_running

25aacca

Add HTTP add worker endpoint

f08f0fa

Update start or update router

875353d

Add remove worker endpoint

907f71f

Clean sglang autoscaling

c9d7722

Include router field in gateway tests

e061049

Bihan requested a review from peterschmidt85 October 21, 2025 03:32

Minor Update

f82ca10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sglang router minimal support #3210

Add sglang router minimal support #3210

Uh oh!

Bihan commented Oct 21, 2025 •

edited

Loading

Uh oh!

Bihan commented Oct 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add sglang router minimal support #3210

Are you sure you want to change the base?

Add sglang router minimal support #3210

Uh oh!

Conversation

Bihan commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bihan commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bihan commented Oct 21, 2025 •

edited

Loading

Bihan commented Oct 21, 2025 •

edited

Loading