Skip to content

Conversation

@Bihan
Copy link
Collaborator

@Bihan Bihan commented Oct 21, 2025

Intro
We want to make it possible to create a gateway which extends the gateway functionality with additional features (all sgl-router features such as cache aware routing, etc) while keeping all the standard gateway features (such as authentication, rate limits).

For the user, using such gateway should be very simple, e.g. setting router to sglang in gateway configurations. Eg:

type: gateway
name: sglang-gateway

backend: aws
region: eu-west-1

domain: example.com
router: sglang

The rest for the user should look the same - the same service endpoint, authentication and rate limits working, etc.

While this first experimental version should only bring minimum features - allow to route replicas traffic through the router (dstack’s gateway/ngnix -> sglang-router -> replica workers), in the future this may be extended with router-specific scaling metrics, such as ttft, e2e, Prefill-Decode Disaggregation, etc).

As the first experimental version, the most critical is to come up with the minimum changes that are tested thoroughly that would allow embedding the router: sglang without breaking any existing functionality.

Note:

  1. In this version installation of pip & sglang-router is done in gateway machine, irrespective of whether router:sglang is in gateway config or not. To make it conditional in future, it should be implemented across backends that support gateway.

  2. Modified upstream block of src/dstack/_internal/proxy/gateway/resources/nginx/service.jinja2 to respect router: sglang in gateway config.

upstream {{ domain }}.upstream {
    {% if router == "sglang" %}
    server 127.0.0.1:3000;  # SGLang router on the gateway
    {% else %}
    {% for replica in replicas %}
    server unix:{{ replica.socket }};  # replica {{ replica.id }}
    {% endfor %}
    {% endif %}
}
  1. Created new nginx conf: src/dstack/_internal/proxy/gateway/resources/nginx/sglang_workers.jinja2

This nginx conf forwards HTTP to Unix socket. dstack workers listen on Unix sockets, while the sglang-router speaks HTTP, so this bridge lets the router reach each worker via local TCP ports.

# Worker 1
upstream sglang_worker_1_upstream {
    server unix:/tmp/tmpazynu7m5/replica.sock;
}

server {
    listen 127.0.0.1:10001;
    access_log off; # disable access logs for this internal endpoint

    proxy_read_timeout 300s;
    proxy_send_timeout 300s;

    location / {
        proxy_pass http://sglang_worker_1_upstream;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Connection "";
        proxy_set_header Upgrade $http_upgrade;
    }
}

# Worker 2
upstream sglang_worker_2_upstream {
    server unix:/tmp/tmpazynu7m6/replica.sock;
}

server {
    listen 127.0.0.1:10002;
    access_log off; # disable access logs for this internal endpoint

    proxy_read_timeout 300s;
    proxy_send_timeout 300s;

    location / {
        proxy_pass http://sglang_worker_2_upstream;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Connection "";
        proxy_set_header Upgrade $http_upgrade;
    }
}

How To Test

Step 1
Replace return value as shown in below example in method get_dstack_gateway_wheel (exact path see here) .

Eg:

def get_dstack_gateway_wheel(build: str) -> str:
    channel = "release" if settings.DSTACK_RELEASE else "stgn"
    base_url = f"https://dstack-gateway-downloads.s3.amazonaws.com/{channel}"
    if build == "latest":
        r = requests.get(f"{base_url}/latest-version", timeout=5)
        r.raise_for_status()
        build = r.text.strip()
        logger.debug("Found the latest gateway build: %s", build)
    # return f"{base_url}/dstack_gateway-{build}-py3-none-any.whl"
    return "https://bihan-test-bucket.s3.eu-west-1.amazonaws.com/dstack_gateway-0.0.0-py3-none-any.whl"

Step 2

Apply below gateway config.

type: gateway
name: sglang-gateway

backend: aws
region: eu-west-1

domain: example.com
router: sglang

Step 3
Update DNS

Step 4

Apply below service config

type: service
name: sglang-service

python: 3.12
nvcc: true

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct

commands:
  - pip install --upgrade pip
  - pip install uv
  - uv pip install sglang --prerelease=allow
  - python -m sglang.launch_server --model-path $MODEL_ID --host 0.0.0.0 --port 8000 --enable-metrics

port: 8000
model: meta-llama/Llama-3.2-3B-Instruct

resources:
  gpu: 24GB

replicas: 0..3
scaling:
  metric: rps
  target: 1

Step 5
To automate request and test autoscaling, you can use below script:
autoscale_test_sglang.py

import asyncio
import aiohttp
import time
import json

# ==== Configuration ====
URL = "https://sglang-service.bihan-gateway.dstack.ai/v1/chat/completions" # <-- replace with your endpoint
TOKEN = "esdfds3263-c36d-41db-ba9b-0d31df4efb15e"   # <-- replace with your token
RPS = 2            # requests per second
DURATION = 1800        # duration in seconds
# =======================

HEADERS = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}

PAYLOAD = {
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Deep Learning?"}
    ]
}


async def send_request(session, idx):
    """Send a single request and print full response"""
    try:
        async with session.post(URL, headers=HEADERS, json=PAYLOAD) as resp:
            text = await resp.text()
            print(f"\n[{idx}] Status: {resp.status}")
            print(f"Response:\n{text}\n")
    except Exception as e:
        print(f"[{idx}] Error: {e}")


async def run_load_test():
    total_requests = RPS * DURATION
    interval = 1.0 / RPS

    async with aiohttp.ClientSession() as session:
        start_time = time.perf_counter()
        tasks = []

        for i in range(total_requests):
            tasks.append(asyncio.create_task(send_request(session, i + 1)))
            await asyncio.sleep(interval)

        await asyncio.gather(*tasks)

        elapsed = time.perf_counter() - start_time
        print(f"\n✅ Sent {total_requests} requests in {elapsed:.2f}s "
              f"(~{total_requests/elapsed:.2f} RPS)")


if __name__ == "__main__":
    asyncio.run(run_load_test())

Step 6
After updating token and service endpoint, run above script python autoscale_test_sglang.py from your local machine.

Once the automated requests start hitting the service endpoint; dstack submits the job. When the service get's deployed and /health check from sglang-router responds with 200 as shown below, you will start to see response from the model.

As the automated requests continue, first dstack scales up to 3 jobs and later adjusts to 2 jobs. If we stop the requests, dstack scales down to 0 jobs.

Logs:

[2025-10-16 07:01:38] INFO:     Application startup complete.
[2025-10-16 07:01:38] INFO:     Uvicorn running on https://sglang-service.bihan-gateway.dstack.ai (Press CTRL+C to quit)
[2025-10-16 07:01:39] INFO:     127.0.0.1:3580 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-16 07:01:39] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-16 07:02:07] INFO:     127.0.0.1:3906 - "GET /health HTTP/1.1" 503 Service Unavailable
[2025-10-16 07:02:46] INFO:     127.0.0.1:3592 - "POST /generate HTTP/1.1" 200 OK
[2025-10-16 07:02:46] The server is fired up and ready to roll!
[2025-10-16 07:03:07] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-16 07:03:08] INFO:     127.0.0.1:3516 - "GET /health HTTP/1.1" 200 OK
[2025-10-16 07:03:08] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-16 07:03:09] INFO:     127.0.0.1:3790 - "GET /health HTTP/1.1" 200 OK

Step 7
You can also use dstack-frontend `http://localhost:3000/projects/main/models/sglang-service for manual requests.

Note: You can check sglang-router logs: cat ~/dstack/router_logs/sgl-router.

Also, maybe in the future we can show sglang-router's log instead of replica's log in dstack CLI

Eg:

sglang-service provisioning completed (running)
Service is published at:
  https://sglang-service.bihan-gateway.dstack.ai
Model meta-llama/Llama-3.2-3B-Instruct is published at:
  https://gateway.bihan-gateway.dstack.ai


2025-10-16 06:59:05  INFO sglang_router_rs::core::worker_manager: src/core/worker_manager.rs:1077: Waiting for 2 workers to become healthy. Unhealthy: ["http://127.0.0.1:10002", "http://127.0.0.1:10001"]
...
...
2025-10-16 07:03:08  INFO sglang_router_rs::core::worker_manager: src/core/worker_manager.rs:1111: All 2 workers are healthy: ["http://127.0.0.1:10002", "http://127.0.0.1:10001"]
...
...
2025-10-16 07:03:08  INFO sglang_router_rs::server: src/server.rs:1066: Router ready | workers: ["http://127.0.0.1:10002", "http://127.0.0.1:10001"]
2025-10-16 07:03:08  INFO sglang_router_rs::server: src/server.rs:1094: Starting server on 0.0.0.0:3000 

@Bihan
Copy link
Collaborator Author

Bihan commented Oct 21, 2025

Completed

  1. Add an example of how to test the new router

  2. Please ensure auto-scaling works (incl. downscaling to 0), and also that dstack uses routers' API to add/remove workers without restarting the gateway

Next Steps

  1. And only after that, refactor the code to move the sgl-router implementation to a separate sg-lang-related subclass - to ensure the normal gateway code doesn't have any sgl-router specific code - similar to how each backend encapsulates its own logic

  2. Ensure tests are working

@Bihan Bihan requested a review from peterschmidt85 October 21, 2025 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant