-
Notifications
You must be signed in to change notification settings - Fork 198
Add sglang router minimal support #3210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Completed
Next Steps
|
Intro
We want to make it possible to create a gateway which extends the gateway functionality with additional features (all sgl-router features such as cache aware routing, etc) while keeping all the standard gateway features (such as authentication, rate limits).
For the user, using such gateway should be very simple, e.g. setting router to sglang in gateway configurations. Eg:
The rest for the user should look the same - the
same service endpoint,authenticationandrate limits working,etc.While this first
experimental versionshould only bring minimum features - allow to route replicas traffic through the router (dstack’s gateway/ngnix -> sglang-router -> replica workers), in the future this may be extended with router-specific scaling metrics, such as ttft, e2e, Prefill-Decode Disaggregation, etc).As the first experimental version, the most critical is to come up with the minimum changes that are tested thoroughly that would allow embedding the
router: sglangwithout breaking any existing functionality.Note:
In this version installation of pip & sglang-router is done in gateway machine, irrespective of whether
router:sglangis in gateway config or not. To make it conditional in future, it should be implemented across backends that support gateway.Modified upstream block of
src/dstack/_internal/proxy/gateway/resources/nginx/service.jinja2to respectrouter: sglangin gateway config.src/dstack/_internal/proxy/gateway/resources/nginx/sglang_workers.jinja2This nginx conf forwards HTTP to Unix socket. dstack workers listen on Unix sockets, while the sglang-router speaks HTTP, so this bridge lets the router reach each worker via local TCP ports.
How To Test
Step 1
Replace return value as shown in below example in method
get_dstack_gateway_wheel(exact path see here) .Eg:
Step 2
Apply below gateway config.
Step 3
Update DNS
Step 4
Apply below service config
Step 5
To automate request and test autoscaling, you can use below script:
autoscale_test_sglang.pyStep 6
After updating
tokenandservice endpoint, run above scriptpython autoscale_test_sglang.pyfrom your local machine.Once the automated requests start hitting the service endpoint; dstack submits the job. When the service get's deployed and
/healthcheck from sglang-router responds with 200 as shown below, you will start to see response from the model.As the automated requests continue, first dstack scales up to 3 jobs and later adjusts to 2 jobs. If we stop the requests, dstack scales down to 0 jobs.
Logs:
Step 7
You can also use dstack-frontend `http://localhost:3000/projects/main/models/sglang-service for manual requests.
Note: You can check sglang-router logs: cat ~/dstack/router_logs/sgl-router.
Also, maybe in the future we can show sglang-router's log instead of replica's log in dstack CLI
Eg: