-
Notifications
You must be signed in to change notification settings - Fork 0
Home
A custom vLLM fork with bug fixes and features for GPT-OSS (Harmony) models, the Responses API, and GPU snapshotting for near-zero cold starts.
Each release is built from upstream vLLM nightly (as of the release date) with a set of open PRs merged on top. The fork tracks upstream closely; it is not a permanent divergence. As PRs are merged upstream, they drop out of the fork.
Releases are tagged with a version like 0.17.2rc0+garrio.2, where 0.17.2rc0 is the upstream vLLM base version and garrio.2 is the fork's patch number.
Each release includes a prebuilt wheel (CUDA 12.9, x86_64). The latest public wheel at the time of writing is:
pip install 'https://github.com/will-deines/vllm/releases/download/0.18.1rc0%2Bgarrio.8/vllm-0.18.1rc0+garrio.8-cp38-abi3-manylinux_2_31_x86_64.whl'Check the Releases page for the latest version.
These are the open PRs on vllm-project/vllm, plus the small amount of public-branch integration work needed to ship them cleanly, that are merged into each release:
| PR | Description |
|---|---|
| #35934 |
suspend()/resume() for CRIU-safe snapshots — Tears down NCCL and distributed state before a CRIU/cuda-checkpoint snapshot, rebuilds it after restore. Weights stay on GPU. Enables near-zero cold starts with Modal gpu_snapshot and similar platforms. |
| PR | Description |
|---|---|
| #37433 |
tool_choice support (auto/required/none) for GPT-OSS models in the Responses API, using structural tag grammars. |
| #35904 |
Structured output + reasoning via structural tag embedding. JSON schema enforcement scoped to the final channel so it doesn't clobber reasoning output. |
| #35905 |
Pluggable ResponseStore abstraction. Swap the in-memory response store for Redis/Postgres/DynamoDB via VLLM_RESPONSES_STORE_BACKEND env var. |
| #35903 |
Stateless multi-turn via encrypted_content state carrier. Multi-turn conversations without server-side storage (no VLLM_ENABLE_RESPONSES_API_STORE required). |
| PR | Description |
|---|---|
| #36011 | Fix streaming cross-channel delta accumulation — When a token batch crosses a channel boundary (analysis → commentary), content was misclassified by the end-of-batch channel state. |
| #35907 |
Fix analysis-channel tool calls + preserve reasoning across turns — Tool calls on the analysis channel were silently misrouted; encoder-side auto_drop_analysis destroyed reasoning context between tool-calling turns. |
| #35906 |
Sanitize leaked Harmony control tokens — Three-layer defense against leaked <|channel|>, <|constrain|>, etc. in tool names and recipients. Includes ResilientStreamableParser for token-level recovery. |
| PR / branch work | Description |
|---|---|
| #37949 |
Prefer Triton on SM8x and narrow SM89 sink-prefill tuning — Keeps GPT-OSS on Triton when attention sinks are required on Ada/Ampere, and only applies the larger SM89 sink-prefill tile to the medium-length prefills that were wins in offline L40S validation. |
public/main integration |
The public branch also carries the prerequisite GPT-OSS CUDA policy and selector-compatibility changes that make the Triton sink path the default serving path for openai/gpt-oss-20b on a single L40S. |
-
Suspend/Resume with Modal gpu_snapshot — Near-zero cold starts using the
suspend()/resume()API with Modal containers. -
Serving GPT-OSS on Modal — Deploy the fork on Modal with the current single-
L40Stuning foropenai/gpt-oss-20b, including the current Triton-based attention stack, bounded batching, fast boot, and Responses/tool-calling support.