Home

garrio vLLM fork

A custom vLLM fork with bug fixes and features for GPT-OSS (Harmony) models, the Responses API, and GPU snapshotting for near-zero cold starts.

What is this?

Each release is built from upstream vLLM nightly (as of the release date) with a set of open PRs merged on top. The fork tracks upstream closely; it is not a permanent divergence. As PRs are merged upstream, they drop out of the fork.

Releases are tagged with a version like 0.17.2rc0+garrio.2, where 0.17.2rc0 is the upstream vLLM base version and garrio.2 is the fork's patch number.

Install

Each release includes a prebuilt wheel (CUDA 12.9, x86_64). The latest public wheel at the time of writing is:

pip install 'https://github.com/will-deines/vllm/releases/download/0.18.1rc0%2Bgarrio.8/vllm-0.18.1rc0+garrio.8-cp38-abi3-manylinux_2_31_x86_64.whl'

Check the Releases page for the latest version.

Included PRs

These are the open PRs on vllm-project/vllm, plus the small amount of public-branch integration work needed to ship them cleanly, that are merged into each release:

GPU Snapshotting

PR	Description
#35934	`suspend()/resume()` for CRIU-safe snapshots — Tears down NCCL and distributed state before a CRIU/cuda-checkpoint snapshot, rebuilds it after restore. Weights stay on GPU. Enables near-zero cold starts with Modal `gpu_snapshot` and similar platforms.

Responses API

PR	Description
#37433	`tool_choice` support (auto/required/none) for GPT-OSS models in the Responses API, using structural tag grammars.
#35904	Structured output + reasoning via structural tag embedding. JSON schema enforcement scoped to the `final` channel so it doesn't clobber reasoning output.
#35905	Pluggable `ResponseStore` abstraction. Swap the in-memory response store for Redis/Postgres/DynamoDB via `VLLM_RESPONSES_STORE_BACKEND` env var.
#35903	Stateless multi-turn via `encrypted_content` state carrier. Multi-turn conversations without server-side storage (no `VLLM_ENABLE_RESPONSES_API_STORE` required).

Harmony Bug Fixes

PR	Description
#36011	Fix streaming cross-channel delta accumulation — When a token batch crosses a channel boundary (analysis → commentary), content was misclassified by the end-of-batch channel state.
#35907	Fix analysis-channel tool calls + preserve reasoning across turns — Tool calls on the analysis channel were silently misrouted; encoder-side `auto_drop_analysis` destroyed reasoning context between tool-calling turns.
#35906	Sanitize leaked Harmony control tokens — Three-layer defense against leaked `<\|channel\|>`, `<\|constrain\|>`, etc. in tool names and recipients. Includes `ResilientStreamableParser` for token-level recovery.

GPT-OSS L40S Attention Stack

PR / branch work	Description
#37949	Prefer Triton on SM8x and narrow SM89 sink-prefill tuning — Keeps GPT-OSS on Triton when attention sinks are required on Ada/Ampere, and only applies the larger SM89 sink-prefill tile to the medium-length prefills that were wins in offline `L40S` validation.
`public/main` integration	The public branch also carries the prerequisite GPT-OSS CUDA policy and selector-compatibility changes that make the Triton sink path the default serving path for `openai/gpt-oss-20b` on a single `L40S`.

Guides

Suspend/Resume with Modal gpu_snapshot — Near-zero cold starts using the suspend()/resume() API with Modal containers.
Serving GPT-OSS on Modal — Deploy the fork on Modal with the current single-L40S tuning for openai/gpt-oss-20b, including the current Triton-based attention stack, bounded batching, fast boot, and Responses/tool-calling support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

garrio vLLM fork

What is this?

Install

Included PRs

GPU Snapshotting

Responses API

Harmony Bug Fixes

GPT-OSS L40S Attention Stack

Guides

Clone this wiki locally

Home

garrio vLLM fork

What is this?

Install

Included PRs

GPU Snapshotting

Responses API

Harmony Bug Fixes

GPT-OSS L40S Attention Stack

Guides

Uh oh!

Uh oh!

Clone this wiki locally