Add OpenAI Responses API core by krystophny · Pull Request #1 · computor-org/vllm-mlx

krystophny · 2026-03-24T07:35:43Z

Summary

add a clean OpenAI-compatible /v1/responses endpoint to vllm-mlx
support string and message-array input, previous_response_id, function tools, stored response objects, and streaming response events
degrade gracefully when the request references unsupported built-in tools or unknown response item types, instead of hard-failing the request path
add focused unit coverage for the new protocol surface

Why this PR exists

FortBench local runs now pivot through vllm-mlx on Apple Silicon for both Codex and OpenCode. Codex local mode expects a Responses-compatible backend. Upstream vllm-mlx already has strong OpenAI Chat/Anthropic support, but it does not yet have a native /v1/responses surface.

This PR keeps the scope intentionally narrow: it adds the core Responses API without mixing in Codex-specific prompt normalization or unrelated loader/cache/runtime fixes.

Why this is independently deployable

it adds a new API surface without changing /v1/chat/completions, /v1/completions, or /v1/messages
unsupported built-in tools degrade to backend notes instead of crashing the request
Codex-specific prompt-shape behavior is kept out of this PR and lives in stacked PR Add Codex Responses prompt normalization #5

Related upstream context

`waybarrios/vllm-mlx`

waybarrios/vllm-mlx#46 introduced Anthropic Messages API and agentic reliability work: feat: Prefix cache improvements, Anthropic Messages API, and agentic reliability fixes waybarrios/vllm-mlx#46
waybarrios/vllm-mlx#50 added Harmony parsers for GPT-OSS: Add Harmony format parsers for GPT-OSS models waybarrios/vllm-mlx#50
waybarrios/vllm-mlx#53 added GPT-OSS reasoning parsing: feat: GPT-OSS reasoning parser for channel-based token format waybarrios/vllm-mlx#53
waybarrios/vllm-mlx#127 merged Qwen3.5 text support and reasoning/tool streaming fixes: Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming waybarrios/vllm-mlx#127
open follow-on tool-choice and streaming fixes continue in #173, #177, and #210: fix: honor tool_choice=none by stripping tools and suppressing parsing waybarrios/vllm-mlx#173 fix: parse tool calls in streaming reasoning branch waybarrios/vllm-mlx#177 fix: respect tool_choice="none" by excluding tools from template waybarrios/vllm-mlx#210

`vllm-project/vllm`

The broader Responses ecosystem is still moving fast upstream as well. Relevant open work includes:

vllm#35905 pluggable ResponseStore: feat(responses): pluggable ResponseStore abstraction vllm-project/vllm#35905
vllm#37727 fix instructions leakage with previous_response_id: [Bugfix] Fix Responses API instructions leaking through previous_response_id vllm-project/vllm#37727
vllm#36445 streaming tool-call support for non-Harmony models: fix: Responses API streaming tool call support for non-harmony models vllm-project/vllm#36445
vllm#37433 Responses tool_choice for GPT-OSS/Harmony: [Responses API] tool_choice support (auto / required / none) for GPT-OSS vllm-project/vllm#37433
vllm#37739 default chat-template kwargs handling in Responses: [Frontend] Fix default_chat_template_kwargs handling in Responses API vllm-project/vllm#37739

This PR is intentionally smaller than that body of work. It provides the core endpoint FortBench needs locally and leaves the more advanced policy/store/tool-choice machinery for follow-up work.

`ggml-org/llama.cpp`

The design here was also informed by prior llama.cpp Responses work:

llama.cpp#18486 partial /v1/responses: server: /v1/responses (partial) ggml-org/llama.cpp#18486
llama.cpp#19720 Responses compliance follow-up: server: add OpenAI Responses API compliance ggml-org/llama.cpp#19720
llama.cpp#19873 /responses route mirroring: Mirroring /v1/responses to /responses ggml-org/llama.cpp#19873
llama.cpp#20079 Codex prompt normalization (developer -> system, merged/closed): fix: translate "developer" role to "system" for chat template compatibility ggml-org/llama.cpp#20079

One explicit lesson we carried forward: unsupported tools or partially-supported response items should not hard-fail the whole request path.

`vllm-project/vllm-metal`

I also reviewed vllm-metal for overlap. The current active work there is lower-level Apple Silicon runtime work such as paged KV, unified prefill/decode, and Qwen smoke support, not Responses front-end API work:

vllm-metal#125: Fix paged-attention KV cache dtype + size accounting (issue #119) vllm-project/vllm-metal#125
vllm-metal#172: [Continuous Batching] Unified prefilling & decoding prototype vllm-project/vllm-metal#172
vllm-metal#174: Bump mlx-lm/mlx-vlm deps and add Qwen3.5-0.8B smoke test vllm-project/vllm-metal#174

Validation

PYTHONPATH=/Users/ert/code/vllm-mlx /Users/ert/code/.venv/bin/python -m pytest tests/test_responses_api.py -q
python3 -m compileall vllm_mlx
local FortBench MLX rerun currently active against this stack for Codex + OpenCode on the 20-task corpus

What could still improve

fuller spec parity around advanced built-in tools and persistence/store semantics
more exhaustive streaming event compliance coverage
dedicated end-to-end tests for mixed reasoning + tools across more parser families
eventual alignment with any upstream vllm-mlx native Responses implementation if one lands

feat: add OpenAI responses API support

c7f7364

krystophny changed the title ~~Add OpenAI Responses API support~~ Add OpenAI Responses API with Codex compatibility Mar 24, 2026

krystophny force-pushed the feature/openai-responses-api branch from dd838de to c7f7364 Compare March 24, 2026 08:24

krystophny changed the title ~~Add OpenAI Responses API with Codex compatibility~~ Add OpenAI Responses API core Mar 24, 2026

krystophny mentioned this pull request Mar 24, 2026

Add Codex Responses prompt normalization #5

Merged

krystophny merged commit 6c47291 into main Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenAI Responses API core#1

Add OpenAI Responses API core#1
krystophny merged 1 commit intomainfrom
feature/openai-responses-api

krystophny commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

krystophny commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this PR exists

Why this is independently deployable

Related upstream context

waybarrios/vllm-mlx

vllm-project/vllm

ggml-org/llama.cpp

vllm-project/vllm-metal

Validation

What could still improve

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

krystophny commented Mar 24, 2026 •

edited

Loading

`waybarrios/vllm-mlx`

`vllm-project/vllm`

`ggml-org/llama.cpp`

`vllm-project/vllm-metal`