Merged
Conversation
dd838de to
c7f7364
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/v1/responsesendpoint tovllm-mlxinput,previous_response_id, function tools, stored response objects, and streaming response eventsWhy this PR exists
FortBench local runs now pivot through
vllm-mlxon Apple Silicon for both Codex and OpenCode. Codex local mode expects a Responses-compatible backend. Upstreamvllm-mlxalready has strong OpenAI Chat/Anthropic support, but it does not yet have a native/v1/responsessurface.This PR keeps the scope intentionally narrow: it adds the core Responses API without mixing in Codex-specific prompt normalization or unrelated loader/cache/runtime fixes.
Why this is independently deployable
/v1/chat/completions,/v1/completions, or/v1/messagesRelated upstream context
waybarrios/vllm-mlxwaybarrios/vllm-mlx#46introduced Anthropic Messages API and agentic reliability work: feat: Prefix cache improvements, Anthropic Messages API, and agentic reliability fixes waybarrios/vllm-mlx#46waybarrios/vllm-mlx#50added Harmony parsers for GPT-OSS: Add Harmony format parsers for GPT-OSS models waybarrios/vllm-mlx#50waybarrios/vllm-mlx#53added GPT-OSS reasoning parsing: feat: GPT-OSS reasoning parser for channel-based token format waybarrios/vllm-mlx#53waybarrios/vllm-mlx#127merged Qwen3.5 text support and reasoning/tool streaming fixes: Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming waybarrios/vllm-mlx#127#173,#177, and#210: fix: honor tool_choice=none by stripping tools and suppressing parsing waybarrios/vllm-mlx#173 fix: parse tool calls in streaming reasoning branch waybarrios/vllm-mlx#177 fix: respect tool_choice="none" by excluding tools from template waybarrios/vllm-mlx#210vllm-project/vllmThe broader Responses ecosystem is still moving fast upstream as well. Relevant open work includes:
vllm#35905pluggable ResponseStore: feat(responses): pluggable ResponseStore abstraction vllm-project/vllm#35905vllm#37727fixinstructionsleakage withprevious_response_id: [Bugfix] Fix Responses API instructions leaking through previous_response_id vllm-project/vllm#37727vllm#36445streaming tool-call support for non-Harmony models: fix: Responses API streaming tool call support for non-harmony models vllm-project/vllm#36445vllm#37433Responsestool_choicefor GPT-OSS/Harmony: [Responses API] tool_choice support (auto / required / none) for GPT-OSS vllm-project/vllm#37433vllm#37739default chat-template kwargs handling in Responses: [Frontend] Fix default_chat_template_kwargs handling in Responses API vllm-project/vllm#37739This PR is intentionally smaller than that body of work. It provides the core endpoint FortBench needs locally and leaves the more advanced policy/store/tool-choice machinery for follow-up work.
ggml-org/llama.cppThe design here was also informed by prior llama.cpp Responses work:
llama.cpp#18486partial/v1/responses: server: /v1/responses (partial) ggml-org/llama.cpp#18486llama.cpp#19720Responses compliance follow-up: server: add OpenAI Responses API compliance ggml-org/llama.cpp#19720llama.cpp#19873/responsesroute mirroring: Mirroring /v1/responses to /responses ggml-org/llama.cpp#19873llama.cpp#20079Codex prompt normalization (developer -> system, merged/closed): fix: translate "developer" role to "system" for chat template compatibility ggml-org/llama.cpp#20079One explicit lesson we carried forward: unsupported tools or partially-supported response items should not hard-fail the whole request path.
vllm-project/vllm-metalI also reviewed
vllm-metalfor overlap. The current active work there is lower-level Apple Silicon runtime work such as paged KV, unified prefill/decode, and Qwen smoke support, not Responses front-end API work:vllm-metal#125: Fix paged-attention KV cache dtype + size accounting (issue #119) vllm-project/vllm-metal#125vllm-metal#172: [Continuous Batching] Unified prefilling & decoding prototype vllm-project/vllm-metal#172vllm-metal#174: Bump mlx-lm/mlx-vlm deps and add Qwen3.5-0.8B smoke test vllm-project/vllm-metal#174Validation
PYTHONPATH=/Users/ert/code/vllm-mlx /Users/ert/code/.venv/bin/python -m pytest tests/test_responses_api.py -qpython3 -m compileall vllm_mlxWhat could still improve
vllm-mlxnative Responses implementation if one lands