Skip to content

Add OpenAI Responses API core#1

Merged
krystophny merged 1 commit intomainfrom
feature/openai-responses-api
Mar 24, 2026
Merged

Add OpenAI Responses API core#1
krystophny merged 1 commit intomainfrom
feature/openai-responses-api

Conversation

@krystophny
Copy link
Copy Markdown
Collaborator

@krystophny krystophny commented Mar 24, 2026

Summary

  • add a clean OpenAI-compatible /v1/responses endpoint to vllm-mlx
  • support string and message-array input, previous_response_id, function tools, stored response objects, and streaming response events
  • degrade gracefully when the request references unsupported built-in tools or unknown response item types, instead of hard-failing the request path
  • add focused unit coverage for the new protocol surface

Why this PR exists

FortBench local runs now pivot through vllm-mlx on Apple Silicon for both Codex and OpenCode. Codex local mode expects a Responses-compatible backend. Upstream vllm-mlx already has strong OpenAI Chat/Anthropic support, but it does not yet have a native /v1/responses surface.

This PR keeps the scope intentionally narrow: it adds the core Responses API without mixing in Codex-specific prompt normalization or unrelated loader/cache/runtime fixes.

Why this is independently deployable

  • it adds a new API surface without changing /v1/chat/completions, /v1/completions, or /v1/messages
  • unsupported built-in tools degrade to backend notes instead of crashing the request
  • Codex-specific prompt-shape behavior is kept out of this PR and lives in stacked PR Add Codex Responses prompt normalization #5

Related upstream context

waybarrios/vllm-mlx

vllm-project/vllm

The broader Responses ecosystem is still moving fast upstream as well. Relevant open work includes:

This PR is intentionally smaller than that body of work. It provides the core endpoint FortBench needs locally and leaves the more advanced policy/store/tool-choice machinery for follow-up work.

ggml-org/llama.cpp

The design here was also informed by prior llama.cpp Responses work:

One explicit lesson we carried forward: unsupported tools or partially-supported response items should not hard-fail the whole request path.

vllm-project/vllm-metal

I also reviewed vllm-metal for overlap. The current active work there is lower-level Apple Silicon runtime work such as paged KV, unified prefill/decode, and Qwen smoke support, not Responses front-end API work:

Validation

  • PYTHONPATH=/Users/ert/code/vllm-mlx /Users/ert/code/.venv/bin/python -m pytest tests/test_responses_api.py -q
  • python3 -m compileall vllm_mlx
  • local FortBench MLX rerun currently active against this stack for Codex + OpenCode on the 20-task corpus

What could still improve

  • fuller spec parity around advanced built-in tools and persistence/store semantics
  • more exhaustive streaming event compliance coverage
  • dedicated end-to-end tests for mixed reasoning + tools across more parser families
  • eventual alignment with any upstream vllm-mlx native Responses implementation if one lands

@krystophny krystophny changed the title Add OpenAI Responses API support Add OpenAI Responses API with Codex compatibility Mar 24, 2026
@krystophny krystophny force-pushed the feature/openai-responses-api branch from dd838de to c7f7364 Compare March 24, 2026 08:24
@krystophny krystophny changed the title Add OpenAI Responses API with Codex compatibility Add OpenAI Responses API core Mar 24, 2026
@krystophny krystophny merged commit 6c47291 into main Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant