feat: reasoning output responses api#5206
feat: reasoning output responses api#5206robinnarsinghranabhat wants to merge 29 commits intollamastack:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be merged. @robinnarsinghranabhat please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
✱ Stainless preview buildsThis PR will update the Edit this comment to update it. It will appear in the SDK's changelogs. ✅ llama-stack-client-go studio · conflict
✅ llama-stack-client-python studio · code · diff
✅ llama-stack-client-node studio · conflict
✅ llama-stack-client-openapi studio · code · diff
This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push. |
776cb1f to
8ad6647
Compare
|
✅ Recordings committed successfully Recordings from the integration tests have been committed to this PR. |
8ad6647 to
a3337d1
Compare
cdoern
left a comment
There was a problem hiding this comment.
please use the Ollama-reasoning suite to add integration tests for this see 87dc40b#diff-40d732a0defb244aec12e21fbd9cd387cbf212732f269549475db8de3877480c for more details on how I added some for the inference API previously.
I added tests to the ollama-reasoning suite with As I had only tested with Sorry I don't understand llamastack CI better. |
mattf
left a comment
There was a problem hiding this comment.
@robinnarsinghranabhat please directly include details of how you ran bfcl
mattf
left a comment
There was a problem hiding this comment.
given reasoning is not part of the chat completions api, but reasoning is part of the responses api, and we implement our responses api atop our chat completions api -
we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.
- pick a name: magic_toc_tokens
- require chat providers to populate magic_toc_tokens when appropriate
- detect the magic_toc_tokens field in the responses impl and convert it to Reasoning output
- ensure we do not leak magic_toc_tokens to users
(1) is going to become an implementation gap between providers, e.g. how do you get cot tokens from the openai provider's chat api? we'll probably have to move to responses.
(1) is going to be hard work on provider adapters, e.g. vllm configured w/o a reasoning parser will return model specific cot tokens in the response or different versions of vllm will put the reasoning content in different response fields
this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.
as written, this pr puts us on an unmaintainable path.
some other ideas -
- add stack_chat_completions_with_reasoning to the Inference contract, for internal use only by the responses implementation
- add responses to the Inference contract for providers who can implement it. care will be needed here to ensure the provider responses loop does not execute any tools and no credentials are passed along.
|
|
||
| **Score:** 83.1% · **Issues:** 48 · **Missing:** 20 | ||
|
|
||
| #### `/chat/completions` |
There was a problem hiding this comment.
@cdoern please confirm that this throws an error for novel outputs. for instance, error raised if the spec say fields x, y, z are to be returned and we return x, y, z & p?
| class OpenAIChatCompletionResponseMessage(BaseModel): | ||
| """An assistant message returned in a chat completion response.""" | ||
|
|
||
| model_config = ConfigDict(extra="allow") |
There was a problem hiding this comment.
To build the next_turn_messages for next round of pinging chat-completions.
|
@mattf Really appreciate this thorough review !
I agree with this is as a long term plan. As long as we stick to chat-completions, i see a need to standardize message conversion as well between responses and chat-completions. And
But I notice that current responses adapter already expects a field called I inferred this as an contract where provider specific streaming-cc implementation is responsible to populate a field named
Updated the description. |
good catch. i'd call that an oops that needs to be resolved. as implemented it means users will silently get different levels of service.
|
|
@mattf Maybe it was a mistake, but isn't current implementation implicitly doing what you suggested, with a name of choice being
I am not sure if this PR should be closed then. Any ideas on where we are with prioritization on defining a |
cdoern
left a comment
There was a problem hiding this comment.
@robinnarsinghranabhat , take a look at Stainless SDK Builds / run-integration-tests / Integration Tests and generally test labeled Stainless SDK Builds / run-integration-tests / Integration Tests these tests generate a NEW client based on your changes and run the entire suite. It is ok if some of the regular integration tests fail as long as their equivalent from stainless passes if the issue is the client.
cdoern
left a comment
There was a problem hiding this comment.
I agree with @mattf , basically this impl is backwards, the API should not have specific handing per-provider, the API needs to have contracts that each provider implements differently.
specific issues:
-
model_config = ConfigDict(extra="allow") on OpenAIAssistantMessageParam (src/llama_stack_api/inference/models.py:636) This opens up the assistant message model to accept any arbitrary field, which is a sledgehammer approach just to smuggle a reasoning field through. It bypasses Pydantic validation and could let malformed data through silently.
-
Reasoning is stuffed into Chat Completions types via setattr/getattr hacks (streaming.py:693, utils.py:321-325)
The code does things like msg.reasoning = reasoning and getattr(choice.message, "reasoning", None) on types that don't have a reasoning field. This only works because of the extra="allow" hack above. It's an untyped, informal contract — nothing enforces it, nothing documents it at the type level. -
Provider-specific reasoning parsing lives in the Responses layer (streaming.py:578-590) This means each new provider's quirks will need to be handled here, in the wrong layer.
-
_get_preceding_reasoning is fragile (utils.py:424-433) It only looks at the single item immediately before the current one. If the input ordering ever changes, or if there are multiple reasoning items, this silently drops reasoning content.
-
ChatCompletionResult.reasoning is a flat str | None (types.py:71)
Reasoning content from providers can be structured (multiple segments, summaries, etc.), but this flattens it all into a single concatenated string, losing structure. -
Partial provider coverage: The PR only handles Ollama/vLLM-style reasoning. OpenAI's own reasoning, Gemini's tags, and other providers are explicitly not covered, making this a partial implementation that will need the same pattern repeated per-provider.
These all tie back to Matt's core point: reasoning extraction should be a provider-level concern with a well-typed internal contract, not ad-hoc field smuggling through the Responses layer.
|
reasoning can be enabled via -
and the reasoning comes back to the user as an output message w/ a required(!) summary field and optional content / encrypted_content fields a reasonable and simple path forward -
someone will come along later and fill out the provider implementations (3). in the meantime, we give users confidence that we're doing what they request. |
|
@mattf @cdoern Made some changes while trying to keep things minimal and not break anything. This is WIP, tested with Ollama for now. Main Ideas
A Confusing Inconsistency :
|
cdoern
left a comment
There was a problem hiding this comment.
this is moving in the right direction! some questions:
3f03d95 to
5195cba
Compare
- Move reasoning fallback from router to Responses layer so it's
testable in unit tests. When provider raises NotImplementedError,
log critical warning and fall back to regular CC instead of crashing.
- Add openai_chat_completions_with_reasoning to Bedrock adapter
- Add tests: supported provider uses reasoning path, unsupported
provider falls back gracefully
- Router now passes through directly — Responses layer owns the
fallback logic
Current LlamaStack client deserializes response output as dicts, not typed objects. Use _get_attr helper for dict-compatible assertions so tests work with both current client (dicts) and OpenAI client (typed objects). Remove stray pdb breakpoint.
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…breaking fallback The router mutates params.model (strips provider prefix like openai/). When reasoning fallback triggers, the mutated params can't be routed again. Pass a copy to the reasoning method so the original stays intact.
When provider doesn't support reasoning and falls back to regular CC, clear reasoning_effort from params — providers like OpenAI's gpt-4o reject unrecognized reasoning_effort parameter with 400 error.
525222c to
af47f5f
Compare
Just seeing this after I pushed. I was overwriting with |
cdoern
left a comment
There was a problem hiding this comment.
I think there may be a better way to fix these mypy errs but I could be wrong
| for msg in params.messages: | ||
| if isinstance(msg, AssistantMessageWithReasoning) and msg.reasoning_content: | ||
| msg.reasoning = msg.reasoning_content | ||
| msg.reasoning = msg.reasoning_content # type: ignore[attr-defined] |
There was a problem hiding this comment.
is this the right solution?
There was a problem hiding this comment.
this one is still here, any way we can fix this rather than ignoring the err?
|
please run pre-commit before pushing to the PR if possible 🙏 |
What does this PR do?
Closed and Re-Opened 5087
Adds end-to-end reasoning output support for the llamastack's Responses API endpoint, enabling reasoning models (e.g.
gpt-ossvia Ollama/vLLM) to propagate their chain-of-thought reasoning through the LlamaStack Responses API pipeline.Test Plan
1. BFCL Evals
1.1 GPT-OSS-120B :
vllm-chat-completions(1),llamastack-chat-completions(2.1) andllamastack-responses(3.1) are now equivalent.1.2 GPT-OSS-20B
Similarly we see that
Row 1.2andRow 2.2are equivalent, meaning llamastack-responses itself brings no regression.More Details of above table
2. Manual verification with
OllamaandLlamaStack->Ollamaongpt-oss:20bHow:
response.outputhas reasoning objects in the right order (source of truth being what they look like inollamaandopenaiproviders directly).llamastack+ollamawith a MCP server, where tool orchestration happens on the LlamaStack server side.3. Server-side MCP tool orchestration (OpenAI vs LlamaStack comparison)
Compared end-to-end outputs produced by OpenAI vs LlamaStack->OpenAI on
gpt-5-mini.Manually compared the response output structure when using function calls and MCP tool calls in a multi-turn scenario.
Output:
Output structure comparison -- OpenAI vs LlamaStack on
gpt-5-mini:[McpListTools, ReasoningItem, OutputMessage][McpListTools, ReasoningItem, OutputMessage][ReasoningItem, McpCall, McpCall, ReasoningItem, OutputMessage][McpListTools, ReasoningItem, McpCall, McpCall, ReasoningItem, OutputMessage]* Minor pre-existing difference: LlamaStack re-emits
McpListToolson every request; OpenAI only emits it on T1. This is existing MCP behavior, not related to reasoning changes.BFCL Evals Setup
Used this guide for setting up evaluation.
then tested with vllm v0.17 and ollama
0.6.1.