feat: reasoning output responses api by robinnarsinghranabhat · Pull Request #5206 · llamastack/llama-stack

robinnarsinghranabhat · 2026-03-19T03:14:59Z

What does this PR do?

Closed and Re-Opened 5087

Adds end-to-end reasoning output support for the llamastack's Responses API endpoint, enabling reasoning models (e.g. gpt-oss via Ollama/vLLM) to propagate their chain-of-thought reasoning through the LlamaStack Responses API pipeline.

Test Plan

1. BFCL Evals

1.1 GPT-OSS-120B :

vllm-chat-completions (1), llamastack-chat-completions (2.1) and llamastack-responses (3.1) are now equivalent.

Rank	Config	Overall	Base	Miss Func	Miss Param	Long Context
1	vLLM 0.17 Non-Streaming Chat Completions	49.75%	61.50%	54.00%	48.00%	35.50%
1.1	Same as Above ( VLLM 0.18 March 27)	52.12%	65.00%	53.00%	52.00%	38.50%
2	Llama Stack Chat Completions Before CC reasoning merged	46.50%	59.00%	48.00%	46.50%	32.50%
2.1	Updated LS-CC with reasoning support (merged)	50.36%	61.50%	53.50%	50.00%	36.50%
3	Llama Stack Responses with vllm 0.17	47.75%	62.00%	51.00%	46.50%	31.50%
3.1	Same as Above ( VLLM 0.18 March 27 Pull)	45.63%	54.50%	47.50%	45.50%	35.00%
3.2	Llama Stack Responses with Reasoning Propagation (THIS PR)	50.63%	63.00%	53.00%	51.00%	35.50%
3.3	PR Refactor run on March 27 run on vllm 0.18	50.37%	62.00%	51.00%	50.00%	38.50%

1.2 GPT-OSS-20B

Similarly we see that Row 1.2 and Row 2.2 are equivalent, meaning llamastack-responses itself brings no regression.

Rank	Config	Overall	Base	Miss Func	Miss Param	Long Context
1	vLLM CC	28.00%	33.50%	33.00%	28.00%	17.50%
1.1	vLLM CC with invalid tool-name handled client-side	41.25%	54.00%	40.50%	40%	30.50%
1.2	vLLM CC-Streaming with invalid tool-name handled client-side	42.5%	56.00%	39.0%	43.00%	32.5%
2	LS Responses	35.75%	41.00%	36.50%	43.00%	22.50%
2.1	LS Responses with Reasoning Propagation (THIS PR)	29.50%	38.00%	26.50%	33.50%	20.00%
2.2	Same as 2.1, and vllm's invalid tools handled client-side (THIS PR)	43.75%	57.50%	41.50%	43.50%	32.50

IMP Assuming vllm's corrupted tool output issue is dealt with, this PR improves gpt-oss-20b as shown in Row 2.2. The problem is minimal in "vllm cc streaming" path when reasoning is skipped. Hence, current ls-responses looks better.

More Details of above table

2. Manual verification with `Ollama` and `LlamaStack->Ollama` on `gpt-oss:20b`

How:

In a multi-call scenario with client-side tool call orchestration, verified that final response.output has reasoning objects in the right order (source of truth being what they look like in ollama and openai providers directly).
Verified that, when these response outputs are propagated back with a new user message or tool output for the next conversation turn (next Responses API invocation), the internal conversion to chat-completions message array is correct. This is also reflected in the added tests.
Verified 1 and 2 when using llamastack+ollama with a MCP server, where tool orchestration happens on the LlamaStack server side.

## Summary script for client-side tool orchestration verification ##
from openai import OpenAI
import json

client = OpenAI(base_url="http://127.0.0.1:8321/v1", api_key="fake_api_key")

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather for a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location", "unit"],
        },
    },
]

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the weather like in Tokyo?"},
]

# Turn 1: mostly likely returns [ReasoningItem, FunctionToolCall]
# Sometimes it might not because model itself decides to not use reasoning
resp = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=messages,
    tools=tools,
    tool_choice="auto",
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)

print("Turn 1 output types:", [item.type for item in resp.output])
# Expected: ['reasoning', 'function_call'] or ['function_call']

# Turn 2: execute tool, pass result back
new_input = list(messages) + list(resp.output)
for item in resp.output:
    if item.type == "function_call":
        new_input.append({
            "type": "function_call_output",
            "call_id": item.call_id,
            "output": json.dumps({"temperature": 27, "condition": "humid", "unit": "celsius", "location": "Tokyo"}),
        })

resp2 = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=new_input,
    tools=tools,
    tool_choice="auto",
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)

print("Turn 2 output types:", [item.type for item in resp2.output])
# Expected: ['reasoning', 'message'] or ['message']

3. Server-side MCP tool orchestration (OpenAI vs LlamaStack comparison)

Compared end-to-end outputs produced by OpenAI vs LlamaStack->OpenAI on gpt-5-mini.
Manually compared the response output structure when using function calls and MCP tool calls in a multi-turn scenario.

## Summary script for MCP server-side tool orchestration verification ##
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8321/v1", api_key="fake_api_key")

mcp_tool = {
    "type": "mcp",
    "server_label": "fetch",
    "server_url": "http://127.0.0.1:8080/sse",
    "require_approval": "never",
}

# Turn 1: greeting
resp = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=[{"role": "user", "content": "Hello! What tools do you have available?"}],
    tools=[mcp_tool],
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)
print("T1 output types:", [item.type for item in resp.output])

# Turn 2: trigger MCP tool calls
input_t2 = [{"role": "user", "content": "Hello!"}] + list(resp.output)
input_t2.append({"role": "user", "content": "Can you fetch https://pypi.org/project/tiktoken/ and tell me the latest version?"})

resp2 = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=input_t2,
    tools=[mcp_tool],
    reasoning={"effort": "medium", "summary": "detailed"},
    stream=False,
)
print("T2 output types:", [item.type for item in resp2.output])

Output:

T1 output types: ['mcp_list_tools', 'reasoning', 'message']
T2 output types: ['mcp_list_tools', 'reasoning', 'mcp_call', 'mcp_call', 'reasoning', 'message']

Output structure comparison -- OpenAI vs LlamaStack on gpt-5-mini:

Turn	OpenAI Direct	LlamaStack + OpenAI	Status
T1 (greeting)	`[McpListTools, ReasoningItem, OutputMessage]`	`[McpListTools, ReasoningItem, OutputMessage]`	Match
T2 (fetch URL)	`[ReasoningItem, McpCall, McpCall, ReasoningItem, OutputMessage]`	`[McpListTools, ReasoningItem, McpCall, McpCall, ReasoningItem, OutputMessage]`	Minor mismatch*

* Minor pre-existing difference: LlamaStack re-emits McpListTools on every request; OpenAI only emits it on T1. This is existing MCP behavior, not related to reasoning changes.

BFCL Evals Setup

Used this guide for setting up evaluation.
then tested with vllm v0.17 and ollama 0.6.1.

mergify · 2026-03-19T03:15:39Z

This pull request has merge conflicts that must be resolved before it can be merged. @robinnarsinghranabhat please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · 2026-03-19T03:16:16Z

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat: reasoning output responses api

Edit this comment to update it. It will appear in the SDK's changelogs.

✅ llama-stack-client-go studio · conflict

Your SDK build had at least one new note diagnostic, which is a regression from the base state.

New diagnostics (24 note)

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

💡 Schema/EnumHasOneMember: This enum schema has just one member, so it could be defined using [`const`](https://json-schema.org/understanding-json-schema/reference/const).

✅ llama-stack-client-python studio · code · diff

Your SDK build had at least one "warning" diagnostic, but this did not represent a regression.
generate ⚠️ → build ✅ → lint ✅ → test ✅
pip install https://pkg.stainless.com/s/llama-stack-client-python/77a7adc641976209e1fe801968c9f6d11119d984/llama_stack_client-0.6.1a1-py3-none-any.whl
New diagnostics (3 note)

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

✅ llama-stack-client-node studio · conflict

Your SDK build had at least one new note diagnostic, which is a regression from the base state.

New diagnostics (3 note)

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

✅ llama-stack-client-openapi studio · code · diff

Your SDK build had at least one "warning" diagnostic, but this did not represent a regression.
generate ⚠️

New diagnostics (3 note)

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningItem` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningSummary` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

💡 Model/Recommended: `#/components/schemas/OpenAIResponseOutputMessageReasoningContent` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/responses`.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-03-30 03:36:48 UTC

github-actions · 2026-03-19T03:25:06Z

✅ Recordings committed successfully

Recordings from the integration tests have been committed to this PR.

View commit workflow

jwm4

I read through all the code changes and they all look good to me. We still needed an actual maintainer review, of course.

cdoern

please use the Ollama-reasoning suite to add integration tests for this see 87dc40b#diff-40d732a0defb244aec12e21fbd9cd387cbf212732f269549475db8de3877480c for more details on how I added some for the inference API previously.

docs/docs/api-openai/conformance.mdx

robinnarsinghranabhat · 2026-03-20T21:44:39Z

please use the Ollama-reasoning suite to add integration tests for this see 87dc40b#diff-40d732a0defb244aec12e21fbd9cd387cbf212732f269549475db8de3877480c for more details on how I added some for the inference API previously.

I added tests to the ollama-reasoning suite with llamastack client, as you had done.

As I had only tested with openai client, It looks like llamastack client deserializes response.output incorrectly. They come back as OutputOpenAIResponseMessageOutput instead of ResponseReasoningItem which breaks assertions. Then I Switched to use openai_client in tests. Still it's failing. Is it because cached outputs are being used in the tests from the earlier llamastack client ?

Sorry I don't understand llamastack CI better.

mattf

@robinnarsinghranabhat please directly include details of how you ran bfcl

mattf

given reasoning is not part of the chat completions api, but reasoning is part of the responses api, and we implement our responses api atop our chat completions api -

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

pick a name: magic_toc_tokens
require chat providers to populate magic_toc_tokens when appropriate
detect the magic_toc_tokens field in the responses impl and convert it to Reasoning output
ensure we do not leak magic_toc_tokens to users

(1) is going to become an implementation gap between providers, e.g. how do you get cot tokens from the openai provider's chat api? we'll probably have to move to responses.
(1) is going to be hard work on provider adapters, e.g. vllm configured w/o a reasoning parser will return model specific cot tokens in the response or different versions of vllm will put the reasoning content in different response fields

this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.

as written, this pr puts us on an unmaintainable path.

some other ideas -

add stack_chat_completions_with_reasoning to the Inference contract, for internal use only by the responses implementation
add responses to the Inference contract for providers who can implement it. care will be needed here to ensure the provider responses loop does not execute any tools and no credentials are passed along.

mattf · 2026-03-21T11:41:52Z

docs/docs/api-openai/conformance.mdx


 **Score:** 83.1% · **Issues:** 48 · **Missing:** 20

 #### `/chat/completions`


@cdoern please confirm that this throws an error for novel outputs. for instance, error raised if the spec say fields x, y, z are to be returned and we return x, y, z & p?

mattf · 2026-03-21T11:43:26Z

src/llama_stack_api/inference/models.py

 class OpenAIChatCompletionResponseMessage(BaseModel):
    """An assistant message returned in a chat completion response."""

+    model_config = ConfigDict(extra="allow")


why do we need this?

Here

To build the next_turn_messages for next round of pinging chat-completions.

robinnarsinghranabhat · 2026-03-21T15:32:52Z

@mattf Really appreciate this thorough review !

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

I agree with this is as a long term plan. As long as we stick to chat-completions, i see a need to standardize message conversion as well between responses and chat-completions. And llamastack responses would keep staying inferior to directly using openai's responses.

this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.
as written, this pr puts us on an unmaintainable path.

But I notice that current responses adapter already expects a field called reasoning on chat completion streaming chunks, and accumulates it.

I inferred this as an contract where provider specific streaming-cc implementation is responsible to populate a field named reasoning field in the chunk.

@robinnarsinghranabhat please directly include details of how you ran bfcl

Updated the description.

mattf · 2026-03-21T18:20:53Z

@mattf Really appreciate this thorough review !

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

I agree with this is as a long term plan. As long as we stick to chat-completions, i see a need to standardize message conversion as well between responses and chat-completions. And llamastack responses would keep staying inferior to directly using openai's responses.

this pr puts the adapter specific reasoning parsing into the responses adapter and declares only a partial implementation. if it were to complete the implementation it would have a web of provider specific code in the responses impl and will become unmaintainable.
as written, this pr puts us on an unmaintainable path.

But I notice that current responses adapter already expects a field called reasoning on chat completion streaming chunks, and accumulates it.

I inferred this as an contract where provider specific streaming-cc implementation is responsible to populate a field named reasoning field in the chunk.

good catch. i'd call that an oops that needs to be resolved. as implemented it means users will silently get different levels of service.

@robinnarsinghranabhat please directly include details of how you ran bfcl

Updated the description.

robinnarsinghranabhat · 2026-03-21T20:22:09Z

@mattf Maybe it was a mistake, but isn't current implementation implicitly doing what you suggested, with a name of choice being reasoning (although not documented)

we need an internal standard for how to propagate reasoning content that is not exposed in our public chat completions api.

pick a name: magic_toc_tokens

require chat providers to populate magic_toc_tokens when appropriate

detect the magic_toc_tokens field in the responses impl and convert it to Reasoning output

ensure we do not leak magic_toc_tokens to users

I am not sure if this PR should be closed then. Any ideas on where we are with prioritization on defining a internal standard to enable llama-stack responses to support reasoning then. Given some guidance (newbie), I am happy to take it :)

cdoern

@robinnarsinghranabhat , take a look at Stainless SDK Builds / run-integration-tests / Integration Tests and generally test labeled Stainless SDK Builds / run-integration-tests / Integration Tests these tests generate a NEW client based on your changes and run the entire suite. It is ok if some of the regular integration tests fail as long as their equivalent from stainless passes if the issue is the client.

cdoern

I agree with @mattf , basically this impl is backwards, the API should not have specific handing per-provider, the API needs to have contracts that each provider implements differently.

specific issues:

model_config = ConfigDict(extra="allow") on OpenAIAssistantMessageParam (src/llama_stack_api/inference/models.py:636) This opens up the assistant message model to accept any arbitrary field, which is a sledgehammer approach just to smuggle a reasoning field through. It bypasses Pydantic validation and could let malformed data through silently.
Reasoning is stuffed into Chat Completions types via setattr/getattr hacks (streaming.py:693, utils.py:321-325)
The code does things like msg.reasoning = reasoning and getattr(choice.message, "reasoning", None) on types that don't have a reasoning field. This only works because of the extra="allow" hack above. It's an untyped, informal contract — nothing enforces it, nothing documents it at the type level.
Provider-specific reasoning parsing lives in the Responses layer (streaming.py:578-590) This means each new provider's quirks will need to be handled here, in the wrong layer.
_get_preceding_reasoning is fragile (utils.py:424-433) It only looks at the single item immediately before the current one. If the input ordering ever changes, or if there are multiple reasoning items, this silently drops reasoning content.
ChatCompletionResult.reasoning is a flat str | None (types.py:71)
Reasoning content from providers can be structured (multiple segments, summaries, etc.), but this flattens it all into a single concatenated string, losing structure.
Partial provider coverage: The PR only handles Ollama/vLLM-style reasoning. OpenAI's own reasoning, Gemini's tags, and other providers are explicitly not covered, making this a partial implementation that will need the same pattern repeated per-provider.

These all tie back to Matt's core point: reasoning extraction should be a provider-level concern with a well-typed internal contract, not ad-hoc field smuggling through the Responses layer.

mattf · 2026-03-23T14:20:30Z

reasoning can be enabled via -

POST /v1/responses:include["reasoning.encrypted_content"]
POST /v1/responses:reasoning.effort/summary (both optional)

and the reasoning comes back to the user as an output message w/ a required(!) summary field and optional content / encrypted_content fields

a reasonable and simple path forward -

treat reasoning.encrypted_content as unsupported (400 response)
let output summary be optional or maybe always ""
when reasoning is requested have responses impl call a new openai_chat_completions_with_reasoning
implement openai_chat_completions_with_reasoning for the providers you care about
let the other providers get a default openai_chat_completions_with_reasoning that raises a not implemented / value error

someone will come along later and fill out the provider implementations (3). in the meantime, we give users confidence that we're doing what they request.

robinnarsinghranabhat · 2026-03-24T03:47:44Z

@mattf @cdoern Made some changes while trying to keep things minimal and not break anything. This is WIP, tested with Ollama for now.

Main Ideas

OpenAIChoiceDelta already defines a typed reasoning_content field, and the VertexAIInferenceAdapter provider is populating it when sending OpenAIChatCompletionChunk to the Responses layer. Thus to stay consistent for now, I consider reasoning_content as the standard field the Responses layer consumes.
openai_chat_completions_with_reasoning is called when the reasoning flag is set and not "none". Providers that support reasoning implement it. With Unsupported providers, llamastack raises clear error.
Added summary field to OpenAIResponseReasoning ( no-op for now )

Example Flow (Ollama)
1. User → POST /v1/responses with reasoning={effort:"medium"} and conversation history as input.
2. Responses layer converts input to CC messages via convert_response_input_to_chat_messages. Any ReasoningItem from previous turns becomes reasoning_content on OpenAIAssistantMessageParam.
3. Responses layer calls ollama.openai_chat_completions_with_reasoning instead of regular openai_chat_completion.
4. Ollama adapter (sending outbound request): Responsible for adjusting CC request params and messages to match what Ollama's CC endpoint expects — -- renames reasoning_content → reasoning on assistant messages, and adjusts reasoning_effort via _prepare_reasoning_params (e.g. defaults
  to "none" when not set, to prevent Ollama's own default of "medium"). Then calls regular openai_chat_completion with the modified params.
5. Ollama server responds with streaming chunks containing reasoning='...'.
6. Ollama adapter (handling inbound streaming chunks): Responsible to Map chunk.delta.reasoning to standardized chunk.delta.reasoning_content , which it propagates to the Responses layer.
7. Responses layer reads reasoning_content, builds ReasoningItem for the output.

A Confusing Inconsistency :

OpenAIMixin.openai_chat_completion declares OpenAIChatCompletionChunk (LlamaStack's type) as its return type, but actually returns openai.types.chat.ChatCompletionChunk (the OpenAI SDK's type), which is what the Responses layer ends up consuming. Meanwhile, VertexAI's openai_chat_completion does return LlamaStack's type and populates OpenAIChoiceDelta.reasoning_content directly. As a Future direction, Should we be ensuring the Responses layer consistently receives LlamaStack types ?

cdoern

this is moving in the right direction! some questions:

src/llama_stack/providers/remote/inference/vllm/vllm.py

src/llama_stack/core/routers/inference.py

- Move reasoning fallback from router to Responses layer so it's testable in unit tests. When provider raises NotImplementedError, log critical warning and fall back to regular CC instead of crashing. - Add openai_chat_completions_with_reasoning to Bedrock adapter - Add tests: supported provider uses reasoning path, unsupported provider falls back gracefully - Router now passes through directly — Responses layer owns the fallback logic

Current LlamaStack client deserializes response output as dicts, not typed objects. Use _get_attr helper for dict-compatible assertions so tests work with both current client (dicts) and OpenAI client (typed objects). Remove stray pdb breakpoint.

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…recordings

…breaking fallback The router mutates params.model (strips provider prefix like openai/). When reasoning fallback triggers, the mutated params can't be routed again. Pass a copy to the reasoning method so the original stays intact.

When provider doesn't support reasoning and falls back to regular CC, clear reasoning_effort from params — providers like OpenAI's gpt-4o reject unrecognized reasoning_effort parameter with 400 error.

robinnarsinghranabhat · 2026-03-31T16:08:34Z

@robinnarsinghranabhat are all of these recordings necessary? are you running locally with --inference-mode record or record-if-missing? please use record-if-missing

Just seeing this after I pushed. I was overwriting with record on ollama-reasoning, vllm-reasoning and gpt-reasoning suites. but that doesn't look like a problem now.

khaledsulayman

Thanks for your work on this PR, Robin! I'm approving with a note that I agree with @cdoern about not committing extra records if it can be avoided. However, as long as @cdoern and @mattf feel their changes have been addressed, LGTM!

cdoern

I think there may be a better way to fix these mypy errs but I could be wrong

cdoern · 2026-03-31T18:06:37Z

src/llama_stack/providers/remote/inference/vllm/vllm.py

        for msg in params.messages:
            if isinstance(msg, AssistantMessageWithReasoning) and msg.reasoning_content:
-                msg.reasoning = msg.reasoning_content
+                msg.reasoning = msg.reasoning_content  # type: ignore[attr-defined]


is this the right solution?

this one is still here, any way we can fix this rather than ignoring the err?

cdoern · 2026-03-31T18:38:12Z

please run pre-commit before pushing to the PR if possible 🙏

docs/docs/api-openai/provider_matrix.md

robinnarsinghranabhat requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners March 19, 2026 03:15

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 19, 2026

mergify bot added the needs-rebase label Mar 19, 2026

robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 776cb1f to 8ad6647 Compare March 19, 2026 03:20

mergify bot removed the needs-rebase label Mar 19, 2026

robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 8ad6647 to a3337d1 Compare March 19, 2026 03:30

jwm4 approved these changes Mar 20, 2026

View reviewed changes

cdoern requested changes Mar 20, 2026

View reviewed changes

docs/docs/api-openai/conformance.mdx Show resolved Hide resolved

mattf reviewed Mar 21, 2026

View reviewed changes

mattf requested changes Mar 21, 2026

View reviewed changes

cdoern reviewed Mar 23, 2026

View reviewed changes

cdoern requested changes Mar 23, 2026

View reviewed changes

cdoern reviewed Mar 24, 2026

View reviewed changes

src/llama_stack/providers/remote/inference/vllm/vllm.py Show resolved Hide resolved

src/llama_stack/core/routers/inference.py Outdated Show resolved Hide resolved

robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 3f03d95 to 5195cba Compare March 25, 2026 02:42

robinnarsinghranabhat and others added 18 commits March 31, 2026 10:29

test: add reasoning effort param to integration tests

96f6d18

fix: minor cleanups in reasoning tests

1270373

Recordings update from CI

d8c87ba

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

test: re-record ollama-reasoning and vllm-reasoning integration test …

845ee17

…recordings

test: re-record gpt-reasoning integration test recordings

c522202

chore: pre-commit formatting fixes

537a937

chore: regenerate docs after rebase

d7b1683

fix: clear reasoning_effort on fallback to prevent provider rejection

f2ad5d3

When provider doesn't support reasoning and falls back to regular CC, clear reasoning_effort from params — providers like OpenAI's gpt-4o reject unrecognized reasoning_effort parameter with 400 error.

fix: remove flawed unsupported-provider fallback test

8c88674

chore: restore recordings from main to clean state before re-recording

d50752b

ollama-reasoning and gpt-reasoning integration test recordings

a0cca00

ci: add --reasoning-parser to vLLM CI setup for reasoning test support

69886e7

missing updates from rebase

7fbcb70

updated recordings for ollama-reasoning, vllm-reasoning, gpt-reasoning

49806fd

spec changes after pre-commit

af47f5f

robinnarsinghranabhat force-pushed the feat/reasoning-output-responses-api branch from 525222c to af47f5f Compare March 31, 2026 15:33

mergify bot removed the needs-rebase label Mar 31, 2026

fix: resolve mypy errors for reasoning method across providers

645ba59

khaledsulayman approved these changes Mar 31, 2026

View reviewed changes

Merge branch 'main' into feat/reasoning-output-responses-api

af160c7

cdoern reviewed Mar 31, 2026

View reviewed changes

robinnarsinghranabhat added 2 commits March 31, 2026 14:27

fix: resolve mypy type errors in reasoning implementation

fd3242b

unused recordings detected

656250d

updated after local pre-commit

3ba8d00

cdoern reviewed Mar 31, 2026

View reviewed changes

docs/docs/api-openai/provider_matrix.md Outdated Show resolved Hide resolved


		Score: 83.1% · Issues: 48 · Missing: 20

		#### `/chat/completions`

Conversation

robinnarsinghranabhat commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

1. BFCL Evals

1.1 GPT-OSS-120B :

1.2 GPT-OSS-20B

2. Manual verification with Ollama and LlamaStack->Ollama on gpt-oss:20b

3. Server-side MCP tool orchestration (OpenAI vs LlamaStack comparison)

BFCL Evals Setup

Uh oh!

mergify bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jwm4 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robinnarsinghranabhat commented Mar 20, 2026

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

mattf Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

mattf Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

robinnarsinghranabhat Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

robinnarsinghranabhat commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattf commented Mar 21, 2026

Uh oh!

robinnarsinghranabhat commented Mar 21, 2026

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

mattf commented Mar 23, 2026

Uh oh!

robinnarsinghranabhat commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main Ideas

A Confusing Inconsistency :

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

robinnarsinghranabhat commented Mar 31, 2026

Uh oh!

khaledsulayman left a comment

Choose a reason for hiding this comment

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

cdoern Mar 31, 2026

robinnarsinghranabhat commented Mar 19, 2026 •

edited

Loading

2. Manual verification with `Ollama` and `LlamaStack->Ollama` on `gpt-oss:20b`

github-actions bot commented Mar 19, 2026 •

edited

Loading

github-actions bot commented Mar 19, 2026 •

edited

Loading

jwm4 left a comment •

edited

Loading

robinnarsinghranabhat commented Mar 21, 2026 •

edited

Loading

robinnarsinghranabhat commented Mar 24, 2026 •

edited

Loading