[Harmony] Fix analysis-channel tool calls and preserve reasoning across turns#35907
[Harmony] Fix analysis-channel tool calls and preserve reasoning across turns#35907will-deines wants to merge 6 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively addresses two bugs related to GPT-OSS Harmony model handling. The first fix correctly routes tool calls made on the analysis channel by ensuring they are parsed as ResponseFunctionToolCall objects, making the completed-message parser consistent with the streaming and in-progress paths. The second fix prevents the loss of reasoning context in multi-turn conversations by disabling the openai_harmony encoder's analysis message dropping, allowing vLLM's own selective filtering to function correctly. No high or critical severity security issues were identified during the audit, and these changes improve the robustness and performance of Harmony-based models. The changes are well-implemented, with specific unit tests ensuring correctness and robustness, and the code quality is high.
3c7c674 to
d6013c5
Compare
bbrowning
left a comment
There was a problem hiding this comment.
I gave this a once-over and it looks good to me. The fix to consider both analysis and commentary channels for tool calls matches real-world conditions, and also aligns with what we already do in the streaming case. And, it's safe to not have the openai/harmony library auto drop analysis messages because we already take care of that internally, since the harmony library was inconsistent in how it applied that auto dropping.
I also have a coworker that was running some BFCL multi-turn test suites against gpt-oss models and vLLM's Responses API implementation and hit the exact bug with unexpected McpCall being returned that this fixes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 84ce3814c4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Thanks for flagging this @bbrowning and @chatgpt-codex-connector — good catch. The With the encoder no longer stripping, the Responses path had no analysis filtering, so stale chain-of-thought from completed turns would accumulate in multi-turn conversations. Fix: replaced the no-op block with |
|
To use Codex here, create an environment for this repo. |
|
Hi @will-deines, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @will-deines, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
9a5721b to
32554be
Compare
…ss turns Backport of PR vllm-project#35907 fixes: - Widen channel check to accept function calls on analysis channel - Disable encoder-side auto_drop_analysis to prevent double-filtering - Replace no-op loop with auto_drop_analysis_messages() in serving.py
|
@bbrowning We can see vllm responses catching up towards vllm-cc. While I expect Vllm-CC (row 1 )to be slightly better in a apples to apples comparison as it supports reasoning. I will note that, there is still another error popping while running BFCL-eval in Vllm-Responses (very infrequent); for which I will raise another separate issue. |
|
@robinnarsinghranabhat Thank you for reporting! I'm interested in your comment "...I expect Vllm-CC (row 1 )to be slightly better in a apples to apples comparison as it supports reasoning" Can you help me understand this better? Also can you help me understand the other issue you're raising? I have run into a problem where control tokens are leaking into tool calls and have another PR that sanitizes them that's working for me in production. If that's the issue, you can try it out PR 35906 |
…ble Responses API tool_choice=required Three fixes on top of cherry-picked upstream PR vllm-project#33306: 1. EBNF grammar: tool_block now accepts both commentary and analysis channels, matching GPT-OSS behavior found in our PR vllm-project#35907. 2. adjust_request: handle both ChatCompletion and Responses API tool formats, guard response_format access for ResponsesRequest. 3. Responses API: remove NotImplementedError guard, add adjust_request call in _make_request_with_harmony so EBNF grammar flows through to sampling params.
…ble Responses API tool_choice=required Three fixes on top of cherry-picked upstream PR vllm-project#33306: 1. EBNF grammar: tool_block now accepts both commentary and analysis channels, matching GPT-OSS behavior found in our PR vllm-project#35907. 2. adjust_request: handle both ChatCompletion and Responses API tool formats, guard response_format access for ResponsesRequest. 3. Responses API: remove NotImplementedError guard, add adjust_request call in _make_request_with_harmony so EBNF grammar flows through to sampling params.
I noticed for gpt-oss models, vllm responses doesn't render the reasoning fields in the input prompts (that |
To be clear, this error of However, There is this another problem of In the third row, I stripped those |
59fb2ef to
9cc9ce2
Compare
|
Traveling so can't take a deep look right now, but we should check for regressions in how we prompt these models with Chat Completions based on those numbers. They are extremely sensitive to the prompting exactly following the expected Harmony formatting. |
…ss turns Two fixes for GPT-OSS Harmony model behavior: 1. Accept function calls on analysis channel in harmony_to_response_output() to match the streaming/in-progress parsers that already handle both channels. 2. Disable openai_harmony encoder's auto_drop_analysis to prevent double-filtering with vLLM's auto_drop_analysis_messages(), preserving reasoning context between tool-calling turns. Signed-off-by: Will Deines <will@garr.io>
The render_for_completion change (auto_drop_analysis=False) disabled the Harmony encoder's built-in analysis stripping so that vLLM could handle it explicitly via auto_drop_analysis_messages(). The Chat Completions path already called this function, but the Responses path did not — it had a no-op slice/reappend block that was relying on the encoder to strip analysis at render time. With the encoder no longer stripping, the Responses path had zero analysis filtering, causing stale chain-of-thought from completed turns to accumulate across multi-turn conversations (bloating prompt tokens and potentially altering model behavior). Replace the no-op block with auto_drop_analysis_messages(), which drops analysis messages from completed turns (before the last assistant final message) while preserving analysis from the current in-progress turn that the model needs for context. This makes both API paths consistent. Signed-off-by: Will Deines <will@garr.io>
Fixes ruff-format pre-commit failure (missing two blank lines between TestRenderForCompletion and TestResponseInputToHarmonyReasoningItem). Signed-off-by: Will Deines <will@garr.io>
The triggered_tags sub-dispatch grammar is fully permissive (all tokens
allowed between triggers). The model sometimes outputs a preamble message
<|channel|>commentary<|message|> with the tool call tokens as content:
<|channel|>commentary<|message|><|channel|>commentary to=functions.X<|message|>{args}<|end|>
harmony_to_response_output then produces a ResponseOutputMessage with
the raw channel tokens as text, failing tool_choice=required assertions.
Add _try_extract_embedded_function_call() to detect this pattern in
_parse_message_no_recipient: if a preamble's content starts with a
function-call channel prefix, re-parse it as ResponseFunctionToolCall.
Signed-off-by: Will Deines <will@garr.io>
9cc9ce2 to
bdbf625
Compare
…ol-calls Signed-off-by: Will Deines <will@garr.io>
Motivation
GPT-OSS Harmony models exhibit three behaviors that stock vLLM handles incorrectly, causing silent misrouting of tool calls and loss of reasoning context in multi-turn tool-calling conversations.
Bug 1: Tool calls on the analysis channel are silently misrouted
GPT-OSS models sometimes emit function calls on the
analysischannel instead ofcommentary. The completed-message parser (harmony_to_response_output) only accepted function calls oncommentary, so analysis-channel function calls fell through to_parse_mcp_call(), producing incorrect MCP call output items instead of function tool calls.This was an inconsistency:
parser_state_to_response_output()(streaming) and the in-progress parser already accepted function calls on both channels. Only the completed-message path was missing.Bug 2: Reasoning lost between tool-calling turns
The
openai_harmonylibrary defaults toauto_drop_analysis=Truewhen rendering conversations for completion, stripping all analysis messages. vLLM already has its ownauto_drop_analysis_messages()that selectively drops prior-turn analysis while preserving current-turn reasoning. The encoder's blanket drop on top of vLLM's selective drop caused double-filtering that destroyed the model's reasoning context between tool-calling turns.Additionally, the Responses API path had a no-op slice-delete-reappend cycle in
serving.pythat was relying on the encoder to strip analysis at render time. With the encoder'sauto_drop_analysisdisabled (Fix 2), the Responses path had zero analysis filtering, causing stale chain-of-thought from completed turns to accumulate across multi-turn conversations.Bug 3: Embedded function calls in preamble content
The
triggered_tagssub-dispatch grammar is fully permissive (all tokens allowed between triggers). The model sometimes outputs a preamble message<|channel|>commentary<|message|>whose content contains the raw function-call channel tokens:harmony_to_response_outputthen produces aResponseOutputMessagewith the raw channel tokens as text content, rather than aResponseFunctionToolCall. This causestool_choice=requiredassertions to fail and corrupts the output.Changes
Fix 1: Accept function calls on analysis channel (
harmony.py)Widened the channel check in
harmony_to_response_output()from== "commentary"toin ("commentary", "analysis"), making the completed-message parser consistent with the streaming and in-progress paths.Fix 2: Disable encoder-side analysis dropping (
harmony_utils.py) + add Responses API filtering (serving.py)Two-part fix:
Pass
RenderConversationConfig(auto_drop_analysis=False)to theopenai_harmonyencoder inrender_for_completion(). This prevents the encoder from double-dropping analysis messages that vLLM already selectively filters viaauto_drop_analysis_messages().Replace the no-op slice-delete-reappend cycle in
serving.py_construct_input_messages_with_harmony()with a proper call toauto_drop_analysis_messages(). This makes the Responses API path consistent with the Chat Completions path, which already called this function. Without this, disabling the encoder's auto-drop left the Responses path with zero analysis filtering.Important context: Dropping prior-turn analysis after a
finalmessage is intentional per the Harmony spec (confirmed in #35779 discussion). This fix does not change the dropping policy — it prevents double-filtering where the encoder's blanket drop destroys current-turn reasoning the model needs for tool-call context, before vLLM's selectiveauto_drop_analysis_messages()even runs.Fix 3: Extract embedded function calls from preamble content (
harmony.py)Add
_try_extract_embedded_function_call()to detect when a preamble's content starts with a function-call channel prefix (e.g.<|channel|>commentary to=functions.X<|message|>{args}). When detected, the raw content is re-parsed into a properResponseFunctionToolCallinstead of being emitted as a text message. Called from_parse_message_no_recipient()before falling back to_parse_final_message().Files Changed
vllm/entrypoints/openai/responses/harmony.py_try_extract_embedded_function_call()(Fix 3); call it from_parse_message_no_recipient()vllm/entrypoints/openai/parser/harmony_utils.pyauto_drop_analysisinrender_for_completion()(Fix 2, part 1)vllm/entrypoints/openai/responses/serving.pyauto_drop_analysis_messages()(Fix 2, part 2)tests/entrypoints/openai/responses/test_harmony_utils.pytest_analysis_with_function_recipient_creates_function_calltests/entrypoints/openai/parser/test_harmony_utils.pyTestRenderForCompletionwithtest_preserves_analysisandtest_preserves_reasoning_across_tool_turnsRelated Issues / PRs
auto_drop_analysis_messages()algorithm; rejected because prior-turn analysis drop is intentional per Harmony spec. Confirms our approach is correct: the algorithm is fine, but the encoder's blanketauto_drop_analysis=Trueis the problemResponseOutputMessages, not hidden reasoning. Our fix extends this: when a preamble's content is an embedded function call, extract it asResponseFunctionToolCallDesign Decisions
Why widen the channel check instead of fixing the model? The model's behavior of emitting tool calls on
analysisis valid per the Harmony protocol — the streaming and in-progress parsers already handle it. The completed-message parser was the only inconsistent path.Why disable
auto_drop_analysisat the encoder instead of removingauto_drop_analysis_messages()? vLLM'sauto_drop_analysis_messages()implements the correct selective dropping policy (only prior-turn analysis before afinalmessage), which is intentional per the Harmony spec (confirmed in [Bug]: Harmony models incorrectly drops prior-turn analysis channel in multi-turn conversations #35779 discussion; fix: preserve prior-turn analysis messages in Harmony multi-turn conversations #35826, which proposed changing the algorithm, was closed as not-a-bug). The encoder's blanketauto_drop_analysis=Trueis the problem — it fires before vLLM's selective drop, destroying current-turn reasoning the model needs for tool-call context. Disabling the encoder's drop preserves vLLM's intentional filtering while preventing the double-drop.Why re-parse embedded function calls instead of fixing the grammar? The
triggered_tagssub-dispatch grammar intentionally allows all tokens between triggers (free-text). Constraining it to prevent embedded channel sequences would require changes to the xgrammar structural tag specification. Re-parsing at the output layer is a surgical fix that handles the model's actual behavior without modifying the grammar contract. This extends the preamble visibility work in [Bugfix] Fix Harmony preamble visibility in Responses API #32114 — preambles are visible messages, but when their content is actually a function call, we should extract it.Test Plan
pytest tests/entrypoints/openai/parser/test_harmony_utils.py -v— 61 passedpytest tests/entrypoints/openai/responses/test_harmony_utils.py -v— 23 passed