-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
vLLM version: 0.9.1 (built from source)
Model: GLM-4.5-Air (AWQ 4-bit quantized)
Hardware: NVIDIA GB10
Flags: --reasoning-parser glm45 --enable-reasoning
🐛 Describe the bug
Bug Description
The GLM-4.5 reasoning parser (--reasoning-parser glm45) fails to extract reasoning_content during streaming chat completions when no tools are included in the request. The <think> tags leak into the content field while reasoning_content remains null.
Observed Behavior
| Scenario | reasoning_content |
content |
|---|---|---|
| WITH tools in request | ✅ Correctly populated | ✅ Clean |
| WITHOUT tools in request | ❌ null |
❌ Contains <think>...</think> tags |
Expected Behavior
Both scenarios should correctly populate reasoning_content with thinking text and content with the final response.
Root Cause
In vllm/entrypoints/openai/serving_chat.py, line 1034 passes output.token_ids directly to the reasoning parser without converting it using as_list().
Line 1034 (BUG):
elif self.reasoning_parser:
delta_message = reasoning_parser.extract_reasoning_content_streaming(
previous_text, current_text, delta_text,
previous_token_ids, current_token_ids,
output.token_ids, # <-- RAW GenericSequence (BUG!)
)
## Comparison with working code path (line 939, WITH tools):
elif tool_choice_auto and self.reasoning_parser:
output_token_ids = as_list(output.token_ids) # <-- Correctly converted
delta_message = reasoning_parser.extract_reasoning_content_streaming(
previous_text, current_text, delta_text,
previous_token_ids, current_token_ids,
output_token_ids, # <-- Uses converted list
)
The output.token_ids type is GenericSequence[int] which may be a NumPy array or other sequence type where the in operator behaves differently than with Python lists.
## Pattern Evidence
Every other similar code path uses as_list():
Line 742-744: current_token_ids = previous_token_ids + as_list(output.token_ids)
Line 746: current_token_ids = as_list(output.token_ids)
Line 881: output_token_ids = as_list(output.token_ids)
Line 939: output_token_ids = as_list(output.token_ids)
Line 1083: output_token_ids=as_list(output.token_ids)
Line 1034 is the ONLY place that passes output.token_ids directly.
## Fix
One-line change at serving_chat.py:1034:
- output.token_ids,
+ as_list(output.token_ids),
## Reproduction
# Start server
python -m vllm.entrypoints.openai.api_server \
--model GLM-4.5-Air \
--reasoning-parser glm45 \
--enable-reasoning
# Test streaming WITHOUT tools (fails)
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-4.5-Air",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"stream": true
}'
# Result: <think> tags appear in content, reasoning_content is null
# Test streaming WITH tools (works)
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-4.5-Air",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"tools": [{"type": "function", "function": {"name": "noop", "parameters": {}}}],
"tool_choice": "none",
"stream": true
}'
# Result: reasoning_content correctly populated
### Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working