Skip to content

[Bug]: GLM-4.5 reasoning parser streaming fails without tools in request - missing as_list() conversion #29763

@sygenaithanos

Description

@sygenaithanos

Your current environment

vLLM version: 0.9.1 (built from source)
Model: GLM-4.5-Air (AWQ 4-bit quantized)
Hardware: NVIDIA GB10
Flags: --reasoning-parser glm45 --enable-reasoning

🐛 Describe the bug

Bug Description

The GLM-4.5 reasoning parser (--reasoning-parser glm45) fails to extract reasoning_content during streaming chat completions when no tools are included in the request. The <think> tags leak into the content field while reasoning_content remains null.

Observed Behavior

Scenario reasoning_content content
WITH tools in request ✅ Correctly populated ✅ Clean
WITHOUT tools in request null ❌ Contains <think>...</think> tags

Expected Behavior

Both scenarios should correctly populate reasoning_content with thinking text and content with the final response.

Root Cause

In vllm/entrypoints/openai/serving_chat.py, line 1034 passes output.token_ids directly to the reasoning parser without converting it using as_list().

Line 1034 (BUG):

elif self.reasoning_parser:
    delta_message = reasoning_parser.extract_reasoning_content_streaming(
        previous_text, current_text, delta_text,
        previous_token_ids, current_token_ids,
        output.token_ids,  # <-- RAW GenericSequence (BUG!)
    )


## Comparison with working code path (line 939, WITH tools):
elif tool_choice_auto and self.reasoning_parser:
    output_token_ids = as_list(output.token_ids)  # <-- Correctly converted
    delta_message = reasoning_parser.extract_reasoning_content_streaming(
        previous_text, current_text, delta_text,
        previous_token_ids, current_token_ids,
        output_token_ids,  # <-- Uses converted list
    )
The output.token_ids type is GenericSequence[int] which may be a NumPy array or other sequence type where the in operator behaves differently than with Python lists.

## Pattern Evidence
Every other similar code path uses as_list():
Line 742-744: current_token_ids = previous_token_ids + as_list(output.token_ids)
Line 746: current_token_ids = as_list(output.token_ids)
Line 881: output_token_ids = as_list(output.token_ids)
Line 939: output_token_ids = as_list(output.token_ids)
Line 1083: output_token_ids=as_list(output.token_ids)
Line 1034 is the ONLY place that passes output.token_ids directly.
## Fix
One-line change at serving_chat.py:1034:
- output.token_ids,
+ as_list(output.token_ids),


## Reproduction
# Start server
python -m vllm.entrypoints.openai.api_server \
  --model GLM-4.5-Air \
  --reasoning-parser glm45 \
  --enable-reasoning

# Test streaming WITHOUT tools (fails)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-4.5-Air",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "stream": true
  }'
# Result: <think> tags appear in content, reasoning_content is null

# Test streaming WITH tools (works)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-4.5-Air",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "tools": [{"type": "function", "function": {"name": "noop", "parameters": {}}}],
    "tool_choice": "none",
    "stream": true
  }'
# Result: reasoning_content correctly populated

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions