Description
On HIP (gfx1151 / Radeon 8060S / Strix Halo) at commit c95dfca,
DFlash DDTree speculative decode produces corrupted output when
more than one tool is defined. The server falls back to AR decode
only in streaming mode, masking the issue for Hermes clients.
Non-streaming requests are completely broken with tools.
PFlash prefill compression destroys tool definitions at any
context size.
Steps to Reproduce
Server:
dflash_server Qwen3.6-27B-Q4_K_M.gguf
--port 8010 --host 0.0.0.0
--draft dflash-draft-3.6-q4_k_m.gguf
--ddtree --ddtree-budget 12
--cache-type-k q8_0 --cache-type-v q8_0
--fa-window 2048
--prefix-cache-slots 32
Request (2+ tools, 6.5K context, non-streaming):
curl /v1/chat/completions -d '{
"model": "luce-dflash",
"messages": [
{"role":"system","content":"You are Hermes. Use tools.<6K pad>"},
{"role":"user","content":"Search for TODO in the code"}
],
"tools": [{bash, grep, read, ...}],
"max_tokens": 256,
"temperature": 0
}'
Current Behavior
-
Non-streaming + tools (spec-decode):
finish=stop, content garbled e.g. #1-1">grep</arg_value>
or blank content, never produces tool_calls.
-
Non-streaming + tools (AR, temperature>0):
Works correctly, returns tool_calls. But no DFlash speedup.
-
Streaming + tools: Works correctly but falls back to
[ar-decode] at ~10.9 tok/s instead of [spec-decode].
-
PFlash (--prefill-compression auto): Compresses 6.5K->277
tokens (4.4% kept). Tool definitions in the system prompt
are lost, causing finish=stop instead of tool_calls.
Threshold setting appears not to be honored at moderate
context sizes.
Expected Behavior
- Spec-decode (temperature=0) should correctly generate
tool_calls with 2+ tools at all context sizes.
- PFlash should preserve tool definition content or not
trigger below the configured threshold.
- DFlash DDTree speedup should apply to tool-call requests.
Logs
Non-streaming spec-decode (broken):
[spec-decode] tokens=256 ... finish=stop
raw output: \n<arg_key>#1-1">grep</arg_value>...
Streaming (works, but AR only):
[ar-decode] tokens=26 speed=10.91 tok/s finish=tool_calls
PFlash:
[pflash] 6366 -> 272 -> 277 tokens (4.4% kept)
finish=stop
Environment
- GPU: Radeon 8060S (gfx1151), ROCm 7.2.3, 62GB VRAM/iGPU
- Model: Qwen3.6-27B-Q4_K_M.gguf
- Draft: dflash-draft-3.6-q4_k_m.gguf (GGUF, 1.0GB)
- Commit: c95dfca
- Docker: rocm/dev-ubuntu-24.04:7.2.3-complete
Related
Description
On HIP (gfx1151 / Radeon 8060S / Strix Halo) at commit c95dfca,
DFlash DDTree speculative decode produces corrupted output when
more than one tool is defined. The server falls back to AR decode
only in streaming mode, masking the issue for Hermes clients.
Non-streaming requests are completely broken with tools.
PFlash prefill compression destroys tool definitions at any
context size.
Steps to Reproduce
Server:
dflash_server Qwen3.6-27B-Q4_K_M.gguf
--port 8010 --host 0.0.0.0
--draft dflash-draft-3.6-q4_k_m.gguf
--ddtree --ddtree-budget 12
--cache-type-k q8_0 --cache-type-v q8_0
--fa-window 2048
--prefix-cache-slots 32
Request (2+ tools, 6.5K context, non-streaming):
curl /v1/chat/completions -d '{
"model": "luce-dflash",
"messages": [
{"role":"system","content":"You are Hermes. Use tools.<6K pad>"},
{"role":"user","content":"Search for TODO in the code"}
],
"tools": [{bash, grep, read, ...}],
"max_tokens": 256,
"temperature": 0
}'
Current Behavior
Non-streaming + tools (spec-decode):
finish=stop, content garbled e.g.
#1-1">grep</arg_value>or blank content, never produces tool_calls.
Non-streaming + tools (AR, temperature>0):
Works correctly, returns tool_calls. But no DFlash speedup.
Streaming + tools: Works correctly but falls back to
[ar-decode] at ~10.9 tok/s instead of [spec-decode].
PFlash (--prefill-compression auto): Compresses 6.5K->277
tokens (4.4% kept). Tool definitions in the system prompt
are lost, causing finish=stop instead of tool_calls.
Threshold setting appears not to be honored at moderate
context sizes.
Expected Behavior
tool_calls with 2+ tools at all context sizes.
trigger below the configured threshold.
Logs
Non-streaming spec-decode (broken):
[spec-decode] tokens=256 ... finish=stop
raw output:
\n<arg_key>#1-1">grep</arg_value>...Streaming (works, but AR only):
[ar-decode] tokens=26 speed=10.91 tok/s finish=tool_calls
PFlash:
[pflash] 6366 -> 272 -> 277 tokens (4.4% kept)
finish=stop
Environment
Related