Skip to content

DDTree spec-decode corrupts tool call output; PFlash destroys tool definitions #327

@auryn-macmillan

Description

@auryn-macmillan

Description

On HIP (gfx1151 / Radeon 8060S / Strix Halo) at commit c95dfca,
DFlash DDTree speculative decode produces corrupted output when
more than one tool is defined. The server falls back to AR decode
only in streaming mode, masking the issue for Hermes clients.
Non-streaming requests are completely broken with tools.
PFlash prefill compression destroys tool definitions at any
context size.

Steps to Reproduce

Server:
dflash_server Qwen3.6-27B-Q4_K_M.gguf
--port 8010 --host 0.0.0.0
--draft dflash-draft-3.6-q4_k_m.gguf
--ddtree --ddtree-budget 12
--cache-type-k q8_0 --cache-type-v q8_0
--fa-window 2048
--prefix-cache-slots 32

Request (2+ tools, 6.5K context, non-streaming):
curl /v1/chat/completions -d '{
"model": "luce-dflash",
"messages": [
{"role":"system","content":"You are Hermes. Use tools.<6K pad>"},
{"role":"user","content":"Search for TODO in the code"}
],
"tools": [{bash, grep, read, ...}],
"max_tokens": 256,
"temperature": 0
}'

Current Behavior

  • Non-streaming + tools (spec-decode):
    finish=stop, content garbled e.g. #1-1">grep</arg_value>
    or blank content, never produces tool_calls.

  • Non-streaming + tools (AR, temperature>0):
    Works correctly, returns tool_calls. But no DFlash speedup.

  • Streaming + tools: Works correctly but falls back to
    [ar-decode] at ~10.9 tok/s instead of [spec-decode].

  • PFlash (--prefill-compression auto): Compresses 6.5K->277
    tokens (4.4% kept). Tool definitions in the system prompt
    are lost, causing finish=stop instead of tool_calls.
    Threshold setting appears not to be honored at moderate
    context sizes.

Expected Behavior

  • Spec-decode (temperature=0) should correctly generate
    tool_calls with 2+ tools at all context sizes.
  • PFlash should preserve tool definition content or not
    trigger below the configured threshold.
  • DFlash DDTree speedup should apply to tool-call requests.

Logs

Non-streaming spec-decode (broken):
[spec-decode] tokens=256 ... finish=stop
raw output: \n<arg_key>#1-1">grep</arg_value>...

Streaming (works, but AR only):
[ar-decode] tokens=26 speed=10.91 tok/s finish=tool_calls

PFlash:
[pflash] 6366 -> 272 -> 277 tokens (4.4% kept)
finish=stop

Environment

  • GPU: Radeon 8060S (gfx1151), ROCm 7.2.3, 62GB VRAM/iGPU
  • Model: Qwen3.6-27B-Q4_K_M.gguf
  • Draft: dflash-draft-3.6-q4_k_m.gguf (GGUF, 1.0GB)
  • Commit: c95dfca
  • Docker: rocm/dev-ubuntu-24.04:7.2.3-complete

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions