DDTree spec-decode corrupts tool call output; PFlash destroys tool definitions

## Description

On HIP (gfx1151 / Radeon 8060S / Strix Halo) at commit c95dfca,
DFlash DDTree speculative decode produces corrupted output when
more than one tool is defined. The server falls back to AR decode
only in streaming mode, masking the issue for Hermes clients.
Non-streaming requests are completely broken with tools.
PFlash prefill compression destroys tool definitions at any
context size.

### Steps to Reproduce

Server:
  dflash_server Qwen3.6-27B-Q4_K_M.gguf \
    --port 8010 --host 0.0.0.0 \
    --draft dflash-draft-3.6-q4_k_m.gguf \
    --ddtree --ddtree-budget 12 \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --fa-window 2048 \
    --prefix-cache-slots 32

Request (2+ tools, 6.5K context, non-streaming):
  curl /v1/chat/completions -d '{
    "model": "luce-dflash",
    "messages": [
      {"role":"system","content":"You are Hermes. Use tools.<6K pad>"},
      {"role":"user","content":"Search for TODO in the code"}
    ],
    "tools": [{bash, grep, read, ...}],
    "max_tokens": 256,
    "temperature": 0
  }'

### Current Behavior

- **Non-streaming + tools (spec-decode)**:
  finish=stop, content garbled e.g. `#1-1">grep</arg_value>`
  or blank content, never produces tool_calls.

- **Non-streaming + tools (AR, temperature>0)**:
  Works correctly, returns tool_calls. But no DFlash speedup.

- **Streaming + tools**: Works correctly but falls back to
  [ar-decode] at ~10.9 tok/s instead of [spec-decode].

- **PFlash (--prefill-compression auto)**: Compresses 6.5K->277
  tokens (4.4% kept). Tool definitions in the system prompt
  are lost, causing finish=stop instead of tool_calls.
  Threshold setting appears not to be honored at moderate
  context sizes.

### Expected Behavior

- Spec-decode (temperature=0) should correctly generate
  tool_calls with 2+ tools at all context sizes.
- PFlash should preserve tool definition content or not
  trigger below the configured threshold.
- DFlash DDTree speedup should apply to tool-call requests.

### Logs

Non-streaming spec-decode (broken):
  [spec-decode] tokens=256 ... finish=stop
  raw output: `\n<arg_key>#1-1">grep</arg_value>...`

Streaming (works, but AR only):
  [ar-decode] tokens=26 speed=10.91 tok/s finish=tool_calls

PFlash:
  [pflash] 6366 -> 272 -> 277 tokens (4.4% kept)
  finish=stop

### Environment

- GPU: Radeon 8060S (gfx1151), ROCm 7.2.3, 62GB VRAM/iGPU
- Model: Qwen3.6-27B-Q4_K_M.gguf
- Draft: dflash-draft-3.6-q4_k_m.gguf (GGUF, 1.0GB)
- Commit: c95dfca
- Docker: rocm/dev-ubuntu-24.04:7.2.3-complete

### Related

- #143 (closed, old python server)
- #229 (closed, OpenCode tool calls, "works in Hermes")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDTree spec-decode corrupts tool call output; PFlash destroys tool definitions #327

Description

Steps to Reproduce

Current Behavior

Expected Behavior

Logs

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DDTree spec-decode corrupts tool call output; PFlash destroys tool definitions #327

Description

Description

Steps to Reproduce

Current Behavior

Expected Behavior

Logs

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions