GGUF DFlash speculative decoding for Qwen3.5/Qwen3.6 27B.
C++/CUDA runtime on top of ggml. Default path: Qwen3.6-27B Q4_K_M target + Lucebox Q4_K_M GGUF DFlash draft.
Qwen3.5 reference: 129.5 tok/s mean on HumanEval (3.43x vs AR); best demo run: 207.6 tok/s vs 38.0 tok/s AR (5.46x).
Blog post · Benchmarks · Discord · lucebox.com
AR (tok/s) DFlash (tok/s) Speedup
HumanEval 37.78 129.52 3.43x
Math500 37.71 110.51 2.93x
GSM8K 37.65 96.15 2.55x
DFlash is a speculative decoder. A small draft proposes multiple tokens, and the target verifies them in one forward pass. DDTree verifies a tree of candidates instead of a single chain, which improves acceptance length at the same target compute budget.
This repo provides the GGUF target path and runtime pieces needed to run that stack on consumer GPUs:
- C++/CUDA decode loop on top of ggml, without libllama or PyTorch at runtime.
- Qwen3.5/Qwen3.6
qwen35GGUF target support. - DFlash draft loading from GGUF or safetensors.
- DDTree verify with tree-aware SSM rollback kernels.
- TQ3_0 and asymmetric K/V cache quantization for long context.
The default setup fits on a 24 GB RTX 3090: ~16 GB Q4_K_M target, 1.84 GB GGUF draft, DDTree verify state, and KV cache.
Qwen3.5-27B Q4_K_M, concurrency=1, n_gen=256, 10 prompts/dataset:
| Task | AR tok/s | DFlash+DDTree tok/s | AL | Speedup |
|---|---|---|---|---|
| HumanEval | 37.78 | 129.52 | 8.31 | 3.43× |
| Math500 | 37.71 | 110.51 | 7.04 | 2.93× |
| GSM8K | 37.65 | 96.15 | 6.14 | 2.55× |
AR = autoregressive (test_generate). DFlash+DDTree = tree verify at budget=22 with fast rollback (test_dflash). AL = Acceptance Length, average committed tokens per draft/verify step. Reproduce via python3 scripts/bench_llm.py.
Up to 256K context on 24 GB via TQ3_0 KV cache (3.5 bpv, default; Q4_0 legacy path tops out near 128K) + sliding target_feat ring (4096 slots). TQ3 = ~9.7× memory saving vs F16; Q4_0 = 8×.
| Prompt length | KV | Prefill time | Decode tok/s (FA window=2048) |
|---|---|---|---|
| 520 (HE) | Q8_0 | 0.06 s | ~104 (window inactive) |
| 32K | Q8_0 | 38 s | ~95 (interp.) |
| 64K | Q4_0 | 126 s | 91 (PR #26) |
| 128K | TQ3_0 | ~10 min | ~85–95 (sliding active) |
Prefill numbers assume --max-ctx sized to the prompt (auto-fit in run.py / bench_llm.py). Oversizing — e.g. --max-ctx=131072 on a 32K prompt — triggers FA stride over unused KV and slows prefill ~27× at that ratio.
HE 10-prompt bench mean in 128K mode (ctx=131072, ddtree-budget=16, FA window=2048): 134.78 tok/s at AL 8.33.
Decode tok/s assume the default sliding-window flash attention (--fa-window 2048, lossless: 100% acceptance at all window sizes). Disable with --fa-window 0 for full attention; expect ~25 tok/s at 60K+. Tune the window via python3 scripts/run.py --fa-window N or --fa-window N on test_dflash/dflash_server (sweet spot 1024–2048; bigger windows trade speed for marginally tighter attention).
Set DFLASH27B_KV_TQ3=1 (TQ3_0, 3.5 bpv, default) or DFLASH27B_KV_Q4=1 (Q4_0, 4.5 bpv, legacy) to enable. Full sweep in RESULTS.md.
The cache now supports independent quantization types for keys and values, optimizing memory-asymmetric workloads. Set DFLASH27B_KV_K=<type> and DFLASH27B_KV_V=<type> via environment or CLI flags. Supported types (case-insensitive): f16, bf16, q4_0, q4_1, q5_0, q5_1, q8_0, tq3_0.
Supported (K, V) pairs:
- K ∈ {F16, BF16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0} × V ∈ {F16, BF16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, TQ3_0}
- K = TQ3_0 × V ∈ {F16, BF16, Q4_0, Q8_0, TQ3_0}
Unsupported pairs abort at allocation with a printed list. Precedence (high→low): per-axis _KV_K/_KV_V override legacy shorthand (--kv-tq3 / --kv-q4 / --kv-f16); legacy shorthand last-wins among themselves. Default: q8_0 for both.
Environment variables:
DFLASH27B_KV_K=q8_0 DFLASH27B_KV_V=q4_0 ./test_dflash …CLI flags on test_dflash / test_generate:
./test_dflash … -ctk q8_0 -ctv q4_0
./test_dflash … --cache-type-k=q8_0 --cache-type-v=q4_0Target layer-split harness on test_dflash:
./test_dflash target.gguf draft.safetensors prompt.bin 128 out.bin \
--target-gpus=1,2,3,4 --target-layer-split=1,1,1,1
./test_dflash target.gguf draft.safetensors prompt.bin 128 out.bin \
--draft-gpu=0 --target-gpus=0,1 --target-layer-split=1,3 \
--target-split-load-draft
./test_dflash target.gguf draft.gguf prompt.bin 128 out.bin \
--draft-gpu=0 --target-gpus=0,1 --target-layer-split=1,1 \
--target-split-dflashWhen more than one target GPU is listed, test_dflash runs a non-daemon
target-only layer-split harness. Each target GPU loads only its assigned
contiguous layer range; output_norm and output.weight stay on the last
target GPU. --target-split-load-draft additionally loads the DFlash draft
on --draft-gpu (.safetensors or quantized draft .gguf), captures target
features into a draft-side mirror, and runs a small draft forward smoke. This
allows capacity checks where the draft and a target layer range share one GPU
before serving integration. --target-split-dflash runs the same split target
placement through a chain DFlash decode loop and reports acceptance length.
CLI flags on scripts/run.py:
python3 scripts/run.py --ctk q8_0 --ctv q4_0 --prompt "hello"
python3 scripts/run.py --cache-type-k q8_0 --cache-type-v q4_0 --prompt "hello"TQ3 semantics under asymmetry: TQ3_0 is K-side-driven. The FWHT rotation is applied to the query and inverse-applied to the attention output only when K=TQ3_0. V's type is independent of rotation. The 256-stride context alignment is triggered if either K or V is TQ3_0.
Legacy --kv-tq3 / --kv-q4 / --kv-f16 flags continue to work as symmetric shorthand for backward compatibility.
Qwen3.6-27B is the default integration path. It uses the same qwen35 target architecture as Qwen3.5, and the docs use the Lucebox GGUF DFlash draft by default.
# 1. target
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
# 2. matched 3.6 draft (GGUF, used by default by scripts/run.py and dflash_server)
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q4_k_m.gguf --local-dir models/draft/
# 3. bench
DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf python3 scripts/bench_he.py --n-gen 128The default draft path is discovered under models/draft/. Scripts prefer dflash-draft-*.gguf, then any .gguf, then model.safetensors. Explicit .gguf and safetensors drafts still work via DFLASH_DRAFT / --draft; qwen35-compatible targets remain swappable via DFLASH_TARGET / --target.
dflash_server serves the same client-facing local API surface used by the
harnesses. It supports /health,
/v1/models, OpenAI Chat Completions including streaming and tool metadata,
OpenAI Responses for Codex, Anthropic Messages for Claude Code, and Open WebUI
model metadata.
Build it with the rest of the CUDA runtime:
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --target dflash_server -jRun it directly:
./build/dflash_server models/Qwen3.6-27B-Q4_K_M.gguf \
--draft models/draft/dflash-draft-3.6-q4_k_m.gguf \
--host 127.0.0.1 --port 18080 \
--max-ctx 32768 --max-tokens 512 \
--fa-window 2048 \
--ddtree --ddtree-budget 22 \
--model-name luce-dflashdflash_server can run as a PFlash compression proxy in front of any
OpenAI-compatible backend instead of doing local inference. When
--prefill-upstream-base is set, each request is compressed (PFlash) and
forwarded upstream: compressed requests are sent as a raw prompt to
<base>/v1/completions (the compressed text already carries chat-template
markup, so this avoids double-templating), while uncompressed requests pass
through to <base>/v1/chat/completions. Streaming and non-streaming responses
are rewritten back to the Chat Completions shape. With no upstream flags the
server is byte-identical to local-inference mode.
./build/dflash_server models/Qwen3.6-27B-Q4_K_M.gguf \
--prefill-compression auto --prefill-threshold 10000 \
--prefill-drafter models/Qwen3-0.6B-BF16.gguf \
--prefill-curve 10000:0.5 40000:0.2 100000:0.1 \
--prefill-upstream-base http://127.0.0.1:8099 \
--prefill-upstream-model my-upstream-model \
--port 8080New PFlash flags:
| Flag | Purpose |
|---|---|
--prefill-curve T:R [T:R ...] |
Piecewise keep-ratio curve. Linear interpolation over (tokens, ratio) breakpoints, e.g. 10000:0.5 40000:0.2 100000:0.1 (2× compression at 10K tokens, 5× at 40K, 10× at 100K+). Overrides --prefill-keep-ratio; a per-session bandit override still takes precedence. |
--prefill-upstream-base <URL> |
OpenAI-compatible upstream base URL. Enables proxy mode. |
--prefill-upstream-key <KEY> |
Bearer token sent to the upstream. |
--prefill-upstream-model <NAME> |
Model name sent on forwarded requests. |
Then point OpenAI-compatible clients at http://127.0.0.1:18080/v1, or probe
the server with:
python3 ../harness/client_test_runner.py probe \
--url http://127.0.0.1:18080 \
--clients allOn fragile external RTX links, use the same conservative NVIDIA profile as the RTX mixed-hardware notes before running long prompts.
Reference measurements from the Qwen3.6 bring-up on RTX 3090:
| Target | Draft | Bench | AL | Accept | Mean tok/s |
|---|---|---|---|---|---|
| Qwen3.5-27B Q4_K_M | z-lab/Qwen3.5-27B-DFlash | HumanEval (README config) | 8.33 | ~65% | 134.78 |
| Qwen3.6-27B Q4_K_M | z-lab/Qwen3.5-27B-DFlash (mismatch) | HumanEval (10 prompts, n_gen=128) | 4.74 | 30.6% | 73.67 |
| Qwen3.6-27B Q4_K_M | z-lab/Qwen3.6-27B-DFlash safetensors | HumanEval (10 prompts, n_gen=128) | 5.05 | 32.3% | 77.77 |
| Qwen3.6-27B Q4_K_M | z-lab/Qwen3.5-27B-DFlash (mismatch) | Math (10 prompts, n_gen=128) | 3.63 | 23.7% | 57.00 |
Full bench_llm.py suite on Qwen3.6-27B UD-Q4_K_XL, 10 prompts, n_gen=256, RTX 3090 24 GB, auto-fit --max-ctx:
| Bench | AR tok/s | DFlash tok/s | AL | Speedup |
|---|---|---|---|---|
| HumanEval | 34.90 | 78.16 | 5.94 | 2.24× |
| GSM8K | 34.89 | 59.65 | 4.43 | 1.71× |
| Math500 | 35.13 | 69.77 | 5.15 | 1.99× |
| Mean | 34.97 | 69.19 | 5.17 | 1.98× |
For MoE targets whose experts don't fit in VRAM, dflash can split experts
across the GPU and CPU: the most-used (hot) experts stay resident on the
GPU, the rest (cold) live in host RAM and are evaluated on the CPU,
overlapped with the GPU hot path. This trades some decode/prefill speed for
VRAM headroom, and is computed automatically at load by a dynamic-placement
pass. It applies to both MoE arches: qwen35/qwen36 and laguna.
When it triggers. If the experts fit in the available VRAM budget, all experts load to GPU (no split, fastest path). Otherwise placement keeps as many hot experts as the budget allows and routes the rest to CPU. You can also shrink the budget manually to force a split (e.g. to free VRAM for a longer context or a larger target).
--spark turns the split into a self-tuning command: it enables the bounded
GPU expert cache, sizes it from the VRAM target, loads a placement profile
next to the model (<model>.gguf.spark.csv) if present, and keeps refining it
from live traffic across restarts. Cap total VRAM with --spark-vram <GiB>, or
pin the cache ring directly with --spark-slots <N> (default: auto-sized from
the VRAM target).
dflash_server models/laguna-xs2-Q4_K_M.gguf --spark # size to the card
dflash_server models/laguna-xs2-Q4_K_M.gguf --spark --spark-vram 14 # cap total VRAM
dflash_server models/laguna-xs2-Q4_K_M.gguf --spark --spark-slots 48 # pin 48 cache slots/layerUnder offload, laguna decodes the whole token in one fused graph
(laguna_step_hybrid), so throughput stays near the all-GPU ceiling (e.g.
~100 tok/s at 60% residency vs ~118 all-GPU on an RTX 3090); set
DFLASH_LAGUNA_NO_SINGLE_GRAPH=1 to fall back to the per-layer path.
| Env | Arch | Effect |
|---|---|---|
DFLASH_EXPERT_BUDGET_MB N |
both | Cap hot-expert VRAM to N MB. Applies only when N is below the auto-computed budget; experts beyond it go cold (CPU). |
DFLASH_EXPERT_BUDGET_PCT P |
laguna | Keep hot experts to P% (0<P<100) of total expert bytes. Applies only when below the auto budget. |
DFLASH_MAX_CONTEXT N |
both | Override the max context used when sizing the KV cache (more KV = less VRAM left for hot experts). |
Substitute <ARCH> = LAGUNA or QWEN35MOE:
| Env | Effect |
|---|---|
DFLASH_<ARCH>_HOTNESS <file> |
Expert frequency/hotness file driving which experts are placed hot. |
DFLASH_<ARCH>_TELEMETRY 1 |
Log per-layer hot/cold FFN timing telemetry. |
DFLASH_<ARCH>_SWAP_MAX N |
Max hot/cold promotions per request boundary (runtime re-placement); 0 disables swapping. |
DFLASH_<ARCH>_SWAP_MIN_GAIN N |
Min observed-frequency gain before a cold expert is promoted to hot. |
DFLASH_<ARCH>_NEXT_PLACEMENT_OUT <file> |
Dump the placement chosen this run (warm-start the hotness file next time). |
DFLASH_QWEN35MOE_RUNTIME_STATS_OUT <file> |
(qwen only) Dump runtime routing-frequency stats. |
Substitute <ARCH> = LAGUNA or QWEN35MOE. --spark sets these for you.
| Env | Effect |
|---|---|
DFLASH_SPARK 1 |
Enable the autotuning Spark path (set by --spark). |
DFLASH_SPARK_VRAM_MB N |
Total VRAM target Spark sizes the hot tier + cache to (set by --spark-vram). |
DFLASH_<ARCH>_EXPERT_CACHE 1 |
Bounded GPU expert cache: swap selected cold experts into spare slots (LRU) so they are served on-GPU; cold-miss falls toward 0 after warmup. |
DFLASH_<ARCH>_CACHE_SLOTS N |
Cache slots per layer (default: auto-sized from the VRAM target; --spark-slots N is the CLI equivalent). |
DFLASH_LAGUNA_GPU_REMAP 1 |
Serve the cache through the unified on-GPU FFN (required for the laguna cache to take effect). |
DFLASH_LAGUNA_NO_SINGLE_GRAPH 1 |
Fall back to per-layer decode instead of the default single-graph hybrid. |
# Force ~8 GB of hot experts on GPU; the rest run cold on the CPU.
DFLASH_EXPERT_BUDGET_MB=8000 ./build/dflash_server models/laguna-xs2-Q4_K_M.gguf --port 8000
# Startup log e.g.: "dynamic placement result: 4717 hot experts, 5267 cold experts"When a layer's hot-expert stack is reduced (i.e. a genuine split), the
ggml-cuda MMQ mul_mat_id kernel illegal-accesses for certain batch sizes on
both HIP/gfx1151 and CUDA/sm_86. As a guard, hybrid-split prefill is
sliced into ≤4-token sub-batches (forcing the stable MMVQ path); decode
(single-token) is unaffected. This costs some prefill throughput on split
layers and is removed once the kernel is fixed upstream.
Poolside Laguna-XS.2 is a 40-layer MoE LLM with 256 experts (top-8) plus an always-on shared expert, per-layer head counts [48,64,64,64]×10, and a per-layer SWA pattern (window 512). It is architecturally distinct from qwen35, so dflash adds a hand-rolled CUDA forward path (Path A, ggml-only — no libllama dependency) that mirrors the qwen35 stack. The Q4_K_M GGUF lands at 18.77 GiB on a single RTX 3090; tok_embd stays CPU-only (110 MiB) to keep the GPU budget under 24 GB.
test_dflash peeks general.architecture from the target GGUF at startup
and dispatches by arch:
qwen35/qwen36→ existing DFlash + DDTree pipeline (no change).laguna→dflash::common::run_laguna_daemon()(no spec-decode, no DDTree).
The daemon stdin/stream-fd protocol is identical, so dflash_server
drives both arches end-to-end. The only thing the user changes is the model path.
cmake --build build --target test_dflash test_laguna_daemon pflash_daemon -j
# 19 GB Q4_K_M target + 1.2 GB Qwen3-0.6B BF16 drafter + tokenizers
hf download Lucebox/Laguna-XS.2-GGUF laguna-xs2-Q4_K_M.gguf --local-dir models/
hf download unsloth/Qwen3-0.6B-GGUF Qwen3-0.6B-BF16.gguf --local-dir models/
hf download poolside/Laguna-XS.2 --local-dir models/Laguna-XS-2 \
--include 'tokenizer*' '*.json'
# OpenAI-compatible HTTP server.
# dflash_server routes to run_laguna_daemon() when arch=laguna.
./build/dflash_server models/laguna-xs2-Q4_K_M.gguf \
--max-ctx 16384 --port 8000
curl -sN http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"model":"luce-dflash","messages":[{"role":"user","content":"Hi"}],"max_tokens":32}'
# Smoke (loader only, no forward)
./build/smoke_load_target_laguna models/laguna-xs2-Q4_K_M.gguf
# Variable-N TTFT bench (DFLASH_KV_TYPE=q4_0 for ctx > 32K, DFLASH_CHUNK=2048 default)
DFLASH_KV_TYPE=q4_0 ./build/bench_laguna_ttft models/laguna-xs2-Q4_K_M.gguf '4096,16384,65536'
# NIAH single-needle, with PFlash compression. The driver still spawns the
# standalone test_laguna_daemon binary so it can run without dflash_server.
python3 scripts/laguna_pflash_niah.py \
--target models/laguna-xs2-Q4_K_M.gguf \
--drafter models/Qwen3-0.6B-BF16.gguf \
--laguna-tok models/Laguna-XS-2 \
--drafter-tok Qwen/Qwen3-0.6B \
--pflash-bin ./build/pflash_daemon \
--laguna-bin ./build/test_laguna_daemon \
--ctx 131072 --depth 0.5 --keep 0.10 --target-kv q4_0| Context | KV | keep | drafter (s) | target prefill (s) | end-to-end TTFT | NIAH |
|---|---|---|---|---|---|---|
| 4 096 | Q8_0 | 0.10 | 1.54 | 0.39 | 1.92 s | ✅ |
| 65 536 | Q4_0 | 0.10 | ~5 | ~6 | ~11 s | ✅ |
| 65 536 | Q4_0 | 0.20 | ~5 | ~8 | ~13 s | ✅ |
| 65 536 | Q4_0 | 0.30 | ~5 | ~10 | ~15 s | ✅ |
| 65 536 | Q4_0 | 0.50 | ~5 | ~17 | ~22 s | ✅ |
| 131 072 | Q4_0 | 0.10 | 11.11 | 4.79 | 15.91 s | ✅ |
| 131 072 | Q4_0 | 0.20 | 11.20 | 13.55 | 24.75 s | ✅ |
| 131 072 | Q4_0 | 0.30 | 11.41 | 26.43 | 37.84 s | ✅ |
The 131K keep=0.10 run depends on token-boundary repair in scripts/laguna_pflash_niah.py::cross_tok_compressed. The driver recovers kept-token positions, groups consecutive runs, expands each run to whitespace boundaries, and decodes the union once. That keeps multi-token needles intact after compression.
The table above uses synthetic uniform filler. Pass --filler-file <path> to
use a real corpus instead (file or directory; directories are recursively
concatenated). On server/src (1.3 MiB of C++/CUDA, ctx=16K, depth=0.5):
| keep | drafter compressed | NIAH |
|---|---|---|
| no-compress | (full haystack) | ✅ PASS |
| 0.10 | 1486 / 15278 | ❌ FAIL |
| 0.15 | 2254 / 15278 | ❌ FAIL |
| 0.20 | 3022 / 15278 | ❌ FAIL |
| 0.30 | 4558 / 15278 | ✅ PASS |
| 0.50 | 7630 / 15278 | ✅ PASS |
No-compress passes, so the model itself reads the needle out of code; the
failure mode at low keep is the drafter dropping the needle line. Code has
a denser distribution of "important" tokens than synthetic filler (every
identifier, syntax token, and semantic span looks informative to the
attention-based scorer), so the needle line doesn't stand out as an outlier
until retention is around 3× the synthetic-filler threshold (~0.10 → ~0.30).
Production implication: route by prompt source. Synthetic / prose prompts
tolerate keep=0.10; code-heavy prompts want keep ≥ 0.30 to keep recall
intact. A NIAH-aware drafter (or a hybrid scorer that boosts low-frequency
tokens) is the path to bring code recall to the same ratio as prose.
test_laguna_daemon is greedy by default. When scripts/laguna_serve.py (or any caller) appends samp=temp,top_p,top_k,rep_pen,seed to a generate line, the daemon strips it and runs a CPU sampler chain (rep_penalty → top_k → softmax(temp) → top_p → draw) over the prompt + emitted history. Verified that greedy / temp=2.0 seed=42 / temp=2.0 seed=43 / top_p=0.5 produce four distinct outputs on the same prompt.
- No Laguna spec-decode draft published yet. Current decode is autoregressive only (~111 tok/s on RTX 3090). When a matched draft lands, the DFlash + DDTree machinery already in
test_dflashports across. - Prefix cache + in-process PFlash compression are disabled on the laguna path. Both require
SNAPSHOT/RESTORE/FREE_SNAPSHOTandcompress/park/unparkcommands insiderun_laguna_daemon. Tracked as follow-ups; the qwen35 path uses them today. - Path B (in-process Python drafter) is not used here. Path A keeps the dflash daemon ggml-only.
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash
# Build (CUDA 12+, CMake 3.18+, sm_60+ GPU including Pascal; CUDA 13+ required for Jetson AGX Thor sm_110)
# Pass -DCMAKE_CUDA_ARCHITECTURES matching your GPU. Common values:
# 60;61 = Pascal P100/P40 (scalar flashprefill fallback, no WMMA)
# 70 = V100 (F16 WMMA kernels, BF16 draft → FP16 at load)
# 75 = 2080 Ti (F16 WMMA kernels, auto-converts BF16 draft → FP16 at load)
# 86 = RTX 3090 / A40 (native BF16 WMMA)
# 89 = RTX 4090
# 90 = H100
# 120 = Blackwell / DGX Spark
# 110 = Jetson AGX Thor (CUDA 13+)
# Omitting the flag falls back to the CMake-set default ("60;61;62;70;75;86"),
# which compiles Pascal (scalar), Volta/Turing (F16 WMMA), and Ampere+ (BF16 WMMA)
# flashprefill paths.
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --target test_dflash -j
# Fetch models: ~16 GB target + 0.98 GB Lucebox Q4_K_M GGUF DFlash draft.
# Quickstart pins to Qwen3.6-27B (latest release). For Qwen3.5-27B swap in
# unsloth/Qwen3.5-27B-GGUF + z-lab/Qwen3.5-27B-DFlash; arch is identical so
# no rebuild is needed.
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q4_k_m.gguf --local-dir models/draft/
# Streaming one-shot generate (run.py defaults to models/Qwen3.6-27B-Q4_K_M.gguf;
# override with --target or DFLASH_TARGET=... env var).
python3 scripts/run.py --prompt "def fibonacci(n):"
# Multi-turn chat REPL
python3 examples/chat.py
# OpenAI-compatible HTTP server (drop-in for Open WebUI / LM Studio / Cline).
./build/dflash_server models/Qwen3.6-27B-Q4_K_M.gguf \
--draft models/draft/dflash-draft-3.6-q4_k_m.gguf --port 8000
# Reproduce paper numbers
python3 scripts/bench_llm.py # HE + GSM8K + Math500
python3 scripts/bench_he.py --n-gen 256 --ddtree-budget 22 # minimal HE benchLong-context mode (up to 256K):
DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=16 \
build/test_dflash models/Qwen3.6-27B-Q4_K_M.gguf \
models/draft/dflash-draft-3.6-q4_k_m.gguf /tmp/long_prompt.bin 64 /tmp/out.bin \
--fast-rollback --ddtree --ddtree-budget=16 --max-ctx=4096 # align_up(prompt + n_gen + 64, 256); raise up to 262144 for long promptsRequirements: NVIDIA sm_60+ GPU (Pascal P40 24GB, V100 32GB, 2080 Ti, 3090, A10, A40, 4090) or Jetson AGX Thor sm_110, CUDA 12+ (CUDA 13+ required for Thor), 22+ GB VRAM, ~80 GB disk. Pascal GPUs use the scalar flashprefill fallback (no WMMA); P100 (12/16 GB) cannot fit the 27B target + draft under the 22 GB minimum. On Volta (SM 7.0) and Turing (SM 7.5), BF16 draft weights are auto-converted to FP16 at load time for tensor core acceleration.
python -c "import torch; p=torch.cuda.get_device_properties(0); print(p.name, 'sm_%d%d'%(p.major,p.minor), p.multi_processor_count,'SMs', round(p.total_memory/1e9,1),'GB')"
nvcc --versionnvcc --version # must show >= 12.9
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub/server
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release # CMake auto-adds sm_121
cmake --build build --target test_dflash dflash_server -jOn GB10 (128 GB unified), re-sweep --ddtree-budget (larger tree = more verify throughput until memory bandwidth saturates) and consider skipping KV quantization entirely (--cache-type-k f16 --cache-type-v f16).
nvcc --version # must show >= 13.0
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub/server
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release # CMake auto-adds Thor arch
cmake --build build --target test_dflash dflash_server -j- DDTree
--ddtree-budget: 22 on 3090 + Q4_K_M + 24 GB (default). RTX 5090 = 40 (swept). GB10 = re-sweep. - TQ3_0 KV sized for 24 GB. On 32 GB (5090) or 128 GB (GB10) you can push context further or skip quantization.
- Headline numbers (207 tok/s demo, 129.5 HumanEval, 2.8× vs SGLang AWQ) are RTX 3090 stock. RTX 5090 numbers (205 tok/s HE, 4.84×) are in RESULTS.md. Ada / GB10 / Thor not yet swept, PRs welcome.
Same DFlash + PFlash stack on AMD GPUs. PR #119 ports the Phase 2 rocWMMA flashprefill kernels to HIP. End-to-end on a Ryzen AI MAX+ 395 box (Radeon 8060S iGPU, gfx1151, 128 GiB LPDDR5X-8000 unified): 37.0 tok/s DFlash decode on Qwen3.5-27B Q4_K_M, 27.6 s TTFT @ 16K with NIAH retrieval intact. 3.08× decode and 2.24× prefill over llama.cpp HIP AR on the same iGPU. End-to-end wall clock at a 16K prompt + 1K generation workload: 2.66× faster than vanilla llama.cpp.
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub/server
# Build for gfx1151 (Strix Halo). Swap arch for gfx1100 / gfx1201.
cmake -B build -S . \
-DCMAKE_BUILD_TYPE=Release \
-DDFLASH27B_GPU_BACKEND=hip \
-DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
-DDFLASH27B_HIP_SM80_EQUIV=ON
cmake --build build --target test_dflash -jDFLASH27B_HIP_SM80_EQUIV=ON enables the rocWMMA Phase 2 flashprefill kernels (path that delivers the prefill speedup). OFF falls back to ggml's flash_attn_ext (slower but no rocwmma headers needed).
Per-arch DDTree tuning: gfx1151 (Strix Halo iGPU, bandwidth-bound on LPDDR5X) peaks at --ddtree-budget=22. gfx1100 (7900 XTX, GDDR6) prefers budget=8 per the PR #156 cross-arch perf plan. Run scripts/bench_he.py --ddtree-budget N to verify on your card.
Drafter recipe for max decode: target = Qwen3.5-27B Q4_K_M, drafter = same gen quantized to Q8_0 via server/scripts/quantize_draft_q8.py. Matching Q8_0 GGUF on the unsloth Qwen3.6 target needs DFLASH27B_DRAFT_SWA=2048 for sliding-window correctness.
See also: docs/HIP_PERF_PLAN.md (perf sweeps), docs/MIXED_BACKEND.md (mixed CUDA+HIP runs).
Block-diffusion draft. Each step, the draft sees [last_target_token, MASK×15] plus the last 5 captured target hidden states. It denoises the masks in a single forward, producing 16 candidate tokens conditioned on real target features. Structurally stronger than chain EAGLE: every position conditions on the same captured context, not its own noisy predictions.
DDTree tree verify. Instead of one chain of 16 candidates, a best-first tree of up to 22 nodes spans the top-K branches at each position. One target forward verifies the whole tree via a causal mask derived from parent pointers. Budget=22 is the sweet spot where draft accuracy plateaus. Chain pre-seed matters: pure best-first construction with greedy verify on a quantized target can rescue an inferior suffix; the chain_seed=true flag in build_ddtree recovered AL from ~4 to ~9.
Per-step rollback, kernel-free. Before verify, the target's recurrent state (SSM intermediate, conv window, KV cache) is snapshotted; after accept, restored to the committed prefix. Three custom CUDA kernels keep rollback off the critical path:
| Kernel | Purpose |
|---|---|
ggml_gated_delta_net_tree_persist |
Direct-writes SSM intermediates into a persistent buffer, skipping a 9 ms ggml_cpy per step |
ggml_ssm_conv_tree |
Tree-aware conv state gather: each sibling reads its K-1 window along the DDTree parent chain, not DFS order |
Sliding target_feat ring |
4096-slot ring via (pos % cap), enables 128K without holding 6.6 GB of captured features |
Prefill and decode share one graph builder; chain mode is just DDTree with budget=n_spec+1 and no branching.
Qwen3.5-27B is not a dense transformer. llama.cpp calls the arch qwen35:
- 64 layers. Every 4th is full softmax attention, the rest are Gated DeltaNet (linear attention with learned recurrence)
- M-RoPE, dimension sections
[11, 11, 10, 0] - 24 Q heads, 4 KV heads, key/value length 256
- SSM state cache alongside the KV cache
The DeltaNet primitive is already a first-class ggml op (ggml_gated_delta_net). Our fork of llama.cpp adds three tree-mode variants (ggml_ssm_conv_tree, ggml_gated_delta_net_tree, ggml_gated_delta_net_tree_persist) so DDTree verify can roll back SSM state in place, without a replay forward. The full engine (graph builders + decode loop + rollback + kernels) is ~2000 lines.
- llama.cpp: runs Qwen3.5-27B via GGUF but has no DFlash integration. Chain EAGLE isn't enough; block diffusion + DDTree needs a custom decode loop that bypasses
llama_decode. - vLLM / SGLang: Qwen3.5-27B in BF16 is 54 GB, so a single 24 GB card forces a quantized path. GGUF for this arch is broken on SGLang as of 2026-04 and vLLM is dropping GGUF support. AWQ runs on SGLang as plain autoregressive at 46.6 tok/s but can't host the BF16 draft + DDTree tree state alongside it on 24 GB. Q4_K_M GGUF is the only format that fits the full spec-decode stack, this repo runs it at 129.5 tok/s mean on HumanEval, 2.8× faster than SGLang AWQ autoregressive on the same hardware.
- z-lab reference: vLLM / SGLang integrations ship DFlash as a speculative-decoding method, but only on BF16 weights benchmarked on NVIDIA B200 (54+ GB VRAM). No GGUF path.
Current scope:
- Batch size 1, single-user local inference target (Ollama / LM Studio use case)
- Target family: Qwen3.5/Qwen3.6
qwen35GGUF targets. Other architectures need their own graph builder. - Draft formats: DFlash GGUF drafts and z-lab-style safetensors drafts.
- Optional sampling:
temperature,top_p,top_k,seed, andfrequency_penaltyare honored on the OpenAI endpoint. The DDTree verify skeleton stays argmax (preserves accept rate); only the committed token at each verify step is drawn from a small CPU sampler chain (rep-pen → top-k → top-p → temp → multinomial).temperature=0(default) keeps the path bit-exact greedy. Full Leviathan-style rejection sampling on the tree is still a future addition. - Backends: CUDA is the primary path. CUDA sm_60+ is supported: Pascal (sm_60-69) uses scalar F16 fallback, Volta/Turing (sm_70-75) use F16 WMMA kernels, and Ampere+ use native BF16 WMMA. HIP support exists for the documented AMD path. No Metal.
- Quantized target: Q4_K_M fits the full stack on 24 GB. Higher target quantizations may improve acceptance if they fit.
Correctness: test_vs_oracle validates the draft graph at cos sim 0.999812 vs the PyTorch reference. The target graph matches llama.cpp's models/qwen35.cpp semantically and produces bit-identical output to test_generate in autoregressive mode.
Open an issue or PR against Luce-Org/lucebox-hub. Good first picks:
- Leviathan-style rejection sampling on each DDTree branch (the current implementation samples only the committed token; full prob-matching across the tree is the next step)
- Full llama.cpp integration: new arch,
llama-speculative-dflash.cpp,llama-cli/llama-serverwiring
@software{luce_dflash_2026,
title = {Luce DFlash: GGUF port of block-diffusion speculative decoding for Qwen3.5-27B on consumer GPUs},
author = {Lucebox},
url = {https://github.com/Luce-Org/lucebox-hub/tree/main/dflash},
year = {2026}
}
@article{dflash2026,
title = {DFlash: Block-Diffusion Speculative Decoding},
author = {z-lab},
journal = {arXiv:2602.06036},
year = {2026}
}
@article{ddtree2026,
title = {Accelerating Speculative Decoding with Block Diffusion Draft Trees},
author = {Ringel, Liran and Romano, Yaniv},
journal = {arXiv:2604.12989},
year = {2026}
}Apache 2.0 · Lucebox · Discord
Inspired by z-lab/DFlash, liranringel/ddtree, ggml-org/llama.cpp.
The DFlash server natively supports the Responses API (/v1/responses), which is the
only wire protocol used by OpenAI Codex.
./build/dflash_server models/Qwen3.5-27B-Q4_K_M.gguf \
--draft models/Qwen3.5-3B-f16.safetensors \
--ddtree --ddtree-budget 22 --port 8080Create or edit ~/.codex/config.toml:
model = "luce-dflash"
model_provider = "dflash"
[model_providers.dflash]
name = "DFlash"
base_url = "http://localhost:8080/v1"
wire_api = "responses"
supports_websockets = falseNo env_key is needed — the local server accepts any token.
codex --provider dflash "Explain this codebase"| Feature | Status |
|---|---|
Responses API (POST /v1/responses) |
✅ |
| Function/tool calling | ✅ |
| Streaming (SSE) | ✅ |
| Codex models endpoint | ✅ |
| Reasoning / thinking | ✅ (effort: low / medium) |
| WebSockets | ❌ (not needed) |

