Luce-Org · dusterbloom · May 24, 2026 · May 24, 2026 · cubic-dev-ai · May 24, 2026
diff --git a/bench/2026-05-25_ee_n_sweep/PLAN.md b/bench/2026-05-25_ee_n_sweep/PLAN.md
@@ -0,0 +1,131 @@
+# ee N-sweep Plan: baseline / ee3 / ee5 / ee7
+
+Scope: NIAH @ 32K/64K/128K + 5-client multi-turn bandit session, all four conditions.
+Goal: find the smallest N where quality and accept_rate hold vs ee7.
+
+## Trigger
+
+User says "GPU is free, run the ee_n sweep."
+
+Pre-flight GPU check:
+
+    flock -n /tmp/lucebox-gpu.lock echo ok
+
+Exit 0 means GPU is free. Exit 1 means another process holds the lock — wait or
+ask the user.
+
+## Pre-flight: generate NIAH case files (once, CPU only)
+
+Run once before the NIAH sweep. Requires `transformers` and `Qwen3-0.6B` tokenizer.
+
+    python3 pflash/tests/niah_gen.py --context 32768  --n 3 -o /tmp/niah_32768.jsonl
+    python3 pflash/tests/niah_gen.py --context 65536  --n 3 -o /tmp/niah_65536.jsonl
+    python3 pflash/tests/niah_gen.py --context 131072 --n 3 -o /tmp/niah_131072.jsonl
+
+Check niah_gen.py's flags first — --context / --n / -o are the expected interface
+but verify before running if in doubt:
+
+    python3 pflash/tests/niah_gen.py --help
+
+## Commands (literal copy-paste, run from worktree root)
+
+    cd /home/peppi/Dev/lucebox-hub/.claude/worktrees/drafter-fastpath
+
+    # Step 1: NIAH sweep (~50 min total, serialized)
+    dflash/bench/run_ee_n_sweep.sh
+
+    # Step 2: multi-client (~140 min total, serialized)
+    dflash/bench/run_ee_n_multiclient.sh
+
+Both scripts accept an optional output_dir as $1.
+
+## Expected output layout
+
+    dflash/bench/results/2026-05-25_ee_n_sweep/
+        raw_results.json
+        SUMMARY.md                          # written by the sweep script
+        baseline_32768_case{0,1,2}_server.log
+        baseline_65536_case{0,1,2}_server.log
+        baseline_131072_case{0,1,2}_server.log
+        ee3_*.log  ee5_*.log  ee7_*.log
+
+    .claude/worktrees/drafter-fastpath/bench/results/2026-05-25_ee_n_multiclient/
+        baseline/
+            claude_code.csv  claude_code.log  claude_code_server.log
+            hermes.csv       hermes.log       hermes_server.log
+            opencode.csv     opencode.log     opencode_server.log
+            pi.csv           pi.log           pi_server.log
+            codex.csv        codex.log        codex_server.log
+        ee3/   ee5/   ee7/   (same structure)
+
+## Decision gate
+
+Smallest N where ALL hold vs ee7:
+
+1. NIAH @ 32K and 64K: within +-1 needle (e.g. 2/3 or 3/3 both acceptable)
+2. accept_rate across 5 clients: mean within +-2 pp of ee7
+3. drafter wall: <= ee7 wall at each context (smaller N must be faster)
+4. No crashes: zero ggml_view_3d asserts, zero server OOM
+
+Outcome mapping:
+- ee3 passes all -> propose ee3 as new production default (follow-up PR after #274 merges)
+- ee5 passes, ee3 fails -> ee5
+- neither passes -> ee7 stays default, close the N-reduction investigation
+
+## Estimated cost
+
+- GPU wall: ~3 hr serialized (NIAH 50 min + multi-client 140 min)
+- Disk: ~50 MB
+- Compute cost: $0 (local RTX 3090)
+
+## Hard stops
+
+- Any condition raises ggml_view_3d assert -> STOP, Bug #42 regressed, file issue
+- flock wait exceeds 30 min -> STOP, something holds the GPU lock unexpectedly
+- A condition's NIAH drops to 0/3 at 32K or 64K -> STOP, do not run further
+- Server OOM on baseline (no early-exit) at 128K -> expected if VRAM too tight;
+  note it and continue with ee conditions
+
+## Out of scope (deferred)
+
+- 1K-16K NIAH: already 3/3 on ee7 per 2026-05-21_ee7_broad results; ee3/ee5 speedup
+  at short context is negligible, gate is long-context quality
+- ee10, ee14, ee2: user explicitly scoped to baseline + ee3 + ee5 + ee7
+- Cross-family drafter (SmolLM2): loader-ready but kernel not generalized yet
+
+## Flag mismatch notes (for next session)
+
+### NIAH script (run_niah_ee7_longctx.py)
+
+The existing `run_niah_ee7_longctx.py` only understands conditions "baseline",
+"ee14", "ee7" — it hardcodes env var logic for those three. It does NOT accept
+"ee3" or "ee5".
+
+Resolution: the new `run_ee_n_sweep_niah.py` script was written for this sweep.
+It accepts arbitrary N via CONDITION_SPECS dict and handles env var injection
+directly. Do NOT call run_niah_ee7_longctx.py for the N-sweep.
+
+### bandit-session required flags
+
+`client_test_runner.py bandit-session` requires --target, --draft, --bin (no
+defaults). The multi-client script passes all three hardcoded to:
+  --target /home/peppi/models/qwen3.6-27b-q4km/Qwen3.6-27B-Q4_K_M.gguf
+  --draft  /home/peppi/models/Qwen3-0.6B-BF16.gguf
+  --bin    dflash/build/dflash_server
+
+If model paths have changed, edit CLIENTS_TARGET / DRAFTER / BIN vars at the
+top of run_ee_n_multiclient.sh before running.
+
+### niah_gen.py interface
+
+Verify niah_gen.py flags before generating cases:
+  python3 pflash/tests/niah_gen.py --help
+
+The expected interface is --context / --n / -o. If the flags differ, adjust
+the generate commands above.
+
+### env prefix for baseline in multi-client script
+
+The shell uses `env $env_prefix python3 ...`. When env_prefix is empty (baseline),
+this collapses to `env python3 ...` which is valid — env with no assignments is
+a no-op passthrough.
diff --git a/dflash/README.md b/dflash/README.md
@@ -353,6 +353,22 @@ python3 scripts/bench_llm.py                                 # HE + GSM8K + Math
 python3 scripts/bench_he.py --n-gen 256 --ddtree-budget 22   # minimal HE bench
 ```
 
+**Early-exit drafter (requires PR #274 — `PFLASH_DRAFTER_EARLY_EXIT_N`):**
+
+Truncate the drafter forward at layer N instead of running all 28 layers. N=3 is the validated production default on RTX 3090 + Qwen2.5-0.5B-BF16:
+
+```bash
+PFLASH_DRAFTER_EARLY_EXIT_N=3 PFLASH_DRAFTER_SCORE_LAYERS=3 \
+  build/dflash_server ...
+```
+
+Headline numbers vs baseline (RTX 3090, Q4_K_M target, 0.5B-BF16 drafter):
+- 6.9× drafter speedup at 32K, 24.3× at 128K
+- accept_rate delta vs ee7: +1.2 pp (within ±2 pp gate across all 5 clients)
+- NIAH 3/3 at 32K, 64K, 128K (Bug #42 fix included)
+
+Reproduce: `dflash/bench/run_ee_n_sweep.sh` (NIAH + multi-client N-sweep) and `dflash/bench/run_ee_n_multiclient.sh` (5-client accept_rate comparison).
+
 **Long-context mode (up to 256K):**
 ```bash
 DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=16 \

diff --git a/dflash/bench/run_ee_n_multiclient.sh b/dflash/bench/run_ee_n_multiclient.sh
@@ -0,0 +1,78 @@
+#!/usr/bin/env bash
+# Multi-client bandit-session × {baseline, ee3, ee5, ee7} × {claude_code, hermes, opencode, pi, codex}
+# Each server boot is flock-serialized on /tmp/lucebox-gpu.lock.
+#
+# Usage: dflash/bench/run_ee_n_multiclient.sh [output_dir]
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+WORKTREE="/home/peppi/Dev/lucebox-hub/.claude/worktrees/harness-adapters"
+DRIVER="$WORKTREE/harness/client_test_runner.py"
+
+TARGET="/home/peppi/models/qwen3.6-27b-q4km/Qwen3.6-27B-Q4_K_M.gguf"
+DRAFTER="/home/peppi/models/Qwen3-0.6B-BF16.gguf"
+BIN="$ROOT/dflash/build/dflash_server"
+
+OUTDIR="${1:-$ROOT/bench/results/2026-05-25_ee_n_multiclient}"
+mkdir -p "$OUTDIR"
+
+# Verify driver exists.
+if [[ ! -f "$DRIVER" ]]; then
+    echo "ERROR: driver not found: $DRIVER"
+    echo "Check that the harness-adapters worktree is checked out."
+    exit 1
+fi
+
+CLIENTS=(claude_code hermes opencode pi codex)
+# format: name:EARLY_EXIT_N:SCORE_LAYERS (0 means unset -> full drafter layers)
+CONDITIONS=("baseline:0:0" "ee3:3:3" "ee5:5:5" "ee7:7:7")
+
+echo "=== ee N-sweep multi-client start ($(date)) ==="
+
+for cond_spec in "${CONDITIONS[@]}"; do
+    name="${cond_spec%%:*}"
+    rest="${cond_spec#*:}"
+    early_n="${rest%%:*}"
+    score_n="${rest#*:}"
+
+    cond_dir="$OUTDIR/$name"
+    mkdir -p "$cond_dir"
+
+    for client in "${CLIENTS[@]}"; do
+        echo "=== $name x $client ($(date)) ==="
+
+        # Build env vars for early-exit (unset for baseline).
+        # DFLASH_SERVER_BIN: overrides harness default cpp binary path.
+        # PYTHONPATH: needed for 'from harness.metrics_parser import ...' inside driver.
+        export DFLASH_SERVER_BIN="$BIN"
+        export PYTHONPATH="$WORKTREE"
+        if [[ "$early_n" != "0" ]]; then
+            export PFLASH_DRAFTER_EARLY_EXIT_N="$early_n"
+            export PFLASH_DRAFTER_SCORE_LAYERS="$score_n"
+        else
+            unset PFLASH_DRAFTER_EARLY_EXIT_N PFLASH_DRAFTER_SCORE_LAYERS 2>/dev/null || true
+        fi
+
+        flock -w 1800 /tmp/lucebox-gpu.lock \
+            python3 "$DRIVER" bandit-session \
+                --client "$client" \
+                --turns 3 \
+                --target "$TARGET" \
+                --draft "$DRAFTER" \
+                --bin "$BIN" \
+                --output "$cond_dir/${client}.csv" \
+        2>&1 | tee "$cond_dir/${client}.log" \
+            || echo "FAIL: $name x $client (see $cond_dir/${client}.log)"
+
+        # Capture server log if the harness wrote one to the standard evidence dir.
+        latest_server_log=$(ls -t "$WORKTREE"/dflash/bench/results/*_adaptive_evidence/server.log 2>/dev/null | head -1 || true)
+        if [[ -n "$latest_server_log" ]]; then
+            cp "$latest_server_log" "$cond_dir/${client}_server.log"
+        fi
+    done
+done
+
+echo "=== multi-client done. Results under $OUTDIR ($(date)) ==="
diff --git a/dflash/bench/run_ee_n_sweep.sh b/dflash/bench/run_ee_n_sweep.sh
@@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+# N-sweep NIAH bench: baseline + ee3 + ee5 + ee7 @ 32K / 64K / 128K
+#
+# Each server boot is flock-serialized on /tmp/lucebox-gpu.lock.
+# Pre-flight: NIAH case files must exist under CASES_DIR (see PLAN.md for gen command).
+#
+# Usage: dflash/bench/run_ee_n_sweep.sh [output_dir] [cases_dir]
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+OUTDIR="${1:-$ROOT/dflash/bench/results/2026-05-25_ee_n_sweep}"
+CASES_DIR="${2:-/tmp}"
+
+mkdir -p "$OUTDIR"
+
+echo "=== ee N-sweep NIAH start ($(date)) ==="
+echo "    out-dir:   $OUTDIR"
+echo "    cases-dir: $CASES_DIR"
+
+# Verify case files exist before acquiring the GPU lock.
+for ctx in 32768 65536 131072; do
+    f="$CASES_DIR/niah_${ctx}.jsonl"
+    if [[ ! -f "$f" ]]; then
+        echo "ERROR: missing $f"
+        echo "Generate with:"
+        echo "  python3 $ROOT/pflash/tests/niah_gen.py --context $ctx --n 3 -o $f"
+        exit 1
+    fi
+done
+
+# Run all conditions in a single flock-serialized call.
+# The Python script handles per-server serialization internally (one server per case).
+flock -w 1800 /tmp/lucebox-gpu.lock -c "
+    python3 $ROOT/dflash/bench/run_ee_n_sweep_niah.py \
+        --out-dir '$OUTDIR' \
+        --cases-dir '$CASES_DIR'
+" || { echo "FAIL: GPU lock timeout or sweep error"; exit 1; }
+
+echo "=== N-sweep NIAH done. Results under $OUTDIR ($(date)) ==="