Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions bench/2026-05-25_ee_n_sweep/PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# ee N-sweep Plan: baseline / ee3 / ee5 / ee7

Scope: NIAH @ 32K/64K/128K + 5-client multi-turn bandit session, all four conditions.
Goal: find the smallest N where quality and accept_rate hold vs ee7.

## Trigger

User says "GPU is free, run the ee_n sweep."

Pre-flight GPU check:

flock -n /tmp/lucebox-gpu.lock echo ok

Exit 0 means GPU is free. Exit 1 means another process holds the lock — wait or
ask the user.

## Pre-flight: generate NIAH case files (once, CPU only)

Run once before the NIAH sweep. Requires `transformers` and `Qwen3-0.6B` tokenizer.

python3 pflash/tests/niah_gen.py --context 32768 --n 3 -o /tmp/niah_32768.jsonl
python3 pflash/tests/niah_gen.py --context 65536 --n 3 -o /tmp/niah_65536.jsonl
python3 pflash/tests/niah_gen.py --context 131072 --n 3 -o /tmp/niah_131072.jsonl

Check niah_gen.py's flags first — --context / --n / -o are the expected interface
but verify before running if in doubt:

python3 pflash/tests/niah_gen.py --help

## Commands (literal copy-paste, run from worktree root)

cd /home/peppi/Dev/lucebox-hub/.claude/worktrees/drafter-fastpath

# Step 1: NIAH sweep (~50 min total, serialized)
dflash/bench/run_ee_n_sweep.sh

# Step 2: multi-client (~140 min total, serialized)
dflash/bench/run_ee_n_multiclient.sh

Both scripts accept an optional output_dir as $1.

## Expected output layout

dflash/bench/results/2026-05-25_ee_n_sweep/
raw_results.json
SUMMARY.md # written by the sweep script
baseline_32768_case{0,1,2}_server.log
baseline_65536_case{0,1,2}_server.log
baseline_131072_case{0,1,2}_server.log
ee3_*.log ee5_*.log ee7_*.log

.claude/worktrees/drafter-fastpath/bench/results/2026-05-25_ee_n_multiclient/
baseline/
claude_code.csv claude_code.log claude_code_server.log
hermes.csv hermes.log hermes_server.log
opencode.csv opencode.log opencode_server.log
pi.csv pi.log pi_server.log
codex.csv codex.log codex_server.log
ee3/ ee5/ ee7/ (same structure)

## Decision gate

Smallest N where ALL hold vs ee7:

1. NIAH @ 32K and 64K: within +-1 needle (e.g. 2/3 or 3/3 both acceptable)
2. accept_rate across 5 clients: mean within +-2 pp of ee7
3. drafter wall: <= ee7 wall at each context (smaller N must be faster)
4. No crashes: zero ggml_view_3d asserts, zero server OOM

Outcome mapping:
- ee3 passes all -> propose ee3 as new production default (follow-up PR after #274 merges)
- ee5 passes, ee3 fails -> ee5
- neither passes -> ee7 stays default, close the N-reduction investigation

## Estimated cost

- GPU wall: ~3 hr serialized (NIAH 50 min + multi-client 140 min)
- Disk: ~50 MB
- Compute cost: $0 (local RTX 3090)

## Hard stops

- Any condition raises ggml_view_3d assert -> STOP, Bug #42 regressed, file issue
- flock wait exceeds 30 min -> STOP, something holds the GPU lock unexpectedly
- A condition's NIAH drops to 0/3 at 32K or 64K -> STOP, do not run further
- Server OOM on baseline (no early-exit) at 128K -> expected if VRAM too tight;
note it and continue with ee conditions

## Out of scope (deferred)

- 1K-16K NIAH: already 3/3 on ee7 per 2026-05-21_ee7_broad results; ee3/ee5 speedup
at short context is negligible, gate is long-context quality
- ee10, ee14, ee2: user explicitly scoped to baseline + ee3 + ee5 + ee7
- Cross-family drafter (SmolLM2): loader-ready but kernel not generalized yet

## Flag mismatch notes (for next session)

### NIAH script (run_niah_ee7_longctx.py)

The existing `run_niah_ee7_longctx.py` only understands conditions "baseline",
"ee14", "ee7" — it hardcodes env var logic for those three. It does NOT accept
"ee3" or "ee5".

Resolution: the new `run_ee_n_sweep_niah.py` script was written for this sweep.
It accepts arbitrary N via CONDITION_SPECS dict and handles env var injection
directly. Do NOT call run_niah_ee7_longctx.py for the N-sweep.

### bandit-session required flags

`client_test_runner.py bandit-session` requires --target, --draft, --bin (no
defaults). The multi-client script passes all three hardcoded to:
--target /home/peppi/models/qwen3.6-27b-q4km/Qwen3.6-27B-Q4_K_M.gguf
--draft /home/peppi/models/Qwen3-0.6B-BF16.gguf
--bin dflash/build/dflash_server

If model paths have changed, edit CLIENTS_TARGET / DRAFTER / BIN vars at the
top of run_ee_n_multiclient.sh before running.

### niah_gen.py interface

Verify niah_gen.py flags before generating cases:
python3 pflash/tests/niah_gen.py --help

The expected interface is --context / --n / -o. If the flags differ, adjust
the generate commands above.

### env prefix for baseline in multi-client script

The shell uses `env $env_prefix python3 ...`. When env_prefix is empty (baseline),
this collapses to `env python3 ...` which is valid — env with no assignments is
a no-op passthrough.
16 changes: 16 additions & 0 deletions dflash/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,22 @@ python3 scripts/bench_llm.py # HE + GSM8K + Math
python3 scripts/bench_he.py --n-gen 256 --ddtree-budget 22 # minimal HE bench
```

**Early-exit drafter (requires PR #274 — `PFLASH_DRAFTER_EARLY_EXIT_N`):**

Truncate the drafter forward at layer N instead of running all 28 layers. N=3 is the validated production default on RTX 3090 + Qwen2.5-0.5B-BF16:

```bash
PFLASH_DRAFTER_EARLY_EXIT_N=3 PFLASH_DRAFTER_SCORE_LAYERS=3 \
build/dflash_server ...
```

Headline numbers vs baseline (RTX 3090, Q4_K_M target, 0.5B-BF16 drafter):
- 6.9× drafter speedup at 32K, 24.3× at 128K
- accept_rate delta vs ee7: +1.2 pp (within ±2 pp gate across all 5 clients)
- NIAH 3/3 at 32K, 64K, 128K (Bug #42 fix included)

Reproduce: `dflash/bench/run_ee_n_sweep.sh` (NIAH + multi-client N-sweep) and `dflash/bench/run_ee_n_multiclient.sh` (5-client accept_rate comparison).

**Long-context mode (up to 256K):**
```bash
DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=16 \
Expand Down
78 changes: 78 additions & 0 deletions dflash/bench/run_ee_n_multiclient.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#!/usr/bin/env bash

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_multiclient.sh, line 12:

<comment>Hardcoded machine-specific absolute paths prevent others from running the reproducibility benchmark</comment>

<file context>
@@ -0,0 +1,78 @@
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+WORKTREE="/home/peppi/Dev/lucebox-hub/.claude/worktrees/harness-adapters"
+DRIVER="$WORKTREE/harness/client_test_runner.py"
+
</file context>

# Multi-client bandit-session × {baseline, ee3, ee5, ee7} × {claude_code, hermes, opencode, pi, codex}
# Each server boot is flock-serialized on /tmp/lucebox-gpu.lock.
#
# Usage: dflash/bench/run_ee_n_multiclient.sh [output_dir]

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"

WORKTREE="/home/peppi/Dev/lucebox-hub/.claude/worktrees/harness-adapters"
DRIVER="$WORKTREE/harness/client_test_runner.py"

TARGET="/home/peppi/models/qwen3.6-27b-q4km/Qwen3.6-27B-Q4_K_M.gguf"
DRAFTER="/home/peppi/models/Qwen3-0.6B-BF16.gguf"
BIN="$ROOT/dflash/build/dflash_server"

OUTDIR="${1:-$ROOT/bench/results/2026-05-25_ee_n_multiclient}"
mkdir -p "$OUTDIR"

# Verify driver exists.
if [[ ! -f "$DRIVER" ]]; then
echo "ERROR: driver not found: $DRIVER"
echo "Check that the harness-adapters worktree is checked out."
exit 1
fi

CLIENTS=(claude_code hermes opencode pi codex)
# format: name:EARLY_EXIT_N:SCORE_LAYERS (0 means unset -> full drafter layers)
CONDITIONS=("baseline:0:0" "ee3:3:3" "ee5:5:5" "ee7:7:7")

echo "=== ee N-sweep multi-client start ($(date)) ==="

for cond_spec in "${CONDITIONS[@]}"; do
name="${cond_spec%%:*}"
rest="${cond_spec#*:}"
early_n="${rest%%:*}"
score_n="${rest#*:}"

cond_dir="$OUTDIR/$name"
mkdir -p "$cond_dir"

for client in "${CLIENTS[@]}"; do
echo "=== $name x $client ($(date)) ==="

# Build env vars for early-exit (unset for baseline).
# DFLASH_SERVER_BIN: overrides harness default cpp binary path.
# PYTHONPATH: needed for 'from harness.metrics_parser import ...' inside driver.
export DFLASH_SERVER_BIN="$BIN"
export PYTHONPATH="$WORKTREE"
if [[ "$early_n" != "0" ]]; then
export PFLASH_DRAFTER_EARLY_EXIT_N="$early_n"
export PFLASH_DRAFTER_SCORE_LAYERS="$score_n"
else
unset PFLASH_DRAFTER_EARLY_EXIT_N PFLASH_DRAFTER_SCORE_LAYERS 2>/dev/null || true
fi

flock -w 1800 /tmp/lucebox-gpu.lock \
python3 "$DRIVER" bandit-session \
--client "$client" \
--turns 3 \
--target "$TARGET" \
--draft "$DRAFTER" \
--bin "$BIN" \
--output "$cond_dir/${client}.csv" \
2>&1 | tee "$cond_dir/${client}.log" \
|| echo "FAIL: $name x $client (see $cond_dir/${client}.log)"

# Capture server log if the harness wrote one to the standard evidence dir.
latest_server_log=$(ls -t "$WORKTREE"/dflash/bench/results/*_adaptive_evidence/server.log 2>/dev/null | head -1 || true)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Server-log capture uses global ls -t latest rather than run-specific path, causing stale/mismatched server logs after failures.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_multiclient.sh, line 71:

<comment>Server-log capture uses global `ls -t` latest rather than run-specific path, causing stale/mismatched server logs after failures.</comment>

<file context>
@@ -0,0 +1,78 @@
+            || echo "FAIL: $name x $client (see $cond_dir/${client}.log)"
+
+        # Capture server log if the harness wrote one to the standard evidence dir.
+        latest_server_log=$(ls -t "$WORKTREE"/dflash/bench/results/*_adaptive_evidence/server.log 2>/dev/null | head -1 || true)
+        if [[ -n "$latest_server_log" ]]; then
+            cp "$latest_server_log" "$cond_dir/${client}_server.log"
</file context>

if [[ -n "$latest_server_log" ]]; then
cp "$latest_server_log" "$cond_dir/${client}_server.log"
fi
done
done

echo "=== multi-client done. Results under $OUTDIR ($(date)) ==="
42 changes: 42 additions & 0 deletions dflash/bench/run_ee_n_sweep.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#!/usr/bin/env bash
# N-sweep NIAH bench: baseline + ee3 + ee5 + ee7 @ 32K / 64K / 128K
#
# Each server boot is flock-serialized on /tmp/lucebox-gpu.lock.
# Pre-flight: NIAH case files must exist under CASES_DIR (see PLAN.md for gen command).
#
# Usage: dflash/bench/run_ee_n_sweep.sh [output_dir] [cases_dir]

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"

OUTDIR="${1:-$ROOT/dflash/bench/results/2026-05-25_ee_n_sweep}"
CASES_DIR="${2:-/tmp}"

mkdir -p "$OUTDIR"

echo "=== ee N-sweep NIAH start ($(date)) ==="
echo " out-dir: $OUTDIR"
echo " cases-dir: $CASES_DIR"

# Verify case files exist before acquiring the GPU lock.
for ctx in 32768 65536 131072; do
f="$CASES_DIR/niah_${ctx}.jsonl"
if [[ ! -f "$f" ]]; then
echo "ERROR: missing $f"
echo "Generate with:"
echo " python3 $ROOT/pflash/tests/niah_gen.py --context $ctx --n 3 -o $f"
exit 1
fi
done

# Run all conditions in a single flock-serialized call.
# The Python script handles per-server serialization internally (one server per case).
flock -w 1800 /tmp/lucebox-gpu.lock -c "
python3 $ROOT/dflash/bench/run_ee_n_sweep_niah.py \
--out-dir '$OUTDIR' \
--cases-dir '$CASES_DIR'
" || { echo "FAIL: GPU lock timeout or sweep error"; exit 1; }

echo "=== N-sweep NIAH done. Results under $OUTDIR ($(date)) ==="
Loading
Loading