BYK · BYK · May 21, 2026 · May 20, 2026
diff --git a/.lore.md b/.lore.md
@@ -33,6 +33,9 @@
 <!-- lore:019e1c27-967c-7eb4-bd0e-afb195823970 -->
 * **Bun NAPI crash on process.exit() — use safeExit() via libc \_exit()**: Bun NAPI crash on process.exit() with fastembed — use safeExit(): Loading fastembed (onnxruntime NAPI bindings) causes a C++ panic on \`process.exit()\` because Bun runs NAPI teardown destructors that throw. Fix: \`packages/gateway/src/cli/exit.ts\` exports \`safeExit(code)\` — uses \`\_exit()\` from libc via \`bun:ffi\` under Bun, falls back to \`process.exit()\` under Node.js. All gateway exit paths must use \`safeExit()\`. Do NOT call \`embedding.resetProvider()\` in test teardown \`resetPipelineState()\` — move \`resetProvider()\` to \`shutdown()\` in \`start.ts\` only. \`resetPipelineState()\` must preserve the 'fastembed unavailable' cached state.
 
+<!-- lore:019e47ac-32a9-7d38-8f6f-b6c69d35baf5 -->
+* **Eval QA session contamination: each QA question creates a new session and stores temporal messages**: Eval QA session contamination: each \`askQuestionViaGateway()\` call sends NO session headers → Tier 3 fingerprint creates a brand-new session per QA question. \`postResponse()\` stores QA question text as temporal messages. Recall with default \`scope: 'all'\` searches ALL sessions in the project, so prior QA question text matches recall queries better than actual replay content. Fix: add \`X-Lore-No-Store: true\` header support in \`postResponse()\` (pipeline.ts ~line 1966) to gate both \`temporal.store()\` calls and \`scheduleBackgroundWork()\`. Pass this header from \`askQuestionViaGateway()\`. This is a legitimate product feature (read-only gateway requests), not eval gaming.
+
 <!-- lore:019e2b12-6ea6-76dc-ab7a-a1532c60b312 -->
 * **git remote -v in hosted gateway — skip when header present, never run with client-controlled cwd**: \`LORE\_HOSTED\_MODE=1\` makes all FS-touching functions no-op: \`getGitRemote()\` returns null, \`config.load()\` skips \`.lore.json\`, agents-file/lat-reader/knowledge-watcher are no-ops. Activation: \`lore start\` (headless) enables hosted mode by default; opt-out via \`--local\` or \`LORE\_HOSTED\_MODE=0\`. \`lore run\` is always local. Flag set in \`initIfNeeded()\` from \`GatewayConfig.hostedMode\`. Never run \`git remote -v\` with client-controlled cwd. \`LORE\_REMOTE\_URL\` + local CLI: \`lore run\`/\`lore start\` skips local gateway and proxies to remote. Local CLI injects \`X-Lore-Git-Remote\`; remote gateway trusts it. CLI-less/SaaS: \`ANTHROPIC\_CUSTOM\_HEADERS\` requires a local \`lore\` CLI process — pure SaaS alternative not yet implemented.
 
@@ -73,14 +76,17 @@
 <!-- lore:019e44c8-e3b2-70c1-afb6-d3acf24c531a -->
 * **Always fix cache memory leaks with TTL eviction, size cap, and scheduled pruning**: Cache memory leak fix pattern: (1) TTL check in \`.get()\` — delete and return undefined if expired; (2) LRU eviction in \`.set()\` — delete oldest key when \`store.size >= maxEntries\`; (3) \`setInterval(() => this.prune(), 60\_000)\` in constructor. Defaults: \`maxEntries = 10\_000\`, \`ttlMs = 300\_000\` (5 min). Note: \`prune()\` is NOT currently scheduled — the \`setInterval\` pattern is the prescribed fix, not existing behavior. Always use \`flock\` advisory locking instead of \`proper-lockfile\` — \`proper-lockfile@4.1.2\` fails in containerized environments where PID namespaces reset on restart, leaving stale locks. \`flock\` is automatically released on process exit. Session ground-truth: cache entries are never auto-evicted and \`prune()\` is never scheduled in current code — do not assert otherwise.
 
+<!-- lore:019e47b2-9bf3-738e-b774-efeea35399b5 -->
+* **Always investigate root causes by requesting systematic code-path analysis across multiple specific files**: When encountering unexpected system behavior (wrong scores, missing data, contamination), the user consistently requests deep investigation across multiple specific files simultaneously rather than iterative single-file exploration. They pre-identify candidate explanations and specific areas to investigate (often 3-6 numbered items), name exact files and functions to examine, and expect the assistant to trace complete execution paths end-to-end. The pattern applies to eval/pipeline debugging in the Lore system and likely generalizes to any complex multi-file debugging scenario. Always read all named files upfront, trace the full call chain, and report findings per-area rather than asking clarifying questions first.
+
 <!-- lore:019e4422-5b29-77a8-8956-488233ef16a4 -->
-* **Always request critical code reviews with specific file paths, line numbers, and severity classifications**: Code review, investigation & workflow standards: (1) Reviews: exact file paths, line numbers, severity (C/M/L), root causes, concrete fixes. Check state-not-cleared, consume-once flags, circuit breaker bypass, concurrency edges. (2) Investigation: read actual source, trace full execution paths, enumerate 2-4 candidate explanations, report confirmed/falsified verdict with line numbers. Demand concrete metrics before accepting fixes. (3) PR discipline: critical self-review before merge, fix all criticals, CI green, amend+force-push. Resolve \`.lore.md\` rebase conflicts with \`--ours\`. After merge, pull main before follow-up work. (4) Planning: write plan file, wait for explicit approval, then execute. Pull from origin/main before any exploration or edits. (5) After bug fix: add tests (4-6 edge cases) in dedicated file referencing issue number. (6) Sentry IDs start with \`LOREAI-GATEWAY-\`. (7) Run lint, typecheck, full test suite before committing. (8) Present structured fix plan before implementation; wait for explicit approval. Never re-propose explicitly rejected approaches. Always include migration versioning context in schema change PRs.
+* **Always request critical code reviews with specific file paths, line numbers, and severity classifications**: Code review, investigation & workflow standards: (1) Reviews: exact file paths, line numbers, severity (C/M/L), root causes, concrete fixes. Check state-not-cleared, consume-once flags, circuit breaker bypass, concurrency edges. File-by-file, skeptical; Critical+Medium fixed before merge, Low tolerated. (2) Investigation: read actual source, trace full execution paths, enumerate 2-4 candidate explanations, report confirmed/falsified verdict with line numbers. Demand concrete metrics before accepting fixes. (3) PR discipline: critical self-review before merge, fix all criticals, CI green, amend+force-push. Resolve \`.lore.md\` rebase conflicts with \`--ours\`. After merge, pull main before follow-up work. (4) Planning: write plan file, wait for explicit approval, then execute. Pull from origin/main before any exploration or edits. (5) After bug fix: add tests (4-6 edge cases) in dedicated file referencing issue number. (6) Sentry IDs start with \`LOREAI-GATEWAY-\`. (7) Run lint, typecheck, full test suite before committing. (8) Present structured fix plan before implementation; wait for explicit approval. Never re-propose explicitly rejected approaches. Always include migration versioning \[truncated — entry too long]
 
 <!-- lore:019e44c8-4e3f-7835-972f-02ed2033a842 -->
 * **Always request worker tests with a consistent 7-case spec covering compute, missing-record, cleanup retention, and sync scenarios**: Worker test files follow a consistent 7-case spec: (1) compute job — DB lookup + update, (2) missing record — skip without throw, (3) cleanup — hard-delete records archived >30 days, (4) cleanup — preserve recently archived records, (5) sync — process a batch, (6) sync — skip missing records, (7) sync — respect dryRun flag. Tests mock DB and Redis. Use Vitest project-wide (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai+ts-node May 2026 — 312ms vs 30s startup). Use kebab-case file naming.
 
 <!-- lore:019e3cd7-97d3-7053-8f02-bb13d727662e -->
-* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args; silently fails. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). Auto-extracts protected keywords from question+referenceAnswer. 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: single LLM summarization (4096 token cap) of dropped prefix + 80K tail — ONE pass only; real Claude Code hits 167K threshold 2-5+ times, making eval compaction unrealistically generous. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). Never accept eval-gaming fixes.
+* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args; silently fails. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). Auto-extracts protected keywords from question+referenceAnswer. 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: multi-pass (up to 4) LLM summarization at 83.5% autoCompactThreshold. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). \`QA\_SYSTEM\` is neutral. Post-replay embedding backfill runs before QA phase. Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). QA contamination fixed via \`X-Lore-No-Store\` header. Never accept eval-gaming fixes.
 
 <!-- lore:019e2168-2fa4-77bd-a557-9d6dbcb40d81 -->
 * **Prefer WASM backend over native onnxruntime-node for compiled binaries**: WASM backend for Bun \`--compile\` binaries with transformers.js: \`binaryExternalsPlugin\` in esbuild redirects \`onnxruntime-node\` → \`onnxruntime-web\` via \`onResolve\` (static imports only — does NOT redirect dynamic \`import()\` calls) and patches transformers.js CDN fallback via \`onLoad\` to read \`wasmPaths\` from \`globalThis.\_\_LORE\_VENDOR\_WASM\_PATHS\_\_\` (object form \`{ mjs, wasm }\` with exact hashed \`$bunfs\` filenames — directory strings fail). WASM files embedded as Bun \`{ type: 'file' }\` assets. For npm/CJS builds, \`onnxruntime-node\` stays external. WASM is ~2x faster on batches than native. Importing \`onnxruntime-web\` explicitly alongside the redirect creates two ort instances — 'cannot register backend cpu using priority 10' error.
diff --git a/README.md b/README.md
@@ -333,11 +333,11 @@ At 400K tokens (realistic coding session length), Lore significantly outperforms
 | What's tested | Lore | Tail-window | Compaction | Lore vs TW |
 |---|---|---|---|---|
 | Easy (late-session details) | **5.0**/5 | 4.7/5 | 4.7/5 | +6% |
-| Medium (mid-session details) | **2.3**/5 | 1.3/5 | 3.9/5 | +77% |
-| Hard (early-session details) | **3.3**/5 | 1.4/5 | 4.1/5 | +136% |
-| **Average across context** | **3.9**/5 | 2.6/5 | 4.1/5 | **+50%** |
+| Medium (mid-session details) | **4.1**/5 | 1.3/5 | 3.9/5 | +215% |
+| Hard (early-session details) | **4.8**/5 | 1.4/5 | 4.1/5 | +243% |
+| **Average across context** | **4.6**/5 | 2.6/5 | 4.1/5 | **+77%** |
 
-*Tail-window drops early-session details entirely at 400K tokens. Lore's distillation preserves them. Remaining gap to compaction tracked in [#417](https://github.com/BYK/loreai/issues/417).*
+*Lore scores are averaged across multiple runs at 400K tokens. Tail-window and compaction baselines are from a prior eval run with the same scenarios. Tail-window drops early-session details entirely; Lore's distillation + recall preserves them — including decision alternatives, exact error messages, and debugging hypotheses.*
 
 ### Preference recall (400K tokens)
 
@@ -350,7 +350,7 @@ At 400K tokens (realistic coding session length), Lore significantly outperforms
 
 *Scored by LLM-as-judge on a 1–5 scale. Tail-window baseline: last 80K tokens of raw conversation (the default behavior without Lore). Evaluated at 400K tokens — the point where context management actually matters.*
 
-**What this means:** after 400K tokens of conversation, the standard approach loses early-session details entirely and forgets a third of your stated preferences. Lore's distillation + knowledge curation preserves both across sessions.
+**What this means:** after 400K tokens of conversation, the standard approach loses early-session details entirely and forgets a third of your stated preferences. Lore's distillation + recall preserves both — averaging 4.6/5 on context retention where tail-window averages 2.6/5.
 
 The eval suite (16 scenarios, 130+ questions, 5 dimensions) is open source in `packages/core/eval/`. Run it yourself:
 
@@ -370,7 +370,9 @@ bun packages/core/eval/run.ts --mode live --inflate 400000
 
 **v4 — research-informed compression.** Three changes from the KV cache compression literature ([Zweiger et al. 2025](https://arxiv.org/abs/2602.16284), [Eyuboglu et al. 2025](https://arxiv.org/abs/2501.17390)): (1) *Loss-annotated tool stripping* with metadata instead of static placeholders. (2) *Context-distillation meta-distillation* producing working context documents instead of flat event logs. (3) *Multi-resolution composable distillations* — archived gen-0 observations for recall alongside compressed gen-1 for in-context summary.
 
-**v5 — behavioral pattern detection + 400K eval.** Vector similarity-based pattern echo detection, action tagging in distillation, cross-session pattern clustering, assertion pinning for long sessions, and a scenario inflator for realistic 400K-token evaluation. This is what closed the preference gap from +15% to +47% over tail-window. Context retention eval shows +50% over tail-window at 400K tokens — early-session details that tail-window drops entirely are preserved by Lore's distillation.
+**v5 — behavioral pattern detection + 400K eval.** Vector similarity-based pattern echo detection, action tagging in distillation, cross-session pattern clustering, assertion pinning for long sessions, and a scenario inflator for realistic 400K-token evaluation. This is what closed the preference gap from +15% to +47% over tail-window.
+
+**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. Context retention eval shows +77% over tail-window at 400K tokens (4.6/5 vs 2.6/5) — up from +50% in v5.
 
 ## Development setup
 

diff --git a/docs/index.html b/docs/index.html
@@ -928,11 +928,11 @@ <h1 class="sr">
 
     <div class="hero-stats sr">
       <div class="stat-cell">
-        <div class="stat-n">+50%</div>
+        <div class="stat-n">+77%</div>
         <div class="stat-l">vs Tail-Window at 400K Tokens</div>
       </div>
       <div class="stat-cell">
-        <div class="stat-n">4.8</div>
+        <div class="stat-n">4.6</div>
         <div class="stat-l">out of 5.0 Detail Retention</div>
       </div>
       <div class="stat-cell">