BYK · BYK · May 21, 2026 · May 21, 2026
diff --git a/.lore.md b/.lore.md
diff --git a/README.md b/README.md
@@ -326,9 +326,23 @@ Lore re-scans the `lat.md/` directory periodically (on session idle), so changes
 
 ## Eval results
 
-At 400K tokens (realistic coding session length), Lore outperforms standard compaction — the approach used by Claude Code, Codex, and other tools that summarize older context when the conversation grows too long:
+### The mega-session test: 2.3 million tokens
 
-### Context retention (400K tokens)
+Real-world coding sessions can span days and accumulate millions of tokens. We extracted a real 5-day, 2.3M-token session (getsentry/cli refactoring — 95 user turns, multiple PRs, architectural decisions, code reviews) and tested whether each approach can answer questions about details from throughout the session:
+
+| What's tested | Lore | Compaction | Lore vs Compaction |
+|---|---|---|---|
+| Easy (late-session details) | **4.0**/5 | 2.4/5 | +67% |
+| Medium (mid-session details) | **3.9**/5 | 3.0/5 | +29% |
+| Hard (early-session details) | **4.1**/5 | 1.8/5 | +136% |
+| **Average** | **4.0**/5 | 2.4/5 | **+70%** |
+| **Perfect scores (5.0)** | **13/20** | 5/20 | 2.6x more |
+
+*At 2.3M tokens, compaction compresses the entire conversation into ~11K tokens of summary — a 200x compression that destroys most details. Lore preserves them through distillation (21 observations totaling ~10K tokens) + 64K raw tail window + searchable temporal archive via recall. The hard questions — details from the first day of a 5-day session — are where compaction fails (1.8/5) and Lore excels (4.1/5).*
+
+### At 400K tokens
+
+At more typical session lengths, Lore still outperforms:
 
 | What's tested | Lore | Compaction | Lore vs Compaction |
 |---|---|---|---|
@@ -338,7 +352,7 @@ At 400K tokens (realistic coding session length), Lore outperforms standard comp
 | **Average** | **4.8**/5 | 4.5/5 | **+7%** |
 | **Perfect scores (5.0)** | **12/15** | 9/15 | — |
 
-*Compaction baseline: multi-pass LLM summarization matching Claude Code's auto-compact behavior (~140K threshold, 2-3 cycles at 400K tokens). Scored by LLM-as-judge on a 1–5 scale. Lore's advantage is largest on medium-difficulty questions — mid-session details like decision alternatives, exact error messages, and rejected approaches that compaction summarizes away but Lore's distillation + recall preserves.*
+*Compaction baseline: multi-pass LLM summarization matching Claude Code's auto-compact behavior (~140K threshold). At 400K tokens, compaction only loses a few details — the advantage grows dramatically at larger scales.*
 
 ### Preference recall (400K tokens)
 
@@ -351,12 +365,16 @@ At 400K tokens (realistic coding session length), Lore outperforms standard comp
 
 *Preference recall baselines are from a prior eval run with tail-window (80K). Compaction preference baselines pending re-run.*
 
-**What this means:** at 400K tokens, Lore scores 4.8/5 on context retention with 12 out of 15 perfect scores — compared to compaction's 4.5/5 with 9 perfect scores. The gap is largest on mid-session details that compaction loses through repeated summarization cycles.
+**What this means:** the longer the session, the bigger Lore's advantage. At 400K tokens, Lore is +7% over compaction. At 2.3M tokens, Lore is +70% — compaction retains less than half the information (2.4/5) while Lore retains 80% (4.0/5). Early-session details that compaction destroys completely (1.8/5) are preserved by Lore's three-tier architecture (4.1/5).
 
-The eval suite (16 scenarios, 130+ questions, 5 dimensions) is open source in `packages/core/eval/`. Run it yourself:
+The eval suite is open source in `packages/core/eval/`. Run it yourself:
 
 ```bash
+# 400K inflated scenario
 bun packages/core/eval/run.ts --mode live --inflate 400000
+
+# 2.3M mega-session (real session, no inflation needed)
+bun packages/core/eval/run.ts --mode live --scenarios mega-cli-refactor
 ```
 
 **Cost:** Lore's memory layer runs at minimal additional cost — background distillation and curation use batch APIs (50% off on supported providers) and cheaper models. Local on-device embeddings (Nomic Embed v1.5) mean zero API cost for vector search. Predictive cache warming reduces expensive cache rebuilds.
@@ -373,7 +391,7 @@ bun packages/core/eval/run.ts --mode live --inflate 400000
 
 **v5 — behavioral pattern detection + 400K eval.** Vector similarity-based pattern echo detection, action tagging in distillation, cross-session pattern clustering, assertion pinning for long sessions, and a scenario inflator for realistic 400K-token evaluation. This is what closed the preference gap from +15% to +47% over tail-window.
 
-**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. Context retention: 4.8/5 with 12/15 perfect scores, +7% over compaction at 400K tokens.
+**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. 2.3M-token mega-session eval on a real 5-day coding session: Lore 4.0/5 vs compaction 2.4/5 (+70%), with 13/20 perfect scores vs 5/20.
 
 ## Development setup
 

diff --git a/docs/index.html b/docs/index.html
@@ -928,16 +928,16 @@ <h1 class="sr">
 
     <div class="hero-stats sr">
       <div class="stat-cell">
-        <div class="stat-n">12/15</div>
-        <div class="stat-l">Perfect Scores at 400K Tokens</div>
+        <div class="stat-n">+70%</div>
+        <div class="stat-l">vs Compaction at 2.3M Tokens</div>
       </div>
       <div class="stat-cell">
-        <div class="stat-n">4.8</div>
-        <div class="stat-l">out of 5.0 Detail Retention</div>
+        <div class="stat-n">13/20</div>
+        <div class="stat-l">Perfect Recall at 2.3M Tokens</div>
       </div>
       <div class="stat-cell">
-        <div class="stat-n">400K+</div>
-        <div class="stat-l">Token Sessions Supported</div>
+        <div class="stat-n">2.3M+</div>
+        <div class="stat-l">Token Sessions Tested</div>
       </div>
     </div>
   </section>

diff --git a/packages/core/eval/baselines.ts b/packages/core/eval/baselines.ts
@@ -171,29 +171,71 @@ export async function compactionBaseline(
 
     if (prefix.length === 0) break;
 
-    // Summarize the prefix via LLM
-    const prefixText = renderConversation(prefix);
-    const userPrompt = COMPACTION_USER_TEMPLATE.replace(
-      "{{conversation}}",
-      prefixText,
-    );
-
-    const result = await llm.prompt(COMPACTION_SYSTEM, userPrompt, {
-      maxTokens: 4096,
-      temperature: 0,
-    });
+    // Summarize the prefix via LLM. If the prefix exceeds the model's
+    // context window, chunk it into segments and summarize each, then
+    // concatenate the summaries.
+    const MAX_CHUNK_TOKENS = 800_000; // leave room for system prompt + output
+    const prefixTokens = totalTokens(prefix);
+    let summaryText: string;
+
+    if (prefixTokens <= MAX_CHUNK_TOKENS) {
+      // Fits in one call
+      const prefixText = renderConversation(prefix);
+      const userPrompt = COMPACTION_USER_TEMPLATE.replace("{{conversation}}", prefixText);
+      const result = await llm.prompt(COMPACTION_SYSTEM, userPrompt, {
+        maxTokens: 4096,
+        temperature: 0,
+      });
+      summaryText = result.text;
+    } else {
+      // Chunk the prefix into segments that fit
+      const chunks: ConversationTurn[][] = [];
+      let chunk: ConversationTurn[] = [];
+      let chunkTokens = 0;
+      for (const turn of prefix) {
+        const t = turn.tokens ?? estimateTokens(renderTurn(turn));
+        if (chunkTokens + t > MAX_CHUNK_TOKENS && chunk.length > 0) {
+          chunks.push(chunk);
+          chunk = [];
+          chunkTokens = 0;
+        }
+        chunk.push(turn);
+        chunkTokens += t;
+      }
+      if (chunk.length > 0) chunks.push(chunk);
+
+      console.log(
+        `  [compaction] prefix too large (${prefixTokens} tok), splitting into ${chunks.length} chunks`,
+      );
+
+      // Summarize each chunk
+      const chunkSummaries: string[] = [];
+      for (let c = 0; c < chunks.length; c++) {
+        const chunkText = renderConversation(chunks[c]);
+        const userPrompt = COMPACTION_USER_TEMPLATE.replace("{{conversation}}", chunkText);
+        const result = await llm.prompt(COMPACTION_SYSTEM, userPrompt, {
+          maxTokens: 4096,
+          temperature: 0,
+        });
+        chunkSummaries.push(result.text);
+        console.log(
+          `  [compaction] chunk ${c + 1}/${chunks.length}: ${totalTokens(chunks[c])} tok → ${estimateTokens(result.text)} tok`,
+        );
+      }
+      summaryText = chunkSummaries.join("\n\n---\n\n");
+    }
 
     // Replace prefix with a synthetic summary turn + keep tail
     const summaryTurn: ConversationTurn = {
       role: "assistant",
-      content: [{ type: "text", text: `## Compacted Summary (pass ${compactionCount + 1})\n\n${result.text}` }],
-      tokens: estimateTokens(result.text),
+      content: [{ type: "text", text: `## Compacted Summary (pass ${compactionCount + 1})\n\n${summaryText}` }],
+      tokens: estimateTokens(summaryText),
     };
     currentTurns = [summaryTurn, ...tail];
     compactionCount++;
 
     console.log(
-      `  [compaction] pass ${compactionCount}: ${prefix.length} turns summarized → ${estimateTokens(result.text)} tok, ${currentTurns.length} turns remaining (${totalTokens(currentTurns)} tok)`,
+      `  [compaction] pass ${compactionCount}: ${prefix.length} turns summarized → ${estimateTokens(summaryText)} tok, ${currentTurns.length} turns remaining (${totalTokens(currentTurns)} tok)`,
     );
   }
 

diff --git a/packages/core/eval/harness.ts b/packages/core/eval/harness.ts
@@ -1117,6 +1117,8 @@ async function loadScenarios(
       case "context": {
         const mod = await import("./scenarios/context-management");
         scenarios.push(...mod.scenarios);
+        const mega = await import("./scenarios/mega-session");
+        scenarios.push(mega.default);
         break;
       }
       case "recall": {

diff --git a/packages/core/eval/scenarios/cli-refactor-session.json.gz b/packages/core/eval/scenarios/cli-refactor-session.json.gz