diff --git a/.lore.md b/.lore.md
index b88a614..95266f7 100644
--- a/.lore.md
+++ b/.lore.md
@@ -5,7 +5,7 @@
 ### Architecture
 
 <!-- lore:019e25c5-5716-77a0-bcf9-d65f321e5736 -->
-* **3-layer gradient model: layer 2 is transient, falls back to layer 1 via urgent distillation**: 3-layer gradient model: Layer 0 (all raw messages + LTM, append-only cache). Layer 1 (distilled prefix + pinned raw window, bust once on entry then warm). Layer 2/emergency (transient hard reset: fresh LTM, 2-3 best distillations, current agentic turn — fires 1-2 turns then urgent distillation falls back to Layer 1). Layer 2 must NOT set stickiness — stickiness only applies to layers 1-3. Bug in \`gradient.ts\`: \`effectiveMinLayer = max(0, lastLayer)\` traps sessions in emergency indefinitely; fix: restrict stickiness to \`lastLayer >= 1 && lastLayer <= 3\`. Context budget caps (160K at Opus) are cost-driven. Layer-specific distLimit: layers 1-2 all non-archived distillations; layer 3: top 5 via \`selectDistillations()\`; emergency: top 2. Scoring: 70% recency + 30% \`importanceBonus()\`. Cache frozen during tool-call chains for byte-identical prefix.
+* **3-layer gradient model: layer 2 is transient, falls back to layer 1 via urgent distillation**: 3-layer gradient model: Layer 0 (all raw messages + LTM, append-only cache). Layer 1 (distilled prefix + pinned raw window, bust once on entry then warm). Layer 2/emergency (transient hard reset: fresh LTM, 2-3 best distillations, current agentic turn — fires 1-2 turns then urgent distillation falls back to Layer 1). Layer 2 must NOT set stickiness — restrict to \`lastLayer >= 1 && lastLayer <= 3\` (bug: \`effectiveMinLayer = max(0, lastLayer)\` traps sessions in emergency). Context budget caps (160K at Opus) are cost-driven. distLimit: layers 1-2 all non-archived distillations; layer 3: top 5 via \`selectDistillations()\`; emergency: top 2. Scoring: 70% recency + 30% \`importanceBonus()\`. Cache frozen during tool-call chains for byte-identical prefix.
 
 <!-- lore:019e3083-8969-732b-a269-72e6cdd1ff7d -->
 * **Background LLM rate limiting: p-limit(2) + 429 circuit breaker in background-limiter.ts**: Global concurrency limit for background LLM work in \`packages/gateway/src/background-limiter.ts\`. Uses \`p-limit(2)\` to cap simultaneous background LLM calls across all idle sessions. Circuit breaker trips on 429 responses and pauses all background work for the \`Retry-After\` duration. Wired into: idle scheduler, pipeline incremental distillation, in-flight curation. Urgent distillation is excluded (client is waiting). Without this, N idle sessions fire N×4 simultaneous background calls causing cascading rate limit failures.
@@ -13,8 +13,11 @@
 <!-- lore:019e1c62-3208-7836-a531-f92d1bb20733 -->
 * **Conversation import system: providers, detection, extraction pipeline**: Core import system in \`packages/core/src/import/\`. Key design: \`AgentHistoryProvider\` interface with \`detect()\`/\`load()\`; providers registered in global registry (\`providers/index.ts\`). Detection scans all providers, returns \`DetectedSession\[]\`. Extraction calls curator LLM sequentially per chunk, deduplicating ops via \`parseOps()\`/\`applyOps()\`. Idempotency via \`import\_history\` table (DB migration v19). Built-in providers: Claude Code (\`~/.claude/projects/\`), OpenCode (SQLite), Aider (markdown), Codex (\`~/.codex/sessions/\` JSONL), Cline (VS Code globalStorage JSON), Continue (\`~/.continue/sessions/\` JSON), Pi (\`~/.pi/agent/sessions/\` tree-structured JSONL). Auto-import triggered in \`lore run\` via \`maybeAutoImport()\`. Copilot Chat skipped (opaque leveldb).
 
+<!-- lore:019e498a-c0fa-7aa3-b3cd-cf0cc0d29df6 -->
+* **getsentry/cli output system: return-based, template-driven, no direct stdout/stderr**: getsentry/cli output system: return-based, template-driven, no direct stdout/stderr. Commands return data; \`buildCommand\` wrapper handles rendering (JSON, plain text, markdown). No manual \`writeJson()\`/\`writeOutput()\`/\`writeResponseBody()\` calls. \`--json\` and \`--fields\` lifted into \`buildCommand\`. For streaming, use \`buildStreamingCommand\` (detects AsyncGenerator, renders based on \`--json\`; JSONL mode flushes each \`yield\` immediately). Spinner is the only component holding stdout/stderr references. Throw \`exitCode\` rather than weaving it around. Verbose mode uses \`logger.debug()\` via Consola. \`api\` command always returns JSON. \`--dry-run\` validates inputs and returns mock result. Goal: 0 direct stdout/stderr usages. No dead code; delete removed functions entirely. \`auth login\` example: yield QR code + URL, then resume polling with spinner. If teams are missing during project creation, auto-create rather than error. Migration order: (1) stop exposing stdout/stderr from \`buildCommand\`; (2) refactor non-streaming list commands; (3) implement \`buildStreamingCommand\`.
+
 <!-- lore:019e458b-fe1a-77fe-886b-37ef1817e7ca -->
-* **Gradient tool\_use/tool\_result pairing: reconstruct-after-eviction pattern**: Gradient tool\_use/tool\_result pairing — reconstruct-after-eviction pattern: \`tryFit()\` has NO logic to keep pairs together; safety via: (1) \`resolveToolResults()\` (temporal-adapter.ts:239) merges tool result data onto assistant tool parts, strips user-side \`tool\_result\` blocks; (2) \`loreMessagesToGateway()\` (pipeline.ts:3401) reconstructs pairs from surviving assistant tool parts; (3) \`removeOrphanedToolResults()\` (pipeline.ts:3524) validates BOTH directions — tool\_result→tool\_use AND tool\_use→tool\_result (Pass 2, PR #424). \`sanitizeToolParts()\` (gradient.ts:1071) converts pending/running tool parts to error state. Layer 4 (emergency) never strips tool parts to avoid infinite loops. Prefix/raw boundary trap: prefix ends with assistant (text-only), rawWindow may start with assistant containing tool\_use → back-to-back assistants → Anthropic rejects. Fix: advance cutoff past leading assistant messages when prefix is present at all 3 assembly points in gradient.ts (tryFit, tryFitStable pinned path, emergency layer). PR #428.
+* **Gradient tool\_use/tool\_result pairing: reconstruct-after-eviction pattern**: Gradient tool\_use/tool\_result pairing — reconstruct-after-eviction pattern: \`tryFit()\` has NO logic to keep pairs together; safety via: (1) \`resolveToolResults()\` (temporal-adapter.ts:239) merges tool result data onto assistant tool parts, strips user-side \`tool\_result\` blocks; (2) \`loreMessagesToGateway()\` (pipeline.ts:3401) reconstructs pairs from surviving assistant tool parts; (3) \`removeOrphanedToolResults()\` (pipeline.ts:3524) validates BOTH directions. \`sanitizeToolParts()\` (gradient.ts:1071) converts pending/running tool parts to error state. Layer 4 never strips tool parts to avoid infinite loops. Prefix/raw boundary trap: prefix ends with assistant (text-only), rawWindow may start with assistant containing tool\_use → back-to-back assistants → Anthropic rejects. Fix: advance cutoff past leading assistant messages when prefix is present at all 3 assembly points in gradient.ts (tryFit, tryFitStable pinned path, emergency layer).
 
 <!-- lore:019e30a6-ff62-723e-9fd8-56f1f1f60b5a -->
 * **LTM confidence field: semantic meaning and rerankPreferences() for legacy entries**: \`ltm.create()\` accepts optional \`confidence\` param (default 1.0, clamped \[0,1]). Semantics: 1.0=unconditional directive, 0.9=strong preference, 0.8=moderate, 0.6=mild. \`CuratorOp\` create type includes \`confidence\`, wired through \`applyOps\`. \`rerankPreferences()\` in \`packages/core/src/ltm.ts\` re-scores legacy entries by directive keyword patterns (\`STRONG\_DIRECTIVE\_RE\`); skips entries whose \`confidence\` was already set to a non-default value — manual overrides are preserved. \`lore data rerank\` CLI command triggers re-ranking; also auto-runs after \`lore data recover\`. Run after deploying to fix existing preferences in DB.
@@ -46,16 +49,13 @@
 * **Remote import idempotency split-brain: isImported queries local DB, recordImport writes to remote**: When \`LORE\_REMOTE\_URL\` is set, \`lore import\` has a split-brain bug: \`isImported()\` checks the \*\*local\*\* SQLite \`import\_history\` table, but \`remotePost('/api/v1/import/record')\` writes the record to the \*\*remote\*\* DB. Result: every subsequent run re-detects all sessions as un-imported and double-extracts. Fix: (1) add \`GET /api/v1/import/history\` endpoint and query remote during dedup check when remote URL is set; (2) optionally also write locally as offline resilience. The remote DB is the source of truth for import history in remote mode.
 
 <!-- lore:019e2e20-95b3-7a9d-ab38-77d87eafecc4 -->
-* **splitSegments() infinite recursion on oversized single messages**: splitSegments() infinite recursion on oversized single messages: In \`packages/core/src/distillation.ts\`, \`splitSegments()\` recurses infinitely when a single message exceeds \`maxSegmentTokens\` (16384). \`findSplitIndex()\` returns \`messages.length\` (=1), so \`left = messages.slice(0, 1)\` produces an identical recursive call. Triggered on large tool outputs (~49KB+). Fix: add base case after the \`totalTokens <= maxTokens\` guard — \`if (messages.length <= 1) return \[messages]\`. The oversized message becomes an indivisible segment.
-
-<!-- lore:019e4683-c4b3-780f-84d7-6157698ac7c2 -->
-* **Temporal storage: store from original req.messages BEFORE resolveToolResults strips tool\_result content**: Temporal storage ordering trap: call \`gatewayMessagesToLore()\` and store temporal messages BEFORE \`resolveToolResults()\` runs. \`resolveToolResults()\` replaces tool\_result parts with \`\[tool results provided] (t:msgID)\` placeholder — if temporal storage happens after, stored user message has no searchable content. The \`(t:msgID)\` reference lets the model fetch original content via recall tool. \`postResponse\` in \`pipeline.ts\` wraps all post-response processing in a broad try/catch — errors inside \`temporal.store()\` are silently swallowed. Check for \`post-response processing failed\` log lines first when debugging temporal storage issues. Eval teardown trap: \`closeDB()\` + \`unlinkSync\` deletes DB, then subsequent QA phase creates a new empty DB — inspect a killed/timed-out eval run's DB to verify actual storage.
+* **splitSegments() infinite recursion on oversized single messages**: \`splitSegments()\` infinite recursion on oversized single messages in \`packages/core/src/distillation.ts\`: recurses infinitely when a single message exceeds \`maxSegmentTokens\` (16384). \`findSplitIndex()\` returns \`messages.length\` (=1), so \`left = messages.slice(0, 1)\` produces an identical recursive call. Triggered on large tool outputs (~49KB+). Fix: add base case after the \`totalTokens <= maxTokens\` guard — \`if (messages.length <= 1) return \[messages]\`. The oversized message becomes an indivisible segment.
 
 <!-- lore:019e1de2-7639-7b32-b4c1-e64486934c27 -->
-* **TTL downgrade hysteresis: downgradeStreak field prevents compounding cache busts**: Auto-TTL downgrade hysteresis in \`packages/gateway/src/pipeline.ts\`: downgrade from 1h→5m TTL requires 3 consecutive short-gap turns (\`ttlDowngradeStreak\` in \`SessionState\`). Block downgrade if >50% of session tokens are cached. Reset streak on any long-gap turn. Subagent turns and tool-use continuations excluded from gap recording — capture \`prevStopReason\` before line 1667 overwrites it, skip when \`prevStopReason === 'tool\_use'\` or \`isSubagentTurn\`. State persistence: immediate (session identity), per-turn (cost snapshot), 30s periodic (gradient EMAs + cache warming via dirty flag). Max data loss on crash: ~30s. Also: recall follow-up requests must set \`cacheConversation: false\` — otherwise modified message array triggers full cache write at 5m TTL pricing.
+* **TTL downgrade hysteresis: downgradeStreak field prevents compounding cache busts**: Auto-TTL downgrade hysteresis in \`packages/gateway/src/pipeline.ts\`: downgrade from 1h→5m TTL requires 3 consecutive short-gap turns (\`ttlDowngradeStreak\` in \`SessionState\`). Block downgrade if >50% of session tokens are cached. Reset streak on any long-gap turn. Subagent turns and tool-use continuations excluded from gap recording — capture \`prevStopReason\` before line 1667 overwrites it, skip when \`prevStopReason === 'tool\_use'\` or \`isSubagentTurn\`. State persistence: immediate (session identity), per-turn (cost snapshot), 30s periodic (gradient EMAs + cache warming via dirty flag). Max data loss on crash: ~30s. Recall follow-up requests must set \`cacheConversation: false\` — otherwise modified message array triggers full cache write at 5m TTL pricing.
 
 <!-- lore:019e1e9f-3131-733f-978e-dde6f41e29fd -->
-* **Upgrade lock double-acquisition bug: same process re-locks same file**: In \`packages/gateway/src/cli/lib/binary.ts\`, \`downloadBinaryToTemp()\` acquires a lock on \`\<execPath>.lock\` and holds it. Then \`installBinary()\` computes the same install path and tries to \`acquireLock()\` again. \`handleExistingLock()\` only allows re-entry if \`existingPid === process.ppid\` (parent), but the lock was written by the same process (\`existingPid === process.pid\`), so it throws 'Another upgrade is already in progress'. Fix: in \`handleExistingLock\`, also allow re-entry when \`existingPid === process.pid\`. Double \`releaseLock()\` is safe — \`releaseLock\` swallows errors so the second call is a no-op after the file is deleted.
+* **Upgrade lock double-acquisition bug: same process re-locks same file**: In \`packages/gateway/src/cli/lib/binary.ts\`, \`downloadBinaryToTemp()\` acquires a lock on \`\<execPath>.lock\` and holds it. Then \`installBinary()\` computes the same install path and tries to \`acquireLock()\` again. \`handleExistingLock()\` only allows re-entry if \`existingPid === process.ppid\` (parent), but the lock was written by the same process (\`existingPid === process.pid\`), so it throws 'Another upgrade is already in progress'. Fix: in \`handleExistingLock\`, also allow re-entry when \`existingPid === process.pid\`. Double \`releaseLock()\` is safe — swallows errors so the second call is a no-op.
 
 <!-- lore:019e1cd6-05d2-74c5-aea8-fd827a4a45e7 -->
 * **vectorSearch() is unscoped — test cleanup must delete all embedding rows**: \`vectorSearch()\` in \`packages/core/src/ltm.ts\` queries \`knowledge WHERE embedding IS NOT NULL AND confidence > 0.2\` with no \`project\_id\` filter (intentional for cross-project search). Two gotchas: (1) Test suites scoped to one project leak embedding rows into other vectorSearch tests — \`beforeEach\` must \`DELETE FROM knowledge WHERE embedding IS NOT NULL\`. (2) \`vectorSearch()\` has no \`excludeCategories\` param — category exclusions from \`forSession()\` callers have no effect; add optional \`excludeCategories\` param and propagate from callers. Also: global entries (pid=null) force \`crossProject=true\`; confidence is clamped to \[0.0, 1.0] in \`update()\`.
@@ -65,22 +65,22 @@
 <!-- lore:019e21e0-9f09-7dd1-a899-854579e160cc -->
 * **Enhanced dedup: title overlap + vector similarity (Nomic v1.5)**: Nomic Embed v1.5 dedup threshold: same-domain cosine similarity spreads 0.46–0.70 (vs BGE Small which clusters at 0.93–0.97+, making dedup unusable). Correct dedup threshold: \*\*0.935\*\* — at-or-above is genuine duplicate. Range 0.85–0.91 contains 'related but distinct' entries; 0.85 produces false positives across project boundaries. \`deduplicate()\` in \`packages/core/src/ltm.ts\` uses both title word-overlap (0.7 Jaccard + 4+ shared words) AND vector cosine similarity. BGE Small embeddings are auto-nulled by \`checkConfigChange()\` on startup; \`backfillEmbeddings()\` re-embeds with Nomic v1.5. \`lore data reindex\` triggers backfill on-demand without gateway restart.
 
-<!-- lore:019e4683-c4bd-782e-bb97-355e5fc4e0e8 -->
-* **Uniform citation format: (prefix:id) for all recall-able references**: Uniform citation format: all recall-able references use \`(prefix:id)\`: \`(d:UUID)\` for distillations, \`(t:msgID)\` for temporal messages, \`(k:entryID)\` for knowledge entries. Distillation headers render as \`(d:UUID | lossy | N sources)\`. Tool result placeholders render as \`\[tool results provided] (t:msgID)\`. Do NOT use markdown link style. Recall RRF: distillations get 4 RRF lists (BM25 + vector + quality + exact-match) vs temporal's 3 (no quality list). \`SOURCE\_WEIGHT\`: distillation=0.8, temporal=0.8, knowledge=1.0. \`charBudget\` 12K. Vector search gate skipped for session-scoped recall. Per-query: 4 RRF lists (knowledge BM25, distillation BM25, temporal BM25, temporal recency); with 3 LLM expansions = 16 lists. \`MAX\_RRF\_LISTS=10\` trims expanded-query lists first. RRF formula: \`w/(60+rank)\`. \`formatFusedResults\`: tier 0 (score≥60% of top), tier 1 (≥30%), tier 2 (rest); per-result char budget clamped \[80,1200]. Temporal in tier 2 gets ~0.35x weight vs knowledge.
+<!-- lore:019e498a-c10c-760c-af27-18cb2e16ab4f -->
+* **getsentry/cli input validation module: src/lib/input-validation.ts**: getsentry/cli input validation (\`src/lib/input-validation.ts\`): \`rejectControlChars(input, label)\`, \`validateSlug(input, label)\` (rejects ?, #, %, whitespace, control chars, /), \`rejectPreEncoded(input, label)\`, \`validateEndpoint(endpoint)\` (blocks .. segments), \`validateResourceId(input, label)\`. Applied at: \`parseOrgProjectArg\`, \`parseIssueArg\`, \`parseSlashSeparatedArg\` (slash-split components only, NOT no-slash case — plain IDs need downstream processing) in \`arg-parsing.ts\`; \`normalizeEndpoint\` in \`api.ts\`. NOT applied in \`resolve-target.ts\` — env vars and DB cache are trusted. Biome lint rule: \`lint/suspicious/noControlCharactersInRegex\` (NOT \`lint/correctness/...\`) — use \`RegExp\` constructor + \`// biome-ignore lint/suspicious/noControlCharactersInRegex: reason\`. Property tests use \`fast-check\`: \`orgSlugArb\`/\`projectSlugArb\` match \`/^\[a-z]\[a-z0-9-]{1,30}\[a-z0-9]$/\`, \`numericIdArb\` matches \`/^\[1-9]\[0-9]{0,15}$/\`. \`DEFAULT\_NUM\_RUNS=50\`. If input validation is added, update arbitraries to match new constraints or property tests will fail.
 
 ### Preference
 
 <!-- lore:019e44c8-e3b2-70c1-afb6-d3acf24c531a -->
-* **Always fix cache memory leaks with TTL eviction, size cap, and scheduled pruning**: Cache memory leak fix pattern: (1) TTL check in \`.get()\` — delete and return undefined if expired; (2) LRU eviction in \`.set()\` — delete oldest key when \`store.size >= maxEntries\`; (3) \`setInterval(() => this.prune(), 60\_000)\` in constructor. Defaults: \`maxEntries = 10\_000\`, \`ttlMs = 300\_000\` (5 min). Note: \`prune()\` is NOT currently scheduled — the \`setInterval\` pattern is the prescribed fix, not existing behavior. Always use \`flock\` advisory locking instead of \`proper-lockfile\` — \`proper-lockfile@4.1.2\` fails in containerized environments where PID namespaces reset on restart, leaving stale locks. \`flock\` is automatically released on process exit. Session ground-truth: cache entries are never auto-evicted and \`prune()\` is never scheduled in current code — do not assert otherwise.
+* **Always fix cache memory leaks with TTL eviction, size cap, and scheduled pruning**: Cache memory leak fix pattern: (1) TTL check in \`.get()\` — delete and return undefined if expired; (2) LRU eviction in \`.set()\` — delete oldest key when \`store.size >= maxEntries\`; (3) \`setInterval(() => this.prune(), 60\_000)\` in constructor. Defaults: \`maxEntries = 10\_000\`, \`ttlMs = 300\_000\` (5 min). Note: \`prune()\` is NOT currently scheduled in existing code — the setInterval pattern is the prescribed fix. Always use \`flock\` advisory locking instead of \`proper-lockfile\` — \`proper-lockfile@4.1.2\` fails in containerized environments where PID namespaces reset on restart, leaving stale locks. \`flock\` is automatically released on process exit.
 
-<!-- lore:019e47b2-9bf3-738e-b774-efeea35399b5 -->
-* **Always investigate root causes by requesting systematic code-path analysis across multiple specific files**: (preference) When encountering unexpected system behavior, pre-identify 3-6 candidate explanations with exact files and functions, read all named files upfront, trace the full call chain end-to-end, and report findings per-area rather than asking clarifying questions first.
+<!-- lore:019e4422-5b29-77a8-8956-488233ef16a4 -->
+* **Always request critical code reviews with specific file paths, line numbers, and severity classifications**: Code review & investigation standards: (1) Reviews: exact file paths, line numbers, severity (C/M/L), root causes, concrete fixes. Check state-not-cleared, consume-once flags, circuit breaker bypass, concurrency edges. Critical+Medium fixed before merge. (2) Investigation: read actual source, trace full call chain, enumerate 2-4 candidates, report confirmed/falsified verdict with line numbers. (3) PR discipline: critical self-review before merge, CI green, amend+force-push. (4) After bug fix: add tests (4-6 edge cases) referencing issue number. Worker test files follow a consistent 7-case spec: compute job, missing record skip, cleanup hard-delete >30 days, preserve recently archived, sync batch, sync skip missing, sync dryRun. (5) Sentry IDs start with \`LOREAI-GATEWAY-\`. (6) Run lint, typecheck, full test suite before committing. Use Vitest (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai May 2026). Use kebab-case file naming.
 
-<!-- lore:019e44c8-4e3f-7835-972f-02ed2033a842 -->
-* **Always request worker tests with a consistent 7-case spec covering compute, missing-record, cleanup retention, and sync scenarios**: (preference) Worker test files follow a consistent 7-case spec: (1) compute job — DB lookup + update, (2) missing record — skip without throw, (3) cleanup — hard-delete records archived >30 days, (4) cleanup — preserve recently archived records, (5) sync — process a batch, (6) sync — skip missing records, (7) sync — respect dryRun flag. Tests mock DB and Redis. Use Vitest project-wide (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai+ts-node May 2026 — 312ms vs 30s startup). Use kebab-case file naming.
+<!-- lore:019e498a-c0e4-70c5-ad40-d4d6d9d26ff5 -->
+* **CI/PR cycle: check failing jobs, wait for bots, resolve all comments before merging**: CI/PR cycle: After every push: (1) check failing jobs via \`gh run view --log-failed --job $(gh pr checks $PR\_NO --json state,link -q '.\[] | select(.state == "FAILURE").link | split("/")\[-1]')\`; (2) wait for 'Sentry Seer' and 'Cursor BugBot' before acting; (3) fix all failures; (4) use \`gh api graphql\` with \`reviewThreads\` filtering \`isResolved==false, isMinimized==false\` for unresolved comments (fields: diff\_hunk, line, start\_line, body); (5) address all bot/human comments, respond or mark resolved; (6) repeat until clean. PR creation: check if already on a relevant branch; follow repo branch/commit conventions; base PR description on implementation plan (not overly long); add plan as \`git notes\`; create as draft initially. Always call \`plan\_exit\` when done planning. If BugBot finds nothing, merge and move on.
 
 <!-- lore:019e3cd7-97d3-7053-8f02-bb13d727662e -->
-* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval system: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: multi-pass LLM summarization at 83.5% autoCompactThreshold. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). QA contamination fixed via \`X-Lore-No-Store\` header. Never accept eval-gaming fixes. Eval table consistency: per-difficulty averages must match overall average. Non-deterministic LLM output causes eval variance: re-run before concluding regression. Post-replay embedding backfill runs before QA phase.
+* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval system: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: multi-pass LLM summarization at 83.5% autoCompactThreshold. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window — scoring below means lost information (treat as bug). QA contamination fixed via \`X-Lore-No-Store\`. Non-deterministic LLM output causes variance: re-run before concluding regression.
 
 <!-- lore:019e2168-2fa4-77bd-a557-9d6dbcb40d81 -->
 * **Prefer WASM backend over native onnxruntime-node for compiled binaries**: WASM backend for Bun \`--compile\` binaries with transformers.js: \`binaryExternalsPlugin\` in esbuild redirects \`onnxruntime-node\` → \`onnxruntime-web\` via \`onResolve\` (static imports only — does NOT redirect dynamic \`import()\` calls) and patches transformers.js CDN fallback via \`onLoad\` to read \`wasmPaths\` from \`globalThis.\_\_LORE\_VENDOR\_WASM\_PATHS\_\_\` (object form \`{ mjs, wasm }\` with exact hashed \`$bunfs\` filenames — directory strings fail). WASM files embedded as Bun \`{ type: 'file' }\` assets. For npm/CJS builds, \`onnxruntime-node\` stays external. WASM is ~2x faster on batches than native. Importing \`onnxruntime-web\` explicitly alongside the redirect creates two ort instances — 'cannot register backend cpu using priority 10' error.
diff --git a/README.md b/README.md
index 4b76475..0a2dafd 100644
--- a/README.md
+++ b/README.md
@@ -326,9 +326,23 @@ Lore re-scans the `lat.md/` directory periodically (on session idle), so changes
 
 ## Eval results
 
-At 400K tokens (realistic coding session length), Lore outperforms standard compaction — the approach used by Claude Code, Codex, and other tools that summarize older context when the conversation grows too long:
+### The mega-session test: 2.3 million tokens
 
-### Context retention (400K tokens)
+Real-world coding sessions can span days and accumulate millions of tokens. We extracted a real 5-day, 2.3M-token session (getsentry/cli refactoring — 95 user turns, multiple PRs, architectural decisions, code reviews) and tested whether each approach can answer questions about details from throughout the session:
+
+| What's tested | Lore | Compaction | Lore vs Compaction |
+|---|---|---|---|
+| Easy (late-session details) | **4.0**/5 | 2.4/5 | +67% |
+| Medium (mid-session details) | **3.9**/5 | 3.0/5 | +29% |
+| Hard (early-session details) | **4.1**/5 | 1.8/5 | +136% |
+| **Average** | **4.0**/5 | 2.4/5 | **+70%** |
+| **Perfect scores (5.0)** | **13/20** | 5/20 | 2.6x more |
+
+*At 2.3M tokens, compaction compresses the entire conversation into ~11K tokens of summary — a 200x compression that destroys most details. Lore preserves them through distillation (21 observations totaling ~10K tokens) + 64K raw tail window + searchable temporal archive via recall. The hard questions — details from the first day of a 5-day session — are where compaction fails (1.8/5) and Lore excels (4.1/5).*
+
+### At 400K tokens
+
+At more typical session lengths, Lore still outperforms:
 
 | What's tested | Lore | Compaction | Lore vs Compaction |
 |---|---|---|---|
@@ -338,7 +352,7 @@ At 400K tokens (realistic coding session length), Lore outperforms standard comp
 | **Average** | **4.8**/5 | 4.5/5 | **+7%** |
 | **Perfect scores (5.0)** | **12/15** | 9/15 | — |
 
-*Compaction baseline: multi-pass LLM summarization matching Claude Code's auto-compact behavior (~140K threshold, 2-3 cycles at 400K tokens). Scored by LLM-as-judge on a 1–5 scale. Lore's advantage is largest on medium-difficulty questions — mid-session details like decision alternatives, exact error messages, and rejected approaches that compaction summarizes away but Lore's distillation + recall preserves.*
+*Compaction baseline: multi-pass LLM summarization matching Claude Code's auto-compact behavior (~140K threshold). At 400K tokens, compaction only loses a few details — the advantage grows dramatically at larger scales.*
 
 ### Preference recall (400K tokens)
 
@@ -351,12 +365,16 @@ At 400K tokens (realistic coding session length), Lore outperforms standard comp
 
 *Preference recall baselines are from a prior eval run with tail-window (80K). Compaction preference baselines pending re-run.*
 
-**What this means:** at 400K tokens, Lore scores 4.8/5 on context retention with 12 out of 15 perfect scores — compared to compaction's 4.5/5 with 9 perfect scores. The gap is largest on mid-session details that compaction loses through repeated summarization cycles.
+**What this means:** the longer the session, the bigger Lore's advantage. At 400K tokens, Lore is +7% over compaction. At 2.3M tokens, Lore is +70% — compaction retains less than half the information (2.4/5) while Lore retains 80% (4.0/5). Early-session details that compaction destroys completely (1.8/5) are preserved by Lore's three-tier architecture (4.1/5).
 
-The eval suite (16 scenarios, 130+ questions, 5 dimensions) is open source in `packages/core/eval/`. Run it yourself:
+The eval suite is open source in `packages/core/eval/`. Run it yourself:
 
 ```bash
+# 400K inflated scenario
 bun packages/core/eval/run.ts --mode live --inflate 400000
+
+# 2.3M mega-session (real session, no inflation needed)
+bun packages/core/eval/run.ts --mode live --scenarios mega-cli-refactor
 ```
 
 **Cost:** Lore's memory layer runs at minimal additional cost — background distillation and curation use batch APIs (50% off on supported providers) and cheaper models. Local on-device embeddings (Nomic Embed v1.5) mean zero API cost for vector search. Predictive cache warming reduces expensive cache rebuilds.
@@ -373,7 +391,7 @@ bun packages/core/eval/run.ts --mode live --inflate 400000
 
 **v5 — behavioral pattern detection + 400K eval.** Vector similarity-based pattern echo detection, action tagging in distillation, cross-session pattern clustering, assertion pinning for long sessions, and a scenario inflator for realistic 400K-token evaluation. This is what closed the preference gap from +15% to +47% over tail-window.
 
-**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. Context retention: 4.8/5 with 12/15 perfect scores, +7% over compaction at 400K tokens.
+**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. 2.3M-token mega-session eval on a real 5-day coding session: Lore 4.0/5 vs compaction 2.4/5 (+70%), with 13/20 perfect scores vs 5/20.
 
 ## Development setup
 
diff --git a/docs/index.html b/docs/index.html
index 0376a99..b80474e 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -928,16 +928,16 @@ <h1 class="sr">
 
     <div class="hero-stats sr">
       <div class="stat-cell">
-        <div class="stat-n">12/15</div>
-        <div class="stat-l">Perfect Scores at 400K Tokens</div>
+        <div class="stat-n">+70%</div>
+        <div class="stat-l">vs Compaction at 2.3M Tokens</div>
       </div>
       <div class="stat-cell">
-        <div class="stat-n">4.8</div>
-        <div class="stat-l">out of 5.0 Detail Retention</div>
+        <div class="stat-n">13/20</div>
+        <div class="stat-l">Perfect Recall at 2.3M Tokens</div>
       </div>
       <div class="stat-cell">
-        <div class="stat-n">400K+</div>
-        <div class="stat-l">Token Sessions Supported</div>
+        <div class="stat-n">2.3M+</div>
+        <div class="stat-l">Token Sessions Tested</div>
       </div>
     </div>
   </section>
diff --git a/packages/core/eval/baselines.ts b/packages/core/eval/baselines.ts
index 2e9dc18..6fee58e 100644
--- a/packages/core/eval/baselines.ts
+++ b/packages/core/eval/baselines.ts
@@ -171,29 +171,71 @@ export async function compactionBaseline(
 
     if (prefix.length === 0) break;
 
-    // Summarize the prefix via LLM
-    const prefixText = renderConversation(prefix);
-    const userPrompt = COMPACTION_USER_TEMPLATE.replace(
-      "{{conversation}}",
-      prefixText,
-    );
-
-    const result = await llm.prompt(COMPACTION_SYSTEM, userPrompt, {
-      maxTokens: 4096,
-      temperature: 0,
-    });
+    // Summarize the prefix via LLM. If the prefix exceeds the model's
+    // context window, chunk it into segments and summarize each, then
+    // concatenate the summaries.
+    const MAX_CHUNK_TOKENS = 800_000; // leave room for system prompt + output
+    const prefixTokens = totalTokens(prefix);
+    let summaryText: string;
+
+    if (prefixTokens <= MAX_CHUNK_TOKENS) {
+      // Fits in one call
+      const prefixText = renderConversation(prefix);
+      const userPrompt = COMPACTION_USER_TEMPLATE.replace("{{conversation}}", prefixText);
+      const result = await llm.prompt(COMPACTION_SYSTEM, userPrompt, {
+        maxTokens: 4096,
+        temperature: 0,
+      });
+      summaryText = result.text;
+    } else {
+      // Chunk the prefix into segments that fit
+      const chunks: ConversationTurn[][] = [];
+      let chunk: ConversationTurn[] = [];
+      let chunkTokens = 0;
+      for (const turn of prefix) {
+        const t = turn.tokens ?? estimateTokens(renderTurn(turn));
+        if (chunkTokens + t > MAX_CHUNK_TOKENS && chunk.length > 0) {
+          chunks.push(chunk);
+          chunk = [];
+          chunkTokens = 0;
+        }
+        chunk.push(turn);
+        chunkTokens += t;
+      }
+      if (chunk.length > 0) chunks.push(chunk);
+
+      console.log(
+        `  [compaction] prefix too large (${prefixTokens} tok), splitting into ${chunks.length} chunks`,
+      );
+
+      // Summarize each chunk
+      const chunkSummaries: string[] = [];
+      for (let c = 0; c < chunks.length; c++) {
+        const chunkText = renderConversation(chunks[c]);
+        const userPrompt = COMPACTION_USER_TEMPLATE.replace("{{conversation}}", chunkText);
+        const result = await llm.prompt(COMPACTION_SYSTEM, userPrompt, {
+          maxTokens: 4096,
+          temperature: 0,
+        });
+        chunkSummaries.push(result.text);
+        console.log(
+          `  [compaction] chunk ${c + 1}/${chunks.length}: ${totalTokens(chunks[c])} tok → ${estimateTokens(result.text)} tok`,
+        );
+      }
+      summaryText = chunkSummaries.join("\n\n---\n\n");
+    }
 
     // Replace prefix with a synthetic summary turn + keep tail
     const summaryTurn: ConversationTurn = {
       role: "assistant",
-      content: [{ type: "text", text: `## Compacted Summary (pass ${compactionCount + 1})\n\n${result.text}` }],
-      tokens: estimateTokens(result.text),
+      content: [{ type: "text", text: `## Compacted Summary (pass ${compactionCount + 1})\n\n${summaryText}` }],
+      tokens: estimateTokens(summaryText),
     };
     currentTurns = [summaryTurn, ...tail];
     compactionCount++;
 
     console.log(
-      `  [compaction] pass ${compactionCount}: ${prefix.length} turns summarized → ${estimateTokens(result.text)} tok, ${currentTurns.length} turns remaining (${totalTokens(currentTurns)} tok)`,
+      `  [compaction] pass ${compactionCount}: ${prefix.length} turns summarized → ${estimateTokens(summaryText)} tok, ${currentTurns.length} turns remaining (${totalTokens(currentTurns)} tok)`,
     );
   }
 
diff --git a/packages/core/eval/harness.ts b/packages/core/eval/harness.ts
index a3df1de..367500b 100644
--- a/packages/core/eval/harness.ts
+++ b/packages/core/eval/harness.ts
@@ -1117,6 +1117,8 @@ async function loadScenarios(
       case "context": {
         const mod = await import("./scenarios/context-management");
         scenarios.push(...mod.scenarios);
+        const mega = await import("./scenarios/mega-session");
+        scenarios.push(mega.default);
         break;
       }
       case "recall": {
diff --git a/packages/core/eval/scenarios/cli-refactor-session.json.gz b/packages/core/eval/scenarios/cli-refactor-session.json.gz
new file mode 100644
index 0000000..0522c94
Binary files /dev/null and b/packages/core/eval/scenarios/cli-refactor-session.json.gz differ
diff --git a/packages/core/eval/scenarios/mega-session.ts b/packages/core/eval/scenarios/mega-session.ts
new file mode 100644
index 0000000..8194b39
--- /dev/null
+++ b/packages/core/eval/scenarios/mega-session.ts
@@ -0,0 +1,302 @@
+/**
+ * Mega-session eval scenario: Real 2.3M-token getsentry/cli refactoring session.
+ *
+ * Extracted from Lore DB session ses_33198e726ffeDyEZ4ZoowIUDJO.
+ * 5-day session (Mar 8-12, 2026) with 95 user turns, 3959 assistant turns.
+ * Multiple PRs, architectural decisions, multi-phase migration, code reviews.
+ *
+ * No inflation needed — this IS the 2.3M token scenario.
+ */
+import { readFileSync } from "node:fs";
+import { gunzipSync } from "node:zlib";
+import { join } from "node:path";
+import type {
+  ScenarioDefinition,
+  ConversationTurn,
+  EvalQuestion,
+  Dimension,
+} from "../types";
+
+// Load the extracted session turns from compressed JSON fixture
+const fixtureDir = join(import.meta.dir, ".");
+const compressed = readFileSync(join(fixtureDir, "cli-refactor-session.json.gz"));
+const turns: ConversationTurn[] = JSON.parse(gunzipSync(compressed).toString());
+
+const dimension: Dimension = "context";
+const scenarioId = "mega-cli-refactor";
+
+const base = {
+  dimension,
+  scenario: scenarioId,
+  sessionRef: "cli-refactor",
+  rubric: {
+    criteria: [
+      {
+        name: "accuracy",
+        description: "Does the answer correctly match the reference?",
+        scale: {
+          1: "Wrong or fabricated answer" as const,
+          3: "Partially correct — has the right topic but wrong specifics" as const,
+          5: "Exactly matches the reference with correct specifics" as const,
+        },
+      },
+    ],
+    weights: { accuracy: 1.0 },
+  },
+};
+
+// ---------------------------------------------------------------------------
+// Questions targeting various depths of the 2.3M-token session
+// ---------------------------------------------------------------------------
+
+const questions: EvalQuestion[] = [
+  // =========================================================================
+  // EASY — late session (turns 70-95, last ~300K tokens)
+  // Recent work that should be in the raw tail window
+  // =========================================================================
+  {
+    ...base,
+    id: "mega-e1",
+    question: "What was the final phase being worked on at the end of the session?",
+    referenceAnswer:
+      "Phase 6 (and 6b) — removing direct stdout/stderr usage from remaining commands " +
+      "and switching them to the return-based output system. The user also mentioned " +
+      "'auth login' as a command that uses the same architecture but is fundamentally " +
+      "different from list commands.",
+    metadata: { difficulty: "easy", tags: ["late-session", "phase"] },
+  },
+  {
+    ...base,
+    id: "mega-e2",
+    question: "What PR was being reviewed for Bugbot comments near the end of the session?",
+    referenceAnswer:
+      "PR #394 — the user referenced https://github.com/getsentry/cli/pull/394#discussion_r2920036806 " +
+      "with review feedback about streaming output and logger.info() calls.",
+    metadata: { difficulty: "easy", tags: ["late-session", "pr"] },
+  },
+  {
+    ...base,
+    id: "mega-e3",
+    question: "What was the user's instruction about Phase 6 and 7 being marked as 'future'?",
+    referenceAnswer:
+      "The user said Phase 6 and 7 should NOT be 'future' — they should be done " +
+      "once Phase 5 is merged. The user pushed back on deferring these phases.",
+    metadata: { difficulty: "easy", tags: ["late-session", "directive"] },
+  },
+  {
+    ...base,
+    id: "mega-e4",
+    question: "What command did the user repeatedly tell the assistant to run for checking CI failures?",
+    referenceAnswer:
+      "gh run view --log-failed --job $(gh pr checks $PR_NO --json state,link " +
+      "-q '.[] | select(.state == \"FAILURE\").link | split(\"/\")[-1]') — " +
+      "used repeatedly throughout the session after each push.",
+    metadata: { difficulty: "easy", tags: ["pattern", "ci"] },
+  },
+  {
+    ...base,
+    id: "mega-e5",
+    question: "What did the user say about the AGENTS.md file in the PR?",
+    referenceAnswer:
+      "The user said 'The change in AGENTS.md is completely irrelevant. Clean up this " +
+      "file to remove all irrelevant entries.' The AGENTS.md changes were auto-managed " +
+      "and not part of the intended PR.",
+    metadata: { difficulty: "easy", tags: ["mid-session", "directive"] },
+  },
+
+  // =========================================================================
+  // MEDIUM — mid session (turns 30-60, ~500K-1.5M token range)
+  // Architectural decisions and design debates
+  // =========================================================================
+  {
+    ...base,
+    id: "mega-m1",
+    question: "What was the architectural vision for the template-based output system?",
+    referenceAnswer:
+      "Commands become 'data producers' — the framework selects a template " +
+      "(JSON, plain text, rendered markdown) based on flags. Commands describe " +
+      "*what* to output; the framework decides *how*. This was a four-phase " +
+      "convergence plan.",
+    metadata: { difficulty: "medium", tags: ["architecture", "design"] },
+  },
+  {
+    ...base,
+    id: "mega-m2",
+    question: "What was the user's position on tuples vs objects for command return values?",
+    referenceAnswer:
+      "The user asked 'are tuples really cheaper than using a simple object?' and " +
+      "the assistant confirmed. The user accepted tuples but wanted the simplest " +
+      "approach — they said they were fine with {data, footer} or [data, footer], " +
+      "whichever is cheaper and more maintainable. They objected to defining footer " +
+      "as a separate function during declaration as 'too rigid'.",
+    metadata: { difficulty: "medium", tags: ["design-debate", "decision"] },
+  },
+  {
+    ...base,
+    id: "mega-m3",
+    question: "What did the user say about consola's spinner functionality?",
+    referenceAnswer:
+      "The user asked 'Does consola have a spinner helper?' and then 'can we make " +
+      "the spinner use process.stderr or process.stdout internally?' followed by " +
+      "'wait, would that cause issues with our tests?' — showing concern about " +
+      "test compatibility with spinner output.",
+    metadata: { difficulty: "medium", tags: ["mid-session", "consola"] },
+  },
+  {
+    ...base,
+    id: "mega-m4",
+    question: "Why did the user want to remove the --include flag from the api command?",
+    referenceAnswer:
+      "The user asked to check Sentry traces (org: 'sentry', project: 'cli') to " +
+      "see if the -i or --include flag was ever used with api calls. The traces " +
+      "showed it was never used, so the user approved removal.",
+    metadata: { difficulty: "medium", tags: ["decision", "sentry-traces"] },
+  },
+  {
+    ...base,
+    id: "mega-m5",
+    question: "What was PR #373 about?",
+    referenceAnswer:
+      "PR #373 was about the --fields flag for context-window-friendly JSON output. " +
+      "The problem was that every --json command dumped the full object, wasting agent " +
+      "tokens. The --fields flag lets agents request only the specific fields they need.",
+    metadata: { difficulty: "medium", tags: ["pr", "feature"] },
+  },
+  {
+    ...base,
+    id: "mega-m6",
+    question:
+      "What was the user's core frustration with the assistant's approach to stdout/stderr in commands?",
+    referenceAnswer:
+      "The user was frustrated that the assistant kept using direct stdout/stderr " +
+      "writes and manual output in commands instead of the return-based system. " +
+      "The user explicitly said: 'which part of \"do not use stderr or stdout or " +
+      "manual writes there directly, always use return-based output\" you don't " +
+      "understand?' (referring to src/commands/api.ts).",
+    metadata: { difficulty: "medium", tags: ["frustration", "directive"] },
+  },
+  {
+    ...base,
+    id: "mega-m7",
+    question: "What was the user's argument about JSON output consistency?",
+    referenceAnswer:
+      "The user argued: (1) JSON output should be consistent as it's machine-consumed, " +
+      "conditionals make things harder especially with tools like jq. (2) The user " +
+      "suggested the api command output should NOT be conditional on --dry-run (json " +
+      "vs human readable) — it should be consistent. They considered adding --no-json " +
+      "or --json=false for users who want human-readable output from api.",
+    metadata: { difficulty: "medium", tags: ["design", "consistency"] },
+  },
+
+  // =========================================================================
+  // HARD — early session (turns 1-25, first ~500K tokens)
+  // First issue, implementation details, specific code
+  // =========================================================================
+  {
+    ...base,
+    id: "mega-h1",
+    question: "What was the very first issue selected from the open issues list, and why?",
+    referenceAnswer:
+      "Issue #350 — Input hardening against agent hallucinations. It was chosen " +
+      "because of security impact (defense-in-depth against URL injection via " +
+      "org/project slugs interpolated into API paths).",
+    metadata: { difficulty: "hard", tags: ["early-session", "issue-selection"] },
+  },
+  {
+    ...base,
+    id: "mega-h2",
+    question: "What branch name was used for the first issue's implementation?",
+    referenceAnswer: "feat/input-hardening",
+    metadata: { difficulty: "hard", tags: ["early-session", "branch"] },
+  },
+  {
+    ...base,
+    id: "mega-h3",
+    question: "What function was created for input validation in the first PR?",
+    referenceAnswer:
+      "validateResourceId — a function to validate all slug/ID components as they're " +
+      "parsed in arg-parsing.ts. It was part of the input hardening against agent " +
+      "hallucinations (Issue #350).",
+    metadata: { difficulty: "hard", tags: ["early-session", "code"] },
+  },
+  {
+    ...base,
+    id: "mega-h4",
+    question: "What was the second issue worked on after merging PR #370?",
+    referenceAnswer:
+      "Magic @ selectors (@latest, @most_frequent) for issue commands — " +
+      "PR #371 on branch feat/magic-selectors.",
+    metadata: { difficulty: "hard", tags: ["early-session", "issue-sequence"] },
+  },
+  {
+    ...base,
+    id: "mega-h5",
+    question: "How many tests were passing in the full test suite during the first PR?",
+    referenceAnswer:
+      "359 tests passed in the full test suite, plus 19 property tests for " +
+      "the input hardening work specifically.",
+    metadata: { difficulty: "hard", tags: ["early-session", "test-counts"] },
+  },
+  {
+    ...base,
+    id: "mega-h6",
+    question: "What user feedback prompted adding magic selector info to help text?",
+    referenceAnswer:
+      "The user said: 'could we add the magic selector info to sentry issue/sentry " +
+      "issue --help? this could help agents' — and then asked 'any other commands " +
+      "you think we can add this to?'",
+    metadata: { difficulty: "hard", tags: ["early-session", "user-feedback"] },
+  },
+  {
+    ...base,
+    id: "mega-h7",
+    question: "What was the coverage requirement the user enforced across PRs?",
+    referenceAnswer:
+      "Patch coverage above 80%. The user referenced Codecov reports on PRs #370 " +
+      "and #371 specifically, asking to bump the coverage above 80% each time.",
+    metadata: { difficulty: "hard", tags: ["cross-session", "coverage"] },
+  },
+  {
+    ...base,
+    id: "mega-h8",
+    question:
+      "What was the user's reasoning for wanting to remove writeResponseBody() from the codebase?",
+    referenceAnswer:
+      "The user insisted on removing writeResponseBody() and switching to the " +
+      "return-based system. When the assistant suggested leaving it as 'harmless' " +
+      "since it was 'still exported and tested, just not called internally', the " +
+      "user explicitly said 'No remove this.' The principle was: no backward-compat " +
+      "stubs — remove dead code entirely.",
+    metadata: { difficulty: "hard", tags: ["mid-session", "code-cleanup"] },
+  },
+];
+
+// ---------------------------------------------------------------------------
+// Scenario definition
+// ---------------------------------------------------------------------------
+
+const scenario: ScenarioDefinition = {
+  id: scenarioId,
+  dimension,
+  label: "Mega CLI Refactor (2.3M tokens)",
+  description:
+    "Real 5-day getsentry/cli refactoring session — 2.3M tokens, 95 user turns, " +
+    "multiple PRs, architectural decisions, multi-phase migration. Tests recall " +
+    "of specific details across extreme context depths.",
+  sessions: [
+    {
+      id: "cli-refactor",
+      label: "CLI Refactoring Session",
+      projectPath: "/workspace/getsentry-cli",
+      turns,
+      metadata: {
+        totalTokens: 2374811,
+        description: "5-day CLI refactoring: Issue #350 → PRs #370-394+, buildCommand migration",
+      },
+    },
+  ],
+  questions,
+  applicableBaselines: ["lore", "compaction"],
+};
+
+export default scenario;