diff --git a/.lore.md b/.lore.md index b88a614..95266f7 100644 --- a/.lore.md +++ b/.lore.md @@ -5,7 +5,7 @@ ### Architecture -* **3-layer gradient model: layer 2 is transient, falls back to layer 1 via urgent distillation**: 3-layer gradient model: Layer 0 (all raw messages + LTM, append-only cache). Layer 1 (distilled prefix + pinned raw window, bust once on entry then warm). Layer 2/emergency (transient hard reset: fresh LTM, 2-3 best distillations, current agentic turn — fires 1-2 turns then urgent distillation falls back to Layer 1). Layer 2 must NOT set stickiness — stickiness only applies to layers 1-3. Bug in \`gradient.ts\`: \`effectiveMinLayer = max(0, lastLayer)\` traps sessions in emergency indefinitely; fix: restrict stickiness to \`lastLayer >= 1 && lastLayer <= 3\`. Context budget caps (160K at Opus) are cost-driven. Layer-specific distLimit: layers 1-2 all non-archived distillations; layer 3: top 5 via \`selectDistillations()\`; emergency: top 2. Scoring: 70% recency + 30% \`importanceBonus()\`. Cache frozen during tool-call chains for byte-identical prefix. +* **3-layer gradient model: layer 2 is transient, falls back to layer 1 via urgent distillation**: 3-layer gradient model: Layer 0 (all raw messages + LTM, append-only cache). Layer 1 (distilled prefix + pinned raw window, bust once on entry then warm). Layer 2/emergency (transient hard reset: fresh LTM, 2-3 best distillations, current agentic turn — fires 1-2 turns then urgent distillation falls back to Layer 1). Layer 2 must NOT set stickiness — restrict to \`lastLayer >= 1 && lastLayer <= 3\` (bug: \`effectiveMinLayer = max(0, lastLayer)\` traps sessions in emergency). Context budget caps (160K at Opus) are cost-driven. distLimit: layers 1-2 all non-archived distillations; layer 3: top 5 via \`selectDistillations()\`; emergency: top 2. Scoring: 70% recency + 30% \`importanceBonus()\`. Cache frozen during tool-call chains for byte-identical prefix. * **Background LLM rate limiting: p-limit(2) + 429 circuit breaker in background-limiter.ts**: Global concurrency limit for background LLM work in \`packages/gateway/src/background-limiter.ts\`. Uses \`p-limit(2)\` to cap simultaneous background LLM calls across all idle sessions. Circuit breaker trips on 429 responses and pauses all background work for the \`Retry-After\` duration. Wired into: idle scheduler, pipeline incremental distillation, in-flight curation. Urgent distillation is excluded (client is waiting). Without this, N idle sessions fire N×4 simultaneous background calls causing cascading rate limit failures. @@ -13,8 +13,11 @@ * **Conversation import system: providers, detection, extraction pipeline**: Core import system in \`packages/core/src/import/\`. Key design: \`AgentHistoryProvider\` interface with \`detect()\`/\`load()\`; providers registered in global registry (\`providers/index.ts\`). Detection scans all providers, returns \`DetectedSession\[]\`. Extraction calls curator LLM sequentially per chunk, deduplicating ops via \`parseOps()\`/\`applyOps()\`. Idempotency via \`import\_history\` table (DB migration v19). Built-in providers: Claude Code (\`~/.claude/projects/\`), OpenCode (SQLite), Aider (markdown), Codex (\`~/.codex/sessions/\` JSONL), Cline (VS Code globalStorage JSON), Continue (\`~/.continue/sessions/\` JSON), Pi (\`~/.pi/agent/sessions/\` tree-structured JSONL). Auto-import triggered in \`lore run\` via \`maybeAutoImport()\`. Copilot Chat skipped (opaque leveldb). + +* **getsentry/cli output system: return-based, template-driven, no direct stdout/stderr**: getsentry/cli output system: return-based, template-driven, no direct stdout/stderr. Commands return data; \`buildCommand\` wrapper handles rendering (JSON, plain text, markdown). No manual \`writeJson()\`/\`writeOutput()\`/\`writeResponseBody()\` calls. \`--json\` and \`--fields\` lifted into \`buildCommand\`. For streaming, use \`buildStreamingCommand\` (detects AsyncGenerator, renders based on \`--json\`; JSONL mode flushes each \`yield\` immediately). Spinner is the only component holding stdout/stderr references. Throw \`exitCode\` rather than weaving it around. Verbose mode uses \`logger.debug()\` via Consola. \`api\` command always returns JSON. \`--dry-run\` validates inputs and returns mock result. Goal: 0 direct stdout/stderr usages. No dead code; delete removed functions entirely. \`auth login\` example: yield QR code + URL, then resume polling with spinner. If teams are missing during project creation, auto-create rather than error. Migration order: (1) stop exposing stdout/stderr from \`buildCommand\`; (2) refactor non-streaming list commands; (3) implement \`buildStreamingCommand\`. + -* **Gradient tool\_use/tool\_result pairing: reconstruct-after-eviction pattern**: Gradient tool\_use/tool\_result pairing — reconstruct-after-eviction pattern: \`tryFit()\` has NO logic to keep pairs together; safety via: (1) \`resolveToolResults()\` (temporal-adapter.ts:239) merges tool result data onto assistant tool parts, strips user-side \`tool\_result\` blocks; (2) \`loreMessagesToGateway()\` (pipeline.ts:3401) reconstructs pairs from surviving assistant tool parts; (3) \`removeOrphanedToolResults()\` (pipeline.ts:3524) validates BOTH directions — tool\_result→tool\_use AND tool\_use→tool\_result (Pass 2, PR #424). \`sanitizeToolParts()\` (gradient.ts:1071) converts pending/running tool parts to error state. Layer 4 (emergency) never strips tool parts to avoid infinite loops. Prefix/raw boundary trap: prefix ends with assistant (text-only), rawWindow may start with assistant containing tool\_use → back-to-back assistants → Anthropic rejects. Fix: advance cutoff past leading assistant messages when prefix is present at all 3 assembly points in gradient.ts (tryFit, tryFitStable pinned path, emergency layer). PR #428. +* **Gradient tool\_use/tool\_result pairing: reconstruct-after-eviction pattern**: Gradient tool\_use/tool\_result pairing — reconstruct-after-eviction pattern: \`tryFit()\` has NO logic to keep pairs together; safety via: (1) \`resolveToolResults()\` (temporal-adapter.ts:239) merges tool result data onto assistant tool parts, strips user-side \`tool\_result\` blocks; (2) \`loreMessagesToGateway()\` (pipeline.ts:3401) reconstructs pairs from surviving assistant tool parts; (3) \`removeOrphanedToolResults()\` (pipeline.ts:3524) validates BOTH directions. \`sanitizeToolParts()\` (gradient.ts:1071) converts pending/running tool parts to error state. Layer 4 never strips tool parts to avoid infinite loops. Prefix/raw boundary trap: prefix ends with assistant (text-only), rawWindow may start with assistant containing tool\_use → back-to-back assistants → Anthropic rejects. Fix: advance cutoff past leading assistant messages when prefix is present at all 3 assembly points in gradient.ts (tryFit, tryFitStable pinned path, emergency layer). * **LTM confidence field: semantic meaning and rerankPreferences() for legacy entries**: \`ltm.create()\` accepts optional \`confidence\` param (default 1.0, clamped \[0,1]). Semantics: 1.0=unconditional directive, 0.9=strong preference, 0.8=moderate, 0.6=mild. \`CuratorOp\` create type includes \`confidence\`, wired through \`applyOps\`. \`rerankPreferences()\` in \`packages/core/src/ltm.ts\` re-scores legacy entries by directive keyword patterns (\`STRONG\_DIRECTIVE\_RE\`); skips entries whose \`confidence\` was already set to a non-default value — manual overrides are preserved. \`lore data rerank\` CLI command triggers re-ranking; also auto-runs after \`lore data recover\`. Run after deploying to fix existing preferences in DB. @@ -46,16 +49,13 @@ * **Remote import idempotency split-brain: isImported queries local DB, recordImport writes to remote**: When \`LORE\_REMOTE\_URL\` is set, \`lore import\` has a split-brain bug: \`isImported()\` checks the \*\*local\*\* SQLite \`import\_history\` table, but \`remotePost('/api/v1/import/record')\` writes the record to the \*\*remote\*\* DB. Result: every subsequent run re-detects all sessions as un-imported and double-extracts. Fix: (1) add \`GET /api/v1/import/history\` endpoint and query remote during dedup check when remote URL is set; (2) optionally also write locally as offline resilience. The remote DB is the source of truth for import history in remote mode. -* **splitSegments() infinite recursion on oversized single messages**: splitSegments() infinite recursion on oversized single messages: In \`packages/core/src/distillation.ts\`, \`splitSegments()\` recurses infinitely when a single message exceeds \`maxSegmentTokens\` (16384). \`findSplitIndex()\` returns \`messages.length\` (=1), so \`left = messages.slice(0, 1)\` produces an identical recursive call. Triggered on large tool outputs (~49KB+). Fix: add base case after the \`totalTokens <= maxTokens\` guard — \`if (messages.length <= 1) return \[messages]\`. The oversized message becomes an indivisible segment. - - -* **Temporal storage: store from original req.messages BEFORE resolveToolResults strips tool\_result content**: Temporal storage ordering trap: call \`gatewayMessagesToLore()\` and store temporal messages BEFORE \`resolveToolResults()\` runs. \`resolveToolResults()\` replaces tool\_result parts with \`\[tool results provided] (t:msgID)\` placeholder — if temporal storage happens after, stored user message has no searchable content. The \`(t:msgID)\` reference lets the model fetch original content via recall tool. \`postResponse\` in \`pipeline.ts\` wraps all post-response processing in a broad try/catch — errors inside \`temporal.store()\` are silently swallowed. Check for \`post-response processing failed\` log lines first when debugging temporal storage issues. Eval teardown trap: \`closeDB()\` + \`unlinkSync\` deletes DB, then subsequent QA phase creates a new empty DB — inspect a killed/timed-out eval run's DB to verify actual storage. +* **splitSegments() infinite recursion on oversized single messages**: \`splitSegments()\` infinite recursion on oversized single messages in \`packages/core/src/distillation.ts\`: recurses infinitely when a single message exceeds \`maxSegmentTokens\` (16384). \`findSplitIndex()\` returns \`messages.length\` (=1), so \`left = messages.slice(0, 1)\` produces an identical recursive call. Triggered on large tool outputs (~49KB+). Fix: add base case after the \`totalTokens <= maxTokens\` guard — \`if (messages.length <= 1) return \[messages]\`. The oversized message becomes an indivisible segment. -* **TTL downgrade hysteresis: downgradeStreak field prevents compounding cache busts**: Auto-TTL downgrade hysteresis in \`packages/gateway/src/pipeline.ts\`: downgrade from 1h→5m TTL requires 3 consecutive short-gap turns (\`ttlDowngradeStreak\` in \`SessionState\`). Block downgrade if >50% of session tokens are cached. Reset streak on any long-gap turn. Subagent turns and tool-use continuations excluded from gap recording — capture \`prevStopReason\` before line 1667 overwrites it, skip when \`prevStopReason === 'tool\_use'\` or \`isSubagentTurn\`. State persistence: immediate (session identity), per-turn (cost snapshot), 30s periodic (gradient EMAs + cache warming via dirty flag). Max data loss on crash: ~30s. Also: recall follow-up requests must set \`cacheConversation: false\` — otherwise modified message array triggers full cache write at 5m TTL pricing. +* **TTL downgrade hysteresis: downgradeStreak field prevents compounding cache busts**: Auto-TTL downgrade hysteresis in \`packages/gateway/src/pipeline.ts\`: downgrade from 1h→5m TTL requires 3 consecutive short-gap turns (\`ttlDowngradeStreak\` in \`SessionState\`). Block downgrade if >50% of session tokens are cached. Reset streak on any long-gap turn. Subagent turns and tool-use continuations excluded from gap recording — capture \`prevStopReason\` before line 1667 overwrites it, skip when \`prevStopReason === 'tool\_use'\` or \`isSubagentTurn\`. State persistence: immediate (session identity), per-turn (cost snapshot), 30s periodic (gradient EMAs + cache warming via dirty flag). Max data loss on crash: ~30s. Recall follow-up requests must set \`cacheConversation: false\` — otherwise modified message array triggers full cache write at 5m TTL pricing. -* **Upgrade lock double-acquisition bug: same process re-locks same file**: In \`packages/gateway/src/cli/lib/binary.ts\`, \`downloadBinaryToTemp()\` acquires a lock on \`\.lock\` and holds it. Then \`installBinary()\` computes the same install path and tries to \`acquireLock()\` again. \`handleExistingLock()\` only allows re-entry if \`existingPid === process.ppid\` (parent), but the lock was written by the same process (\`existingPid === process.pid\`), so it throws 'Another upgrade is already in progress'. Fix: in \`handleExistingLock\`, also allow re-entry when \`existingPid === process.pid\`. Double \`releaseLock()\` is safe — \`releaseLock\` swallows errors so the second call is a no-op after the file is deleted. +* **Upgrade lock double-acquisition bug: same process re-locks same file**: In \`packages/gateway/src/cli/lib/binary.ts\`, \`downloadBinaryToTemp()\` acquires a lock on \`\.lock\` and holds it. Then \`installBinary()\` computes the same install path and tries to \`acquireLock()\` again. \`handleExistingLock()\` only allows re-entry if \`existingPid === process.ppid\` (parent), but the lock was written by the same process (\`existingPid === process.pid\`), so it throws 'Another upgrade is already in progress'. Fix: in \`handleExistingLock\`, also allow re-entry when \`existingPid === process.pid\`. Double \`releaseLock()\` is safe — swallows errors so the second call is a no-op. * **vectorSearch() is unscoped — test cleanup must delete all embedding rows**: \`vectorSearch()\` in \`packages/core/src/ltm.ts\` queries \`knowledge WHERE embedding IS NOT NULL AND confidence > 0.2\` with no \`project\_id\` filter (intentional for cross-project search). Two gotchas: (1) Test suites scoped to one project leak embedding rows into other vectorSearch tests — \`beforeEach\` must \`DELETE FROM knowledge WHERE embedding IS NOT NULL\`. (2) \`vectorSearch()\` has no \`excludeCategories\` param — category exclusions from \`forSession()\` callers have no effect; add optional \`excludeCategories\` param and propagate from callers. Also: global entries (pid=null) force \`crossProject=true\`; confidence is clamped to \[0.0, 1.0] in \`update()\`. @@ -65,22 +65,22 @@ * **Enhanced dedup: title overlap + vector similarity (Nomic v1.5)**: Nomic Embed v1.5 dedup threshold: same-domain cosine similarity spreads 0.46–0.70 (vs BGE Small which clusters at 0.93–0.97+, making dedup unusable). Correct dedup threshold: \*\*0.935\*\* — at-or-above is genuine duplicate. Range 0.85–0.91 contains 'related but distinct' entries; 0.85 produces false positives across project boundaries. \`deduplicate()\` in \`packages/core/src/ltm.ts\` uses both title word-overlap (0.7 Jaccard + 4+ shared words) AND vector cosine similarity. BGE Small embeddings are auto-nulled by \`checkConfigChange()\` on startup; \`backfillEmbeddings()\` re-embeds with Nomic v1.5. \`lore data reindex\` triggers backfill on-demand without gateway restart. - -* **Uniform citation format: (prefix:id) for all recall-able references**: Uniform citation format: all recall-able references use \`(prefix:id)\`: \`(d:UUID)\` for distillations, \`(t:msgID)\` for temporal messages, \`(k:entryID)\` for knowledge entries. Distillation headers render as \`(d:UUID | lossy | N sources)\`. Tool result placeholders render as \`\[tool results provided] (t:msgID)\`. Do NOT use markdown link style. Recall RRF: distillations get 4 RRF lists (BM25 + vector + quality + exact-match) vs temporal's 3 (no quality list). \`SOURCE\_WEIGHT\`: distillation=0.8, temporal=0.8, knowledge=1.0. \`charBudget\` 12K. Vector search gate skipped for session-scoped recall. Per-query: 4 RRF lists (knowledge BM25, distillation BM25, temporal BM25, temporal recency); with 3 LLM expansions = 16 lists. \`MAX\_RRF\_LISTS=10\` trims expanded-query lists first. RRF formula: \`w/(60+rank)\`. \`formatFusedResults\`: tier 0 (score≥60% of top), tier 1 (≥30%), tier 2 (rest); per-result char budget clamped \[80,1200]. Temporal in tier 2 gets ~0.35x weight vs knowledge. + +* **getsentry/cli input validation module: src/lib/input-validation.ts**: getsentry/cli input validation (\`src/lib/input-validation.ts\`): \`rejectControlChars(input, label)\`, \`validateSlug(input, label)\` (rejects ?, #, %, whitespace, control chars, /), \`rejectPreEncoded(input, label)\`, \`validateEndpoint(endpoint)\` (blocks .. segments), \`validateResourceId(input, label)\`. Applied at: \`parseOrgProjectArg\`, \`parseIssueArg\`, \`parseSlashSeparatedArg\` (slash-split components only, NOT no-slash case — plain IDs need downstream processing) in \`arg-parsing.ts\`; \`normalizeEndpoint\` in \`api.ts\`. NOT applied in \`resolve-target.ts\` — env vars and DB cache are trusted. Biome lint rule: \`lint/suspicious/noControlCharactersInRegex\` (NOT \`lint/correctness/...\`) — use \`RegExp\` constructor + \`// biome-ignore lint/suspicious/noControlCharactersInRegex: reason\`. Property tests use \`fast-check\`: \`orgSlugArb\`/\`projectSlugArb\` match \`/^\[a-z]\[a-z0-9-]{1,30}\[a-z0-9]$/\`, \`numericIdArb\` matches \`/^\[1-9]\[0-9]{0,15}$/\`. \`DEFAULT\_NUM\_RUNS=50\`. If input validation is added, update arbitraries to match new constraints or property tests will fail. ### Preference -* **Always fix cache memory leaks with TTL eviction, size cap, and scheduled pruning**: Cache memory leak fix pattern: (1) TTL check in \`.get()\` — delete and return undefined if expired; (2) LRU eviction in \`.set()\` — delete oldest key when \`store.size >= maxEntries\`; (3) \`setInterval(() => this.prune(), 60\_000)\` in constructor. Defaults: \`maxEntries = 10\_000\`, \`ttlMs = 300\_000\` (5 min). Note: \`prune()\` is NOT currently scheduled — the \`setInterval\` pattern is the prescribed fix, not existing behavior. Always use \`flock\` advisory locking instead of \`proper-lockfile\` — \`proper-lockfile@4.1.2\` fails in containerized environments where PID namespaces reset on restart, leaving stale locks. \`flock\` is automatically released on process exit. Session ground-truth: cache entries are never auto-evicted and \`prune()\` is never scheduled in current code — do not assert otherwise. +* **Always fix cache memory leaks with TTL eviction, size cap, and scheduled pruning**: Cache memory leak fix pattern: (1) TTL check in \`.get()\` — delete and return undefined if expired; (2) LRU eviction in \`.set()\` — delete oldest key when \`store.size >= maxEntries\`; (3) \`setInterval(() => this.prune(), 60\_000)\` in constructor. Defaults: \`maxEntries = 10\_000\`, \`ttlMs = 300\_000\` (5 min). Note: \`prune()\` is NOT currently scheduled in existing code — the setInterval pattern is the prescribed fix. Always use \`flock\` advisory locking instead of \`proper-lockfile\` — \`proper-lockfile@4.1.2\` fails in containerized environments where PID namespaces reset on restart, leaving stale locks. \`flock\` is automatically released on process exit. - -* **Always investigate root causes by requesting systematic code-path analysis across multiple specific files**: (preference) When encountering unexpected system behavior, pre-identify 3-6 candidate explanations with exact files and functions, read all named files upfront, trace the full call chain end-to-end, and report findings per-area rather than asking clarifying questions first. + +* **Always request critical code reviews with specific file paths, line numbers, and severity classifications**: Code review & investigation standards: (1) Reviews: exact file paths, line numbers, severity (C/M/L), root causes, concrete fixes. Check state-not-cleared, consume-once flags, circuit breaker bypass, concurrency edges. Critical+Medium fixed before merge. (2) Investigation: read actual source, trace full call chain, enumerate 2-4 candidates, report confirmed/falsified verdict with line numbers. (3) PR discipline: critical self-review before merge, CI green, amend+force-push. (4) After bug fix: add tests (4-6 edge cases) referencing issue number. Worker test files follow a consistent 7-case spec: compute job, missing record skip, cleanup hard-delete >30 days, preserve recently archived, sync batch, sync skip missing, sync dryRun. (5) Sentry IDs start with \`LOREAI-GATEWAY-\`. (6) Run lint, typecheck, full test suite before committing. Use Vitest (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai May 2026). Use kebab-case file naming. - -* **Always request worker tests with a consistent 7-case spec covering compute, missing-record, cleanup retention, and sync scenarios**: (preference) Worker test files follow a consistent 7-case spec: (1) compute job — DB lookup + update, (2) missing record — skip without throw, (3) cleanup — hard-delete records archived >30 days, (4) cleanup — preserve recently archived records, (5) sync — process a batch, (6) sync — skip missing records, (7) sync — respect dryRun flag. Tests mock DB and Redis. Use Vitest project-wide (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai+ts-node May 2026 — 312ms vs 30s startup). Use kebab-case file naming. + +* **CI/PR cycle: check failing jobs, wait for bots, resolve all comments before merging**: CI/PR cycle: After every push: (1) check failing jobs via \`gh run view --log-failed --job $(gh pr checks $PR\_NO --json state,link -q '.\[] | select(.state == "FAILURE").link | split("/")\[-1]')\`; (2) wait for 'Sentry Seer' and 'Cursor BugBot' before acting; (3) fix all failures; (4) use \`gh api graphql\` with \`reviewThreads\` filtering \`isResolved==false, isMinimized==false\` for unresolved comments (fields: diff\_hunk, line, start\_line, body); (5) address all bot/human comments, respond or mark resolved; (6) repeat until clean. PR creation: check if already on a relevant branch; follow repo branch/commit conventions; base PR description on implementation plan (not overly long); add plan as \`git notes\`; create as draft initially. Always call \`plan\_exit\` when done planning. If BugBot finds nothing, merge and move on. -* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval system: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: multi-pass LLM summarization at 83.5% autoCompactThreshold. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). QA contamination fixed via \`X-Lore-No-Store\` header. Never accept eval-gaming fixes. Eval table consistency: per-difficulty averages must match overall average. Non-deterministic LLM output causes eval variance: re-run before concluding regression. Post-replay embedding backfill runs before QA phase. +* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval system: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: multi-pass LLM summarization at 83.5% autoCompactThreshold. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window — scoring below means lost information (treat as bug). QA contamination fixed via \`X-Lore-No-Store\`. Non-deterministic LLM output causes variance: re-run before concluding regression. * **Prefer WASM backend over native onnxruntime-node for compiled binaries**: WASM backend for Bun \`--compile\` binaries with transformers.js: \`binaryExternalsPlugin\` in esbuild redirects \`onnxruntime-node\` → \`onnxruntime-web\` via \`onResolve\` (static imports only — does NOT redirect dynamic \`import()\` calls) and patches transformers.js CDN fallback via \`onLoad\` to read \`wasmPaths\` from \`globalThis.\_\_LORE\_VENDOR\_WASM\_PATHS\_\_\` (object form \`{ mjs, wasm }\` with exact hashed \`$bunfs\` filenames — directory strings fail). WASM files embedded as Bun \`{ type: 'file' }\` assets. For npm/CJS builds, \`onnxruntime-node\` stays external. WASM is ~2x faster on batches than native. Importing \`onnxruntime-web\` explicitly alongside the redirect creates two ort instances — 'cannot register backend cpu using priority 10' error. diff --git a/README.md b/README.md index 4b76475..0a2dafd 100644 --- a/README.md +++ b/README.md @@ -326,9 +326,23 @@ Lore re-scans the `lat.md/` directory periodically (on session idle), so changes ## Eval results -At 400K tokens (realistic coding session length), Lore outperforms standard compaction — the approach used by Claude Code, Codex, and other tools that summarize older context when the conversation grows too long: +### The mega-session test: 2.3 million tokens -### Context retention (400K tokens) +Real-world coding sessions can span days and accumulate millions of tokens. We extracted a real 5-day, 2.3M-token session (getsentry/cli refactoring — 95 user turns, multiple PRs, architectural decisions, code reviews) and tested whether each approach can answer questions about details from throughout the session: + +| What's tested | Lore | Compaction | Lore vs Compaction | +|---|---|---|---| +| Easy (late-session details) | **4.0**/5 | 2.4/5 | +67% | +| Medium (mid-session details) | **3.9**/5 | 3.0/5 | +29% | +| Hard (early-session details) | **4.1**/5 | 1.8/5 | +136% | +| **Average** | **4.0**/5 | 2.4/5 | **+70%** | +| **Perfect scores (5.0)** | **13/20** | 5/20 | 2.6x more | + +*At 2.3M tokens, compaction compresses the entire conversation into ~11K tokens of summary — a 200x compression that destroys most details. Lore preserves them through distillation (21 observations totaling ~10K tokens) + 64K raw tail window + searchable temporal archive via recall. The hard questions — details from the first day of a 5-day session — are where compaction fails (1.8/5) and Lore excels (4.1/5).* + +### At 400K tokens + +At more typical session lengths, Lore still outperforms: | What's tested | Lore | Compaction | Lore vs Compaction | |---|---|---|---| @@ -338,7 +352,7 @@ At 400K tokens (realistic coding session length), Lore outperforms standard comp | **Average** | **4.8**/5 | 4.5/5 | **+7%** | | **Perfect scores (5.0)** | **12/15** | 9/15 | — | -*Compaction baseline: multi-pass LLM summarization matching Claude Code's auto-compact behavior (~140K threshold, 2-3 cycles at 400K tokens). Scored by LLM-as-judge on a 1–5 scale. Lore's advantage is largest on medium-difficulty questions — mid-session details like decision alternatives, exact error messages, and rejected approaches that compaction summarizes away but Lore's distillation + recall preserves.* +*Compaction baseline: multi-pass LLM summarization matching Claude Code's auto-compact behavior (~140K threshold). At 400K tokens, compaction only loses a few details — the advantage grows dramatically at larger scales.* ### Preference recall (400K tokens) @@ -351,12 +365,16 @@ At 400K tokens (realistic coding session length), Lore outperforms standard comp *Preference recall baselines are from a prior eval run with tail-window (80K). Compaction preference baselines pending re-run.* -**What this means:** at 400K tokens, Lore scores 4.8/5 on context retention with 12 out of 15 perfect scores — compared to compaction's 4.5/5 with 9 perfect scores. The gap is largest on mid-session details that compaction loses through repeated summarization cycles. +**What this means:** the longer the session, the bigger Lore's advantage. At 400K tokens, Lore is +7% over compaction. At 2.3M tokens, Lore is +70% — compaction retains less than half the information (2.4/5) while Lore retains 80% (4.0/5). Early-session details that compaction destroys completely (1.8/5) are preserved by Lore's three-tier architecture (4.1/5). -The eval suite (16 scenarios, 130+ questions, 5 dimensions) is open source in `packages/core/eval/`. Run it yourself: +The eval suite is open source in `packages/core/eval/`. Run it yourself: ```bash +# 400K inflated scenario bun packages/core/eval/run.ts --mode live --inflate 400000 + +# 2.3M mega-session (real session, no inflation needed) +bun packages/core/eval/run.ts --mode live --scenarios mega-cli-refactor ``` **Cost:** Lore's memory layer runs at minimal additional cost — background distillation and curation use batch APIs (50% off on supported providers) and cheaper models. Local on-device embeddings (Nomic Embed v1.5) mean zero API cost for vector search. Predictive cache warming reduces expensive cache rebuilds. @@ -373,7 +391,7 @@ bun packages/core/eval/run.ts --mode live --inflate 400000 **v5 — behavioral pattern detection + 400K eval.** Vector similarity-based pattern echo detection, action tagging in distillation, cross-session pattern clustering, assertion pinning for long sessions, and a scenario inflator for realistic 400K-token evaluation. This is what closed the preference gap from +15% to +47% over tail-window. -**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. Context retention: 4.8/5 with 12/15 perfect scores, +7% over compaction at 400K tokens. +**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. 2.3M-token mega-session eval on a real 5-day coding session: Lore 4.0/5 vs compaction 2.4/5 (+70%), with 13/20 perfect scores vs 5/20. ## Development setup diff --git a/docs/index.html b/docs/index.html index 0376a99..b80474e 100644 --- a/docs/index.html +++ b/docs/index.html @@ -928,16 +928,16 @@

-
12/15
-
Perfect Scores at 400K Tokens
+
+70%
+
vs Compaction at 2.3M Tokens
-
4.8
-
out of 5.0 Detail Retention
+
13/20
+
Perfect Recall at 2.3M Tokens
-
400K+
-
Token Sessions Supported
+
2.3M+
+
Token Sessions Tested
diff --git a/packages/core/eval/baselines.ts b/packages/core/eval/baselines.ts index 2e9dc18..6fee58e 100644 --- a/packages/core/eval/baselines.ts +++ b/packages/core/eval/baselines.ts @@ -171,29 +171,71 @@ export async function compactionBaseline( if (prefix.length === 0) break; - // Summarize the prefix via LLM - const prefixText = renderConversation(prefix); - const userPrompt = COMPACTION_USER_TEMPLATE.replace( - "{{conversation}}", - prefixText, - ); - - const result = await llm.prompt(COMPACTION_SYSTEM, userPrompt, { - maxTokens: 4096, - temperature: 0, - }); + // Summarize the prefix via LLM. If the prefix exceeds the model's + // context window, chunk it into segments and summarize each, then + // concatenate the summaries. + const MAX_CHUNK_TOKENS = 800_000; // leave room for system prompt + output + const prefixTokens = totalTokens(prefix); + let summaryText: string; + + if (prefixTokens <= MAX_CHUNK_TOKENS) { + // Fits in one call + const prefixText = renderConversation(prefix); + const userPrompt = COMPACTION_USER_TEMPLATE.replace("{{conversation}}", prefixText); + const result = await llm.prompt(COMPACTION_SYSTEM, userPrompt, { + maxTokens: 4096, + temperature: 0, + }); + summaryText = result.text; + } else { + // Chunk the prefix into segments that fit + const chunks: ConversationTurn[][] = []; + let chunk: ConversationTurn[] = []; + let chunkTokens = 0; + for (const turn of prefix) { + const t = turn.tokens ?? estimateTokens(renderTurn(turn)); + if (chunkTokens + t > MAX_CHUNK_TOKENS && chunk.length > 0) { + chunks.push(chunk); + chunk = []; + chunkTokens = 0; + } + chunk.push(turn); + chunkTokens += t; + } + if (chunk.length > 0) chunks.push(chunk); + + console.log( + ` [compaction] prefix too large (${prefixTokens} tok), splitting into ${chunks.length} chunks`, + ); + + // Summarize each chunk + const chunkSummaries: string[] = []; + for (let c = 0; c < chunks.length; c++) { + const chunkText = renderConversation(chunks[c]); + const userPrompt = COMPACTION_USER_TEMPLATE.replace("{{conversation}}", chunkText); + const result = await llm.prompt(COMPACTION_SYSTEM, userPrompt, { + maxTokens: 4096, + temperature: 0, + }); + chunkSummaries.push(result.text); + console.log( + ` [compaction] chunk ${c + 1}/${chunks.length}: ${totalTokens(chunks[c])} tok → ${estimateTokens(result.text)} tok`, + ); + } + summaryText = chunkSummaries.join("\n\n---\n\n"); + } // Replace prefix with a synthetic summary turn + keep tail const summaryTurn: ConversationTurn = { role: "assistant", - content: [{ type: "text", text: `## Compacted Summary (pass ${compactionCount + 1})\n\n${result.text}` }], - tokens: estimateTokens(result.text), + content: [{ type: "text", text: `## Compacted Summary (pass ${compactionCount + 1})\n\n${summaryText}` }], + tokens: estimateTokens(summaryText), }; currentTurns = [summaryTurn, ...tail]; compactionCount++; console.log( - ` [compaction] pass ${compactionCount}: ${prefix.length} turns summarized → ${estimateTokens(result.text)} tok, ${currentTurns.length} turns remaining (${totalTokens(currentTurns)} tok)`, + ` [compaction] pass ${compactionCount}: ${prefix.length} turns summarized → ${estimateTokens(summaryText)} tok, ${currentTurns.length} turns remaining (${totalTokens(currentTurns)} tok)`, ); } diff --git a/packages/core/eval/harness.ts b/packages/core/eval/harness.ts index a3df1de..367500b 100644 --- a/packages/core/eval/harness.ts +++ b/packages/core/eval/harness.ts @@ -1117,6 +1117,8 @@ async function loadScenarios( case "context": { const mod = await import("./scenarios/context-management"); scenarios.push(...mod.scenarios); + const mega = await import("./scenarios/mega-session"); + scenarios.push(mega.default); break; } case "recall": { diff --git a/packages/core/eval/scenarios/cli-refactor-session.json.gz b/packages/core/eval/scenarios/cli-refactor-session.json.gz new file mode 100644 index 0000000..0522c94 Binary files /dev/null and b/packages/core/eval/scenarios/cli-refactor-session.json.gz differ diff --git a/packages/core/eval/scenarios/mega-session.ts b/packages/core/eval/scenarios/mega-session.ts new file mode 100644 index 0000000..8194b39 --- /dev/null +++ b/packages/core/eval/scenarios/mega-session.ts @@ -0,0 +1,302 @@ +/** + * Mega-session eval scenario: Real 2.3M-token getsentry/cli refactoring session. + * + * Extracted from Lore DB session ses_33198e726ffeDyEZ4ZoowIUDJO. + * 5-day session (Mar 8-12, 2026) with 95 user turns, 3959 assistant turns. + * Multiple PRs, architectural decisions, multi-phase migration, code reviews. + * + * No inflation needed — this IS the 2.3M token scenario. + */ +import { readFileSync } from "node:fs"; +import { gunzipSync } from "node:zlib"; +import { join } from "node:path"; +import type { + ScenarioDefinition, + ConversationTurn, + EvalQuestion, + Dimension, +} from "../types"; + +// Load the extracted session turns from compressed JSON fixture +const fixtureDir = join(import.meta.dir, "."); +const compressed = readFileSync(join(fixtureDir, "cli-refactor-session.json.gz")); +const turns: ConversationTurn[] = JSON.parse(gunzipSync(compressed).toString()); + +const dimension: Dimension = "context"; +const scenarioId = "mega-cli-refactor"; + +const base = { + dimension, + scenario: scenarioId, + sessionRef: "cli-refactor", + rubric: { + criteria: [ + { + name: "accuracy", + description: "Does the answer correctly match the reference?", + scale: { + 1: "Wrong or fabricated answer" as const, + 3: "Partially correct — has the right topic but wrong specifics" as const, + 5: "Exactly matches the reference with correct specifics" as const, + }, + }, + ], + weights: { accuracy: 1.0 }, + }, +}; + +// --------------------------------------------------------------------------- +// Questions targeting various depths of the 2.3M-token session +// --------------------------------------------------------------------------- + +const questions: EvalQuestion[] = [ + // ========================================================================= + // EASY — late session (turns 70-95, last ~300K tokens) + // Recent work that should be in the raw tail window + // ========================================================================= + { + ...base, + id: "mega-e1", + question: "What was the final phase being worked on at the end of the session?", + referenceAnswer: + "Phase 6 (and 6b) — removing direct stdout/stderr usage from remaining commands " + + "and switching them to the return-based output system. The user also mentioned " + + "'auth login' as a command that uses the same architecture but is fundamentally " + + "different from list commands.", + metadata: { difficulty: "easy", tags: ["late-session", "phase"] }, + }, + { + ...base, + id: "mega-e2", + question: "What PR was being reviewed for Bugbot comments near the end of the session?", + referenceAnswer: + "PR #394 — the user referenced https://github.com/getsentry/cli/pull/394#discussion_r2920036806 " + + "with review feedback about streaming output and logger.info() calls.", + metadata: { difficulty: "easy", tags: ["late-session", "pr"] }, + }, + { + ...base, + id: "mega-e3", + question: "What was the user's instruction about Phase 6 and 7 being marked as 'future'?", + referenceAnswer: + "The user said Phase 6 and 7 should NOT be 'future' — they should be done " + + "once Phase 5 is merged. The user pushed back on deferring these phases.", + metadata: { difficulty: "easy", tags: ["late-session", "directive"] }, + }, + { + ...base, + id: "mega-e4", + question: "What command did the user repeatedly tell the assistant to run for checking CI failures?", + referenceAnswer: + "gh run view --log-failed --job $(gh pr checks $PR_NO --json state,link " + + "-q '.[] | select(.state == \"FAILURE\").link | split(\"/\")[-1]') — " + + "used repeatedly throughout the session after each push.", + metadata: { difficulty: "easy", tags: ["pattern", "ci"] }, + }, + { + ...base, + id: "mega-e5", + question: "What did the user say about the AGENTS.md file in the PR?", + referenceAnswer: + "The user said 'The change in AGENTS.md is completely irrelevant. Clean up this " + + "file to remove all irrelevant entries.' The AGENTS.md changes were auto-managed " + + "and not part of the intended PR.", + metadata: { difficulty: "easy", tags: ["mid-session", "directive"] }, + }, + + // ========================================================================= + // MEDIUM — mid session (turns 30-60, ~500K-1.5M token range) + // Architectural decisions and design debates + // ========================================================================= + { + ...base, + id: "mega-m1", + question: "What was the architectural vision for the template-based output system?", + referenceAnswer: + "Commands become 'data producers' — the framework selects a template " + + "(JSON, plain text, rendered markdown) based on flags. Commands describe " + + "*what* to output; the framework decides *how*. This was a four-phase " + + "convergence plan.", + metadata: { difficulty: "medium", tags: ["architecture", "design"] }, + }, + { + ...base, + id: "mega-m2", + question: "What was the user's position on tuples vs objects for command return values?", + referenceAnswer: + "The user asked 'are tuples really cheaper than using a simple object?' and " + + "the assistant confirmed. The user accepted tuples but wanted the simplest " + + "approach — they said they were fine with {data, footer} or [data, footer], " + + "whichever is cheaper and more maintainable. They objected to defining footer " + + "as a separate function during declaration as 'too rigid'.", + metadata: { difficulty: "medium", tags: ["design-debate", "decision"] }, + }, + { + ...base, + id: "mega-m3", + question: "What did the user say about consola's spinner functionality?", + referenceAnswer: + "The user asked 'Does consola have a spinner helper?' and then 'can we make " + + "the spinner use process.stderr or process.stdout internally?' followed by " + + "'wait, would that cause issues with our tests?' — showing concern about " + + "test compatibility with spinner output.", + metadata: { difficulty: "medium", tags: ["mid-session", "consola"] }, + }, + { + ...base, + id: "mega-m4", + question: "Why did the user want to remove the --include flag from the api command?", + referenceAnswer: + "The user asked to check Sentry traces (org: 'sentry', project: 'cli') to " + + "see if the -i or --include flag was ever used with api calls. The traces " + + "showed it was never used, so the user approved removal.", + metadata: { difficulty: "medium", tags: ["decision", "sentry-traces"] }, + }, + { + ...base, + id: "mega-m5", + question: "What was PR #373 about?", + referenceAnswer: + "PR #373 was about the --fields flag for context-window-friendly JSON output. " + + "The problem was that every --json command dumped the full object, wasting agent " + + "tokens. The --fields flag lets agents request only the specific fields they need.", + metadata: { difficulty: "medium", tags: ["pr", "feature"] }, + }, + { + ...base, + id: "mega-m6", + question: + "What was the user's core frustration with the assistant's approach to stdout/stderr in commands?", + referenceAnswer: + "The user was frustrated that the assistant kept using direct stdout/stderr " + + "writes and manual output in commands instead of the return-based system. " + + "The user explicitly said: 'which part of \"do not use stderr or stdout or " + + "manual writes there directly, always use return-based output\" you don't " + + "understand?' (referring to src/commands/api.ts).", + metadata: { difficulty: "medium", tags: ["frustration", "directive"] }, + }, + { + ...base, + id: "mega-m7", + question: "What was the user's argument about JSON output consistency?", + referenceAnswer: + "The user argued: (1) JSON output should be consistent as it's machine-consumed, " + + "conditionals make things harder especially with tools like jq. (2) The user " + + "suggested the api command output should NOT be conditional on --dry-run (json " + + "vs human readable) — it should be consistent. They considered adding --no-json " + + "or --json=false for users who want human-readable output from api.", + metadata: { difficulty: "medium", tags: ["design", "consistency"] }, + }, + + // ========================================================================= + // HARD — early session (turns 1-25, first ~500K tokens) + // First issue, implementation details, specific code + // ========================================================================= + { + ...base, + id: "mega-h1", + question: "What was the very first issue selected from the open issues list, and why?", + referenceAnswer: + "Issue #350 — Input hardening against agent hallucinations. It was chosen " + + "because of security impact (defense-in-depth against URL injection via " + + "org/project slugs interpolated into API paths).", + metadata: { difficulty: "hard", tags: ["early-session", "issue-selection"] }, + }, + { + ...base, + id: "mega-h2", + question: "What branch name was used for the first issue's implementation?", + referenceAnswer: "feat/input-hardening", + metadata: { difficulty: "hard", tags: ["early-session", "branch"] }, + }, + { + ...base, + id: "mega-h3", + question: "What function was created for input validation in the first PR?", + referenceAnswer: + "validateResourceId — a function to validate all slug/ID components as they're " + + "parsed in arg-parsing.ts. It was part of the input hardening against agent " + + "hallucinations (Issue #350).", + metadata: { difficulty: "hard", tags: ["early-session", "code"] }, + }, + { + ...base, + id: "mega-h4", + question: "What was the second issue worked on after merging PR #370?", + referenceAnswer: + "Magic @ selectors (@latest, @most_frequent) for issue commands — " + + "PR #371 on branch feat/magic-selectors.", + metadata: { difficulty: "hard", tags: ["early-session", "issue-sequence"] }, + }, + { + ...base, + id: "mega-h5", + question: "How many tests were passing in the full test suite during the first PR?", + referenceAnswer: + "359 tests passed in the full test suite, plus 19 property tests for " + + "the input hardening work specifically.", + metadata: { difficulty: "hard", tags: ["early-session", "test-counts"] }, + }, + { + ...base, + id: "mega-h6", + question: "What user feedback prompted adding magic selector info to help text?", + referenceAnswer: + "The user said: 'could we add the magic selector info to sentry issue/sentry " + + "issue --help? this could help agents' — and then asked 'any other commands " + + "you think we can add this to?'", + metadata: { difficulty: "hard", tags: ["early-session", "user-feedback"] }, + }, + { + ...base, + id: "mega-h7", + question: "What was the coverage requirement the user enforced across PRs?", + referenceAnswer: + "Patch coverage above 80%. The user referenced Codecov reports on PRs #370 " + + "and #371 specifically, asking to bump the coverage above 80% each time.", + metadata: { difficulty: "hard", tags: ["cross-session", "coverage"] }, + }, + { + ...base, + id: "mega-h8", + question: + "What was the user's reasoning for wanting to remove writeResponseBody() from the codebase?", + referenceAnswer: + "The user insisted on removing writeResponseBody() and switching to the " + + "return-based system. When the assistant suggested leaving it as 'harmless' " + + "since it was 'still exported and tested, just not called internally', the " + + "user explicitly said 'No remove this.' The principle was: no backward-compat " + + "stubs — remove dead code entirely.", + metadata: { difficulty: "hard", tags: ["mid-session", "code-cleanup"] }, + }, +]; + +// --------------------------------------------------------------------------- +// Scenario definition +// --------------------------------------------------------------------------- + +const scenario: ScenarioDefinition = { + id: scenarioId, + dimension, + label: "Mega CLI Refactor (2.3M tokens)", + description: + "Real 5-day getsentry/cli refactoring session — 2.3M tokens, 95 user turns, " + + "multiple PRs, architectural decisions, multi-phase migration. Tests recall " + + "of specific details across extreme context depths.", + sessions: [ + { + id: "cli-refactor", + label: "CLI Refactoring Session", + projectPath: "/workspace/getsentry-cli", + turns, + metadata: { + totalTokens: 2374811, + description: "5-day CLI refactoring: Issue #350 → PRs #370-394+, buildCommand migration", + }, + }, + ], + questions, + applicableBaselines: ["lore", "compaction"], +}; + +export default scenario;