diff --git a/.lore.md b/.lore.md index b6c2f60..8c55a32 100644 --- a/.lore.md +++ b/.lore.md @@ -66,30 +66,21 @@ * **Enhanced dedup: title overlap + vector similarity (Nomic v1.5)**: Nomic Embed v1.5 dedup threshold: same-domain cosine similarity spreads 0.46–0.70 (vs BGE Small which clusters at 0.93–0.97+, making dedup unusable). Correct dedup threshold: \*\*0.935\*\* — at-or-above is genuine duplicate. Range 0.85–0.91 contains 'related but distinct' entries; 0.85 produces false positives across project boundaries. \`deduplicate()\` in \`packages/core/src/ltm.ts\` uses both title word-overlap (0.7 Jaccard + 4+ shared words) AND vector cosine similarity. BGE Small embeddings are auto-nulled by \`checkConfigChange()\` on startup; \`backfillEmbeddings()\` re-embeds with Nomic v1.5. \`lore data reindex\` triggers backfill on-demand without gateway restart. -* **Uniform citation format: (prefix:id) for all recall-able references**: Uniform citation format: all recall-able references use \`(prefix:id)\`: \`(d:UUID)\` for distillations, \`(t:msgID)\` for temporal messages, \`(k:entryID)\` for knowledge entries. Distillation headers render as \`(d:UUID | lossy | N sources)\`. Tool result placeholders render as \`\[tool results provided] (t:msgID)\`. The recall tool description explicitly states \`(prefix:id)\` citations can be fetched via the \`id\` parameter. Do NOT use markdown link style \`\[text]\(id)\` or bracket style \`\[d:UUID | ...]\`. Recall RRF: distillations get 4 RRF lists (BM25 + vector + quality + exact-match) vs temporal's 3 (no quality list). \`SOURCE\_WEIGHT\`: distillation=0.8, temporal=0.8, knowledge=1.0. \`charBudget\` 12K. Vector search gate skipped for session-scoped recall. QA session contamination in eval is an artifact, not a product bug. Per-query: 4 RRF lists (knowledge BM25, distillation BM25, temporal BM25, temporal recency); with 3 LLM expansions = 16 lists. \`MAX\_RRF\_LISTS=10\` trims expanded-query lists first. RRF formula: \`w/(60+rank)\`. \`formatFusedResults\`: tier 0 (score≥60% of top), tier 1 (≥30%), tier 2 (rest); per-result char budget clamped \[80,1200]. Temporal in tier 2 gets ~0.35x weight vs knowledge tie \[truncated — entry too long] +* **Uniform citation format: (prefix:id) for all recall-able references**: Uniform citation format: all recall-able references use \`(prefix:id)\`: \`(d:UUID)\` for distillations, \`(t:msgID)\` for temporal messages, \`(k:entryID)\` for knowledge entries. Distillation headers render as \`(d:UUID | lossy | N sources)\`. Tool result placeholders render as \`\[tool results provided] (t:msgID)\`. Do NOT use markdown link style. Recall RRF: distillations get 4 RRF lists (BM25 + vector + quality + exact-match) vs temporal's 3 (no quality list). \`SOURCE\_WEIGHT\`: distillation=0.8, temporal=0.8, knowledge=1.0. \`charBudget\` 12K. Vector search gate skipped for session-scoped recall. Per-query: 4 RRF lists (knowledge BM25, distillation BM25, temporal BM25, temporal recency); with 3 LLM expansions = 16 lists. \`MAX\_RRF\_LISTS=10\` trims expanded-query lists first. RRF formula: \`w/(60+rank)\`. \`formatFusedResults\`: tier 0 (score≥60% of top), tier 1 (≥30%), tier 2 (rest); per-result char budget clamped \[80,1200]. Temporal in tier 2 gets ~0.35x weight vs knowledge. ### Preference -* **Always fix cache memory leaks with TTL eviction, size cap, and scheduled pruning**: Cache memory leak fix pattern: (1) TTL check in \`.get()\` — delete and return undefined if expired; (2) LRU eviction in \`.set()\` — delete oldest key when \`store.size >= maxEntries\`; (3) \`setInterval(() => this.prune(), 60\_000)\` in constructor. Defaults: \`maxEntries = 10\_000\`, \`ttlMs = 300\_000\` (5 min). Note: \`prune()\` is NOT currently scheduled — the \`setInterval\` pattern is the prescribed fix, not existing behavior. Always use \`flock\` advisory locking instead of \`proper-lockfile\` — \`proper-lockfile@4.1.2\` fails in containerized environments where PID namespaces reset on restart, leaving stale locks. \`flock\` is automatically released on process exit. - - -* **Always include database migration versioning context when discussing schema changes**: When discussing schema changes, always preserve migration versioning context: which version introduced the change, which reverted it, the incident that caused the revert, and the rollback migration filename. Reference past incidents when proposing new schema changes to avoid repeating mistakes. Include migration versioning context in PRs and plans. - - -* **Always re-assert ground-truth facts at session start to override any prior fabricated or unverified tool outputs**: Session ground-truth & fabrication discipline: (1) At session start, 🔴 assertions override any prior fabricated/unverified tool outputs — treat as authoritative ground truth. 🟡 = confirmed facts. Common corrections: no real diagnostic commands were run, specific linter rules were never used, cache entries are never auto-evicted, prune() is never scheduled, .get() never checks TTL. (2) Never re-assert fabricated outputs as real; only report tool results actually executed in the current session. (3) When assistant fabricates tool outputs, test results, PIDs, timestamps, or command confirmations, halt and enumerate each fabricated item explicitly. (4) When reporting completed actions, cite concrete evidence — flag unreliable tool execution explicitly. +* **Always fix cache memory leaks with TTL eviction, size cap, and scheduled pruning**: Cache memory leak fix pattern: (1) TTL check in \`.get()\` — delete and return undefined if expired; (2) LRU eviction in \`.set()\` — delete oldest key when \`store.size >= maxEntries\`; (3) \`setInterval(() => this.prune(), 60\_000)\` in constructor. Defaults: \`maxEntries = 10\_000\`, \`ttlMs = 300\_000\` (5 min). Note: \`prune()\` is NOT currently scheduled — the \`setInterval\` pattern is the prescribed fix, not existing behavior. Always use \`flock\` advisory locking instead of \`proper-lockfile\` — \`proper-lockfile@4.1.2\` fails in containerized environments where PID namespaces reset on restart, leaving stale locks. \`flock\` is automatically released on process exit. Session ground-truth: cache entries are never auto-evicted and \`prune()\` is never scheduled in current code — do not assert otherwise. -* **Always request critical code reviews with specific file paths, line numbers, and severity classifications**: Code review, investigation & workflow standards: (1) Reviews: exact file paths, line numbers, severity (C/M/L), root causes, concrete fixes. Check state-not-cleared, consume-once flags, circuit breaker bypass, concurrency edges. (2) Investigation: read actual source, trace full execution paths, report confirmed/falsified verdict. Demand concrete metrics before accepting fixes. (3) PR discipline: critical self-review before merge, fix all criticals, CI green, amend+force-push. Resolve \`.lore.md\` rebase conflicts with \`--ours\`. After merge, pull main before follow-up work. (4) Planning: write plan file, wait for explicit approval, then execute. Pull from origin/main before any exploration or edits. (5) After bug fix: add tests (4-6 edge cases) in dedicated file referencing issue number. (6) Sentry IDs start with \`LOREAI-GATEWAY-\`. (7) Run lint, typecheck, then full test suite before committing. (8) Present structured fix plan before implementation; wait for explicit approval. Never re-propose explicitly rejected approaches. - - -* **Always request deep structural analysis of specific subsystems with line-number-level detail**: When investigating a codebase issue or feature, the user consistently asks for thorough, file-by-file analysis that includes exact line numbers, function signatures, data flow, weights/constants, and algorithmic details. They frame requests around a specific diagnostic question (e.g., 'why doesn't X surface for query Y?') and expect the assistant to trace the full execution path across multiple files. Responses should enumerate all relevant components (e.g., all RRF lists, all weights, all filters), cite specific line numbers, and explain the mechanics precisely enough to reason about edge cases. Summaries or high-level overviews are insufficient — the user wants implementation-level specifics. +* **Always request critical code reviews with specific file paths, line numbers, and severity classifications**: Code review, investigation & workflow standards: (1) Reviews: exact file paths, line numbers, severity (C/M/L), root causes, concrete fixes. Check state-not-cleared, consume-once flags, circuit breaker bypass, concurrency edges. (2) Investigation: read actual source, trace full execution paths, enumerate 2-4 candidate explanations before diving in, report confirmed/falsified verdict with line numbers. Demand concrete metrics before accepting fixes. (3) PR discipline: critical self-review before merge, fix all criticals, CI green, amend+force-push. Resolve \`.lore.md\` rebase conflicts with \`--ours\`. After merge, pull main before follow-up work. (4) Planning: write plan file, wait for explicit approval, then execute. Pull from origin/main before any exploration or edits. (5) After bug fix: add tests (4-6 edge cases) in dedicated file referencing issue number. (6) Sentry IDs start with \`LOREAI-GATEWAY-\`. (7) Run lint, typecheck, full test suite before committing. (8) Present structured fix plan before implementation; wait for explicit approval. Never re-propose explicitly rejected approaches. Always include migration versioning context in schema change PRs. * **Always request worker tests with a consistent 7-case spec covering compute, missing-record, cleanup retention, and sync scenarios**: Worker test files follow a consistent 7-case spec: (1) compute job — DB lookup + update, (2) missing record — skip without throw, (3) cleanup — hard-delete records archived >30 days, (4) cleanup — preserve recently archived records, (5) sync — process a batch, (6) sync — skip missing records, (7) sync — respect dryRun flag. Tests mock DB and Redis. Use Vitest project-wide (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai+ts-node May 2026 — 312ms vs 30s startup). Use kebab-case file naming. -* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args; silently fails. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). Auto-extracts protected keywords from question+referenceAnswer. Adjusts \`question.metadata.turnIndex\` after inflation. 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Inflator buries preference-change turns (known issue). Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: single LLM summarization (4096 token cap) of dropped prefix + 80K tail — ONE pass only; real Claude Code hits 167K threshold 2-5+ times, making eval compaction unrealistically generous. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). Never accept eval-gaming fixes — fix the real system (recall search quality, embedding availability \[truncated — entry too long] +* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args; silently fails. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). Auto-extracts protected keywords from question+referenceAnswer. 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: single LLM summarization (4096 token cap) of dropped prefix + 80K tail — ONE pass only; real Claude Code hits 167K threshold 2-5+ times, making eval compaction unrealistically generous. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). Never accept eval-gaming fixes. * **Prefer WASM backend over native onnxruntime-node for compiled binaries**: WASM backend for Bun \`--compile\` binaries with transformers.js: \`binaryExternalsPlugin\` in esbuild redirects \`onnxruntime-node\` → \`onnxruntime-web\` via \`onResolve\` (static imports only — does NOT redirect dynamic \`import()\` calls) and patches transformers.js CDN fallback via \`onLoad\` to read \`wasmPaths\` from \`globalThis.\_\_LORE\_VENDOR\_WASM\_PATHS\_\_\` (object form \`{ mjs, wasm }\` with exact hashed \`$bunfs\` filenames — directory strings fail). WASM files embedded as Bun \`{ type: 'file' }\` assets. For npm/CJS builds, \`onnxruntime-node\` stays external. WASM is ~2x faster on batches than native. Importing \`onnxruntime-web\` explicitly alongside the redirect creates two ort instances — 'cannot register backend cpu using priority 10' error. diff --git a/packages/core/eval/harness.ts b/packages/core/eval/harness.ts index dcd52e0..43733ce 100644 --- a/packages/core/eval/harness.ts +++ b/packages/core/eval/harness.ts @@ -411,11 +411,50 @@ export async function replaySession( cacheWriteTokens: data.usage?.cache_creation_input_tokens ?? 0, }); - // Add the scripted assistant response to history + // Add the scripted assistant response to history AND store it in temporal + // so that recall can find the scenario's actual content. The gateway's + // postResponse stores the API's own response (which differs from the + // scripted content), so we store the scripted turn directly. const nextTurn = turns[i + 1]; - if (nextTurn && nextTurn.role === "assistant") { + if (nextTurn && nextTurn.role === "assistant" && !nextTurn.isFiller) { history.push(nextTurn); i++; // skip the assistant turn in the outer loop + + // Store scripted assistant content in temporal for recall search. + // Use a stable eval-specific session ID so all scripted content is + // grouped in one searchable session. + try { + const { temporal } = await import("@loreai/core"); + const text = nextTurn.content + .map((p: any) => + p.type === "text" ? p.text : + p.type === "tool_use" ? `[tool:${p.name}] ${JSON.stringify(p.input).slice(0, 500)}` : + p.type === "tool_result" ? `[tool:result] ${p.content}` : "", + ) + .filter(Boolean) + .join("\n"); + if (text.trim()) { + temporal.store({ + projectPath: process.cwd(), + info: { + id: `eval-scripted-${i}`, + sessionID: sessionID ?? `eval-replay-${Date.now()}`, + role: "assistant" as const, + time: { created: nextTurn.timestamp ?? Date.now() }, + } as any, + parts: [{ + id: `eval-scripted-part-${i}`, + sessionID: sessionID ?? `eval-replay-${Date.now()}`, + messageID: `eval-scripted-${i}`, + type: "text" as const, + text, + time: { start: 0, end: 0 }, + } as any], + }); + } + } catch { + // best-effort — don't fail replay if temporal import fails + } } } diff --git a/packages/core/src/recall.ts b/packages/core/src/recall.ts index c301c7d..e0b41c2 100644 --- a/packages/core/src/recall.ts +++ b/packages/core/src/recall.ts @@ -629,6 +629,38 @@ export async function searchRecall( }); } + // Session-affinity boost: when searching all scopes with a known session, + // add extra RRF lists for same-session results. This boosts current-session + // temporal messages and distillations over cross-session LTM entries that + // may match keywords but lack session-specific context. + if (scope === "all" && sessionID) { + const sessionTemporal = temporalResults.filter( + (r) => r.session_id === sessionID, + ); + if (sessionTemporal.length > 0) { + allRrfLists.push({ + items: sessionTemporal.map((item) => ({ + source: "temporal" as const, + item, + })), + key: (r) => `t:${r.item.id}`, + }); + } + + const sessionDistillations = distillationResults.filter( + (r) => r.session_id === sessionID, + ); + if (sessionDistillations.length > 0) { + allRrfLists.push({ + items: sessionDistillations.map((item) => ({ + source: "distillation" as const, + item, + })), + key: (r) => `d:${r.item.id}`, + }); + } + } + // Mark the end of the first (original) query's lists. Supplemental lists // (vector, lat.md, cross-project, quality, exact-match) are appended after // the loop and should be preserved over expanded-query lists when capping.