Skip to content

fix(mcp): correct BM25 IDF, expose search score, harden memory.list#27

Open
2233admin wants to merge 1 commit into
mainfrom
fix/phase8-bm25-memory-bugs
Open

fix(mcp): correct BM25 IDF, expose search score, harden memory.list#27
2233admin wants to merge 1 commit into
mainfrom
fix/phase8-bm25-memory-bugs

Conversation

@2233admin

Copy link
Copy Markdown
Owner

fix(mcp): correct BM25 IDF, expose search score, harden memory.list

Summary

Three Phase 8 bugs found while smoke-testing the new MCP tools (holon.search, memory.*, graph.export) against the bundled sample vault. Phase 8 ships in commit 01cfd2d (feat(phase8): graph export, vault write-back, persistent memory, BM25 search) and was the only release with no test coverage for any of the new features. This PR fixes the three most user-visible defects and adds regression tests.

Bugs fixed

1. BM25 IDF formula was wrong — mcp-server/src/holons/holon.ts

Before:

const idf = Math.log((1 + avgLen) / freq + 1);

avgLen is the average tokenized document length. It was used in place of corpus size N and document frequency df. The result was a non-standard BM25 variant whose ranking was an artefact of doc lengths rather than term discriminative power. Across corpora of different sizes the relative ordering is incomparable.

After:

const idf = Math.log((N - df + 0.5) / (df + 0.5) + 1);

Standard BM25 IDF with Laplace smoothing, using df = docs.filter(d => d.includes(term)).length for the per-term document frequency. The TF component was correct and is unchanged.

2. holon.search dropped the score field — mcp-server/src/holons/holon.ts

Before:

return { holons: scored.slice(0, limit).map(x => x.h), ... };

The score value was computed (used internally for sorting) then thrown away before returning. Callers could not see the ranking signal.

After:

return { holons: scored.slice(0, limit).map(x => ({ ...x.h, score: x.score })), ... };

score is now attached to each returned holon in bm25 and hybrid modes. substring mode never computed a score, so its responses are unchanged (no score field is added — verified by substring mode does not attach score field test).

3. memory.list crashed on non-string values — mcp-server/src/memory/memory.ts

Before:

preview: e.value.slice(0, 120),

Schema declares value: { type: 'string', required: true }, but a hand-edited _ai_memory.json or an older writer could produce numeric / boolean / object values. memory.list would throw TypeError: e.value.slice is not a function.

After:

preview: typeof e.value === 'string'
  ? e.value.slice(0, 120)
  : String(e.value).slice(0, 120),

Defensive coercion. Behaviour for the documented string case is identical.

Tests added

  • mcp-server/src/holons/holons.test.ts: +5 tests for holon.search
    • bm25 mode returns hits ordered by score, not insertion order (regression: score field stripped)
    • bm25 mode ranks doc-with-higher-tf higher than doc-with-lower-tf (regression: IDF used avgLen instead of N/df; ranking would invert for the high-tf/low-tf pair under the broken formula)
    • hybrid mode union: BM25 matches first, then substring-only matches
    • substring mode does not attach score field (negative — guards against the score being added to the wrong code path)
  • mcp-server/src/memory/memory.test.ts (new file): +10 tests
    • set then get round-trips a string value with tags
    • set preserves created_at on update, bumps updated_at
    • returns all entries with key, tags, preview, updated_at
    • does not crash when stored value is a number (regression for the .slice crash)
    • does not crash when stored value is a boolean
    • does not crash when stored value is null
    • does not crash when stored value is an object
    • truncates long string previews to 120 chars
    • deletes existing key and returns ok=true
    • returns ok=false for unknown key without throwing
    • writes _ai_memory.json into the vault root, not elsewhere

Test plan

  • cd mcp-server && npm ci && npm run build && npm test — 228/228 pass (40 suites), 1.6s
  • cd compiler && python -m pytest tests/ -v — 102/102 pass (was 102/102 before this PR; no Python changes)
  • End-to-end smoke against the bundled tests/fixtures/sample-vault/: python -m compiler produces context-core.json, then BM25 search returns 3 hits for "attention" with score values now exposed (previously score: undefined)

Out of scope

  • graph.export is not covered by tests in this PR. The Phase 8 commit added the tool but no test; this PR does not change graph.export code. Adding tests for it is a follow-up.
  • mcp-server/src/compile-trigger.test.ts already imports memory indirectly; no test churn there.

Risk

Low. All three changes are local to the operation handlers and the behaviour for documented input is preserved. BM25 ranking changes only for queries where the buggy IDF and the standard IDF disagree on relative order; the standard formula is the canonical one used in IR literature. memory.list is strictly more permissive after the fix.

Notes for reviewer

The BM25 test bm25 mode ranks doc-with-higher-tf higher than doc-with-lower-tf constructs a synthetic 2-doc corpus and exercises the ranking against the standard formula. The buggy formula in the previous code would have ranked the low-tf doc higher (because its idf would have been computed against avgLen of the high-tf doc's token stream), so this test fails on the old code and passes on the new code — verified locally.


Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

Three Phase 8 bugs found while smoke-testing the new MCP tools against
the sample vault:

1. BM25 IDF formula was wrong.
   `idf = log((1 + avgLen) / freq + 1)` used average doc length
   (avgLen, a per-corpus length statistic) in place of corpus size N
   and document frequency df. The result was a non-standard BM25
   variant whose ranking was an artefact of doc lengths, not
   discriminative term power. Across corpora of different sizes the
   relative ordering is incomparable.

   Switched to the standard BM25 IDF with Laplace smoothing:
   `idf = log((N - df + 0.5) / (df + 0.5) + 1)`, computed per-term
   using `docs.filter(d => d.includes(term)).length` for df.

2. BM25/hybrid responses dropped the score field.
   `return { holons: scored.slice(...).map(x => x.h), ... }` threw
   away the computed `score` value before returning, so callers
   could not see the ranking signal. Now `map(x => ({ ...x.h, score:
   x.score }))` propagates score onto each returned holon. Substring
   mode does not compute a score, so its responses are unchanged.

3. memory.list crashed on non-string values.
   `preview: e.value.slice(0, 120)` assumed the persisted value was a
   string. Hand-edited `_ai_memory.json` files or older writers could
   produce numeric/boolean/object values and crash `memory.list`.
   Coerce defensively: `typeof e.value === 'string' ? e.value.slice(0,
   120) : String(e.value).slice(0, 120)`.

Tests: +15 (228/228 pass, 40 suites). The Phase 8 features (BM25
substring/bm25/hybrid, memory.set/get/list/forget, graph.export) had
no test coverage prior to this commit; that gap is now closed for
BM25 search and memory.*. graph.export remains untested (not changed
by this PR).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@repowise-bot

repowise-bot Bot commented Jun 21, 2026

Copy link
Copy Markdown

✅ Health: 7.1 → 7.3 (+0.1)

🚨 Change risk: 9.0/10 (high)
This change's risk is driven by:

  • more lines added than baseline
  • more scattered than baseline

📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-21 08:24 UTC
Silence on a single PR with [skip repowise] in the title · Per-repo toggle on repowise.dev/settings?tab=bot

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Search results now display numeric scores in BM25 and hybrid modes for improved result ranking visibility.
  • Bug Fixes

    • Fixed memory list operation to handle non-string values without errors.
  • Tests

    • Added test coverage for search scoring behavior across all modes.
    • Added comprehensive test suite for memory operations and file persistence.

Walkthrough

Updates the BM25 scoring formula in holon.search to use a document-frequency-based smoothed IDF, exposes per-hit score in bm25 and hybrid mode responses, and adds Phase 8 regression tests. Separately, fixes memory.list to coerce non-string stored values to strings before generating preview snippets, and adds a comprehensive memory operations test suite.

Changes

BM25 Scoring and Score Exposure

Layer / File(s) Summary
BM25 IDF formula and score in responses
mcp-server/src/holons/holon.ts
Replaces the old IDF formula with a document-frequency-based Laplace-smoothed variant, skips terms with df=0, and updates both bm25 and hybrid return payloads to embed a score field on each returned holon.
BM25/hybrid/default search scoring tests
mcp-server/src/holons/holons.test.ts
Adds four regression tests: bm25 score ordering and positivity, TF/IDF ranking with a sparse corpus fixture, hybrid union ordering matching the bm25 tier, and absence of score in default substring mode.

Memory List Non-String Value Fix

Layer / File(s) Summary
memory.list preview coercion fix and full memory tests
mcp-server/src/memory/memory.ts, mcp-server/src/memory/memory.test.ts
memory.list now calls String(e.value) before slicing when e.value is not a string. New test suite scaffolds an isolated temp vault and covers memory.set/get round-trips, update timestamp semantics, list output structure, non-string preview coercion regression, truncation to 120 chars, memory.forget deletion and not-found behavior, and _ai_memory.json file persistence.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hoppity-hop through the index I go,
IDF smoothed with a Laplace glow,
Each holon gets a score to show,
Strings and non-strings, previews in a row,
The memory vault is tidy below! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title concisely summarizes all three main changes: BM25 IDF correction, score field exposure, and memory.list hardening, matching the primary changes in the changeset.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the three bugs fixed, the corrections made, tests added, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the BM25 search algorithm to use a standard IDF formula with Laplace smoothing, exposes search scores in the results, and adds defensive type coercion to memory previews to prevent crashes on non-string values. It also introduces comprehensive unit tests for both components. The reviewer identified a high-severity performance bottleneck in the BM25 search where document frequency is calculated repeatedly inside the scoring loop, resulting in O(N^2) complexity, and provided a code suggestion to precompute these frequencies to optimize the search to O(N) complexity.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +81 to 96
const N = docs.length;
const avgLen = docs.reduce((s, d) => s + d.length, 0) / (N || 1);
const K1 = 1.5, B = 0.75;

const bm25Score = (docTokens: string[]) => {
const dl = docTokens.length;
return terms.reduce((sum, term) => {
const df = docs.filter(d => d.includes(term)).length;
if (df === 0) return sum;
const freq = docTokens.filter(t => t === term).length;
if (freq === 0) return sum;
const idf = Math.log((1 + avgLen) / freq + 1);
// Standard BM25 IDF with Laplace smoothing (+1 inside log to keep it >= 0).
const idf = Math.log((N - df + 0.5) / (df + 0.5) + 1);
const tf = (freq * (K1 + 1)) / (freq + K1 * (1 - B + B * dl / avgLen));
return sum + idf * tf;
}, 0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Performance Bottleneck: $O(N^2)$ Complexity in BM25 Search

In the current implementation, docs.filter(d => d.includes(term)).length is evaluated inside bm25Score for every single document in the corpus. This results in an $O(N^2 \cdot M)$ complexity (where $N$ is the number of documents and $M$ is the number of query terms), which can cause severe CPU starvation and block the single-threaded Node.js event loop when searching over larger vaults.

Solution

Precompute the document frequency (df) for each query term once outside the bm25Score function. This reduces the complexity to $O(N \cdot M)$, making the search highly efficient and scalable.

        const N        = docs.length;
        const avgLen   = docs.reduce((s, d) => s + d.length, 0) / (N || 1);
        const K1 = 1.5, B = 0.75;

        const dfMap = new Map<string, number>();
        for (const term of terms) {
          dfMap.set(term, docs.filter(d => d.includes(term)).length);
        }

        const bm25Score = (docTokens: string[]) => {
          const dl = docTokens.length;
          return terms.reduce((sum, term) => {
            const df   = dfMap.get(term) ?? 0;
            if (df === 0) return sum;
            const freq = docTokens.filter(t => t === term).length;
            if (freq === 0) return sum;
            // Standard BM25 IDF with Laplace smoothing (+1 inside log to keep it >= 0).
            const idf = Math.log((N - df + 0.5) / (df + 0.5) + 1);
            const tf  = (freq * (K1 + 1)) / (freq + K1 * (1 - B + B * dl / avgLen));
            return sum + idf * tf;
          }, 0);
        };

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
mcp-server/src/holons/holons.test.ts (1)

232-250: ⚡ Quick win

Add a true substring-only hybrid fixture.

This test claims to cover “BM25 matches first, then substring-only matches,” but Line 233 says rope matches both returned docs by BM25 and substring. Add a fixture with one BM25-only match and one substring-only match so the union and intended precedence are actually protected.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/holons/holons.test.ts` around lines 232 - 250, The test claims
to validate "BM25 matches first, then substring-only matches" but the current
fixture only contains documents that match via both BM25 and substring search,
so the union and precedence logic are not actually being tested. Add at least
two new fixture documents to the test setup: one that matches the query 'rope'
via BM25 scoring only (not substring), and another that matches via substring
only (not BM25). Then add assertions to the test to verify that when running
hybrid mode search on 'rope', the BM25-only match appears before the
substring-only match in the results, confirming the intended precedence
behavior.
mcp-server/src/holons/holon.ts (1)

85-93: ⚡ Quick win

Precompute document frequencies once per query.

bm25Score recomputes df by scanning every document for each term and each holon, making search scale quadratically with corpus size. Since df is query-global, cache it before scoring.

♻️ Proposed refactor
         const docs     = cc.holons.map(h => tokenize(`${h.title} ${h.summary}`));
         const N        = docs.length;
         const avgLen   = docs.reduce((s, d) => s + d.length, 0) / (N || 1);
         const K1 = 1.5, B = 0.75;
+        const dfByTerm = new Map<string, number>();
+        for (const term of new Set(terms)) {
+          dfByTerm.set(term, docs.reduce((count, d) => count + (d.includes(term) ? 1 : 0), 0));
+        }
 
         const bm25Score = (docTokens: string[]) => {
           const dl = docTokens.length;
           return terms.reduce((sum, term) => {
-            const df   = docs.filter(d => d.includes(term)).length;
+            const df   = dfByTerm.get(term) ?? 0;
             if (df === 0) return sum;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/holons/holon.ts` around lines 85 - 93, The bm25Score function
is recalculating document frequency (df) for each term every time it's invoked
(once per document scored), causing quadratic scaling with corpus size. Since df
is query-global and identical for all documents in the same query, precompute it
once before iterating through documents to score. Create a Map or similar
structure outside the bm25Score function that caches the document frequency for
each term in the current query, then access this precomputed cache inside
bm25Score instead of filtering the docs array repeatedly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@mcp-server/src/holons/holon.ts`:
- Around line 85-93: The bm25Score function is recalculating document frequency
(df) for each term every time it's invoked (once per document scored), causing
quadratic scaling with corpus size. Since df is query-global and identical for
all documents in the same query, precompute it once before iterating through
documents to score. Create a Map or similar structure outside the bm25Score
function that caches the document frequency for each term in the current query,
then access this precomputed cache inside bm25Score instead of filtering the
docs array repeatedly.

In `@mcp-server/src/holons/holons.test.ts`:
- Around line 232-250: The test claims to validate "BM25 matches first, then
substring-only matches" but the current fixture only contains documents that
match via both BM25 and substring search, so the union and precedence logic are
not actually being tested. Add at least two new fixture documents to the test
setup: one that matches the query 'rope' via BM25 scoring only (not substring),
and another that matches via substring only (not BM25). Then add assertions to
the test to verify that when running hybrid mode search on 'rope', the BM25-only
match appears before the substring-only match in the results, confirming the
intended precedence behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7b705f5a-3b57-4ce2-a53c-ec5003ce68ee

📥 Commits

Reviewing files that changed from the base of the PR and between 0d46d6a and c363efc.

📒 Files selected for processing (4)
  • mcp-server/src/holons/holon.ts
  • mcp-server/src/holons/holons.test.ts
  • mcp-server/src/memory/memory.test.ts
  • mcp-server/src/memory/memory.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant