fix(mcp): correct BM25 IDF, expose search score, harden memory.list#27
fix(mcp): correct BM25 IDF, expose search score, harden memory.list#272233admin wants to merge 1 commit into
Conversation
Three Phase 8 bugs found while smoke-testing the new MCP tools against
the sample vault:
1. BM25 IDF formula was wrong.
`idf = log((1 + avgLen) / freq + 1)` used average doc length
(avgLen, a per-corpus length statistic) in place of corpus size N
and document frequency df. The result was a non-standard BM25
variant whose ranking was an artefact of doc lengths, not
discriminative term power. Across corpora of different sizes the
relative ordering is incomparable.
Switched to the standard BM25 IDF with Laplace smoothing:
`idf = log((N - df + 0.5) / (df + 0.5) + 1)`, computed per-term
using `docs.filter(d => d.includes(term)).length` for df.
2. BM25/hybrid responses dropped the score field.
`return { holons: scored.slice(...).map(x => x.h), ... }` threw
away the computed `score` value before returning, so callers
could not see the ranking signal. Now `map(x => ({ ...x.h, score:
x.score }))` propagates score onto each returned holon. Substring
mode does not compute a score, so its responses are unchanged.
3. memory.list crashed on non-string values.
`preview: e.value.slice(0, 120)` assumed the persisted value was a
string. Hand-edited `_ai_memory.json` files or older writers could
produce numeric/boolean/object values and crash `memory.list`.
Coerce defensively: `typeof e.value === 'string' ? e.value.slice(0,
120) : String(e.value).slice(0, 120)`.
Tests: +15 (228/228 pass, 40 suites). The Phase 8 features (BM25
substring/bm25/hybrid, memory.set/get/list/forget, graph.export) had
no test coverage prior to this commit; that gap is now closed for
BM25 search and memory.*. graph.export remains untested (not changed
by this PR).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
✅ Health: 7.1 → 7.3 (+0.1) 🚨 Change risk: 9.0/10 (high)
📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-21 08:24 UTC |
📝 WalkthroughSummary by CodeRabbit
WalkthroughUpdates the BM25 scoring formula in ChangesBM25 Scoring and Score Exposure
Memory List Non-String Value Fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the BM25 search algorithm to use a standard IDF formula with Laplace smoothing, exposes search scores in the results, and adds defensive type coercion to memory previews to prevent crashes on non-string values. It also introduces comprehensive unit tests for both components. The reviewer identified a high-severity performance bottleneck in the BM25 search where document frequency is calculated repeatedly inside the scoring loop, resulting in O(N^2) complexity, and provided a code suggestion to precompute these frequencies to optimize the search to O(N) complexity.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const N = docs.length; | ||
| const avgLen = docs.reduce((s, d) => s + d.length, 0) / (N || 1); | ||
| const K1 = 1.5, B = 0.75; | ||
|
|
||
| const bm25Score = (docTokens: string[]) => { | ||
| const dl = docTokens.length; | ||
| return terms.reduce((sum, term) => { | ||
| const df = docs.filter(d => d.includes(term)).length; | ||
| if (df === 0) return sum; | ||
| const freq = docTokens.filter(t => t === term).length; | ||
| if (freq === 0) return sum; | ||
| const idf = Math.log((1 + avgLen) / freq + 1); | ||
| // Standard BM25 IDF with Laplace smoothing (+1 inside log to keep it >= 0). | ||
| const idf = Math.log((N - df + 0.5) / (df + 0.5) + 1); | ||
| const tf = (freq * (K1 + 1)) / (freq + K1 * (1 - B + B * dl / avgLen)); | ||
| return sum + idf * tf; | ||
| }, 0); |
There was a problem hiding this comment.
Performance Bottleneck: $O(N^2)$ Complexity in BM25 Search
In the current implementation, docs.filter(d => d.includes(term)).length is evaluated inside bm25Score for every single document in the corpus. This results in an
Solution
Precompute the document frequency (df) for each query term once outside the bm25Score function. This reduces the complexity to
const N = docs.length;
const avgLen = docs.reduce((s, d) => s + d.length, 0) / (N || 1);
const K1 = 1.5, B = 0.75;
const dfMap = new Map<string, number>();
for (const term of terms) {
dfMap.set(term, docs.filter(d => d.includes(term)).length);
}
const bm25Score = (docTokens: string[]) => {
const dl = docTokens.length;
return terms.reduce((sum, term) => {
const df = dfMap.get(term) ?? 0;
if (df === 0) return sum;
const freq = docTokens.filter(t => t === term).length;
if (freq === 0) return sum;
// Standard BM25 IDF with Laplace smoothing (+1 inside log to keep it >= 0).
const idf = Math.log((N - df + 0.5) / (df + 0.5) + 1);
const tf = (freq * (K1 + 1)) / (freq + K1 * (1 - B + B * dl / avgLen));
return sum + idf * tf;
}, 0);
};There was a problem hiding this comment.
🧹 Nitpick comments (2)
mcp-server/src/holons/holons.test.ts (1)
232-250: ⚡ Quick winAdd a true substring-only hybrid fixture.
This test claims to cover “BM25 matches first, then substring-only matches,” but Line 233 says
ropematches both returned docs by BM25 and substring. Add a fixture with one BM25-only match and one substring-only match so the union and intended precedence are actually protected.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/holons/holons.test.ts` around lines 232 - 250, The test claims to validate "BM25 matches first, then substring-only matches" but the current fixture only contains documents that match via both BM25 and substring search, so the union and precedence logic are not actually being tested. Add at least two new fixture documents to the test setup: one that matches the query 'rope' via BM25 scoring only (not substring), and another that matches via substring only (not BM25). Then add assertions to the test to verify that when running hybrid mode search on 'rope', the BM25-only match appears before the substring-only match in the results, confirming the intended precedence behavior.mcp-server/src/holons/holon.ts (1)
85-93: ⚡ Quick winPrecompute document frequencies once per query.
bm25Scorerecomputesdfby scanning every document for each term and each holon, making search scale quadratically with corpus size. Sincedfis query-global, cache it before scoring.♻️ Proposed refactor
const docs = cc.holons.map(h => tokenize(`${h.title} ${h.summary}`)); const N = docs.length; const avgLen = docs.reduce((s, d) => s + d.length, 0) / (N || 1); const K1 = 1.5, B = 0.75; + const dfByTerm = new Map<string, number>(); + for (const term of new Set(terms)) { + dfByTerm.set(term, docs.reduce((count, d) => count + (d.includes(term) ? 1 : 0), 0)); + } const bm25Score = (docTokens: string[]) => { const dl = docTokens.length; return terms.reduce((sum, term) => { - const df = docs.filter(d => d.includes(term)).length; + const df = dfByTerm.get(term) ?? 0; if (df === 0) return sum;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcp-server/src/holons/holon.ts` around lines 85 - 93, The bm25Score function is recalculating document frequency (df) for each term every time it's invoked (once per document scored), causing quadratic scaling with corpus size. Since df is query-global and identical for all documents in the same query, precompute it once before iterating through documents to score. Create a Map or similar structure outside the bm25Score function that caches the document frequency for each term in the current query, then access this precomputed cache inside bm25Score instead of filtering the docs array repeatedly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@mcp-server/src/holons/holon.ts`:
- Around line 85-93: The bm25Score function is recalculating document frequency
(df) for each term every time it's invoked (once per document scored), causing
quadratic scaling with corpus size. Since df is query-global and identical for
all documents in the same query, precompute it once before iterating through
documents to score. Create a Map or similar structure outside the bm25Score
function that caches the document frequency for each term in the current query,
then access this precomputed cache inside bm25Score instead of filtering the
docs array repeatedly.
In `@mcp-server/src/holons/holons.test.ts`:
- Around line 232-250: The test claims to validate "BM25 matches first, then
substring-only matches" but the current fixture only contains documents that
match via both BM25 and substring search, so the union and precedence logic are
not actually being tested. Add at least two new fixture documents to the test
setup: one that matches the query 'rope' via BM25 scoring only (not substring),
and another that matches via substring only (not BM25). Then add assertions to
the test to verify that when running hybrid mode search on 'rope', the BM25-only
match appears before the substring-only match in the results, confirming the
intended precedence behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7b705f5a-3b57-4ce2-a53c-ec5003ce68ee
📒 Files selected for processing (4)
mcp-server/src/holons/holon.tsmcp-server/src/holons/holons.test.tsmcp-server/src/memory/memory.test.tsmcp-server/src/memory/memory.ts
fix(mcp): correct BM25 IDF, expose search score, harden memory.list
Summary
Three Phase 8 bugs found while smoke-testing the new MCP tools (
holon.search,memory.*,graph.export) against the bundled sample vault. Phase 8 ships in commit01cfd2d(feat(phase8): graph export, vault write-back, persistent memory, BM25 search) and was the only release with no test coverage for any of the new features. This PR fixes the three most user-visible defects and adds regression tests.Bugs fixed
1. BM25 IDF formula was wrong —
mcp-server/src/holons/holon.tsBefore:
avgLenis the average tokenized document length. It was used in place of corpus sizeNand document frequencydf. The result was a non-standard BM25 variant whose ranking was an artefact of doc lengths rather than term discriminative power. Across corpora of different sizes the relative ordering is incomparable.After:
Standard BM25 IDF with Laplace smoothing, using
df = docs.filter(d => d.includes(term)).lengthfor the per-term document frequency. The TF component was correct and is unchanged.2.
holon.searchdropped the score field —mcp-server/src/holons/holon.tsBefore:
The
scorevalue was computed (used internally for sorting) then thrown away before returning. Callers could not see the ranking signal.After:
scoreis now attached to each returned holon inbm25andhybridmodes.substringmode never computed a score, so its responses are unchanged (noscorefield is added — verified bysubstring mode does not attach score fieldtest).3.
memory.listcrashed on non-string values —mcp-server/src/memory/memory.tsBefore:
Schema declares
value: { type: 'string', required: true }, but a hand-edited_ai_memory.jsonor an older writer could produce numeric / boolean / object values.memory.listwould throwTypeError: e.value.slice is not a function.After:
Defensive coercion. Behaviour for the documented string case is identical.
Tests added
mcp-server/src/holons/holons.test.ts: +5 tests forholon.searchbm25 mode returns hits ordered by score, not insertion order(regression: score field stripped)bm25 mode ranks doc-with-higher-tf higher than doc-with-lower-tf(regression: IDF used avgLen instead of N/df; ranking would invert for the high-tf/low-tf pair under the broken formula)hybrid mode union: BM25 matches first, then substring-only matchessubstring mode does not attach score field(negative — guards against the score being added to the wrong code path)mcp-server/src/memory/memory.test.ts(new file): +10 testsset then get round-trips a string value with tagsset preserves created_at on update, bumps updated_atreturns all entries with key, tags, preview, updated_atdoes not crash when stored value is a number(regression for the.slicecrash)does not crash when stored value is a booleandoes not crash when stored value is nulldoes not crash when stored value is an objecttruncates long string previews to 120 charsdeletes existing key and returns ok=truereturns ok=false for unknown key without throwingwrites _ai_memory.json into the vault root, not elsewhereTest plan
cd mcp-server && npm ci && npm run build && npm test— 228/228 pass (40 suites), 1.6scd compiler && python -m pytest tests/ -v— 102/102 pass (was 102/102 before this PR; no Python changes)tests/fixtures/sample-vault/:python -m compilerproducescontext-core.json, then BM25 search returns 3 hits for "attention" with score values now exposed (previouslyscore: undefined)Out of scope
graph.exportis not covered by tests in this PR. The Phase 8 commit added the tool but no test; this PR does not changegraph.exportcode. Adding tests for it is a follow-up.mcp-server/src/compile-trigger.test.tsalready importsmemoryindirectly; no test churn there.Risk
Low. All three changes are local to the operation handlers and the behaviour for documented input is preserved. BM25 ranking changes only for queries where the buggy IDF and the standard IDF disagree on relative order; the standard formula is the canonical one used in IR literature.
memory.listis strictly more permissive after the fix.Notes for reviewer
The BM25 test
bm25 mode ranks doc-with-higher-tf higher than doc-with-lower-tfconstructs a synthetic 2-doc corpus and exercises the ranking against the standard formula. The buggy formula in the previous code would have ranked the low-tf doc higher (because its idf would have been computed againstavgLenof the high-tf doc's token stream), so this test fails on the old code and passes on the new code — verified locally.Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com