Skip to content

perf(stdio): fix slow search — wrong index root, per-call merkle walk, add search --trace command#61

Merged
aeneasr merged 16 commits intomainfrom
address-perf-issues
Mar 19, 2026
Merged

perf(stdio): fix slow search — wrong index root, per-call merkle walk, add search --trace command#61
aeneasr merged 16 commits intomainfrom
address-perf-issues

Conversation

@aeneasr
Copy link
Member

@aeneasr aeneasr commented Mar 19, 2026

Summary

  • Fix slow/wrong index root: `findEffectiveRoot` now defaults to git root for normal subdirs but falls back to the search path itself when it's under a `SkipDir` (e.g. `testdata/`), preventing accidental whole-repo indexing
  • Fix context-length errors with BERT models: `splitOversizedChunks` now accounts for the `"// filePath\n"` embed prefix when computing chunk budgets; `createSubChunks` skips header+overlap when they'd exceed the per-chunk limit, eliminating non-convergent re-splitting cycles
  • Second split pass after merge: `mergeUndersizedChunks` can produce oversized chunks; a second `splitOversizedChunks` call in `indexWithTree` catches them
  • Restore Markdown/YAML/JSON chunkers: `MarkdownChunker` and `StructuredChunker` registrations were dropped in a prior commit; restored with tests and extension entries
  • Background index pre-warming: `SessionStart` hook triggers index in background so the first search in a new session is instant
  • TTL-based freshness caching: skip merkle walk within 30s window (configurable via `LUMEN_FRESHNESS_TTL`); pre-populate TTL from `LastIndexedAt` on startup
  • Seed from sibling worktree: new indexes copy from a sibling worktree's DB when available to avoid full re-embed
  • `search --trace` CLI command: diagnostic tool showing effective root, cache hits, and timing

Test plan

  • `go test ./...` — all unit/integration tests pass
  • `go test -tags e2e ./...` — all e2e tests pass (requires Ollama + all-minilm)
  • `golangci-lint run` — zero issues
  • Cupaloy snapshots updated for all lang tests

🤖 Generated with Claude Code

aeneasr and others added 16 commits March 19, 2026 14:23
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When cwd is a large monorepo root and path is a small subdirectory, the
previous behaviour would create a new index rooted at cwd — scanning and
embedding the entire ancestor tree (e.g. 4503 files, 8+ minutes) just to
search a subdirectory.

With this fix, getOrCreate only uses preferredRoot (cwd) if a DB already
exists at that path. Otherwise it falls through to findEffectiveRoot(path),
which scopes the index to the actual path being searched. Once a cwd-level
index has been built (e.g. via lumen index), subsequent searches will reuse
it and benefit from the shared project-wide index.
Within a Claude session the MCP server receives many consecutive search
calls. Previously every call re-walked the entire project tree to check
if the index was stale — 1-3s of pure filesystem I/O even when nothing
had changed (2s for a 4500-file monorepo).

Add a lastCheckedAt timestamp to each cacheEntry. ensureIndexed skips
EnsureFresh entirely if the index was confirmed fresh within the last 30s.
ForceReindex bypasses the TTL. touchChecked updates both the projectDir
entry and its effectiveRoot alias so the TTL is consistent regardless of
which key the caller uses.
Previously findEffectiveRoot returned path itself when no existing index
was found in the ancestry, causing each subdirectory search to create its
own isolated index. This meant sibling directories got separate DBs,
files were embedded multiple times, and cross-directory searches missed
results.

Now when no existing DB is found within the git repo boundary, the git
root is returned as the effective root. All first searches in a repo
share one index at the repo root. The git boundary cap is preserved so
ancestor indexes above the repo root are still never adopted.
On macOS, t.TempDir() returns symlink paths (/var/folders/...) while
git.RepoRoot() resolves them via EvalSymlinks (/private/var/folders/...).
This mismatch caused filepath.Rel(effectiveRoot, input.Path) to produce
paths with ".." components, which matched nothing in the database and
returned empty search results from subdirectories.

Fix by applying filepath.EvalSymlinks to both Path and Cwd in
validateSearchInput so all path comparisons operate on resolved paths.

Also fix TestE2E_GitRootFallbackSharedIndex: the third assertion searched
from apiDir expecting pkg/ results, but pathPrefix filtering correctly
restricts results to the searched subdirectory. Changed to search from
the repo root to verify the shared index is complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a table-driven e2e test that systematically exercises all path
topology combinations: plain dir, git root, git subdirectory, sibling
subdirectory sharing, cwd fallback, external worktree, internal worktree
subdir, and symlink variants.

Key assertions:
- wantNoSymbols: verifies pathPrefix scoping actively excludes out-of-scope
  results (previously untested)
- wantMinFiles>=2: distinguishes correct full-root indexing from narrow
  subdir-only indexing
- second call in git-subdir-sibling: verifies sibling dirs share git-root
  index without reindexing

Closes the two structural gaps identified in PR #61: symlink path
normalization regression and git-repo subdirectory pathPrefix filtering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two tests (TestGenerateSessionContext_NoIndex, TestHookOutputJSON) called
the real generateSessionContext with a non-existent path, which triggered
spawnBackgroundIndexer. In a test binary os.Executable() returns the test
binary itself, so the spawned process ran all tests with no -test.run filter
— including those two — causing exponential process proliferation (fork bomb).

Fixed by switching both tests to generateSessionContextInternal with a no-op
bgIndexer mock, matching the pattern used by all other hook tests.

Also added t.Cleanup(idx.Close) to four getOrCreate tests that created
*index.Indexer values (holding open SQLite WAL file handles) without ever
closing them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On SessionStart, spawn a detached 'lumen index <cwd>' process so the
first search in a new session doesn't pay the full embed cost. Uses
dependency injection (findDonor/bgIndexer callbacks) so the hook is
testable without spawning real processes.

In getOrCreate, pre-populate the freshness TTL cache entry when the
index was stamped recently by background pre-warming, avoiding a
redundant merkle walk on the very first search.

Propagate seed warnings (sibling copy failures) through getOrCreate
and up to SemanticSearchOutput so callers are informed rather than
silently degraded.

Adds LastIndexedAt() to Indexer to read the last_indexed_at metadata
field written after every successful index run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Key fixes:

1. splitOversizedChunks: account for filepath prefix in embed text
   - Checks now use budget = maxChars - (3 + len(filePath) + 1) so the
     full "// path\n" + content text stays within the model's context
   - Passes the tighter budget to splitChunk and createSubChunks

2. createSubChunks: skip header+overlap when it would exceed budget
   - Adds maxChars parameter; only prepends header+overlap lines if the
     total still fits, preventing non-convergent re-splitting cycles
   - Existing tests updated with maxChars=0 (no limit) to preserve behavior

3. index.go: second splitOversizedChunks pass after mergeUndersizedChunks
   - Prevents oversized chunks from surviving after merge expands them

4. Restore MarkdownChunker and re-register .md/.yaml/.json extensions
   - Languages were removed in a prior commit; restored chunker and tests

5. Fix findEffectiveRoot git-root fallback for SkipDir paths
   - Previously always defaulted to gitRoot; now checks pathCrossesSkipDir
   - testdata/ paths (a Go SkipDir convention) now scope to their own dir,
     preventing testdata/sample-project from indexing the entire repo

6. e2e: add TTL sleeps to TestE2E_IncrementalIndex
   - File changes must wait >1s for the freshness TTL to expire before
     re-searching, matching TestE2E_FreshnessTTLSkipsMerkleWalk pattern

7. e2e lang tests: cap LUMEN_MAX_CHUNK_TOKENS=100 for BERT context window
   - all-minilm uses 4x denser tokenisation than the 4-chars/token estimate

8. Update all cupaloy snapshots to reflect new chunk boundaries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aeneasr aeneasr merged commit 57db1ea into main Mar 19, 2026
4 checks passed
aeneasr added a commit that referenced this pull request Mar 19, 2026
Adds a table-driven e2e test that systematically exercises all path
topology combinations: plain dir, git root, git subdirectory, sibling
subdirectory sharing, cwd fallback, external worktree, internal worktree
subdir, and symlink variants.

Key assertions:
- wantNoSymbols: verifies pathPrefix scoping actively excludes out-of-scope
  results (previously untested)
- wantMinFiles>=2: distinguishes correct full-root indexing from narrow
  subdir-only indexing
- second call in git-subdir-sibling: verifies sibling dirs share git-root
  index without reindexing

Closes the two structural gaps identified in PR #61: symlink path
normalization regression and git-repo subdirectory pathPrefix filtering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant