perf(stdio): fix slow search — wrong index root, per-call merkle walk, add search --trace command by aeneasr · Pull Request #61 · ory/lumen

aeneasr · 2026-03-19T13:56:20Z

Summary

Fix slow/wrong index root: `findEffectiveRoot` now defaults to git root for normal subdirs but falls back to the search path itself when it's under a `SkipDir` (e.g. `testdata/`), preventing accidental whole-repo indexing
Fix context-length errors with BERT models: `splitOversizedChunks` now accounts for the `"// filePath\n"` embed prefix when computing chunk budgets; `createSubChunks` skips header+overlap when they'd exceed the per-chunk limit, eliminating non-convergent re-splitting cycles
Second split pass after merge: `mergeUndersizedChunks` can produce oversized chunks; a second `splitOversizedChunks` call in `indexWithTree` catches them
Restore Markdown/YAML/JSON chunkers: `MarkdownChunker` and `StructuredChunker` registrations were dropped in a prior commit; restored with tests and extension entries
Background index pre-warming: `SessionStart` hook triggers index in background so the first search in a new session is instant
TTL-based freshness caching: skip merkle walk within 30s window (configurable via `LUMEN_FRESHNESS_TTL`); pre-populate TTL from `LastIndexedAt` on startup
Seed from sibling worktree: new indexes copy from a sibling worktree's DB when available to avoid full re-embed
`search --trace` CLI command: diagnostic tool showing effective root, cache hits, and timing

Test plan

`go test ./...` — all unit/integration tests pass
`go test -tags e2e ./...` — all e2e tests pass (requires Ollama + all-minilm)
`golangci-lint run` — zero issues
Cupaloy snapshots updated for all lang tests

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When cwd is a large monorepo root and path is a small subdirectory, the previous behaviour would create a new index rooted at cwd — scanning and embedding the entire ancestor tree (e.g. 4503 files, 8+ minutes) just to search a subdirectory. With this fix, getOrCreate only uses preferredRoot (cwd) if a DB already exists at that path. Otherwise it falls through to findEffectiveRoot(path), which scopes the index to the actual path being searched. Once a cwd-level index has been built (e.g. via lumen index), subsequent searches will reuse it and benefit from the shared project-wide index.

Within a Claude session the MCP server receives many consecutive search calls. Previously every call re-walked the entire project tree to check if the index was stale — 1-3s of pure filesystem I/O even when nothing had changed (2s for a 4500-file monorepo). Add a lastCheckedAt timestamp to each cacheEntry. ensureIndexed skips EnsureFresh entirely if the index was confirmed fresh within the last 30s. ForceReindex bypasses the TTL. touchChecked updates both the projectDir entry and its effectiveRoot alias so the TTL is consistent regardless of which key the caller uses.

Previously findEffectiveRoot returned path itself when no existing index was found in the ancestry, causing each subdirectory search to create its own isolated index. This meant sibling directories got separate DBs, files were embedded multiple times, and cross-directory searches missed results. Now when no existing DB is found within the git repo boundary, the git root is returned as the effective root. All first searches in a repo share one index at the repo root. The git boundary cap is preserved so ancestor indexes above the repo root are still never adopted.

On macOS, t.TempDir() returns symlink paths (/var/folders/...) while git.RepoRoot() resolves them via EvalSymlinks (/private/var/folders/...). This mismatch caused filepath.Rel(effectiveRoot, input.Path) to produce paths with ".." components, which matched nothing in the database and returned empty search results from subdirectories. Fix by applying filepath.EvalSymlinks to both Path and Cwd in validateSearchInput so all path comparisons operate on resolved paths. Also fix TestE2E_GitRootFallbackSharedIndex: the third assertion searched from apiDir expecting pkg/ results, but pathPrefix filtering correctly restricts results to the searched subdirectory. Changed to search from the repo root to verify the shared index is complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a table-driven e2e test that systematically exercises all path topology combinations: plain dir, git root, git subdirectory, sibling subdirectory sharing, cwd fallback, external worktree, internal worktree subdir, and symlink variants. Key assertions: - wantNoSymbols: verifies pathPrefix scoping actively excludes out-of-scope results (previously untested) - wantMinFiles>=2: distinguishes correct full-root indexing from narrow subdir-only indexing - second call in git-subdir-sibling: verifies sibling dirs share git-root index without reindexing Closes the two structural gaps identified in PR #61: symlink path normalization regression and git-repo subdirectory pathPrefix filtering. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two tests (TestGenerateSessionContext_NoIndex, TestHookOutputJSON) called the real generateSessionContext with a non-existent path, which triggered spawnBackgroundIndexer. In a test binary os.Executable() returns the test binary itself, so the spawned process ran all tests with no -test.run filter — including those two — causing exponential process proliferation (fork bomb). Fixed by switching both tests to generateSessionContextInternal with a no-op bgIndexer mock, matching the pattern used by all other hook tests. Also added t.Cleanup(idx.Close) to four getOrCreate tests that created *index.Indexer values (holding open SQLite WAL file handles) without ever closing them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

On SessionStart, spawn a detached 'lumen index <cwd>' process so the first search in a new session doesn't pay the full embed cost. Uses dependency injection (findDonor/bgIndexer callbacks) so the hook is testable without spawning real processes. In getOrCreate, pre-populate the freshness TTL cache entry when the index was stamped recently by background pre-warming, avoiding a redundant merkle walk on the very first search. Propagate seed warnings (sibling copy failures) through getOrCreate and up to SemanticSearchOutput so callers are informed rather than silently degraded. Adds LastIndexedAt() to Indexer to read the last_indexed_at metadata field written after every successful index run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Key fixes: 1. splitOversizedChunks: account for filepath prefix in embed text - Checks now use budget = maxChars - (3 + len(filePath) + 1) so the full "// path\n" + content text stays within the model's context - Passes the tighter budget to splitChunk and createSubChunks 2. createSubChunks: skip header+overlap when it would exceed budget - Adds maxChars parameter; only prepends header+overlap lines if the total still fits, preventing non-convergent re-splitting cycles - Existing tests updated with maxChars=0 (no limit) to preserve behavior 3. index.go: second splitOversizedChunks pass after mergeUndersizedChunks - Prevents oversized chunks from surviving after merge expands them 4. Restore MarkdownChunker and re-register .md/.yaml/.json extensions - Languages were removed in a prior commit; restored chunker and tests 5. Fix findEffectiveRoot git-root fallback for SkipDir paths - Previously always defaulted to gitRoot; now checks pathCrossesSkipDir - testdata/ paths (a Go SkipDir convention) now scope to their own dir, preventing testdata/sample-project from indexing the entire repo 6. e2e: add TTL sleeps to TestE2E_IncrementalIndex - File changes must wait >1s for the freshness TTL to expire before re-searching, matching TestE2E_FreshnessTTLSkipsMerkleWalk pattern 7. e2e lang tests: cap LUMEN_MAX_CHUNK_TOKENS=100 for BERT context window - all-minilm uses 4x denser tokenisation than the 4-chars/token estimate 8. Update all cupaloy snapshots to reflect new chunk boundaries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a table-driven e2e test that systematically exercises all path topology combinations: plain dir, git root, git subdirectory, sibling subdirectory sharing, cwd fallback, external worktree, internal worktree subdir, and symlink variants. Key assertions: - wantNoSymbols: verifies pathPrefix scoping actively excludes out-of-scope results (previously untested) - wantMinFiles>=2: distinguishes correct full-root indexing from narrow subdir-only indexing - second call in git-subdir-sibling: verifies sibling dirs share git-root index without reindexing Closes the two structural gaps identified in PR #61: symlink path normalization regression and git-repo subdirectory pathPrefix filtering. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

aeneasr and others added 16 commits March 19, 2026 14:23

docs: add search trace implementation plan

1e0a42b

test(cmd): add tracer unit tests and tracer implementation

2e90e29

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(cmd): add search subcommand with --trace diagnostic flag

77ea7be

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(cmd): lint cleanup in search subcommand

b91e159

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(index): seed CLI index command from sibling worktree on first use

739ba46

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add path topology e2e test matrix design spec

cd1b8be

docs: finalize path topology e2e spec with symlink and worktree details

4fe7cbd

chore: add .worktrees/ to .gitignore

983270a

aeneasr merged commit 57db1ea into main Mar 19, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(stdio): fix slow search — wrong index root, per-call merkle walk, add search --trace command#61

perf(stdio): fix slow search — wrong index root, per-call merkle walk, add search --trace command#61
aeneasr merged 16 commits intomainfrom
address-perf-issues

aeneasr commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aeneasr commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aeneasr commented Mar 19, 2026 •

edited

Loading