fix(index): purge stale donor-seeded file records + freshness TTL cache + debug logging by aeneasr · Pull Request #63 · ory/lumen

aeneasr · 2026-03-21T09:50:02Z

Summary

Root cause of markdown indexing: When a new worktree index is seeded from a donor (via file copy), the donor's DB may contain file records with extensions that are no longer supported (e.g. .md from an older binary). On every subsequent EnsureFresh, these entries caused phantom "removed" diffs and their file records accumulated in the DB indefinitely.
Fix: The purge loop now calls DeleteFileChunks(path) for each stale-extension record, eagerly removing it from the DB rather than just hiding it from the in-memory diff.
Freshness TTL cache (LUMEN_FRESHNESS_TTL, default 60s): successive semantic_search calls skip the Merkle tree rebuild when the index was confirmed fresh recently, eliminating unnecessary latency on every search.
Structured debug logging to ~/.local/share/lumen/debug.log (configurable via LUMEN_LOG_LEVEL): logs cache hits/misses, seeding decisions, reindex triggers, and per-file diffs.

Regression Tests (red → green)

Four tests cover the failure scenario end-to-end:

Test	What it guards
`TestIndexer_StaleUnsupportedExtensionNotCountedAsRemoved`	Stale `.md` must not appear in `RemovedFiles`
`TestIndexer_StaleUnsupportedExtensionDeletedFromDB`	Stale `.md` must be fully purged from DB after reindex
`TestIndexer_SupportedFileRemovedFromDisk_CountedAsRemoved`	Purge filter must not suppress legitimate supported-file removals
`TestIndexer_MixedStaleAndValidRemovals`	Mixed case: only the real `.go` deletion appears in `RemovedFiles`

Test Plan

go test ./... passes
Regression tests added and verified red→green

🤖 Generated with Claude Code

…ing + freshness TTL cache + debug logging When a new worktree index is seeded from a donor (another worktree), the entire donor SQLite database is copied. If the donor was built by an older binary that indexed files with now-unsupported extensions (e.g. .md), those file-hash records are carried over. On every subsequent EnsureFresh call the Merkle tree (correctly) excludes those paths, the incremental diff treats them as "removed", DeleteFileChunks is called (a no-op – chunks are already gone), but the file record itself is never deleted, so the same phantom "removed" entry reappears on every search forever. Fix: before computing the Merkle diff, purge any oldHash entry whose extension is not in SupportedExtensions. This breaks the perpetual phantom-removal loop. Also expose AddedFiles/ModifiedFiles/RemovedFiles and FirstIndex on Stats so callers (and tests) can assert on exactly what triggered a reindex. Also adds: - Freshness TTL cache (LUMEN_FRESHNESS_TTL, default 60s) so successive semantic_search calls skip the Merkle tree rebuild when nothing has changed, eliminating unnecessary re-index latency on every search. - Structured JSON debug logging to ~/.local/share/lumen/debug.log (LUMEN_LOG_LEVEL env var; default "info") covering cache hits/misses, seeding decisions, reindex triggers and file-level diffs. - LUMEN_FRESHNESS_TTL config field + EnvOrDefaultDuration helper. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…y delete orphaned DB records The previous commit only filtered stale unsupported-extension records out of the in-memory oldHashes map. Without also calling DeleteFileChunks for those paths, the file records would linger in the DB forever (accumulating across reindex cycles), and would be returned by GetFileHashes on every future run — requiring the purge loop to run each time. Fix: call DeleteFileChunks(path) inside the purge loop so orphaned file records are removed from the DB eagerly, not just hidden from the diff. Add three regression tests: - TestIndexer_StaleUnsupportedExtensionDeletedFromDB: verifies orphaned .md records are fully removed from the DB after EnsureFresh, not just hidden from the diff. - TestIndexer_SupportedFileRemovedFromDisk_CountedAsRemoved: ensures the stale-extension filter does not suppress legitimate removals of supported files deleted from disk. - TestIndexer_MixedStaleAndValidRemovals: verifies that when both stale .md entries and a genuinely deleted .go file are present, only the .go file appears in RemovedFiles. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Implements internal/indexlock with TryAcquire/IsHeld/Release backed by flock(2) — the OS releases the lock automatically on process death (including SIGKILL), making stale locks impossible. Windows builds get no-op stubs that compile cleanly but provide no mutual-exclusion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…TERM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t-path Pass the already-computed dbPath from handleSemanticSearch into ensureIndexed instead of recomputing it inside the function, avoiding a redundant config.DBPathForProject call on every search. Add TestEnsureIndexed_SkipsWhenLockHeld which spawns a subprocess to hold the exclusive flock and asserts that ensureIndexed returns (Reindexed=false, err=nil) without invoking EnsureFresh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pterm.DefaultProgressbar has ShowElapsedTime=true by default, which causes schedule.Every to spawn a background goroutine that periodically reads Title via getString(). Concurrently, pterm.Fprint() writes Title via UpdateTitle() on all active bars whenever any print call is made. Since flock(2) lacks a mutex on Title/IsActive, the goroutine and Fprint() can race after Stop() signals but before the goroutine observes IsActive=false. Disabling ShowElapsedTime prevents the goroutine from being created entirely, eliminating the race at its source.

…orts - Replace manual os.OpenFile with gopkg.in/natefinch/lumberjack.v2 for automatic log rotation (10 MB / 3 compressed backups). - Change newDebugLogger return type from *os.File to io.Closer so the caller is not coupled to the underlying writer implementation. - Remove internal/index/repro_test.go (hardcoded personal path, not suitable for a shared repo; gated behind //go:build repro but still leaked developer state). - Fix stdlib import ordering in cmd/stdio.go (io before log/slog). - Run go mod tidy: resolves duplicate testify h1 entries and pins lumberjack at v2.2.1 via gopkg.in/natefinch/lumberjack.v2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace platform-specific flock(2) / no-op Windows stubs with github.com/gofrs/flock which uses flock(2) on Unix and LockFileEx on Windows. This enables real advisory locking on all platforms. - Merge lock_unix.go + lock_windows.go into single lock.go - Implement real Windows spawnBackgroundIndexer with CREATE_NEW_PROCESS_GROUP - Add go cmd.Wait() in both spawners to reap child processes - Remove redundant LOCK_UN calls (closing FD releases flocks) - Add comment on ristretto SetWithTTL async admission - Fix errcheck lint on f.Close() - Update README platform support to include Windows Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…worktree When a search path pointed into an internal .worktrees/ subdirectory, getOrCreate treated it as an independent worktree project and created a new index from scratch (Reindexed=true), even though the parent repo was already indexed. Root cause: findEffectiveRoot could not walk above the worktree's own git root (git rev-parse --show-toplevel returns the worktree dir), so it never found the cached parent index. Additionally, on macOS the symlink /var → /private/var caused path comparisons to fail prematurely. Fix: when projectPath is a worktree and preferredRoot has an existing index (in-memory cache or on-disk DB), use the parent index directly. Also resolve symlinks in findEffectiveRoot before comparing against the git root to prevent premature walk termination on macOS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prevents permanent data loss when embedding API times out or process crashes mid-batch. Previously the file hash was committed before chunks, making the file appear 'up to date' with zero searchable chunks. Uses a sentinel empty hash to satisfy the chunks→files FK constraint during InsertChunks, then updates to the real hash only after flushBatch succeeds. A sentinel hash never matches a real SHA-256, so a crash between placeholder write and real-hash commit causes correct re-indexing on the next run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nsions Previously force=true only re-indexed current files without cleaning up deleted files or stale extension records from the database. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous 5s timeout was insufficient when the background indexer holds a write transaction during Ollama/LM Studio API calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Prevents inconsistent DB state if process crashes between the DELETE statements during a vector dimension change.

The implementation returns true on error (conservative: assume lock held) but the comment incorrectly said 'Returns false on any error (fail-open)'.

Prevents concurrent SQLite writes that could exceed busy_timeout when force_reindex=true is requested while a background indexer is running.

Adds Close() to indexerCache and defers it in runStdio to prevent SQLite connection and WAL file leaks. Also defers idx.Close() in the CLI index command, and moves signal.NotifyContext before setupIndexer so signals during DB setup trigger graceful cancellation.

Symlinked files could point outside the repo boundary. Files above 10MB (e.g., generated code) could cause OOM when 8 workers read them in parallel.

Uses the same heuristic as git: check for NUL bytes in the first 512 bytes of content. Prevents wasting embedding compute on binary files that happen to have supported extensions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Prevents license headers from being included in function chunks when there is no blank line between the header and the first declaration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When multiple query rules match the same code region, keep only the most specific match. Reduces wasted embedding compute and duplicate search results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extends the existing linguist-generated filter to also exclude vendored code marked with linguist-vendored in .gitattributes.

Respects the user's global gitignore configuration so files like *.swp, .DS_Store, or custom patterns are excluded from indexing.

Prevents silent truncation of minified JS files or other source files with lines exceeding the default 64KB scanner buffer.

Read methods (Search, IsFresh, Status) now take RLock instead of bypassing the mutex entirely. Write methods continue to take Lock.

Replace lumberjack with plain os.OpenFile for log rotation — removes the external dependency. The logger() nil-safe accessor was already removed; add log: discardLog to all indexerCache test fixtures that omitted it to prevent nil-pointer panics when getOrCreate reaches ic.log.Info. Also promote gofrs/flock to a direct dependency in go.mod to match its use in internal/indexlock. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Resolved conflicts from rebasing fix-reindex-again onto origin/main. Kept HEAD's freshness-cache approach (lastCheckedAt / recentlyChecked) over main's ristretto-based cache. Ran go mod tidy to drop the now- unused ristretto, xxhash, go-humanize and dgryski/go-farm deps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cfg.FreshnessTTL was loaded from LUMEN_FRESHNESS_TTL but never passed to indexerCache.freshnessTTL. recentlyChecked() fell back to the hardcoded defaultFreshnessTTL (30s), so the env var had no effect. E2E tests set LUMEN_FRESHNESS_TTL=1s and waited 1.1s expecting TTL expiry — the 30s fallback kept the cache hot and suppressed reindexing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Chunk deduplication and comment-capping changes (added in earlier commits) altered the indexed chunks, causing all-minilm search results to diverge from stored snapshots. Regenerated all snapshots locally with UPDATE_SNAPSHOTS=true against the same all-minilm model used in CI. Also adds -timeout=30m to the unit-test step so CI does not hit Go's default 10-minute limit on slower runners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GitHub Actions kills runners that produce no stdout for ~10 minutes. Without -v, go test only prints a line when an entire package finishes. Slow packages (cmd ~57s, chunker ~107s, index ~107s locally) take 5-10× longer on Ubuntu CI runners, causing a 9-10 minute output gap that triggers the infrastructure shutdown signal. Adding -v makes Go emit === RUN / --- PASS lines for every individual test, keeping the output stream alive throughout the run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sqlite3 CGO compilation takes 8-10 minutes on GitHub's ubuntu-latest runners when the build cache is cold (e.g. after go.sum changes). GitHub Actions kills runners that produce no stdout for ~10 minutes, which causes the job to fail before tests even start. Add a background heartbeat that prints every 60 s, keeping the output stream alive. The trap ensures the heartbeat is cleaned up on exit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

With -v, the test runner emits thousands of === RUN / --- PASS lines. On GitHub's runners the OS pipe buffer fills up, blocking both the go test process and the heartbeat background process. The result is 4-5 minutes of no output reaching GitHub, which triggers the runner shutdown even though the heartbeat is running. Without -v the pipe stays clear, the 60-second heartbeat reaches GitHub reliably, and the stall-detection never fires. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Running go test ./... with default -p=GOMAXPROCS caused multiple packages to write SQLite databases simultaneously, creating extreme I/O contention: - internal/chunker went from 9s (alone) to 110s (parallel): 12x slower - internal/index went from 6.8s (sequential) to 107s (parallel): 15x slower The contention caused the CI runner to produce no output for >10 minutes, triggering GitHub Actions' stall detection and a SIGTERM kill. With -p 1, total local runtime is ~165s with coverage instrumentation. Expected CI runtime: 330-500s (5-8 minutes) — well within the 30m timeout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CGO compilation of sqlite3+fts5 and 11 tree-sitter grammars takes ~8-10 minutes on a cold cache, causing the Test job to be killed before any tests can run. This creates an infinite loop: the build cache never gets saved because the job is always killed before completion. Fix: - Add ccache (hendrikmuhs/ccache-action) so C compilation is nearly instant on the second and subsequent runs. The action saves the ccache even on job cancellation, breaking the cold-cache loop after a single cold run. - Add timeout-minutes: 30 to the Test job so the first cold run has enough time to populate the ccache and save it before being cancelled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…exer spawnBackgroundIndexer uses os.Executable() to find the binary to spawn. In a test context, os.Executable() returns the test binary itself (cmd.test). Spawning it with \"index <path>\" args causes the test binary to execute all tests (since \"index\" is a positional arg that the test framework ignores), including TestSpawnBackgroundIndexer_DoesNotPanic, which spawns another copy, and so on — an exponential fork-bomb. The symptom was the CI Test job hanging for 8+ minutes (hundreds of test binary processes competing for SQLite locks and CPU) before the runner was killed by the OS/GitHub infrastructure. Fix: add TestMain that detects the subprocess invocation pattern (os.Args[1] == \"index\") and exits cleanly instead of running the full suite. This matches the existing guard in the generateSessionContext tests which already document this exact fork-bomb risk. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

aeneasr and others added 30 commits March 21, 2026 23:45

fix: remove worktrees

e18a0c8

feat(index): acquire flock before indexing, thread ctx, cancel on SIG…

1ea436c

…TERM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(index): write skip message to stderr, document signal-race trade-off

7d4775e

feat(hook): spawn detached background indexer on session start

c02b2b4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(hook): add license headers, smoke test, and Windows stub guidance

a56c3bf

feat(stdio): skip EnsureFresh when background indexer holds lock

3d424db

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: note Windows is not supported (flock dependency)

75e109b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(index): force reindex removes deleted files and purges stale exte…

46d49e9

…nsions Previously force=true only re-indexed current files without cleaning up deleted files or stale extension records from the database. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(store): increase busy_timeout to 30s for slow embedding batches

476b6e1

The previous 5s timeout was insufficient when the background indexer holds a write transaction during Ollama/LM Studio API calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(store): wrap dimension-reset deletes in a transaction

b01305b

Prevents inconsistent DB state if process crashes between the DELETE statements during a vector dimension change.

docs(indexlock): fix IsHeld comment to match fail-closed behavior

49d970d

The implementation returns true on error (conservative: assume lock held) but the comment incorrectly said 'Returns false on any error (fail-open)'.

fix(stdio): skip force reindex when background indexer holds lock

6bf0664

Prevents concurrent SQLite writes that could exceed busy_timeout when force_reindex=true is requested while a background indexer is running.

fix(merkle): skip symlinks and files >10MB during tree walk

113b4fd

Symlinked files could point outside the repo boundary. Files above 10MB (e.g., generated code) could cause OOM when 8 workers read them in parallel.

fix(index): simplify isBinaryContent using slices.Contains

d7284b4

fix(chunker): cap leading comments to 10 lines

1a1ddcd

Prevents license headers from being included in function chunks when there is no blank line between the header and the first declaration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(chunker): deduplicate overlapping tree-sitter chunks

9f7d462

When multiple query rules match the same code region, keep only the most specific match. Reduces wasted embedding compute and duplicate search results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(merkle): support linguist-vendored in .gitattributes

adf44fe

Extends the existing linguist-generated filter to also exclude vendored code marked with linguist-vendored in .gitattributes.

feat(merkle): load global gitignore (core.excludesFile)

1dad39a

Respects the user's global gitignore configuration so files like *.swp, .DS_Store, or custom patterns are excluded from indexing.

fix(stdio): increase scanner buffer to 1MB for long lines

9861dce

Prevents silent truncation of minified JS files or other source files with lines exceeding the default 64KB scanner buffer.

aeneasr and others added 4 commits March 21, 2026 23:51

refactor(index): switch to RWMutex for consistent locking discipline

4b78043

Read methods (Search, IsFresh, Status) now take RLock instead of bypassing the mutex entirely. Write methods continue to take Lock.

fix(lint): wrap deferred Close calls to satisfy errcheck

27ecba7

aeneasr force-pushed the fix-reindex-again branch from 0c4cd48 to 4a5caa4 Compare March 21, 2026 22:58

aeneasr enabled auto-merge (rebase) March 22, 2026 09:18

aeneasr and others added 7 commits March 22, 2026 11:17

aeneasr merged commit 19849aa into main Mar 22, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(index): purge stale donor-seeded file records + freshness TTL cache + debug logging#63

fix(index): purge stale donor-seeded file records + freshness TTL cache + debug logging#63
aeneasr merged 42 commits intomainfrom
fix-reindex-again

aeneasr commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aeneasr commented Mar 21, 2026

Summary

Regression Tests (red → green)

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant