fix(index): purge stale donor-seeded file records + freshness TTL cache + debug logging#63
Merged
fix(index): purge stale donor-seeded file records + freshness TTL cache + debug logging#63
Conversation
…ing + freshness TTL cache + debug logging When a new worktree index is seeded from a donor (another worktree), the entire donor SQLite database is copied. If the donor was built by an older binary that indexed files with now-unsupported extensions (e.g. .md), those file-hash records are carried over. On every subsequent EnsureFresh call the Merkle tree (correctly) excludes those paths, the incremental diff treats them as "removed", DeleteFileChunks is called (a no-op – chunks are already gone), but the file record itself is never deleted, so the same phantom "removed" entry reappears on every search forever. Fix: before computing the Merkle diff, purge any oldHash entry whose extension is not in SupportedExtensions. This breaks the perpetual phantom-removal loop. Also expose AddedFiles/ModifiedFiles/RemovedFiles and FirstIndex on Stats so callers (and tests) can assert on exactly what triggered a reindex. Also adds: - Freshness TTL cache (LUMEN_FRESHNESS_TTL, default 60s) so successive semantic_search calls skip the Merkle tree rebuild when nothing has changed, eliminating unnecessary re-index latency on every search. - Structured JSON debug logging to ~/.local/share/lumen/debug.log (LUMEN_LOG_LEVEL env var; default "info") covering cache hits/misses, seeding decisions, reindex triggers and file-level diffs. - LUMEN_FRESHNESS_TTL config field + EnvOrDefaultDuration helper. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y delete orphaned DB records The previous commit only filtered stale unsupported-extension records out of the in-memory oldHashes map. Without also calling DeleteFileChunks for those paths, the file records would linger in the DB forever (accumulating across reindex cycles), and would be returned by GetFileHashes on every future run — requiring the purge loop to run each time. Fix: call DeleteFileChunks(path) inside the purge loop so orphaned file records are removed from the DB eagerly, not just hidden from the diff. Add three regression tests: - TestIndexer_StaleUnsupportedExtensionDeletedFromDB: verifies orphaned .md records are fully removed from the DB after EnsureFresh, not just hidden from the diff. - TestIndexer_SupportedFileRemovedFromDisk_CountedAsRemoved: ensures the stale-extension filter does not suppress legitimate removals of supported files deleted from disk. - TestIndexer_MixedStaleAndValidRemovals: verifies that when both stale .md entries and a genuinely deleted .go file are present, only the .go file appears in RemovedFiles. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements internal/indexlock with TryAcquire/IsHeld/Release backed by flock(2) — the OS releases the lock automatically on process death (including SIGKILL), making stale locks impossible. Windows builds get no-op stubs that compile cleanly but provide no mutual-exclusion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…TERM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t-path Pass the already-computed dbPath from handleSemanticSearch into ensureIndexed instead of recomputing it inside the function, avoiding a redundant config.DBPathForProject call on every search. Add TestEnsureIndexed_SkipsWhenLockHeld which spawns a subprocess to hold the exclusive flock and asserts that ensureIndexed returns (Reindexed=false, err=nil) without invoking EnsureFresh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pterm.DefaultProgressbar has ShowElapsedTime=true by default, which causes schedule.Every to spawn a background goroutine that periodically reads Title via getString(). Concurrently, pterm.Fprint() writes Title via UpdateTitle() on all active bars whenever any print call is made. Since flock(2) lacks a mutex on Title/IsActive, the goroutine and Fprint() can race after Stop() signals but before the goroutine observes IsActive=false. Disabling ShowElapsedTime prevents the goroutine from being created entirely, eliminating the race at its source.
…orts - Replace manual os.OpenFile with gopkg.in/natefinch/lumberjack.v2 for automatic log rotation (10 MB / 3 compressed backups). - Change newDebugLogger return type from *os.File to io.Closer so the caller is not coupled to the underlying writer implementation. - Remove internal/index/repro_test.go (hardcoded personal path, not suitable for a shared repo; gated behind //go:build repro but still leaked developer state). - Fix stdlib import ordering in cmd/stdio.go (io before log/slog). - Run go mod tidy: resolves duplicate testify h1 entries and pins lumberjack at v2.2.1 via gopkg.in/natefinch/lumberjack.v2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace platform-specific flock(2) / no-op Windows stubs with github.com/gofrs/flock which uses flock(2) on Unix and LockFileEx on Windows. This enables real advisory locking on all platforms. - Merge lock_unix.go + lock_windows.go into single lock.go - Implement real Windows spawnBackgroundIndexer with CREATE_NEW_PROCESS_GROUP - Add go cmd.Wait() in both spawners to reap child processes - Remove redundant LOCK_UN calls (closing FD releases flocks) - Add comment on ristretto SetWithTTL async admission - Fix errcheck lint on f.Close() - Update README platform support to include Windows Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…worktree When a search path pointed into an internal .worktrees/ subdirectory, getOrCreate treated it as an independent worktree project and created a new index from scratch (Reindexed=true), even though the parent repo was already indexed. Root cause: findEffectiveRoot could not walk above the worktree's own git root (git rev-parse --show-toplevel returns the worktree dir), so it never found the cached parent index. Additionally, on macOS the symlink /var → /private/var caused path comparisons to fail prematurely. Fix: when projectPath is a worktree and preferredRoot has an existing index (in-memory cache or on-disk DB), use the parent index directly. Also resolve symlinks in findEffectiveRoot before comparing against the git root to prevent premature walk termination on macOS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents permanent data loss when embedding API times out or process crashes mid-batch. Previously the file hash was committed before chunks, making the file appear 'up to date' with zero searchable chunks. Uses a sentinel empty hash to satisfy the chunks→files FK constraint during InsertChunks, then updates to the real hash only after flushBatch succeeds. A sentinel hash never matches a real SHA-256, so a crash between placeholder write and real-hash commit causes correct re-indexing on the next run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nsions Previously force=true only re-indexed current files without cleaning up deleted files or stale extension records from the database. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous 5s timeout was insufficient when the background indexer holds a write transaction during Ollama/LM Studio API calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents inconsistent DB state if process crashes between the DELETE statements during a vector dimension change.
The implementation returns true on error (conservative: assume lock held) but the comment incorrectly said 'Returns false on any error (fail-open)'.
Prevents concurrent SQLite writes that could exceed busy_timeout when force_reindex=true is requested while a background indexer is running.
Adds Close() to indexerCache and defers it in runStdio to prevent SQLite connection and WAL file leaks. Also defers idx.Close() in the CLI index command, and moves signal.NotifyContext before setupIndexer so signals during DB setup trigger graceful cancellation.
Symlinked files could point outside the repo boundary. Files above 10MB (e.g., generated code) could cause OOM when 8 workers read them in parallel.
Uses the same heuristic as git: check for NUL bytes in the first 512 bytes of content. Prevents wasting embedding compute on binary files that happen to have supported extensions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents license headers from being included in function chunks when there is no blank line between the header and the first declaration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When multiple query rules match the same code region, keep only the most specific match. Reduces wasted embedding compute and duplicate search results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the existing linguist-generated filter to also exclude vendored code marked with linguist-vendored in .gitattributes.
Respects the user's global gitignore configuration so files like *.swp, .DS_Store, or custom patterns are excluded from indexing.
Prevents silent truncation of minified JS files or other source files with lines exceeding the default 64KB scanner buffer.
Read methods (Search, IsFresh, Status) now take RLock instead of bypassing the mutex entirely. Write methods continue to take Lock.
Replace lumberjack with plain os.OpenFile for log rotation — removes the external dependency. The logger() nil-safe accessor was already removed; add log: discardLog to all indexerCache test fixtures that omitted it to prevent nil-pointer panics when getOrCreate reaches ic.log.Info. Also promote gofrs/flock to a direct dependency in go.mod to match its use in internal/indexlock. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolved conflicts from rebasing fix-reindex-again onto origin/main. Kept HEAD's freshness-cache approach (lastCheckedAt / recentlyChecked) over main's ristretto-based cache. Ran go mod tidy to drop the now- unused ristretto, xxhash, go-humanize and dgryski/go-farm deps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0c4cd48 to
4a5caa4
Compare
cfg.FreshnessTTL was loaded from LUMEN_FRESHNESS_TTL but never passed to indexerCache.freshnessTTL. recentlyChecked() fell back to the hardcoded defaultFreshnessTTL (30s), so the env var had no effect. E2E tests set LUMEN_FRESHNESS_TTL=1s and waited 1.1s expecting TTL expiry — the 30s fallback kept the cache hot and suppressed reindexing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Chunk deduplication and comment-capping changes (added in earlier commits) altered the indexed chunks, causing all-minilm search results to diverge from stored snapshots. Regenerated all snapshots locally with UPDATE_SNAPSHOTS=true against the same all-minilm model used in CI. Also adds -timeout=30m to the unit-test step so CI does not hit Go's default 10-minute limit on slower runners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GitHub Actions kills runners that produce no stdout for ~10 minutes. Without -v, go test only prints a line when an entire package finishes. Slow packages (cmd ~57s, chunker ~107s, index ~107s locally) take 5-10× longer on Ubuntu CI runners, causing a 9-10 minute output gap that triggers the infrastructure shutdown signal. Adding -v makes Go emit === RUN / --- PASS lines for every individual test, keeping the output stream alive throughout the run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sqlite3 CGO compilation takes 8-10 minutes on GitHub's ubuntu-latest runners when the build cache is cold (e.g. after go.sum changes). GitHub Actions kills runners that produce no stdout for ~10 minutes, which causes the job to fail before tests even start. Add a background heartbeat that prints every 60 s, keeping the output stream alive. The trap ensures the heartbeat is cleaned up on exit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With -v, the test runner emits thousands of === RUN / --- PASS lines. On GitHub's runners the OS pipe buffer fills up, blocking both the go test process and the heartbeat background process. The result is 4-5 minutes of no output reaching GitHub, which triggers the runner shutdown even though the heartbeat is running. Without -v the pipe stays clear, the 60-second heartbeat reaches GitHub reliably, and the stall-detection never fires. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Running go test ./... with default -p=GOMAXPROCS caused multiple packages to write SQLite databases simultaneously, creating extreme I/O contention: - internal/chunker went from 9s (alone) to 110s (parallel): 12x slower - internal/index went from 6.8s (sequential) to 107s (parallel): 15x slower The contention caused the CI runner to produce no output for >10 minutes, triggering GitHub Actions' stall detection and a SIGTERM kill. With -p 1, total local runtime is ~165s with coverage instrumentation. Expected CI runtime: 330-500s (5-8 minutes) — well within the 30m timeout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CGO compilation of sqlite3+fts5 and 11 tree-sitter grammars takes ~8-10 minutes on a cold cache, causing the Test job to be killed before any tests can run. This creates an infinite loop: the build cache never gets saved because the job is always killed before completion. Fix: - Add ccache (hendrikmuhs/ccache-action) so C compilation is nearly instant on the second and subsequent runs. The action saves the ccache even on job cancellation, breaking the cold-cache loop after a single cold run. - Add timeout-minutes: 30 to the Test job so the first cold run has enough time to populate the ccache and save it before being cancelled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…exer spawnBackgroundIndexer uses os.Executable() to find the binary to spawn. In a test context, os.Executable() returns the test binary itself (cmd.test). Spawning it with \"index <path>\" args causes the test binary to execute all tests (since \"index\" is a positional arg that the test framework ignores), including TestSpawnBackgroundIndexer_DoesNotPanic, which spawns another copy, and so on — an exponential fork-bomb. The symptom was the CI Test job hanging for 8+ minutes (hundreds of test binary processes competing for SQLite locks and CPU) before the runner was killed by the OS/GitHub infrastructure. Fix: add TestMain that detects the subprocess invocation pattern (os.Args[1] == \"index\") and exits cleanly instead of running the full suite. This matches the existing guard in the generateSessionContext tests which already document this exact fork-bomb risk. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.mdfrom an older binary). On every subsequentEnsureFresh, these entries caused phantom "removed" diffs and their file records accumulated in the DB indefinitely.DeleteFileChunks(path)for each stale-extension record, eagerly removing it from the DB rather than just hiding it from the in-memory diff.LUMEN_FRESHNESS_TTL, default 60s): successivesemantic_searchcalls skip the Merkle tree rebuild when the index was confirmed fresh recently, eliminating unnecessary latency on every search.~/.local/share/lumen/debug.log(configurable viaLUMEN_LOG_LEVEL): logs cache hits/misses, seeding decisions, reindex triggers, and per-file diffs.Regression Tests (red → green)
Four tests cover the failure scenario end-to-end:
TestIndexer_StaleUnsupportedExtensionNotCountedAsRemoved.mdmust not appear inRemovedFilesTestIndexer_StaleUnsupportedExtensionDeletedFromDB.mdmust be fully purged from DB after reindexTestIndexer_SupportedFileRemovedFromDisk_CountedAsRemovedTestIndexer_MixedStaleAndValidRemovals.godeletion appears inRemovedFilesTest Plan
go test ./...passes🤖 Generated with Claude Code