Skip to content

fix(index): purge stale donor-seeded file records + freshness TTL cache + debug logging#63

Merged
aeneasr merged 42 commits intomainfrom
fix-reindex-again
Mar 22, 2026
Merged

fix(index): purge stale donor-seeded file records + freshness TTL cache + debug logging#63
aeneasr merged 42 commits intomainfrom
fix-reindex-again

Conversation

@aeneasr
Copy link
Member

@aeneasr aeneasr commented Mar 21, 2026

Summary

  • Root cause of markdown indexing: When a new worktree index is seeded from a donor (via file copy), the donor's DB may contain file records with extensions that are no longer supported (e.g. .md from an older binary). On every subsequent EnsureFresh, these entries caused phantom "removed" diffs and their file records accumulated in the DB indefinitely.
  • Fix: The purge loop now calls DeleteFileChunks(path) for each stale-extension record, eagerly removing it from the DB rather than just hiding it from the in-memory diff.
  • Freshness TTL cache (LUMEN_FRESHNESS_TTL, default 60s): successive semantic_search calls skip the Merkle tree rebuild when the index was confirmed fresh recently, eliminating unnecessary latency on every search.
  • Structured debug logging to ~/.local/share/lumen/debug.log (configurable via LUMEN_LOG_LEVEL): logs cache hits/misses, seeding decisions, reindex triggers, and per-file diffs.

Regression Tests (red → green)

Four tests cover the failure scenario end-to-end:

Test What it guards
TestIndexer_StaleUnsupportedExtensionNotCountedAsRemoved Stale .md must not appear in RemovedFiles
TestIndexer_StaleUnsupportedExtensionDeletedFromDB Stale .md must be fully purged from DB after reindex
TestIndexer_SupportedFileRemovedFromDisk_CountedAsRemoved Purge filter must not suppress legitimate supported-file removals
TestIndexer_MixedStaleAndValidRemovals Mixed case: only the real .go deletion appears in RemovedFiles

Test Plan

  • go test ./... passes
  • Regression tests added and verified red→green

🤖 Generated with Claude Code

aeneasr and others added 30 commits March 21, 2026 23:45
…ing + freshness TTL cache + debug logging

When a new worktree index is seeded from a donor (another worktree),
the entire donor SQLite database is copied.  If the donor was built by
an older binary that indexed files with now-unsupported extensions (e.g.
.md), those file-hash records are carried over.  On every subsequent
EnsureFresh call the Merkle tree (correctly) excludes those paths, the
incremental diff treats them as "removed", DeleteFileChunks is called
(a no-op – chunks are already gone), but the file record itself is never
deleted, so the same phantom "removed" entry reappears on every search
forever.

Fix: before computing the Merkle diff, purge any oldHash entry whose
extension is not in SupportedExtensions.  This breaks the perpetual
phantom-removal loop.  Also expose AddedFiles/ModifiedFiles/RemovedFiles
and FirstIndex on Stats so callers (and tests) can assert on exactly
what triggered a reindex.

Also adds:
- Freshness TTL cache (LUMEN_FRESHNESS_TTL, default 60s) so successive
  semantic_search calls skip the Merkle tree rebuild when nothing has
  changed, eliminating unnecessary re-index latency on every search.
- Structured JSON debug logging to ~/.local/share/lumen/debug.log
  (LUMEN_LOG_LEVEL env var; default "info") covering cache hits/misses,
  seeding decisions, reindex triggers and file-level diffs.
- LUMEN_FRESHNESS_TTL config field + EnvOrDefaultDuration helper.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y delete orphaned DB records

The previous commit only filtered stale unsupported-extension records out
of the in-memory oldHashes map. Without also calling DeleteFileChunks for
those paths, the file records would linger in the DB forever (accumulating
across reindex cycles), and would be returned by GetFileHashes on every
future run — requiring the purge loop to run each time.

Fix: call DeleteFileChunks(path) inside the purge loop so orphaned file
records are removed from the DB eagerly, not just hidden from the diff.

Add three regression tests:
- TestIndexer_StaleUnsupportedExtensionDeletedFromDB: verifies orphaned
  .md records are fully removed from the DB after EnsureFresh, not just
  hidden from the diff.
- TestIndexer_SupportedFileRemovedFromDisk_CountedAsRemoved: ensures the
  stale-extension filter does not suppress legitimate removals of
  supported files deleted from disk.
- TestIndexer_MixedStaleAndValidRemovals: verifies that when both stale
  .md entries and a genuinely deleted .go file are present, only the .go
  file appears in RemovedFiles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements internal/indexlock with TryAcquire/IsHeld/Release backed by
flock(2) — the OS releases the lock automatically on process death
(including SIGKILL), making stale locks impossible. Windows builds get
no-op stubs that compile cleanly but provide no mutual-exclusion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…TERM

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t-path

Pass the already-computed dbPath from handleSemanticSearch into
ensureIndexed instead of recomputing it inside the function, avoiding
a redundant config.DBPathForProject call on every search.

Add TestEnsureIndexed_SkipsWhenLockHeld which spawns a subprocess to
hold the exclusive flock and asserts that ensureIndexed returns
(Reindexed=false, err=nil) without invoking EnsureFresh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pterm.DefaultProgressbar has ShowElapsedTime=true by default, which causes
schedule.Every to spawn a background goroutine that periodically reads
Title via getString(). Concurrently, pterm.Fprint() writes Title via
UpdateTitle() on all active bars whenever any print call is made.

Since flock(2) lacks a mutex on Title/IsActive, the goroutine and Fprint()
can race after Stop() signals but before the goroutine observes IsActive=false.

Disabling ShowElapsedTime prevents the goroutine from being created
entirely, eliminating the race at its source.
…orts

- Replace manual os.OpenFile with gopkg.in/natefinch/lumberjack.v2 for
  automatic log rotation (10 MB / 3 compressed backups).
- Change newDebugLogger return type from *os.File to io.Closer so the
  caller is not coupled to the underlying writer implementation.
- Remove internal/index/repro_test.go (hardcoded personal path, not
  suitable for a shared repo; gated behind //go:build repro but still
  leaked developer state).
- Fix stdlib import ordering in cmd/stdio.go (io before log/slog).
- Run go mod tidy: resolves duplicate testify h1 entries and pins
  lumberjack at v2.2.1 via gopkg.in/natefinch/lumberjack.v2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace platform-specific flock(2) / no-op Windows stubs with
github.com/gofrs/flock which uses flock(2) on Unix and LockFileEx on
Windows. This enables real advisory locking on all platforms.

- Merge lock_unix.go + lock_windows.go into single lock.go
- Implement real Windows spawnBackgroundIndexer with CREATE_NEW_PROCESS_GROUP
- Add go cmd.Wait() in both spawners to reap child processes
- Remove redundant LOCK_UN calls (closing FD releases flocks)
- Add comment on ristretto SetWithTTL async admission
- Fix errcheck lint on f.Close()
- Update README platform support to include Windows

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…worktree

When a search path pointed into an internal .worktrees/ subdirectory,
getOrCreate treated it as an independent worktree project and created a
new index from scratch (Reindexed=true), even though the parent repo was
already indexed.

Root cause: findEffectiveRoot could not walk above the worktree's own
git root (git rev-parse --show-toplevel returns the worktree dir), so it
never found the cached parent index. Additionally, on macOS the symlink
/var → /private/var caused path comparisons to fail prematurely.

Fix: when projectPath is a worktree and preferredRoot has an existing
index (in-memory cache or on-disk DB), use the parent index directly.
Also resolve symlinks in findEffectiveRoot before comparing against the
git root to prevent premature walk termination on macOS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents permanent data loss when embedding API times out or process
crashes mid-batch. Previously the file hash was committed before chunks,
making the file appear 'up to date' with zero searchable chunks.

Uses a sentinel empty hash to satisfy the chunks→files FK constraint
during InsertChunks, then updates to the real hash only after flushBatch
succeeds. A sentinel hash never matches a real SHA-256, so a crash
between placeholder write and real-hash commit causes correct re-indexing
on the next run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nsions

Previously force=true only re-indexed current files without cleaning up
deleted files or stale extension records from the database.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous 5s timeout was insufficient when the background indexer
holds a write transaction during Ollama/LM Studio API calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents inconsistent DB state if process crashes between the DELETE
statements during a vector dimension change.
The implementation returns true on error (conservative: assume lock held)
but the comment incorrectly said 'Returns false on any error (fail-open)'.
Prevents concurrent SQLite writes that could exceed busy_timeout when
force_reindex=true is requested while a background indexer is running.
Adds Close() to indexerCache and defers it in runStdio to prevent
SQLite connection and WAL file leaks. Also defers idx.Close() in the
CLI index command, and moves signal.NotifyContext before setupIndexer
so signals during DB setup trigger graceful cancellation.
Symlinked files could point outside the repo boundary. Files above 10MB
(e.g., generated code) could cause OOM when 8 workers read them in parallel.
Uses the same heuristic as git: check for NUL bytes in the first 512
bytes of content. Prevents wasting embedding compute on binary files
that happen to have supported extensions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents license headers from being included in function chunks when
there is no blank line between the header and the first declaration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When multiple query rules match the same code region, keep only the
most specific match. Reduces wasted embedding compute and duplicate
search results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the existing linguist-generated filter to also exclude vendored
code marked with linguist-vendored in .gitattributes.
Respects the user's global gitignore configuration so files like
*.swp, .DS_Store, or custom patterns are excluded from indexing.
Prevents silent truncation of minified JS files or other source files
with lines exceeding the default 64KB scanner buffer.
aeneasr and others added 4 commits March 21, 2026 23:51
Read methods (Search, IsFresh, Status) now take RLock instead of
bypassing the mutex entirely. Write methods continue to take Lock.
Replace lumberjack with plain os.OpenFile for log rotation — removes the
external dependency. The logger() nil-safe accessor was already removed;
add log: discardLog to all indexerCache test fixtures that omitted it to
prevent nil-pointer panics when getOrCreate reaches ic.log.Info.

Also promote gofrs/flock to a direct dependency in go.mod to match its
use in internal/indexlock.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolved conflicts from rebasing fix-reindex-again onto origin/main.
Kept HEAD's freshness-cache approach (lastCheckedAt / recentlyChecked)
over main's ristretto-based cache. Ran go mod tidy to drop the now-
unused ristretto, xxhash, go-humanize and dgryski/go-farm deps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aeneasr aeneasr force-pushed the fix-reindex-again branch from 0c4cd48 to 4a5caa4 Compare March 21, 2026 22:58
cfg.FreshnessTTL was loaded from LUMEN_FRESHNESS_TTL but never passed
to indexerCache.freshnessTTL. recentlyChecked() fell back to the
hardcoded defaultFreshnessTTL (30s), so the env var had no effect.
E2E tests set LUMEN_FRESHNESS_TTL=1s and waited 1.1s expecting TTL
expiry — the 30s fallback kept the cache hot and suppressed reindexing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aeneasr aeneasr enabled auto-merge (rebase) March 22, 2026 09:18
aeneasr and others added 7 commits March 22, 2026 11:17
Chunk deduplication and comment-capping changes (added in earlier
commits) altered the indexed chunks, causing all-minilm search
results to diverge from stored snapshots.  Regenerated all snapshots
locally with UPDATE_SNAPSHOTS=true against the same all-minilm model
used in CI.

Also adds -timeout=30m to the unit-test step so CI does not hit Go's
default 10-minute limit on slower runners.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GitHub Actions kills runners that produce no stdout for ~10 minutes.
Without -v, go test only prints a line when an entire package finishes.
Slow packages (cmd ~57s, chunker ~107s, index ~107s locally) take
5-10× longer on Ubuntu CI runners, causing a 9-10 minute output gap
that triggers the infrastructure shutdown signal.

Adding -v makes Go emit === RUN / --- PASS lines for every individual
test, keeping the output stream alive throughout the run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sqlite3 CGO compilation takes 8-10 minutes on GitHub's ubuntu-latest
runners when the build cache is cold (e.g. after go.sum changes).
GitHub Actions kills runners that produce no stdout for ~10 minutes,
which causes the job to fail before tests even start.

Add a background heartbeat that prints every 60 s, keeping the output
stream alive. The trap ensures the heartbeat is cleaned up on exit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With -v, the test runner emits thousands of === RUN / --- PASS lines.
On GitHub's runners the OS pipe buffer fills up, blocking both the
go test process and the heartbeat background process.  The result is
4-5 minutes of no output reaching GitHub, which triggers the runner
shutdown even though the heartbeat is running.

Without -v the pipe stays clear, the 60-second heartbeat reaches
GitHub reliably, and the stall-detection never fires.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Running go test ./... with default -p=GOMAXPROCS caused multiple packages to
write SQLite databases simultaneously, creating extreme I/O contention:
- internal/chunker went from 9s (alone) to 110s (parallel): 12x slower
- internal/index went from 6.8s (sequential) to 107s (parallel): 15x slower

The contention caused the CI runner to produce no output for >10 minutes,
triggering GitHub Actions' stall detection and a SIGTERM kill.

With -p 1, total local runtime is ~165s with coverage instrumentation.
Expected CI runtime: 330-500s (5-8 minutes) — well within the 30m timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CGO compilation of sqlite3+fts5 and 11 tree-sitter grammars takes ~8-10
minutes on a cold cache, causing the Test job to be killed before any tests
can run. This creates an infinite loop: the build cache never gets saved
because the job is always killed before completion.

Fix:
- Add ccache (hendrikmuhs/ccache-action) so C compilation is nearly instant
  on the second and subsequent runs. The action saves the ccache even on
  job cancellation, breaking the cold-cache loop after a single cold run.
- Add timeout-minutes: 30 to the Test job so the first cold run has enough
  time to populate the ccache and save it before being cancelled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…exer

spawnBackgroundIndexer uses os.Executable() to find the binary to spawn.
In a test context, os.Executable() returns the test binary itself (cmd.test).
Spawning it with \"index <path>\" args causes the test binary to execute all
tests (since \"index\" is a positional arg that the test framework ignores),
including TestSpawnBackgroundIndexer_DoesNotPanic, which spawns another copy,
and so on — an exponential fork-bomb.

The symptom was the CI Test job hanging for 8+ minutes (hundreds of test
binary processes competing for SQLite locks and CPU) before the runner was
killed by the OS/GitHub infrastructure.

Fix: add TestMain that detects the subprocess invocation pattern
(os.Args[1] == \"index\") and exits cleanly instead of running the full suite.
This matches the existing guard in the generateSessionContext tests which
already document this exact fork-bomb risk.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aeneasr aeneasr merged commit 19849aa into main Mar 22, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant