fix(ingestor): neighbor-builder delta scan + watermark — recovers 97% packet loss from #1289 (fixes #1339)#1341
Merged
Merged
Conversation
added 2 commits
May 24, 2026 03:11
…1339) Adds TestNeighborEdgesBuilderDeltaScan: seeds 100k observations with monotonic timestamps, runs warm-up (full scan allowed), then asserts: 1. Second build with NO new observations is a no-op (<1s). 2. After K new observations with timestamps > MAX(neighbor_edges.last_seen), next build upserts exactly K edges in <1s. 3. MAX(last_seen) advances strictly. Watermark derived from MAX(neighbor_edges.last_seen) — neighbor_edges itself is the persistence, no new metadata table or in-memory field. Current builder issues an unbounded SELECT, so the empty-delta assertion fails (sees all 200k baseline-seeded edges again). Refs #1339
…t_seen) watermark (#1339) GREEN. buildAndPersistNeighborEdges now: * Derives a watermark from MAX(neighbor_edges.last_seen) on every call. Empty edges table → watermark 0 → full warm-up scan (preserves the synchronous-warm-up intent of #1289). * Subsequent calls restrict the SELECT with 'WHERE o.timestamp > ? ORDER BY o.timestamp LIMIT 50000', so per-tick work is O(new-observations), never O(all-observations). * The 50k row cap prevents a single tick from monopolising the SQLite single-writer (#1339 root cause). Backlog drains across successive ticks instead. StartNeighborEdgesBuilder: * Warm-up loops the builder until it returns < cap so the first server snapshot load sees a fully-populated edges table even on fresh DBs. * Per-tick wallclock is logged ('tick: N edges in DUR') and a SLOW tick (>5s) is logged loudly so operators can spot a regression of #1339 immediately. Broader instrumentation tracked in #1340. Output schema unchanged — server's neighbor_recomputer.go is unaffected. neighbor_edges itself is the persistence; no metadata table. Trade-off (documented in code): an anomalously-old observation that lands after its timestamp has been crossed will be skipped. Acceptable for an approximate neighbor graph; periodic full-rebuild can land later. Test: TestNeighborEdgesBuilderDeltaScan now passes (warm-up full scan of 100k rows, no-op tick <1s, 100-row delta tick <500ms). Mutation (revert WHERE clause) reproduces the lock-starvation cliff and the test fails by timeout/lock-contention. Fixes #1339
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR #1289 moved neighbor-graph construction into the ingestor with a 60s ticker.
buildAndPersistNeighborEdgesthen issued an unboundedSELECT … FROM observations o JOIN transmissions t …every tick. On staging (3.7M observations) one tick took ~2 minutes; withmax_open_conns=1, the SQLite single-writer was held continuously and MQTT ingest collapsed (~6,500 tx/day → ~180 tx/day, 97% loss).Fix
Watermark-bounded delta scan. Each call derives the watermark from
MAX(neighbor_edges.last_seen)and restricts the SELECT toWHERE o.timestamp > ? ORDER BY o.timestamp LIMIT 50000.neighbor_edgesitself is the persistence — no new metadata table, no in-memory state, restarts resume cleanly from whatever the table reflects.tick: N edges in DUR); a tick >5s is logged loudly as a possible regression of bug(ingestor): neighbor-builder full-table scan every 60s holds SQLite write lock → 97% packet drop #1339. Broader instrumentation is tracked in feat(observability): SQLite write-lock wait/hold instrumentation per component — make starvation detectable #1340.neighbor_recomputer.gois unaffected.Trade-off
An anomalously-old observation that arrives after its timestamp has been crossed by the watermark will be skipped. Acceptable for an approximate neighbor graph; a periodic full-rebuild can land later if needed.
TDD
d88e2522):TestNeighborEdgesBuilderDeltaScanseeds 100k observations, asserts an empty-delta tick is a no-op (<1s), and a 100-row delta is upserted in <500ms with no rescan of baseline rows. Baseline builder fails the empty-delta assertion (sees all 200k baseline edges).cf6fbb4e): watermark + LIMIT — all assertions pass.WHERE o.timestamp > ?clause → the test hangs to lock-contention timeout, confirming the WHERE actually gates the behavior.Benchmark (synthetic, 100k observations, local sqlite)
Staging projection: 2–3 min ticks → <1s ticks; SQLite writer freed for MQTT ingest.
Fixes #1339