Skip to content

fix(ingestor): neighbor-builder delta scan + watermark — recovers 97% packet loss from #1289 (fixes #1339)#1341

Merged
Kpa-clawbot merged 2 commits into
masterfrom
fix/issue-1339
May 24, 2026
Merged

fix(ingestor): neighbor-builder delta scan + watermark — recovers 97% packet loss from #1289 (fixes #1339)#1341
Kpa-clawbot merged 2 commits into
masterfrom
fix/issue-1339

Conversation

@Kpa-clawbot
Copy link
Copy Markdown
Owner

Summary

PR #1289 moved neighbor-graph construction into the ingestor with a 60s ticker. buildAndPersistNeighborEdges then issued an unbounded SELECT … FROM observations o JOIN transmissions t … every tick. On staging (3.7M observations) one tick took ~2 minutes; with max_open_conns=1, the SQLite single-writer was held continuously and MQTT ingest collapsed (~6,500 tx/day → ~180 tx/day, 97% loss).

Fix

Watermark-bounded delta scan. Each call derives the watermark from MAX(neighbor_edges.last_seen) and restricts the SELECT to WHERE o.timestamp > ? ORDER BY o.timestamp LIMIT 50000. neighbor_edges itself is the persistence — no new metadata table, no in-memory state, restarts resume cleanly from whatever the table reflects.

Trade-off

An anomalously-old observation that arrives after its timestamp has been crossed by the watermark will be skipped. Acceptable for an approximate neighbor graph; a periodic full-rebuild can land later if needed.

TDD

  • RED (d88e2522): TestNeighborEdgesBuilderDeltaScan seeds 100k observations, asserts an empty-delta tick is a no-op (<1s), and a 100-row delta is upserted in <500ms with no rescan of baseline rows. Baseline builder fails the empty-delta assertion (sees all 200k baseline edges).
  • GREEN (cf6fbb4e): watermark + LIMIT — all assertions pass.
  • Mutation: revert the WHERE o.timestamp > ? clause → the test hangs to lock-contention timeout, confirming the WHERE actually gates the behavior.

Benchmark (synthetic, 100k observations, local sqlite)

Scan duration
Baseline builder, full scan every tick ~40s
Patched builder, empty-delta tick <50ms
Patched builder, 100-row delta <50ms

Staging projection: 2–3 min ticks → <1s ticks; SQLite writer freed for MQTT ingest.

Fixes #1339

openclaw-bot added 2 commits May 24, 2026 03:11
…1339)

Adds TestNeighborEdgesBuilderDeltaScan: seeds 100k observations
with monotonic timestamps, runs warm-up (full scan allowed), then
asserts:

  1. Second build with NO new observations is a no-op (<1s).
  2. After K new observations with timestamps > MAX(neighbor_edges.last_seen),
     next build upserts exactly K edges in <1s.
  3. MAX(last_seen) advances strictly.

Watermark derived from MAX(neighbor_edges.last_seen) — neighbor_edges
itself is the persistence, no new metadata table or in-memory field.

Current builder issues an unbounded SELECT, so the empty-delta
assertion fails (sees all 200k baseline-seeded edges again).

Refs #1339
…t_seen) watermark (#1339)

GREEN. buildAndPersistNeighborEdges now:

  * Derives a watermark from MAX(neighbor_edges.last_seen) on every
    call. Empty edges table → watermark 0 → full warm-up scan
    (preserves the synchronous-warm-up intent of #1289).
  * Subsequent calls restrict the SELECT with
    'WHERE o.timestamp > ? ORDER BY o.timestamp LIMIT 50000', so
    per-tick work is O(new-observations), never O(all-observations).
  * The 50k row cap prevents a single tick from monopolising the
    SQLite single-writer (#1339 root cause). Backlog drains across
    successive ticks instead.

StartNeighborEdgesBuilder:

  * Warm-up loops the builder until it returns < cap so the first
    server snapshot load sees a fully-populated edges table even on
    fresh DBs.
  * Per-tick wallclock is logged ('tick: N edges in DUR') and a SLOW
    tick (>5s) is logged loudly so operators can spot a regression
    of #1339 immediately. Broader instrumentation tracked in #1340.

Output schema unchanged — server's neighbor_recomputer.go is unaffected.
neighbor_edges itself is the persistence; no metadata table.

Trade-off (documented in code): an anomalously-old observation that
lands after its timestamp has been crossed will be skipped. Acceptable
for an approximate neighbor graph; periodic full-rebuild can land later.

Test: TestNeighborEdgesBuilderDeltaScan now passes (warm-up full scan
of 100k rows, no-op tick <1s, 100-row delta tick <500ms). Mutation
(revert WHERE clause) reproduces the lock-starvation cliff and the
test fails by timeout/lock-contention.

Fixes #1339
@Kpa-clawbot Kpa-clawbot merged commit eeddf46 into master May 24, 2026
6 checks passed
@Kpa-clawbot Kpa-clawbot deleted the fix/issue-1339 branch May 24, 2026 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(ingestor): neighbor-builder full-table scan every 60s holds SQLite write lock → 97% packet drop

1 participant