Summary
The #1339 root cause (neighbor-builder write-lock starvation) was invisible to operators for ~3 days. We need first-class instrumentation for SQLite write-lock acquisition latency + duration so the next regression of this class is detectable from operator metrics.
Why this matters now
What to add
Wrap every Exec/tx.Begin on the writer connection in instrumentation:
- wait_ms histogram: time spent waiting for the write lock before the query actually starts
- hold_ms histogram: time the writer holds the lock during the query
- contention_total counter: count of writes that had to wait > 100ms
- slow_writer_log: log writer name + duration whenever hold_ms > 500ms (configurable)
Expose via the existing /api/perf endpoint structure (already has per-component metrics from #1123).
Per-writer attribution
The infra to distinguish writers exists in #1123's per-component IO metrics. Use the same component tags:
neighbor_builder
mqtt_handler (transmission/observation inserts)
prune_packets, prune_observers, prune_metrics
mbcap_persist
vacuum
Acceptance
Out of scope
- Replacing SQLite with a multi-writer DB
- Per-statement timing (too granular)
- Cross-process tracing
References
Summary
The #1339 root cause (neighbor-builder write-lock starvation) was invisible to operators for ~3 days. We need first-class instrumentation for SQLite write-lock acquisition latency + duration so the next regression of this class is detectable from operator metrics.
Why this matters now
What to add
Wrap every
Exec/tx.Beginon the writer connection in instrumentation:Expose via the existing
/api/perfendpoint structure (already has per-component metrics from #1123).Per-writer attribution
The infra to distinguish writers exists in #1123's per-component IO metrics. Use the same component tags:
neighbor_buildermqtt_handler(transmission/observation inserts)prune_packets,prune_observers,prune_metricsmbcap_persistvacuumAcceptance
/api/perfincludesdb.writer.wait_ms_p50/p95/p99anddb.writer.hold_ms_p50/p95/p99per component tag[db-slow-writer] component=X duration=Y query=<truncated>Out of scope
References
cmd/ingestor/db.gocachedRW / Exec call sites