fix(mqtt): persistent session + parallel handler — paho receiving 200× more messages (fixes #1337) by Kpa-clawbot · Pull Request #1338 · Kpa-clawbot/CoreScope

Kpa-clawbot · 2026-05-24T03:01:20Z

Red: 2fd579bc — assertion failures on CleanSession/ClientID/Order (CI link will be added after creation).
Green: 4ea12087 — buildMQTTOpts now sets the four paho opts below.

Problem

Staging ingestor received ~7 msg/h from mqtt2.wcmesh.com; mosquitto_sub against the same broker / creds / topics inside the same container received ~6720 msg/h — a 200× gap. Prod (same creds, presumably warmer session) sees 21k/h. Issue: paho client misconfiguration was silently dropping the backlog on every reconnect.

Root cause (hypothesis 1 + 5 from the triage)

paho defaults bit us:

CleanSession=true + empty ClientID (paho generates a fresh random one per reconnect) → broker treats every dial as a brand-new session, discards the queued backlog from the previous disconnect. Combined with the bug(ingestor): MQTT client does not reconnect after disconnect — prod stayed down 6+ hours on 2026-05-15 #1212/fix(#1212): MQTT per-attempt logging + stall watchdog — prevent silent reconnect-loop death #1216 watchdog reconnecting every ~5 min, ~99% of messages were lost.
Order=true (paho default) serializes the default publish handler — one slow packet stalled all others under burst load.
KeepAlive was never set explicitly, leaving paho relying on OS keepalive for half-open TCP detection.

Fix — `cmd/ingestor/main.go` `buildMQTTOpts`

SetClientID("corescope-ingestor-<hostname>-<source-tag>") — persistent, unique per source, stable across restarts.
SetCleanSession(false) — broker retains subscription state across reconnects and replays the queued messages we missed.
SetKeepAlive(30 * time.Second) — paho-level half-open detection.
SetOrderMatters(false) — parallel handler dispatch.

Watchdog (#1212/#1216) untouched. MaxReconnectInterval=30s unchanged — no reconnect storm.

Tests

Three new tests in cmd/ingestor/mqtt_session_test.go:

TestBuildMQTTOpts_PersistentSession_Issue1337 — pins the four opts above.
TestBuildMQTTOpts_ClientIDStableAcrossBuilds_Issue1337 — ClientID stable across two builds (otherwise reconnect = new session).
TestBuildMQTTOpts_ClientIDUniquePerSource_Issue1337 — distinct sources get distinct ClientIDs (duplicate IDs cause the broker to disconnect the older session, infinite flap).

Red commit (2fd579bc) verified: assertion failures on CleanSession / empty ClientID / Order — not a build error. Green commit (4ea12087) flips all three to pass. Full cd cmd/ingestor && go test ./... passes locally (57s).

Staging verification (post-merge, manual)

After deploy: SSH staging, docker logs corescope-staging-go | grep "\[stats\]" | tail -1 10 min after restart. Expect tx_dupes growth at ~1000/min (matching mosquitto_sub rate), not 30/min.

Preflight

bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master → exit 0 (all gates clean).

Fixes #1337

Three tests that fail on master: - TestBuildMQTTOpts_PersistentSession_Issue1337 — asserts CleanSession=false, non-empty ClientID embedding hostname+source name, KeepAlive=30s, Order=false - TestBuildMQTTOpts_ClientIDStableAcrossBuilds_Issue1337 — same source name + hostname must yield identical ClientID across two builds (otherwise reconnect = new session = broker drops the backlog) - TestBuildMQTTOpts_ClientIDUniquePerSource_Issue1337 — distinct source names must yield distinct ClientIDs (duplicate ClientID = broker disconnects the older session, infinite flap) Refs #1337

paho defaults (CleanSession=true, empty random ClientID per reconnect, Order=true) caused the staging ingestor to receive ~7 msg/h while mosquitto_sub on the same broker/creds/topics received ~6720/h — a 200x gap. Every watchdog-driven reconnect (~every 5min) made the broker treat us as a brand-new session and drop the queued backlog. buildMQTTOpts now sets: - SetClientID("corescope-ingestor-<hostname>-<source-tag>") persistent + unique across sources, stable across restarts - SetCleanSession(false) broker keeps subscription state across reconnects and replays the backlog we missed - SetKeepAlive(30 * time.Second) paho-level half-open detection (was unset; relying on OS keepalive) - SetOrderMatters(false) handler dispatch is parallel; one slow packet no longer stalls all others under burst load The existing watchdog (#1212/#1216) is untouched. Reconnect throttle (MaxReconnectInterval=30s) is unchanged — no reconnect storm. Fixes #1337

Kpa-clawbot · 2026-05-24T03:05:04Z

Closing — wrong diagnosis on my part. Real root cause is #1339: neighbor-builder full-table scan every 60s holds the SQLite write lock and starves the MQTT handler. The paho client itself is fine; messages do arrive at the handler but can't be persisted because the write lock is held by the builder.

The changes here (CleanSession=false + persistent ClientID + OrderMatters=false) aren't WRONG — they're general MQTT hygiene improvements — but they don't address the actual symptom and risk introducing other behavior changes without need. If we want to ship the hygiene changes later as a separate small PR, that's fine, but doing it under the wrong issue is misleading.

Real fix is in PR #.

openclaw-bot added 2 commits May 24, 2026 02:59

Kpa-clawbot closed this May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mqtt): persistent session + parallel handler — paho receiving 200× more messages (fixes #1337)#1338

fix(mqtt): persistent session + parallel handler — paho receiving 200× more messages (fixes #1337)#1338
Kpa-clawbot wants to merge 2 commits into
masterfrom
fix-1337

Kpa-clawbot commented May 24, 2026

Uh oh!

Kpa-clawbot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kpa-clawbot commented May 24, 2026

Problem

Root cause (hypothesis 1 + 5 from the triage)

Fix — cmd/ingestor/main.go buildMQTTOpts

Tests

Staging verification (post-merge, manual)

Preflight

Uh oh!

Kpa-clawbot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix — `cmd/ingestor/main.go` `buildMQTTOpts`