fix(mqtt): persistent session + parallel handler — paho receiving 200× more messages (fixes #1337)#1338
Closed
Kpa-clawbot wants to merge 2 commits into
Closed
fix(mqtt): persistent session + parallel handler — paho receiving 200× more messages (fixes #1337)#1338Kpa-clawbot wants to merge 2 commits into
Kpa-clawbot wants to merge 2 commits into
Conversation
added 2 commits
May 24, 2026 02:59
Three tests that fail on master: - TestBuildMQTTOpts_PersistentSession_Issue1337 — asserts CleanSession=false, non-empty ClientID embedding hostname+source name, KeepAlive=30s, Order=false - TestBuildMQTTOpts_ClientIDStableAcrossBuilds_Issue1337 — same source name + hostname must yield identical ClientID across two builds (otherwise reconnect = new session = broker drops the backlog) - TestBuildMQTTOpts_ClientIDUniquePerSource_Issue1337 — distinct source names must yield distinct ClientIDs (duplicate ClientID = broker disconnects the older session, infinite flap) Refs #1337
paho defaults (CleanSession=true, empty random ClientID per reconnect,
Order=true) caused the staging ingestor to receive ~7 msg/h while
mosquitto_sub on the same broker/creds/topics received ~6720/h — a 200x
gap. Every watchdog-driven reconnect (~every 5min) made the broker treat
us as a brand-new session and drop the queued backlog.
buildMQTTOpts now sets:
- SetClientID("corescope-ingestor-<hostname>-<source-tag>")
persistent + unique across sources, stable across restarts
- SetCleanSession(false)
broker keeps subscription state across reconnects and replays the
backlog we missed
- SetKeepAlive(30 * time.Second)
paho-level half-open detection (was unset; relying on OS keepalive)
- SetOrderMatters(false)
handler dispatch is parallel; one slow packet no longer stalls all
others under burst load
The existing watchdog (#1212/#1216) is untouched. Reconnect throttle
(MaxReconnectInterval=30s) is unchanged — no reconnect storm.
Fixes #1337
Owner
Author
|
Closing — wrong diagnosis on my part. Real root cause is #1339: neighbor-builder full-table scan every 60s holds the SQLite write lock and starves the MQTT handler. The paho client itself is fine; messages do arrive at the handler but can't be persisted because the write lock is held by the builder. The changes here (CleanSession=false + persistent ClientID + OrderMatters=false) aren't WRONG — they're general MQTT hygiene improvements — but they don't address the actual symptom and risk introducing other behavior changes without need. If we want to ship the hygiene changes later as a separate small PR, that's fine, but doing it under the wrong issue is misleading. Real fix is in PR #. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Red:
2fd579bc— assertion failures on CleanSession/ClientID/Order (CI link will be added after creation).Green:
4ea12087— buildMQTTOpts now sets the four paho opts below.Problem
Staging ingestor received ~7 msg/h from
mqtt2.wcmesh.com;mosquitto_subagainst the same broker / creds / topics inside the same container received ~6720 msg/h — a 200× gap. Prod (same creds, presumably warmer session) sees 21k/h. Issue: paho client misconfiguration was silently dropping the backlog on every reconnect.Root cause (hypothesis 1 + 5 from the triage)
paho defaults bit us:
CleanSession=true+ emptyClientID(paho generates a fresh random one per reconnect) → broker treats every dial as a brand-new session, discards the queued backlog from the previous disconnect. Combined with the bug(ingestor): MQTT client does not reconnect after disconnect — prod stayed down 6+ hours on 2026-05-15 #1212/fix(#1212): MQTT per-attempt logging + stall watchdog — prevent silent reconnect-loop death #1216 watchdog reconnecting every ~5 min, ~99% of messages were lost.Order=true(paho default) serializes the default publish handler — one slow packet stalled all others under burst load.KeepAlivewas never set explicitly, leaving paho relying on OS keepalive for half-open TCP detection.Fix —
cmd/ingestor/main.gobuildMQTTOptsSetClientID("corescope-ingestor-<hostname>-<source-tag>")— persistent, unique per source, stable across restarts.SetCleanSession(false)— broker retains subscription state across reconnects and replays the queued messages we missed.SetKeepAlive(30 * time.Second)— paho-level half-open detection.SetOrderMatters(false)— parallel handler dispatch.Watchdog (#1212/#1216) untouched.
MaxReconnectInterval=30sunchanged — no reconnect storm.Tests
Three new tests in
cmd/ingestor/mqtt_session_test.go:TestBuildMQTTOpts_PersistentSession_Issue1337— pins the four opts above.TestBuildMQTTOpts_ClientIDStableAcrossBuilds_Issue1337— ClientID stable across two builds (otherwise reconnect = new session).TestBuildMQTTOpts_ClientIDUniquePerSource_Issue1337— distinct sources get distinct ClientIDs (duplicate IDs cause the broker to disconnect the older session, infinite flap).Red commit (
2fd579bc) verified: assertion failures on CleanSession / empty ClientID / Order — not a build error. Green commit (4ea12087) flips all three to pass. Fullcd cmd/ingestor && go test ./...passes locally (57s).Staging verification (post-merge, manual)
After deploy: SSH staging,
docker logs corescope-staging-go | grep "\[stats\]" | tail -110 min after restart. Expecttx_dupesgrowth at ~1000/min (matchingmosquitto_subrate), not 30/min.Preflight
bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master→ exit 0 (all gates clean).Fixes #1337