Skip to content

fix(mqtt): persistent session + parallel handler — paho receiving 200× more messages (fixes #1337)#1338

Closed
Kpa-clawbot wants to merge 2 commits into
masterfrom
fix-1337
Closed

fix(mqtt): persistent session + parallel handler — paho receiving 200× more messages (fixes #1337)#1338
Kpa-clawbot wants to merge 2 commits into
masterfrom
fix-1337

Conversation

@Kpa-clawbot
Copy link
Copy Markdown
Owner

Red: 2fd579bc — assertion failures on CleanSession/ClientID/Order (CI link will be added after creation).
Green: 4ea12087 — buildMQTTOpts now sets the four paho opts below.

Problem

Staging ingestor received ~7 msg/h from mqtt2.wcmesh.com; mosquitto_sub against the same broker / creds / topics inside the same container received ~6720 msg/h — a 200× gap. Prod (same creds, presumably warmer session) sees 21k/h. Issue: paho client misconfiguration was silently dropping the backlog on every reconnect.

Root cause (hypothesis 1 + 5 from the triage)

paho defaults bit us:

Fix — cmd/ingestor/main.go buildMQTTOpts

  • SetClientID("corescope-ingestor-<hostname>-<source-tag>") — persistent, unique per source, stable across restarts.
  • SetCleanSession(false) — broker retains subscription state across reconnects and replays the queued messages we missed.
  • SetKeepAlive(30 * time.Second) — paho-level half-open detection.
  • SetOrderMatters(false) — parallel handler dispatch.

Watchdog (#1212/#1216) untouched. MaxReconnectInterval=30s unchanged — no reconnect storm.

Tests

Three new tests in cmd/ingestor/mqtt_session_test.go:

  1. TestBuildMQTTOpts_PersistentSession_Issue1337 — pins the four opts above.
  2. TestBuildMQTTOpts_ClientIDStableAcrossBuilds_Issue1337 — ClientID stable across two builds (otherwise reconnect = new session).
  3. TestBuildMQTTOpts_ClientIDUniquePerSource_Issue1337 — distinct sources get distinct ClientIDs (duplicate IDs cause the broker to disconnect the older session, infinite flap).

Red commit (2fd579bc) verified: assertion failures on CleanSession / empty ClientID / Order — not a build error. Green commit (4ea12087) flips all three to pass. Full cd cmd/ingestor && go test ./... passes locally (57s).

Staging verification (post-merge, manual)

After deploy: SSH staging, docker logs corescope-staging-go | grep "\[stats\]" | tail -1 10 min after restart. Expect tx_dupes growth at ~1000/min (matching mosquitto_sub rate), not 30/min.

Preflight

bash ~/.openclaw/skills/pr-preflight/scripts/run-all.sh origin/master → exit 0 (all gates clean).

Fixes #1337

openclaw-bot added 2 commits May 24, 2026 02:59
Three tests that fail on master:
- TestBuildMQTTOpts_PersistentSession_Issue1337 — asserts CleanSession=false,
  non-empty ClientID embedding hostname+source name, KeepAlive=30s, Order=false
- TestBuildMQTTOpts_ClientIDStableAcrossBuilds_Issue1337 — same source name +
  hostname must yield identical ClientID across two builds (otherwise reconnect
  = new session = broker drops the backlog)
- TestBuildMQTTOpts_ClientIDUniquePerSource_Issue1337 — distinct source names
  must yield distinct ClientIDs (duplicate ClientID = broker disconnects the
  older session, infinite flap)

Refs #1337
paho defaults (CleanSession=true, empty random ClientID per reconnect,
Order=true) caused the staging ingestor to receive ~7 msg/h while
mosquitto_sub on the same broker/creds/topics received ~6720/h — a 200x
gap. Every watchdog-driven reconnect (~every 5min) made the broker treat
us as a brand-new session and drop the queued backlog.

buildMQTTOpts now sets:
  - SetClientID("corescope-ingestor-<hostname>-<source-tag>")
    persistent + unique across sources, stable across restarts
  - SetCleanSession(false)
    broker keeps subscription state across reconnects and replays the
    backlog we missed
  - SetKeepAlive(30 * time.Second)
    paho-level half-open detection (was unset; relying on OS keepalive)
  - SetOrderMatters(false)
    handler dispatch is parallel; one slow packet no longer stalls all
    others under burst load

The existing watchdog (#1212/#1216) is untouched. Reconnect throttle
(MaxReconnectInterval=30s) is unchanged — no reconnect storm.

Fixes #1337
@Kpa-clawbot
Copy link
Copy Markdown
Owner Author

Closing — wrong diagnosis on my part. Real root cause is #1339: neighbor-builder full-table scan every 60s holds the SQLite write lock and starves the MQTT handler. The paho client itself is fine; messages do arrive at the handler but can't be persisted because the write lock is held by the builder.

The changes here (CleanSession=false + persistent ClientID + OrderMatters=false) aren't WRONG — they're general MQTT hygiene improvements — but they don't address the actual symptom and risk introducing other behavior changes without need. If we want to ship the hygiene changes later as a separate small PR, that's fine, but doing it under the wrong issue is misleading.

Real fix is in PR #.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(mqtt): ingestor receives 200× fewer messages than mosquitto_sub from same broker — paho client misconfigured

1 participant