fix(mqtt): watchdog forces paho reconnect on stall — recovers from half-open TCP (closes #1335)#1336
Open
Kpa-clawbot wants to merge 2 commits into
Open
fix(mqtt): watchdog forces paho reconnect on stall — recovers from half-open TCP (closes #1335)#1336Kpa-clawbot wants to merge 2 commits into
Kpa-clawbot wants to merge 2 commits into
Conversation
added 2 commits
May 23, 2026 22:32
Adds failing tests for #1335: when checkSourceLiveness returns LivenessStalled (paho.IsConnected==true but no messages past threshold — half-open TCP), processLivenessTransition must invoke a per-source ForceReconnectFn so paho drops the dead socket and re-dials. Also asserts: - throttle: a second stall within forceReconnectThrottle (60s) does NOT fire a second reconnect (no broker hammering) - NeverReceived MUST NOT force-reconnect (likely ACL deny — TCP churn won't help; the distinct cold-start alarm is the right signal) - nil ForceReconnectFn is a safe no-op Compiles, fails on assertion. GREEN commit follows.
…en TCP #1335: PR #1216 added per-source stall DETECTION but only logged. Staging's lincomatic source has been losing ~14k pkts/hr behind a half-open TCP socket the Azure NAT silently abandons: paho reports IsConnected==true, no messages arrive for an hour, container restart is the only known recovery. Prod (MikroTik networking) doesn't see it. Changes (all additive — no behavior change for currently-flowing sources): - SourceLivenessState.ForceReconnectFn: per-source closure that wraps client.Disconnect(250) + client.Connect(). Wired in main.go right next to IsConnectedFn. - processLivenessTransition: on the LivenessStalled edge AND on every heartbeat re-emit while still Stalled, invoke maybeForceReconnect. LivenessNeverReceived (cold-start ACL deny / wrong hash) is deliberately NOT force-reconnected — a new TCP socket won't fix an ACL deny and would just churn the broker. - maybeForceReconnect: throttled at forceReconnectThrottle (60s) per source so a stall→reconnect→re-stall loop self-recovers without hammering the broker. Runs the Disconnect+Connect in a goroutine so a single slow source can't stall the watchdog tick. - buildMQTTOpts: explicit SetKeepAlive(30s). paho's default happens to be 30s but #1335 RCA called this out specifically — making it explicit so it can't drift, and so operators reading the code know it's intentional. - WATCHDOG telemetry: 'forcing reconnect' (intent), 'reconnect attempt issued' (post-goroutine), 'suppressing forced reconnect' (throttle). Test mqtt_watchdog_force_reconnect_test.go covers: stall edge fires, throttle suppresses second attempt within 60s, throttle re-arms after, NeverReceived does NOT force-reconnect, nil ForceReconnectFn is safe.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RED
f06887— GREEN8f53c1. CI: (will populate on PR open)Fixes #1335Problem
PR #1216 added per-source stall detection (
LivenessStalled) but only logged. Staging'slincomaticsource has been silently losing ~14k pkts/hr behind a half-open TCP socket the Azure NAT abandons: paho reportsIsConnected==true, no messages arrive for 1h+, container restart is the only known recovery. Prod (MikroTik networking) doesn't see it.Fix
Make the watchdog actually recover.
SourceLivenessState.ForceReconnectFn— per-source closure wired inmain.gonext toIsConnectedFn, wrapsclient.Disconnect(250) + client.Connect().processLivenessTransition— on theLivenessStallededge AND on every heartbeat re-emit while still Stalled, invokemaybeForceReconnect.LivenessNeverReceived(cold-start ACL deny / wrong hash) is deliberately not force-reconnected — a new TCP socket won't fix an ACL deny and would just churn the broker.maybeForceReconnect— throttled atforceReconnectThrottle = 60sper source so a stall→reconnect→re-stall loop self-recovers without hammering the broker. The Disconnect+Connect runs in a goroutine so a single slow source can't stall the watchdog tick.buildMQTTOpts— explicitSetKeepAlive(30 * time.Second). paho's default happens to be 30s, but the bug(mqtt): watchdog detects stall but does NOT force reconnect — staging losing ~14k pkts/hr behind half-open TCP #1335 RCA called this out — making it explicit so it can't drift and so operators reading the code know it's intentional.WATCHDOG forcing reconnect(intent),WATCHDOG reconnect attempt issued(post-goroutine),WATCHDOG suppressing forced reconnect(throttle window).TDD
f06887—mqtt_watchdog_force_reconnect_test.go. Stub field + constant added so the file compiles; assertions fail becauseprocessLivenessTransitionnever invokesForceReconnectFn. Reverting just thes.ForceReconnectFn()call line from GREEN re-fails the same assertion (mutation verified).8f53c1— wiring + throttle + keepalive.Scope discipline
Additive only. No regression to currently-flowing sources:
LivenessOK,LivenessRecovered,LivenessDisconnected,LivenessHeartbeat, andLivenessNeverReceivedtransitions are unchanged. Throttle bound = ≤1 reconnect/min/source = ≤60/hr worst-case across all sources, well within any broker rate limit.Preflight: clean (all gates pass).