Summary
Staging's lincomatic MQTT source (mqtt2.wcmesh.com) is stuck in a recurring stall→detect→never-recover cycle, dropping ~14k packets/hr that prod receives from the same broker.
Repro on staging
- 19:59:51 — watchdog detects
lincomatic stalled (1h5m no messages despite client reporting connected)
- 20:00 — container restart
- 20:00:03 — reconnect attempts start
- 21:25:13 —
lincomatic re-connects + re-subscribes to 5 topics ✓
- 21:30:46 — watchdog detects new 5m32s stall (5 min after reconnect)
- Stat:
tx_inserted=9, tx_dupes=189 cumulative over 2h post-restart
- Compare prod (same broker):
packetsLastHour=13974
Root cause (in code review)
cmd/ingestor/mqtt_watchdog.go DETECTS stall states (LivenessTransientStall, LivenessStalled) and LOGS warnings, but does NOT force a paho.Disconnect+reconnect. It relies on paho's automatic reconnect to kick in — which doesn't, because the underlying TCP socket reports connected (half-open).
Why prod doesn't see it
Prod runs on a MikroTik container with different networking. Azure staging VM appears to suffer from NAT/keepalive issue that creates half-open sockets the paho client doesn't notice.
Fix
On stall detection (LivenessStalled event), force a paho disconnect+reconnect:
if newState == LivenessStalled && oldState != LivenessStalled {
log.Printf("[ingestor] MQTT [%s] WATCHDOG forcing reconnect...", source)
client.Disconnect(250) // drop existing connection
client.Connect() // paho will re-establish
}
Acceptance
Out of scope
- Investigating WHY Azure VM's NAT path causes half-open (network, not app)
- Switching to MQTT 5 (paho-mqtt3 limitation)
References
Summary
Staging's
lincomaticMQTT source (mqtt2.wcmesh.com) is stuck in a recurring stall→detect→never-recover cycle, dropping ~14k packets/hr that prod receives from the same broker.Repro on staging
lincomaticstalled (1h5m no messages despite client reporting connected)lincomaticre-connects + re-subscribes to 5 topics ✓tx_inserted=9, tx_dupes=189cumulative over 2h post-restartpacketsLastHour=13974Root cause (in code review)
cmd/ingestor/mqtt_watchdog.goDETECTS stall states (LivenessTransientStall,LivenessStalled) and LOGS warnings, but does NOT force a paho.Disconnect+reconnect. It relies on paho's automatic reconnect to kick in — which doesn't, because the underlying TCP socket reports connected (half-open).Why prod doesn't see it
Prod runs on a MikroTik container with different networking. Azure staging VM appears to suffer from NAT/keepalive issue that creates half-open sockets the paho client doesn't notice.
Fix
On stall detection (
LivenessStalledevent), force a paho disconnect+reconnect:Acceptance
Out of scope
References
cmd/ingestor/mqtt_watchdog.goLivenessStalled transition