Skip to content

bug(mqtt): watchdog detects stall but does NOT force reconnect — staging losing ~14k pkts/hr behind half-open TCP #1335

@Kpa-clawbot

Description

@Kpa-clawbot

Summary

Staging's lincomatic MQTT source (mqtt2.wcmesh.com) is stuck in a recurring stall→detect→never-recover cycle, dropping ~14k packets/hr that prod receives from the same broker.

Repro on staging

  • 19:59:51 — watchdog detects lincomatic stalled (1h5m no messages despite client reporting connected)
  • 20:00 — container restart
  • 20:00:03 — reconnect attempts start
  • 21:25:13 — lincomatic re-connects + re-subscribes to 5 topics ✓
  • 21:30:46 — watchdog detects new 5m32s stall (5 min after reconnect)
  • Stat: tx_inserted=9, tx_dupes=189 cumulative over 2h post-restart
  • Compare prod (same broker): packetsLastHour=13974

Root cause (in code review)

cmd/ingestor/mqtt_watchdog.go DETECTS stall states (LivenessTransientStall, LivenessStalled) and LOGS warnings, but does NOT force a paho.Disconnect+reconnect. It relies on paho's automatic reconnect to kick in — which doesn't, because the underlying TCP socket reports connected (half-open).

Why prod doesn't see it

Prod runs on a MikroTik container with different networking. Azure staging VM appears to suffer from NAT/keepalive issue that creates half-open sockets the paho client doesn't notice.

Fix

On stall detection (LivenessStalled event), force a paho disconnect+reconnect:

if newState == LivenessStalled && oldState != LivenessStalled {
    log.Printf("[ingestor] MQTT [%s] WATCHDOG forcing reconnect...", source)
    client.Disconnect(250) // drop existing connection
    client.Connect()       // paho will re-establish
}

Acceptance

  • Stall detected → watchdog forces reconnect within 10s
  • Test: synthetic stall scenario (mock client returns IsConnected=true but produces no messages for >threshold) → assert Disconnect+Connect called
  • Mutation: remove the force-reconnect → test fails
  • Optionally: TCP-keepalive on the socket dialer to surface half-open earlier

Out of scope

  • Investigating WHY Azure VM's NAT path causes half-open (network, not app)
  • Switching to MQTT 5 (paho-mqtt3 limitation)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions