bug(mqtt): watchdog detects stall but does NOT force reconnect — staging losing ~14k pkts/hr behind half-open TCP

## Summary
Staging's `lincomatic` MQTT source (`mqtt2.wcmesh.com`) is stuck in a recurring stall→detect→never-recover cycle, dropping ~14k packets/hr that prod receives from the same broker.

## Repro on staging
- 19:59:51 — watchdog detects `lincomatic` stalled (1h5m no messages despite client reporting connected)
- 20:00 — container restart
- 20:00:03 — reconnect attempts start
- 21:25:13 — `lincomatic` re-connects + re-subscribes to 5 topics ✓
- 21:30:46 — watchdog detects new 5m32s stall (**5 min after reconnect**)
- Stat: `tx_inserted=9, tx_dupes=189` cumulative over 2h post-restart
- Compare prod (same broker): `packetsLastHour=13974`

## Root cause (in code review)
`cmd/ingestor/mqtt_watchdog.go` DETECTS stall states (`LivenessTransientStall`, `LivenessStalled`) and LOGS warnings, but does NOT force a paho.Disconnect+reconnect. It relies on paho's automatic reconnect to kick in — which doesn't, because the underlying TCP socket reports connected (half-open).

## Why prod doesn't see it
Prod runs on a MikroTik container with different networking. Azure staging VM appears to suffer from NAT/keepalive issue that creates half-open sockets the paho client doesn't notice.

## Fix
On stall detection (`LivenessStalled` event), force a paho disconnect+reconnect:
```go
if newState == LivenessStalled && oldState != LivenessStalled {
    log.Printf("[ingestor] MQTT [%s] WATCHDOG forcing reconnect...", source)
    client.Disconnect(250) // drop existing connection
    client.Connect()       // paho will re-establish
}
```

## Acceptance
- [ ] Stall detected → watchdog forces reconnect within 10s
- [ ] Test: synthetic stall scenario (mock client returns IsConnected=true but produces no messages for >threshold) → assert Disconnect+Connect called
- [ ] Mutation: remove the force-reconnect → test fails
- [ ] Optionally: TCP-keepalive on the socket dialer to surface half-open earlier

## Out of scope
- Investigating WHY Azure VM's NAT path causes half-open (network, not app)
- Switching to MQTT 5 (paho-mqtt3 limitation)

## References
- PR #1216 (original watchdog detection — detection-only, not recovery)
- `cmd/ingestor/mqtt_watchdog.go` LivenessStalled transition
- Compare prod (working) vs staging (stuck) — same broker


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(mqtt): watchdog detects stall but does NOT force reconnect — staging losing ~14k pkts/hr behind half-open TCP #1335

Summary

Repro on staging

Root cause (in code review)

Why prod doesn't see it

Fix

Acceptance

Out of scope

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug(mqtt): watchdog detects stall but does NOT force reconnect — staging losing ~14k pkts/hr behind half-open TCP #1335

Description

Summary

Repro on staging

Root cause (in code review)

Why prod doesn't see it

Fix

Acceptance

Out of scope

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions