Skip to content

bug(mqtt): ingestor receives 200× fewer messages than mosquitto_sub from same broker — paho client misconfigured #1337

@Kpa-clawbot

Description

@Kpa-clawbot

Summary

Staging ingestor receives ~0.5 messages/min from mqtt2.wcmesh.com while mosquitto_sub against the SAME broker / SAME credentials / SAME topics from inside the SAME container receives ~112 messages/min — a 200× gap.

Repro on staging (current state)

# Inside the staging container, exact same credentials + topic patterns staging subscribes to:
docker exec corescope-staging-go timeout 30 mosquitto_sub \
  -h mqtt2.wcmesh.com -p 8883 --tls-version tlsv1.2 --insecure \
  -u <REDACTED> -P <REDACTED> \
  -t "meshcore/SJC/+/packets" -t "meshcore/OAK/+/packets" \
  -t "meshcore/PRB/+/packets" -t "meshcore/SFO/+/packets" \
  -t "meshcore/MRY/+/packets" 2>&1 | grep raw | wc -l
# → 56 packets in 30s

# Compare ingestor's cumulative stats (same broker, same creds):
docker logs corescope-staging-go 2>&1 | grep "\[stats\]" | tail -1
# → tx_inserted=20 tx_dupes=44 obs_inserted=64 over 1h35m
# → ~44 messages received total in 95 minutes

Suspects (in priority order)

  1. paho CleanSession=true drops persistent subscriptions on reconnect; reconnect happens every ~5min on staging
  2. Subscription QoS mismatch: broker only delivers QoS 0 to non-persistent sessions; paho subscribes QoS 1 but session is non-persistent → broker drops
  3. Subscribe ACK race: paho considers subscribe "complete" before broker acks, messages in flight before ACK silently dropped
  4. Network MTU / TLS-fragment issue on the larger payloads — small status messages get through, larger packet messages don't (would explain ratio if packets are 800-byte; observers may need MTU diagnosis)

Acceptance

  • Compare staging ingestor's paho options vs what mosquitto_sub uses (KeepAlive, CleanSession, AutoReconnect, MaxReconnectInterval, SubscriptionRetry)
  • Add paho.SetOrderMatters(false) or equivalent for parallel processing if relevant
  • Add MQTT-level instrumentation: log every message received in handler (briefly) + count by topic to confirm broker actually delivers
  • Confirm subscribe persistence via paho.SetCleanSession(false) + persistent clientID, see if delivery rate matches mosquitto_sub
  • Mutation: revert the fix → rate drops back to 7/h
  • Test on staging: subscribe rate matches mosquitto_sub ±10%

Out of scope

  • Reorganizing the ingest pipeline architecture
  • Switching MQTT libraries
  • Investigating broker-side rate limits (probably not the cause given mosquitto_sub works)

Critical context

  • Prod has same user authenticated to same broker, sees 21,733 pkts/hr (3 orders of magnitude higher than staging). This suggests staging's paho config is the differentiator, not the broker or auth.
  • Staging container has NORMAL network (mosquitto_sub gets 56 in 30s)
  • Watchdog detects + reconnects but doesn't help → reconnect re-subscribes but messages between reconnect cycles are missed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions