Skip to content

Add Prometheus metrics for circuit breaker monitoring #6850

@negz

Description

@negz

What problem are you facing?

The circuit breaker implementation in #6777 prevents XR reconciliation thrashing by opening when watch events arrive too frequently. When the circuit opens, users see a Responsive condition on their XR indicating the problem.

Operators need to monitor circuit breaker behavior at the cluster level to detect and alert on thrashing patterns. The status condition provides per-XR visibility, but there's no way to:

  • Alert when circuits are opening frequently across the cluster
  • Track how many circuits are currently open
  • Measure the rate of dropped reconciliation events
  • Identify which XRD types are experiencing thrashing

Without metrics, operators can't proactively detect thrashing patterns or set up alerts before users notice degraded responsiveness.

How could Crossplane help solve your problem?

Add Prometheus metrics for circuit breaker state transitions and event handling. The metrics should use the controller label (format: composite/<plural>.<group>) to provide per-XRD visibility without per-XR cardinality explosion.

Proposed metrics:

  1. crossplane_circuit_breaker_opens_total - Counter incremented when circuit transitions from closed to open. Tracks thrashing frequency per XRD type.

  2. crossplane_circuit_breaker_closes_total - Counter incremented when circuit transitions from open to closed. Combined with opens, allows deriving current open circuit count.

  3. crossplane_circuit_breaker_events_total{result="allowed|dropped|halfopen_allowed"} - Counter tracking all events by outcome. The result label distinguishes normal events, dropped events during circuit open, and probe events during half-open state.

These counters enable operators to:

  • Alert when rate(opens_total[5m]) > threshold indicates frequent thrashing
  • Track currently open circuits via opens_total - closes_total
  • Monitor drop rates via rate(events_total{result="dropped"}[5m])
  • Identify problematic XRD types for investigation

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions