Skip to content

Backend: Expose endpoint SLA breach counters as Prometheus metrics #865

Description

@Junirezz

Context

endpointSlaRegistry.ts documents per-route P95 latency budgets and availability targets. latencyMonitoring.ts tracks rolling P95 per endpoint but does not export SLO breach state to Prometheus.

Problem / Gap

On-call engineers cannot alert on endpoint-level SLO violations using standard Prometheus/Grafana tooling. SLA metadata exists in code but is not observable at runtime.

Proposed approach

  • Emit counters/gauges for SLO breaches per registered route (e.g. backend_slo_breach_total{path,tier}).
  • Include current P95 vs budget as a gauge during scrape.
  • Wire breach events through existing latency monitoring cooldown logic to avoid alert storms.
  • Document metric names in backend observability docs.

Acceptance criteria

  • /metrics exposes per-endpoint SLO breach counters aligned with ENDPOINT_SLA_REGISTRY.
  • Breach increments respect the existing alert cooldown window.
  • Critical tier routes (/health, /ready) have distinct metric labels.
  • Tests verify breach counter increments when synthetic latency exceeds budget.

Files/areas affected

  • backend/src/endpointSlaRegistry.ts
  • backend/src/latencyMonitoring.ts
  • backend/src/metrics.ts
  • backend/src/index.ts

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions