Context
endpointSlaRegistry.ts documents per-route P95 latency budgets and availability targets. latencyMonitoring.ts tracks rolling P95 per endpoint but does not export SLO breach state to Prometheus.
Problem / Gap
On-call engineers cannot alert on endpoint-level SLO violations using standard Prometheus/Grafana tooling. SLA metadata exists in code but is not observable at runtime.
Proposed approach
- Emit counters/gauges for SLO breaches per registered route (e.g.
backend_slo_breach_total{path,tier}).
- Include current P95 vs budget as a gauge during scrape.
- Wire breach events through existing latency monitoring cooldown logic to avoid alert storms.
- Document metric names in backend observability docs.
Acceptance criteria
Files/areas affected
backend/src/endpointSlaRegistry.ts
backend/src/latencyMonitoring.ts
backend/src/metrics.ts
backend/src/index.ts
Context
endpointSlaRegistry.tsdocuments per-route P95 latency budgets and availability targets.latencyMonitoring.tstracks rolling P95 per endpoint but does not export SLO breach state to Prometheus.Problem / Gap
On-call engineers cannot alert on endpoint-level SLO violations using standard Prometheus/Grafana tooling. SLA metadata exists in code but is not observable at runtime.
Proposed approach
backend_slo_breach_total{path,tier}).Acceptance criteria
/metricsexposes per-endpoint SLO breach counters aligned withENDPOINT_SLA_REGISTRY./health,/ready) have distinct metric labels.Files/areas affected
backend/src/endpointSlaRegistry.tsbackend/src/latencyMonitoring.tsbackend/src/metrics.tsbackend/src/index.ts