-
Notifications
You must be signed in to change notification settings - Fork 458
Operations Runbook
Nyk edited this page Mar 6, 2026
·
1 revision
Last reviewed: 2026-03-06 Owner: operations
Use this page for day-2 operations and incident response.
-
/api/statushealthy (memory/disk within limits). - Scheduler jobs are running.
- Notifications queue is not stalled.
- Agent heartbeat recency is within expected window.
- Error logs are not spiking.
- Backup integrity test for
.data. - Review webhook retry/circuit-breaker behavior.
- Review access requests and admin accounts.
- Review token cost anomalies.
- P1: Login unavailable, data loss risk, task board unusable.
- P2: Gateway degraded, delayed events, intermittent write errors.
- P3: Non-blocking panel/API regressions.
- Confirm scope (single user, single tenant, global).
- Check server logs and recent deploy/change.
- Verify DB lock/contention status.
- Verify gateway connectivity and allowed origins.
- Apply rollback if user-facing impact is sustained.
- Restart app process/container.
- Restart gateway and reconnect.
- Free stale lock holder (single writer against
.data). - Rotate API key if auth compromise suspected.
- API availability: 99.5% monthly.
- P1 acknowledgement: 10 minutes.
- P1 mitigation started: 20 minutes.
- Recovery target (P1): 60 minutes.