Operations Runbook

Jump to bottom

Nyk edited this page Mar 6, 2026 · 1 revision

Operations Runbook

Last reviewed: 2026-03-06 Owner: operations

Use this page for day-2 operations and incident response.

Daily Checks

/api/status healthy (memory/disk within limits).
Scheduler jobs are running.
Notifications queue is not stalled.
Agent heartbeat recency is within expected window.
Error logs are not spiking.

Weekly Checks

Backup integrity test for .data.
Review webhook retry/circuit-breaker behavior.
Review access requests and admin accounts.
Review token cost anomalies.

Incident Priorities

P1: Login unavailable, data loss risk, task board unusable.
P2: Gateway degraded, delayed events, intermittent write errors.
P3: Non-blocking panel/API regressions.

Triage Checklist

Confirm scope (single user, single tenant, global).
Check server logs and recent deploy/change.
Verify DB lock/contention status.
Verify gateway connectivity and allowed origins.
Apply rollback if user-facing impact is sustained.

Common Recovery Actions

Restart app process/container.
Restart gateway and reconnect.
Free stale lock holder (single writer against .data).
Rotate API key if auth compromise suspected.

Operational SLO Targets (Initial)

API availability: 99.5% monthly.
P1 acknowledgement: 10 minutes.
P1 mitigation started: 20 minutes.
Recovery target (P1): 60 minutes.