Skip to content

Operations Runbook

Nyk edited this page Mar 6, 2026 · 1 revision

Operations Runbook

Last reviewed: 2026-03-06 Owner: operations

Use this page for day-2 operations and incident response.

Daily Checks

  1. /api/status healthy (memory/disk within limits).
  2. Scheduler jobs are running.
  3. Notifications queue is not stalled.
  4. Agent heartbeat recency is within expected window.
  5. Error logs are not spiking.

Weekly Checks

  1. Backup integrity test for .data.
  2. Review webhook retry/circuit-breaker behavior.
  3. Review access requests and admin accounts.
  4. Review token cost anomalies.

Incident Priorities

  • P1: Login unavailable, data loss risk, task board unusable.
  • P2: Gateway degraded, delayed events, intermittent write errors.
  • P3: Non-blocking panel/API regressions.

Triage Checklist

  1. Confirm scope (single user, single tenant, global).
  2. Check server logs and recent deploy/change.
  3. Verify DB lock/contention status.
  4. Verify gateway connectivity and allowed origins.
  5. Apply rollback if user-facing impact is sustained.

Common Recovery Actions

  • Restart app process/container.
  • Restart gateway and reconnect.
  • Free stale lock holder (single writer against .data).
  • Rotate API key if auth compromise suspected.

Operational SLO Targets (Initial)

  • API availability: 99.5% monthly.
  • P1 acknowledgement: 10 minutes.
  • P1 mitigation started: 20 minutes.
  • Recovery target (P1): 60 minutes.

Clone this wiki locally