feat(ops): Build a /status health dashboard page showing live system component health, event throughput, and translation success rate

Body:
Summary
Open-Audit already exposes Prometheus metrics at /metrics and has a circuit breaker and rate limiter in lib/resilience/. However, there is no human-readable status page that shows the health of the system at a glance — not just for operators, but for any contributor or user who wants to understand whether the system is working correctly. This issue adds a public /status page and the backend API that powers it.
Required work
New API route: app/api/status/route.ts
Return a structured JSON health report:
json{
  "status": "healthy" | "degraded" | "down",
  "timestamp": "ISO8601",
  "components": {
    "stellarRpc": {
      "status": "healthy" | "degraded" | "down",
      "latencyMs": 142,
      "lastChecked": "ISO8601",
      "circuitBreakerState": "closed" | "open" | "half-open"
    },
    "database": {
      "status": "healthy" | "down",
      "latencyMs": 8,
      "lastChecked": "ISO8601"
    },
    "redis": {
      "status": "healthy" | "down" | "not-configured",
      "latencyMs": 2,
      "lastChecked": "ISO8601"
    },
    "worker": {
      "status": "healthy" | "down" | "not-configured",
      "lastHeartbeat": "ISO8601"
    }
  },
  "metrics": {
    "eventsIndexedLast1h": 1452,
    "eventsIndexedLast24h": 18934,
    "translationSuccessRate1h": 0.94,
    "translationSuccessRate24h": 0.97,
    "averageTranslationLatencyMs": 12,
    "activeWebSocketConnections": 7
  }
}
Implementation:

Stellar RPC health: ping getLatestLedger and record latency; read circuit breaker state from lib/resilience/circuit-breaker.ts
Database health: run SELECT 1 via Prisma and record latency
Redis health: ping Redis if REDIS_URL is set; return "not-configured" otherwise
Worker health: the indexer worker writes a heartbeat key to Redis every 30 seconds; the status API reads it and marks the worker "down" if the key is older than 90 seconds
Metrics: query the database for event counts and translation outcomes in the last 1h and 24h windows
Overall status is "healthy" if all configured components are healthy; "degraded" if any component is degraded but the system is partially functional; "down" if Stellar RPC or the database is unreachable

New page: app/status/page.tsx

Server-rendered page that fetches /api/status on load and refreshes every 30 seconds (use setInterval + router.refresh())
Display each component as a status row with a green/amber/red indicator dot, component name, latency, and last-checked time
Display the metrics section as a simple stats grid (events per hour, translation success rate, active connections)
Show a banner at the top: "All systems operational" (green), "Partial outage" (amber), or "Major outage" (red)
The page must render meaningfully even when the API is partially down — show what is available and grey out what is not

Worker heartbeat:

Add a setInterval in src/worker/indexer.ts that writes HSET open-audit:worker:heartbeat lastSeen <timestamp> to Redis every 30 seconds

Acceptance criteria

GET /api/status returns a valid JSON response in under 500ms under normal conditions
Each component's health reflects its actual state — simulate a Stellar RPC timeout and confirm stellarRpc.status becomes "down" and overall status becomes "degraded"
The /status page renders correctly in a browser with all component rows and metric stats visible
The page auto-refreshes every 30 seconds without a full page reload
Worker heartbeat is written to Redis correctly and the status API correctly detects a stale heartbeat as "down"
Unit tests cover: all-healthy response shape, degraded when one component fails, down when database is unreachable, stale worker heartbeat detection
npm run lint and npm test pass with no regressions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ops): Build a /status health dashboard page showing live system component health, event throughput, and translation success rate #236

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat(ops): Build a /status health dashboard page showing live system component health, event throughput, and translation success rate #236

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions