This document describes the design, implementation, and operational integration of the production-grade health check probes (Liveness and Readiness) developed for the CarbonScribe Corporate Platform Backend.
Standard health checks often only verify that the backend process is running, failing to identify silent dependency outages. To ensure production stability, automated recovery, and orchestration integration (e.g., Kubernetes, ECS), this implementation separates health checks into two distinct probes:
- Liveness Probe (
GET /health/liveness): A quick, non-blocking check that confirms the NestJS application process is running and capable of handling incoming event-loop loops. - Readiness Probe (
GET /health/readiness): A comprehensive check verifying the real status and reachability of all external dependencies. It returns200 OKonly if all critical dependencies are healthy, and503 Service Unavailablewith details if any dependency is down or degraded.
To prevent health checks from causing cascading failures or degrading application throughput under load, the readiness probe is built with the following performance guarantees:
-
Parallel Execution (
$O(1)$ Bound Time): Rather than executing dependency checks sequentially ($O(N)$), checks are run concurrently usingPromise.all. The total execution time is bounded by the slowest single service timeout ($O(\max(t_i))$), rather than the sum of all timeouts. -
Fast Timeouts & Context Cancellation: Every external request (Database, Redis, Kafka, IPFS, Stellar) is protected by a strict
Promise.racetimeout (2–3 seconds). This guarantees the readiness probe responds quickly, even if a dependency is completely unresponsive. - Space Complexity ($O(1)$): Checks are ephemeral, executing with no memory leak vectors and utilizing existing shared client connection pools.
Every critical dependency check is implemented cleanly within the standalone HealthService domain module:
- Strategy: Executes a lightweight query
SELECT 1to verify the Prisma connection pool and target PostgreSQL database are actively processing queries. - Timeout: 2 seconds.
- Implementation snippet:
await Promise.race([ this.prisma.$queryRaw`SELECT 1`, new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 2000)) ]);
- Strategy: Fetches the active
ioredisclient and executes an explicit.ping()call to verify connection status, rather than relying solely on connection lifecycle flags. - Timeout: 2 seconds.
- Implementation snippet:
await Promise.race([ client.ping(), new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 2000)) ]);
- Strategy: Uses the active
kafkajsAdmin client to fetch topic metadata (fetchTopicMetadata({ topics: [] })). If Kafka is explicitly disabled in the configuration (e.g. local-only in dev), it cleanly reports the dependency asdisabledand does not block. - Timeout: 3 seconds.
- Implementation snippet:
await Promise.race([ admin.fetchTopicMetadata({ topics: [] }), new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 3000)) ]);
- Strategy: Issues an HTTP GET request to the Pinata gateway authentication test endpoint (
https://api.pinata.cloud/data/testAuthentication). To avoid failing readiness tests in development or staging where mock credentials are used, any response back from the remote API (including HTTP 401 or 403) is gracefully classified as reachable (network connectivity is up). - Timeout: 2 seconds.
- Implementation snippet:
try { await Promise.race([ axios.get('https://api.pinata.cloud/data/testAuthentication', { headers, timeout: 2000 }), new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 2000)) ]); } catch (err) { if (err.response) return { status: 'healthy', details: `Reachable (HTTP Status: ${err.response.status})` }; throw err; }
- Strategy: Leverages the
rpc.Serverconnection inSorobanServiceto retrieve the latest ledger sequence (rpcClient.getLatestLedger()). This verifies live network reachability and RPC server responsiveness. - Timeout: 2 seconds.
- Implementation snippet:
await Promise.race([ rpcClient.getLatestLedger(), new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 2000)) ]);
The system has been fully integrated into the existing modular codebase across the following paths:
- Service Logic:
src/health/health.service.ts - HTTP Controller Routing:
src/health/health.controller.ts - NestJS Module Setup:
src/health/health.module.ts - System Integration: Registered within the main
src/app.module.tsimport list. - Unit and Integration Specs:
src/health/health.service.spec.tssrc/health/health.controller.spec.ts
- HTTP Status:
200 OK - JSON Payload:
{ "status": "healthy", "timestamp": "2026-05-26T19:00:00.000Z", "service": "corporate-platform-backend", "liveness": "up" }
- HTTP Status:
200 OK - JSON Payload:
{ "status": "healthy", "timestamp": "2026-05-26T19:00:02.000Z", "version": "0.0.1", "uptimeSeconds": 120, "checks": { "database": { "status": "healthy", "latencyMs": 14 }, "redis": { "status": "healthy", "latencyMs": 8 }, "kafka": { "status": "healthy", "latencyMs": 45 }, "ipfs": { "status": "healthy", "latencyMs": 110, "details": "Reachable (HTTP Status: 401)" }, "stellar": { "status": "healthy", "latencyMs": 95 } } }
- HTTP Status:
503 Service Unavailable - JSON Payload:
{ "status": "unhealthy", "timestamp": "2026-05-26T19:01:10.000Z", "version": "0.0.1", "uptimeSeconds": 188, "checks": { "database": { "status": "unhealthy", "error": "Database check timed out" }, "redis": { "status": "healthy", "latencyMs": 7 }, "kafka": { "status": "healthy", "latencyMs": 32 }, "ipfs": { "status": "healthy", "latencyMs": 89, "details": "Reachable (HTTP Status: 401)" }, "stellar": { "status": "unhealthy", "error": "Stellar RPC request timed out" } } }
Expose these endpoints directly to orchestrators to automate traffic routing and recovery actions:
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3