Skip to content

feat(server): DB health + pool/webhook metrics endpoints (PR3/3)#66

Open
gavinelder wants to merge 1 commit intopr2-postgres-tls-pool-tuningfrom
pr3-postgres-health-metrics
Open

feat(server): DB health + pool/webhook metrics endpoints (PR3/3)#66
gavinelder wants to merge 1 commit intopr2-postgres-tls-pool-tuningfrom
pr3-postgres-health-metrics

Conversation

@gavinelder
Copy link
Copy Markdown
Collaborator

Summary

PR 3 of 3 — final leg of the postgres enterprise hardening plan. Stacked on #65#64; please review/merge those first.

Adds four operational endpoints on the existing HTTP server, wires Kubernetes probes to them, and ships a postgres operations runbook.

Endpoint Purpose Probe target
GET /healthz Process liveness, no DB dependency. Returns 200 whenever the server can respond. livenessProbe
GET /healthz/db Pings the pool with a 750ms timeout. 200 / 503. readinessProbe
GET /metrics/db pgxpool.Stat() snapshot as JSON. (scrape)
GET /metrics/webhook BatchServiceAdapter.GetMetrics() snapshot. (scrape)

Routing is registered before ignoreUserAgentMiddleware so a probe client cannot be silenced by an operator's --ignored-user-agent flag. 503 responses reuse the existing ErrorResponse struct for consistency with the webhook handler.

manifests/deployment.yml gains livenessProbe (/healthz) and readinessProbe (/healthz/db) on the staticreg container.

Documentation

  • docs/CONFIGURATION.md — adds Health/Metrics endpoints reference and a Production Checklist.
  • docs/ARCHITECTURE.md — refreshes the database section with schema-isolation, goose migrations, and the new health/observability endpoints.
  • docs/POSTGRES_OPERATIONS.md — new runbook covering: schema layout, inspecting state via SQL/HTTP, adding new migrations, rolling back via the goose CLI image, pool-tuning guidance, and the v0.7 → v0.8 data-consolidation procedure operators may want to run after PR1 lands.

Behavior change

Adding probes is opt-in for non-Kubernetes deployments (the YAML changes are inert if you don't use that manifest). For deployments using manifests/deployment.yml directly, k8s will start failing pods if /healthz/db returns 503 — this is the desired behavior (drains traffic from a pod whose DB connection is broken).

Test plan

  • go build ./... && go vet ./... clean

  • go test ./... (unit) passes — new tests for all four handler nil-paths plus the no-pool webhook-adapter happy path

  • go test -tags=integration ./pkg/server/... passes against docker compose up postgres (verifies /healthz/db returns 200 and /metrics/db exposes the expected stat keys with a live pool)

  • End-to-end smoke test: built binary, started against compose postgres, hit all four endpoints with curl. Confirmed 200 status and correct JSON shapes:

    /healthz         → {"status":"ok"}
    /healthz/db      → {"status":"ok"}
    /metrics/db      → {"acquired_conns":0,"idle_conns":3,"max_conns":25,...}
    /metrics/webhook → {"events_received":0,"events_dropped":0,...}
    
  • Reviewer to verify a staging k8s rollout: probe failure when postgres is killed, recovery after restoration; /metrics/db consumed by an existing scraper.

🤖 Generated with Claude Code

@gavinelder gavinelder force-pushed the pr2-postgres-tls-pool-tuning branch from 26ee276 to 127c327 Compare April 27, 2026 11:31
@gavinelder gavinelder force-pushed the pr3-postgres-health-metrics branch from 3d5deae to a95bcd0 Compare April 27, 2026 11:33
@gavinelder gavinelder marked this pull request as ready for review April 27, 2026 11:36
@gavinelder gavinelder force-pushed the pr2-postgres-tls-pool-tuning branch from 127c327 to 2463758 Compare April 27, 2026 11:42
Operators have no visibility into the postgres pool's health and no
Kubernetes-friendly probe targets. Add four endpoints on the existing
HTTP server, wire k8s probes to them, and document end-to-end ops
guidance.

- GET /healthz: process liveness, no DB dependency. Safe for k8s
  livenessProbe — a transient DB outage does not trigger restarts.
- GET /healthz/db: pings the pool with a 750ms timeout (fits inside k8s'
  default 1s probe timeout). 200 if healthy, 503 with a generic
  "database unavailable" otherwise. The pgx error is logged
  server-side via slog so the probe response cannot leak connection
  details (host, user). Wired to k8s readinessProbe so a pod loses
  traffic when its DB connection breaks.
- GET /metrics/db: pgxpool.Stat() snapshot as JSON (acquired/idle/max
  conns, acquire counts, lifetime destroy counts). Same shape can be
  wrapped in a Prometheus collector later.
- GET /metrics/webhook: BatchServiceAdapter.GetMetrics() snapshot
  (events received/dropped/flushed, batches, queue size).

Endpoints register before ignoreUserAgentMiddleware so probes and
scrapers cannot be silenced by an operator's --ignored-user-agent flag.
503 responses reuse the existing ErrorResponse struct for consistency
with the webhook handler. Integration test uses pkg/db/dbtest.SetEnv
(introduced in PR1) to share the postgres-env bootstrap.

Add unit tests (nil-pool / nil-adapter paths, no-pool adapter happy
path) and an integration test (build tag: integration) that exercises
the live pool behind /healthz/db and /metrics/db.

Update manifests/deployment.yml with livenessProbe (/healthz) and
readinessProbe (/healthz/db) on the existing port. The probes use the
k8s default scheme (HTTP) since the staticreg server runs
ListenAndServe (plaintext); --tls-enable controls the upstream
registry client, not the local listener.

Add a Production Checklist and Health/Metrics Endpoints section to
docs/CONFIGURATION.md, refresh the schema-management/observability
section in docs/ARCHITECTURE.md, and add a new
docs/POSTGRES_OPERATIONS.md runbook covering migrations, rollback, pool
tuning, and the v0.7→v0.8 data consolidation procedure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gavinelder gavinelder force-pushed the pr3-postgres-health-metrics branch from a95bcd0 to 16367cd Compare April 27, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant