feat(server): DB health + pool/webhook metrics endpoints (PR3/3) by gavinelder · Pull Request #66 · seqeralabs/staticreg

gavinelder · 2026-04-27T11:13:32Z

Summary

PR 3 of 3 — final leg of the postgres enterprise hardening plan. Stacked on #65 → #64; please review/merge those first.

Adds four operational endpoints on the existing HTTP server, wires Kubernetes probes to them, and ships a postgres operations runbook.

Endpoint	Purpose	Probe target
`GET /healthz`	Process liveness, no DB dependency. Returns 200 whenever the server can respond.	livenessProbe
`GET /healthz/db`	Pings the pool with a 750ms timeout. 200 / 503.	readinessProbe
`GET /metrics/db`	`pgxpool.Stat()` snapshot as JSON.	(scrape)
`GET /metrics/webhook`	`BatchServiceAdapter.GetMetrics()` snapshot.	(scrape)

Routing is registered before ignoreUserAgentMiddleware so a probe client cannot be silenced by an operator's --ignored-user-agent flag. 503 responses reuse the existing ErrorResponse struct for consistency with the webhook handler.

manifests/deployment.yml gains livenessProbe (/healthz) and readinessProbe (/healthz/db) on the staticreg container.

Documentation

docs/CONFIGURATION.md — adds Health/Metrics endpoints reference and a Production Checklist.
docs/ARCHITECTURE.md — refreshes the database section with schema-isolation, goose migrations, and the new health/observability endpoints.
docs/POSTGRES_OPERATIONS.md — new runbook covering: schema layout, inspecting state via SQL/HTTP, adding new migrations, rolling back via the goose CLI image, pool-tuning guidance, and the v0.7 → v0.8 data-consolidation procedure operators may want to run after PR1 lands.

Behavior change

Adding probes is opt-in for non-Kubernetes deployments (the YAML changes are inert if you don't use that manifest). For deployments using manifests/deployment.yml directly, k8s will start failing pods if /healthz/db returns 503 — this is the desired behavior (drains traffic from a pod whose DB connection is broken).

Test plan

go build ./... && go vet ./... clean
go test ./... (unit) passes — new tests for all four handler nil-paths plus the no-pool webhook-adapter happy path
go test -tags=integration ./pkg/server/... passes against docker compose up postgres (verifies /healthz/db returns 200 and /metrics/db exposes the expected stat keys with a live pool)

End-to-end smoke test: built binary, started against compose postgres, hit all four endpoints with curl. Confirmed 200 status and correct JSON shapes:

/healthz         → {"status":"ok"}
/healthz/db      → {"status":"ok"}
/metrics/db      → {"acquired_conns":0,"idle_conns":3,"max_conns":25,...}
/metrics/webhook → {"events_received":0,"events_dropped":0,...}

Reviewer to verify a staging k8s rollout: probe failure when postgres is killed, recovery after restoration; /metrics/db consumed by an existing scraper.

🤖 Generated with Claude Code

Operators have no visibility into the postgres pool's health and no Kubernetes-friendly probe targets. Add four endpoints on the existing HTTP server, wire k8s probes to them, and document end-to-end ops guidance. - GET /healthz: process liveness, no DB dependency. Safe for k8s livenessProbe — a transient DB outage does not trigger restarts. - GET /healthz/db: pings the pool with a 750ms timeout (fits inside k8s' default 1s probe timeout). 200 if healthy, 503 with a generic "database unavailable" otherwise. The pgx error is logged server-side via slog so the probe response cannot leak connection details (host, user). Wired to k8s readinessProbe so a pod loses traffic when its DB connection breaks. - GET /metrics/db: pgxpool.Stat() snapshot as JSON (acquired/idle/max conns, acquire counts, lifetime destroy counts). Same shape can be wrapped in a Prometheus collector later. - GET /metrics/webhook: BatchServiceAdapter.GetMetrics() snapshot (events received/dropped/flushed, batches, queue size). Endpoints register before ignoreUserAgentMiddleware so probes and scrapers cannot be silenced by an operator's --ignored-user-agent flag. 503 responses reuse the existing ErrorResponse struct for consistency with the webhook handler. Integration test uses pkg/db/dbtest.SetEnv (introduced in PR1) to share the postgres-env bootstrap. Add unit tests (nil-pool / nil-adapter paths, no-pool adapter happy path) and an integration test (build tag: integration) that exercises the live pool behind /healthz/db and /metrics/db. Update manifests/deployment.yml with livenessProbe (/healthz) and readinessProbe (/healthz/db) on the existing port. The probes use the k8s default scheme (HTTP) since the staticreg server runs ListenAndServe (plaintext); --tls-enable controls the upstream registry client, not the local listener. Add a Production Checklist and Health/Metrics Endpoints section to docs/CONFIGURATION.md, refresh the schema-management/observability section in docs/ARCHITECTURE.md, and add a new docs/POSTGRES_OPERATIONS.md runbook covering migrations, rollback, pool tuning, and the v0.7→v0.8 data consolidation procedure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gavinelder force-pushed the pr2-postgres-tls-pool-tuning branch from 26ee276 to 127c327 Compare April 27, 2026 11:31

gavinelder force-pushed the pr3-postgres-health-metrics branch from 3d5deae to a95bcd0 Compare April 27, 2026 11:33

gavinelder marked this pull request as ready for review April 27, 2026 11:36

gavinelder force-pushed the pr2-postgres-tls-pool-tuning branch from 127c327 to 2463758 Compare April 27, 2026 11:42

gavinelder force-pushed the pr3-postgres-health-metrics branch from a95bcd0 to 16367cd Compare April 27, 2026 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): DB health + pool/webhook metrics endpoints (PR3/3)#66

feat(server): DB health + pool/webhook metrics endpoints (PR3/3)#66
gavinelder wants to merge 1 commit intopr2-postgres-tls-pool-tuningfrom
pr3-postgres-health-metrics

gavinelder commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gavinelder commented Apr 27, 2026

Summary

Documentation

Behavior change

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant