feat(server): DB health + pool/webhook metrics endpoints (PR3/3)#66
Open
gavinelder wants to merge 1 commit intopr2-postgres-tls-pool-tuningfrom
Open
feat(server): DB health + pool/webhook metrics endpoints (PR3/3)#66gavinelder wants to merge 1 commit intopr2-postgres-tls-pool-tuningfrom
gavinelder wants to merge 1 commit intopr2-postgres-tls-pool-tuningfrom
Conversation
26ee276 to
127c327
Compare
3d5deae to
a95bcd0
Compare
127c327 to
2463758
Compare
Operators have no visibility into the postgres pool's health and no Kubernetes-friendly probe targets. Add four endpoints on the existing HTTP server, wire k8s probes to them, and document end-to-end ops guidance. - GET /healthz: process liveness, no DB dependency. Safe for k8s livenessProbe — a transient DB outage does not trigger restarts. - GET /healthz/db: pings the pool with a 750ms timeout (fits inside k8s' default 1s probe timeout). 200 if healthy, 503 with a generic "database unavailable" otherwise. The pgx error is logged server-side via slog so the probe response cannot leak connection details (host, user). Wired to k8s readinessProbe so a pod loses traffic when its DB connection breaks. - GET /metrics/db: pgxpool.Stat() snapshot as JSON (acquired/idle/max conns, acquire counts, lifetime destroy counts). Same shape can be wrapped in a Prometheus collector later. - GET /metrics/webhook: BatchServiceAdapter.GetMetrics() snapshot (events received/dropped/flushed, batches, queue size). Endpoints register before ignoreUserAgentMiddleware so probes and scrapers cannot be silenced by an operator's --ignored-user-agent flag. 503 responses reuse the existing ErrorResponse struct for consistency with the webhook handler. Integration test uses pkg/db/dbtest.SetEnv (introduced in PR1) to share the postgres-env bootstrap. Add unit tests (nil-pool / nil-adapter paths, no-pool adapter happy path) and an integration test (build tag: integration) that exercises the live pool behind /healthz/db and /metrics/db. Update manifests/deployment.yml with livenessProbe (/healthz) and readinessProbe (/healthz/db) on the existing port. The probes use the k8s default scheme (HTTP) since the staticreg server runs ListenAndServe (plaintext); --tls-enable controls the upstream registry client, not the local listener. Add a Production Checklist and Health/Metrics Endpoints section to docs/CONFIGURATION.md, refresh the schema-management/observability section in docs/ARCHITECTURE.md, and add a new docs/POSTGRES_OPERATIONS.md runbook covering migrations, rollback, pool tuning, and the v0.7→v0.8 data consolidation procedure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a95bcd0 to
16367cd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 3 of 3 — final leg of the postgres enterprise hardening plan. Stacked on #65 → #64; please review/merge those first.
Adds four operational endpoints on the existing HTTP server, wires Kubernetes probes to them, and ships a postgres operations runbook.
GET /healthzGET /healthz/dbGET /metrics/dbpgxpool.Stat()snapshot as JSON.GET /metrics/webhookBatchServiceAdapter.GetMetrics()snapshot.Routing is registered before
ignoreUserAgentMiddlewareso a probe client cannot be silenced by an operator's--ignored-user-agentflag. 503 responses reuse the existingErrorResponsestruct for consistency with the webhook handler.manifests/deployment.ymlgainslivenessProbe(/healthz) andreadinessProbe(/healthz/db) on the staticreg container.Documentation
docs/CONFIGURATION.md— adds Health/Metrics endpoints reference and a Production Checklist.docs/ARCHITECTURE.md— refreshes the database section with schema-isolation, goose migrations, and the new health/observability endpoints.docs/POSTGRES_OPERATIONS.md— new runbook covering: schema layout, inspecting state via SQL/HTTP, adding new migrations, rolling back via the goose CLI image, pool-tuning guidance, and the v0.7 → v0.8 data-consolidation procedure operators may want to run after PR1 lands.Behavior change
Adding probes is opt-in for non-Kubernetes deployments (the YAML changes are inert if you don't use that manifest). For deployments using
manifests/deployment.ymldirectly, k8s will start failing pods if/healthz/dbreturns 503 — this is the desired behavior (drains traffic from a pod whose DB connection is broken).Test plan
go build ./... && go vet ./...cleango test ./...(unit) passes — new tests for all four handler nil-paths plus the no-pool webhook-adapter happy pathgo test -tags=integration ./pkg/server/...passes againstdocker compose up postgres(verifies/healthz/dbreturns 200 and/metrics/dbexposes the expected stat keys with a live pool)End-to-end smoke test: built binary, started against compose postgres, hit all four endpoints with curl. Confirmed 200 status and correct JSON shapes:
Reviewer to verify a staging k8s rollout: probe failure when postgres is killed, recovery after restoration;
/metrics/dbconsumed by an existing scraper.🤖 Generated with Claude Code