diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md index 804fde7..702f3b4 100644 --- a/docs/DEPLOYMENT.md +++ b/docs/DEPLOYMENT.md @@ -1,8 +1,20 @@ -# NeuroWealth — Production Deployment Runbook +# NeuroWealth Backend Deployment Guide -End-to-end guide for deploying the NeuroWealth backend to Kubernetes with safe migrations, health checks, and secrets management. +Complete guide for deploying the NeuroWealth backend across all environments. -For Docker image build details see also `docs/PRODUCTION_DEPLOYMENT.md`. +## Table of Contents + +- [Prerequisites](#prerequisites) +- [Local Development (Docker Compose)](#local-development-docker-compose) +- [Staging (Docker)](#staging-docker) +- [Production (Kubernetes)](#production-kubernetes) +- [Environment Variables Reference](#environment-variables-reference) +- [Health Probes](#health-probes) +- [Secrets Management](#secrets-management) +- [Database Migrations](#database-migrations) +- [Rollback Procedure](#rollback-procedure) +- [Monitoring and Observability](#monitoring-and-observability) +- [Troubleshooting](#troubleshooting) --- @@ -10,16 +22,104 @@ For Docker image build details see also `docs/PRODUCTION_DEPLOYMENT.md`. | Requirement | Notes | |-------------|-------| -| Kubernetes 1.25+ | Any managed cluster (EKS, GKE, AKS) | -| PostgreSQL 14+ | Managed database (RDS, Cloud SQL, etc.) — **not** the in-repo `docker-compose.yml` Postgres | -| Container registry | Push the image built from the root `Dockerfile` | +| Node.js 20 | Matches `Dockerfile` and CI | +| Docker | For local development and container builds | +| Kubernetes 1.25+ | For production (EKS, GKE, AKS) | +| PostgreSQL 14+ | Managed database (RDS, Cloud SQL) | +| Container registry | For pushing Docker images | | Stellar Soroban RPC | `STELLAR_RPC_URL` or comma-separated `STELLAR_RPC_URLS` for failover | | TLS certificate | cert-manager, cloud LB, or manual `Secret` for ingress | | Secrets store | External Secrets Operator, Sealed Secrets, or `kubectl create secret` | --- -## Manifest layout +## Local Development (Docker Compose) + +### Quick Start + +```bash +# Clone the repository +git clone https://github.com/Neurowealth/Backend.git +cd Backend + +# Copy environment variables +cp .env.example .env + +# Start PostgreSQL with Docker Compose +docker-compose up -d + +# Install dependencies +npm install + +# Generate Prisma client +npx prisma generate + +# Apply database migrations +npx prisma migrate deploy + +# Start development server +npm run dev +``` + +### Docker Compose Services + +The `docker-compose.yml` runs: +- **PostgreSQL** on port 5432 +- **Application** on port 3001 (optional) + +### Verifying Local Setup + +```bash +# Check health endpoint +curl http://localhost:3001/health/live + +# Check readiness (requires database connection) +curl http://localhost:3001/health/ready +``` + +--- + +## Staging (Docker) + +### Build Docker Image + +```bash +# Build the image +docker build -t neurowealth-backend:staging . + +# Tag for your registry +docker tag neurowealth-backend:staging /neurowealth-backend:staging + +# Push to registry +docker push /neurowealth-backend:staging +``` + +### Deploy with Docker + +```bash +# Run the container +docker run -d \ + --name neurowealth-backend-staging \ + -p 3001:3001 \ + --env-file /path/to/staging.env \ + /neurowealth-backend:staging +``` + +### Staging Environment Variables + +```bash +NODE_ENV=staging +STELLAR_NETWORK=testnet +STELLAR_RPC_URL=https://soroban-testnet.stellar.org +CORS_ORIGINS=https://staging.neurowealth.io +LOG_LEVEL=debug +``` + +--- + +## Production (Kubernetes) + +### Manifest Layout All manifests live under `deploy/k8s/`: @@ -27,7 +127,7 @@ All manifests live under `deploy/k8s/`: |------|---------| | `namespace.yaml` | `neurowealth` namespace | | `configmap.yaml` | Non-secret environment (CORS, rate limits, RPC URLs, contract IDs) | -| `secret.yaml.example` | **Template only** — copy values into a real Secret; never commit plaintext | +| `secret.yaml.example` | **Template only** — copy values into a real Secret | | `serviceaccount.yaml` | Pod service account | | `deployment.yaml` | App Deployment with initContainer migration + probes | | `service.yaml` | ClusterIP on port 3001 | @@ -35,9 +135,7 @@ All manifests live under `deploy/k8s/`: | `migration-job.yaml` | Standalone pre-deploy migration Job | | `hpa.yaml` | HPA pinned to 1 replica (see scaling constraints) | ---- - -## Environment matrix +### Environment Matrix | Setting | Staging | Production | |---------|---------|------------| @@ -49,33 +147,7 @@ All manifests live under `deploy/k8s/`: | `replicas` | `1` | `1` (until worker split) | | Secrets | Staging Secret / external store | Production Secret / external store | -Override `configmap.yaml` values per environment (separate ConfigMaps or Kustomize overlays recommended). - ---- - -## Secrets - -Create the live Secret from the template — **do not** apply `secret.yaml.example` with real values: - -```bash -kubectl create namespace neurowealth - -kubectl create secret generic neurowealth-secrets \ - --namespace=neurowealth \ - --from-literal=DATABASE_URL='postgresql://...' \ - --from-literal=JWT_SEED='...' \ - --from-literal=WALLET_ENCRYPTION_KEY='...' \ - --from-literal=STELLAR_AGENT_SECRET_KEY='...' \ - --from-literal=ANTHROPIC_API_KEY='...' \ - --from-literal=ADMIN_API_TOKEN='...' \ - --from-literal=TWILIO_AUTH_TOKEN='...' -``` - -Required keys match `src/config/env.ts` startup validation. Optional keys: `TWILIO_ACCOUNT_SID`, `INTERNAL_SERVICE_TOKEN`, `SLACK_WEBHOOK_URL`, `PAGERDUTY_ROUTING_KEY`. - ---- - -## Build and push image +### Build and Push Image ```bash docker build -t /neurowealth-backend: . @@ -84,13 +156,13 @@ docker push /neurowealth-backend: Update the `image:` field in `deployment.yaml` and `migration-job.yaml` to your registry tag. -**Migration strategy:** The default `Dockerfile` CMD runs `prisma migrate deploy && node dist/index.js`. In Kubernetes, the Deployment **overrides** the command to `node dist/index.js` only. Migrations run in the **initContainer** (or standalone Job) so a failed migration blocks the rollout instead of leaving a half-started pod serving traffic. +### Migration Strategy ---- +The default `Dockerfile` CMD runs `prisma migrate deploy && node dist/index.js`. In Kubernetes, the Deployment **overrides** the command to `node dist/index.js` only. Migrations run in the **initContainer** (or standalone Job) so a failed migration blocks the rollout instead of leaving a half-started pod serving traffic. -## Rollout procedure +### Rollout Procedure -### 1. Migrate +#### 1. Migrate **Option A — initContainer (default in `deployment.yaml`):** migrations run automatically before each pod starts. @@ -102,7 +174,7 @@ kubectl apply -f deploy/k8s/migration-job.yaml kubectl wait --for=condition=complete job/neurowealth-migrate -n neurowealth --timeout=300s ``` -### 2. Deploy +#### 2. Deploy ```bash kubectl apply -f deploy/k8s/namespace.yaml @@ -114,7 +186,7 @@ kubectl apply -f deploy/k8s/service.yaml kubectl apply -f deploy/k8s/ingress.yaml ``` -### 3. Verify readiness +#### 3. Verify Readiness ```bash kubectl rollout status deployment/neurowealth-backend -n neurowealth @@ -128,83 +200,289 @@ curl -s http://localhost:3001/health/ready Readiness returns **200** only when database, event listener, and agent loop are all healthy. During rollout or shutdown it returns **503**. -### 4. Smoke test +#### 4. Smoke Test ```bash curl -s -o /dev/null -w "%{http_code}" https://api.neurowealth.io/health ``` +### Scaling Guidance + +#### Current Constraint: Single Active Consumer + +The monolith starts three subsystems in every pod (`src/index.ts`): +1. HTTP API +2. **Stellar event listener** — polls Soroban RPC every 5 s, persists cursor to `event_cursors` +3. **Agent cron loop** — hourly rebalance, snapshots, daily protocol scan + +There is **no leader election**. Running multiple replicas will: +- Duplicate event processing (mitigated by `processed_events` idempotency, but wastes RPC quota and risks race conditions) +- Run duplicate cron jobs (rebalance checks, snapshots) + +**Recommendation:** keep `replicas: 1` until the architecture is split. + +#### Future Scaling Path + +1. Add feature flags: `ENABLE_EVENT_LISTENER`, `ENABLE_AGENT_LOOP` +2. Split deployments: + - `neurowealth-api` — stateless HTTP, `replicas: N`, HPA enabled + - `neurowealth-worker` — listener + agent, `replicas: 1` +3. Optional: K8s Lease or Postgres advisory lock for worker leader election before scaling workers beyond 1 + +#### HPA + +`deploy/k8s/hpa.yaml` is pinned to `minReplicas: 1` / `maxReplicas: 1`. Re-enable scaling only after the worker/API split. + +--- + +## Environment Variables Reference + +Copy `.env.example` as a checklist. Set every value via your secrets manager — never commit production secrets. + +### Required (app will not start without these) + +| Variable | Notes | +|----------|-------| +| `NODE_ENV` | Must be `production` | +| `PORT` | Default `3001` | +| `DATABASE_URL` | PostgreSQL connection string | +| `STELLAR_NETWORK` | `mainnet`, `testnet`, or `futurenet` | +| `STELLAR_RPC_URL` | Soroban RPC endpoint for the chosen network | +| `STELLAR_AGENT_SECRET_KEY` | 56-char Stellar secret (`S…`) | +| `VAULT_CONTRACT_ID` | Deployed vault contract ID | +| `USDC_TOKEN_ADDRESS` | USDC token contract on Stellar | +| `ANTHROPIC_API_KEY` | Claude API key for the agent | +| `JWT_SEED` | 64-hex secret for signing sessions — rotate every 90 days | +| `WALLET_ENCRYPTION_KEY` | 64-hex (32 bytes) — `openssl rand -hex 32` | +| `TWILIO_AUTH_TOKEN` | Required for WhatsApp webhook signature validation | + +### Required in Production Only + +| Variable | Notes | +|----------|-------| +| `ADMIN_API_TOKEN` | Strong token (≥ 8 chars) for `/api/admin/*` — inject via secrets manager | +| `CORS_ORIGINS` or `ALLOWED_ORIGINS` | Comma-separated frontend origins (e.g. `https://app.example.com`) — **do not use `*`** | + +### Recommended + +| Variable | Default | Notes | +|----------|---------|-------| +| `LOG_LEVEL` | `info` | Winston log level in production | +| `RATE_LIMIT_*` / `AUTH_RATE_LIMIT_*` | see `.env.example` | Tune per environment | +| `INTERNAL_SERVICE_TOKEN` | — | Service-to-service bypass for rate limits | +| `TRUSTED_IPS` | — | Comma-separated IPs that skip rate limits (probes, internal scrapers) | +| `DLQ_ALERT_THRESHOLD` | `50` | Alert when dead-letter queue exceeds this count | + +### Reverse Proxy + +Express `trust proxy` is set to `1` in `src/index.ts` so `req.ip` reflects the client behind a single load balancer. If you run behind CDN + LB (two hops), adjust that setting before deploy. + +Full list and defaults: `.env.example`. + --- -## Health probes +## Health Probes + +| Probe | Path | Success | Failure | Use | +|-------|------|---------|---------|-----| +| **Liveness** | `GET /health/live` | `200` | n/a | Process is running; restart if unreachable | +| **Readiness** | `GET /health/ready` | `200` when DB, event listener, and agent loop are ready | `503` during startup or shutdown | Route traffic only to healthy instances | + +Additional endpoints: +- `GET /health` — basic JSON status (also available via `healthRouter`) +- `GET /metrics` — Prometheus scrape target + +### Kubernetes Example + +```yaml +livenessProbe: + httpGet: + path: /health/live + port: 3001 + initialDelaySeconds: 10 + periodSeconds: 15 + timeoutSeconds: 5 + +readinessProbe: + httpGet: + path: /health/ready + port: 3001 + initialDelaySeconds: 15 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3 +``` -Configured in `deployment.yaml` to match existing endpoints in `src/index.ts`: +### AWS ALB Example -| Probe | Path | Purpose | -|-------|------|---------| -| Liveness | `GET /health/live` | Process is running — always 200 | -| Readiness | `GET /health/ready` | DB + event listener + agent loop ready | +- **Health check path:** `/health/ready` +- **Matcher:** `200` +- **Interval:** 30 s +- **Unhealthy threshold:** 3 -`terminationGracePeriodSeconds: 35` — the app drains in-flight requests for up to 30 s on `SIGTERM` before stopping background services. +During graceful shutdown (`SIGTERM`/`SIGINT`), readiness returns `503` with `status: shutting_down` so load balancers stop sending new requests before the process exits. --- -## Rollback procedure +## Secrets Management + +### Required Production Secrets + +| Secret | Rotation | Notes | +|--------|----------|-------| +| `JWT_SEED` | Every 90 days | Invalidates active sessions; schedule maintenance window | +| `WALLET_ENCRYPTION_KEY` | Coordinated migration | Re-encrypt `custodial_wallets` rows before swapping the key; loss of key = unrecoverable wallets | +| `STELLAR_AGENT_SECRET_KEY` | Rare | Fund a new agent key and update contract permissions before swap | +| `ADMIN_API_TOKEN` | On compromise | Rotate immediately; update secrets store and redeploy | +| `TWILIO_AUTH_TOKEN` | On compromise | Update Twilio console and redeploy | +| `DATABASE_URL` | Per provider policy | Use least-privilege DB user; enable SSL | + +### Secret Managers (Recommended) + +| Provider | Best for | Notes | +|----------|----------|-------| +| **AWS Secrets Manager** | AWS-hosted production | Automatic rotation hooks; inject via ECS task secrets or Lambda env | +| **HashiCorp Vault** | Multi-cloud / on-prem | Dynamic secrets, audit trail; use AppRole or K8s auth | +| **GitHub Actions secrets** | CI/CD and staging | Store `DATABASE_URL`, `JWT_SEED`, etc. per environment; never log values | + +### Creating Kubernetes Secrets ```bash -# Roll back to previous ReplicaSet -kubectl rollout undo deployment/neurowealth-backend -n neurowealth +kubectl create namespace neurowealth -# Or pin a known-good image: -kubectl set image deployment/neurowealth-backend \ - api=/neurowealth-backend: \ - -n neurowealth +kubectl create secret generic neurowealth-secrets \ + --namespace=neurowealth \ + --from-literal=DATABASE_URL='postgresql://...' \ + --from-literal=JWT_SEED='...' \ + --from-literal=WALLET_ENCRYPTION_KEY='...' \ + --from-literal=STELLAR_AGENT_SECRET_KEY='...' \ + --from-literal=ANTHROPIC_API_KEY='...' \ + --from-literal=ADMIN_API_TOKEN='...' \ + --from-literal=TWILIO_AUTH_TOKEN='...' +``` -kubectl rollout status deployment/neurowealth-backend -n neurowealth +Required keys match `src/config/env.ts` startup validation. Optional keys: `TWILIO_ACCOUNT_SID`, `INTERNAL_SERVICE_TOKEN`, `SLACK_WEBHOOK_URL`, `PAGERDUTY_ROUTING_KEY`. + +### Secrets Best Practices + +**Do:** +- Inject secrets at runtime from a secrets manager +- Use separate credentials per environment (staging vs production) +- Take a DB snapshot before running `prisma migrate deploy` + +**Do not:** +- Commit `.env` files or bake secrets into Docker image layers +- Log secret values (Winston redacts in production but avoid passing secrets in error messages) + +--- + +## Database Migrations + +### Pre-Deployment Checklist + +- [ ] Review pending Prisma migrations (`npx prisma migrate status`) +- [ ] Confirm migration SQL is non-destructive or has a documented data backfill +- [ ] Take a database backup/snapshot (provider console or `pg_dump`) +- [ ] Schedule during low traffic; notify on-call +- [ ] Staging deploy passed CI (`migration-smoke` job green) + +### Applying Migrations + +**Option A — Docker Container (single-instance):** + +```bash +docker run -d \ + --name neurowealth-migrate \ + --env-file /path/to/production.env \ + /neurowealth-backend: \ + npx prisma migrate deploy +``` + +**Option B — Kubernetes initContainer (recommended):** + +```yaml +initContainers: + - name: migrate + image: /neurowealth-backend: + command: ["npx", "prisma", "migrate", "deploy"] + envFrom: + - secretRef: + name: neurowealth-backend-secrets +``` + +**Option C — Standalone Job (large migrations):** + +```bash +kubectl apply -f deploy/k8s/migration-job.yaml +kubectl wait --for=condition=complete job/neurowealth-migrate -n neurowealth --timeout=300s +``` + +**Option D — Safe script with smoke test:** + +```bash +DATABASE_URL=postgresql://... bash scripts/apply-migration.sh ``` -**Database rollback:** Prisma migrations are forward-only. If a migration introduced a breaking schema change, restore from a database backup or deploy a hotfix migration — do not rely on `migrate reset` in production. +### Verifying Migration + +```bash +# Check migration status +npx prisma migrate status + +# Verify database tables +psql $DATABASE_URL -c "\dt event_cursors" +psql $DATABASE_URL -c "\dt processed_events" +``` --- -## Scaling guidance +## Rollback Procedure -### Current constraint: single active consumer +### Application Rollback -The monolith starts three subsystems in every pod (`src/index.ts`): +```bash +# Roll back to previous ReplicaSet +kubectl rollout undo deployment/neurowealth-backend -n neurowealth -1. HTTP API -2. **Stellar event listener** — polls Soroban RPC every 5 s, persists cursor to `event_cursors` -3. **Agent cron loop** — hourly rebalance, snapshots, daily protocol scan +# Or pin a known-good image: +kubectl set image deployment/neurowealth-backend \ + api=/neurowealth-backend: \ + -n neurowealth -There is **no leader election**. Running multiple replicas will: +kubectl rollout status deployment/neurowealth-backend -n neurowealth +``` -- Duplicate event processing (mitigated by `processed_events` idempotency, but wastes RPC quota and risks race conditions) -- Run duplicate cron jobs (rebalance checks, snapshots) +### Database Rollback -**Recommendation:** keep `replicas: 1` until the architecture is split. +Prisma migrations are forward-only. If a migration introduced a breaking schema change, restore from a database backup or deploy a hotfix migration — do not rely on `migrate reset` in production. -### Future scaling path +#### When to Rollback -1. Add feature flags: `ENABLE_EVENT_LISTENER`, `ENABLE_AGENT_LOOP` -2. Split deployments: - - `neurowealth-api` — stateless HTTP, `replicas: N`, HPA enabled - - `neurowealth-worker` — listener + agent, `replicas: 1` -3. Optional: K8s Lease or Postgres advisory lock for worker leader election before scaling workers beyond 1 +| Situation | Action | +|-----------|--------| +| Migration applied, app bug only | Roll back **application** image to previous tag; DB unchanged | +| Bad migration, no data loss yet | Restore DB from pre-deploy snapshot; redeploy previous app + migration set | +| Bad migration with partial writes | Restore snapshot; replay DLQ after fix; document manual reconciliation | -### HPA +#### Rollback Steps -`deploy/k8s/hpa.yaml` is pinned to `minReplicas: 1` / `maxReplicas: 1`. Re-enable scaling only after the worker/API split. +1. Stop traffic to new instances (drain load balancer). +2. Restore database from the pre-deploy backup/snapshot. +3. Deploy the **previous** application image (matching the restored schema). +4. Run `npm run smoke` against the restored DB. +5. Re-enable traffic; post-mortem and fix-forward migration in a new release. --- -## Observability +## Monitoring and Observability + +### Metrics - **Metrics:** `GET /metrics` on port 3001 (Prometheus) - **Request tracing:** clients may send `X-Request-ID` or `X-Correlation-ID`; the server echoes `X-Request-ID` on every response and includes `correlationId` in structured logs - **DLQ:** monitor `dead_letter_events` count and `event_cursors.lastProcessedLedger` lag — see `docs/OBSERVABILITY.md` and `docs/RUNBOOK.md` -### Monitoring assets +### Monitoring Assets Pre-built alert rules and Grafana dashboards live under `deploy/monitoring/`: @@ -218,14 +496,14 @@ Pre-built alert rules and Grafana dashboards live under `deploy/monitoring/`: | `deploy/monitoring/grafana/provisioning/datasources.yaml` | Grafana datasource provisioning | | `deploy/monitoring/grafana/provisioning/dashboards.yaml` | Grafana dashboard provisioning | -**Prometheus:** add the alert rules file to your Prometheus configuration: +### Prometheus Configuration ```yaml rule_files: - /etc/prometheus/rules/alert-rules.yaml ``` -**Grafana:** copy the provisioning files and dashboards to your Grafana instance: +### Grafana Setup ```bash cp deploy/monitoring/grafana/provisioning/* /etc/grafana/provisioning/ @@ -234,20 +512,25 @@ cp deploy/monitoring/grafana/dashboards/*.json /etc/grafana/dashboards/ Grafana will auto-load the dashboards on next restart. ---- +### Logs and Metrics Commands -## CI validation +```bash +# Docker logs +docker logs neurowealth-backend --tail 200 -f -Manifests are validated in CI with `kubeconform` (see `.github/workflows/k8s-validate.yml`). Run locally: +# Kubernetes logs +kubectl logs -l app=neurowealth-backend --tail=200 -f -```bash -kubeconform -summary deploy/k8s/*.yaml +# Prometheus metrics +curl -sS http://localhost:3001/metrics | head -50 ``` --- ## Troubleshooting +### Common Issues + | Symptom | Check | |---------|-------| | Pod `CrashLoopBackOff` | `kubectl logs deployment/neurowealth-backend -n neurowealth`; verify all required secrets | @@ -256,11 +539,73 @@ kubeconform -summary deploy/k8s/*.yaml | Events not processing | `SELECT * FROM event_cursors;` — cursor lag; ensure only one replica runs the listener | | Duplicate rebalances | Confirm `replicas: 1`; check agent cron is not running on multiple pods | +### Health and Readiness Commands + +```bash +# Liveness — should always return 200 once the process is up +curl -sS http://localhost:3001/health/live | jq . + +# Readiness — 200 when all subsystems ready, 503 during startup/shutdown +curl -sS -o /dev/null -w "%{http_code}\n" http://localhost:3001/health/ready + +# Detailed subsystem status +curl -sS http://localhost:3001/health/ready | jq . +``` + +### Database and Migration Commands + +```bash +# Check migration status +DATABASE_URL="postgresql://..." npx prisma migrate status + +# Apply pending migrations (production-safe, non-destructive) +DATABASE_URL="postgresql://..." npx prisma migrate deploy + +# Safe migration + smoke test +DATABASE_URL="postgresql://..." bash scripts/apply-migration.sh + +# Verify DB connectivity from the app host +psql "$DATABASE_URL" -c "SELECT 1" +``` + +### Common Startup Failures + +```bash +# Missing or invalid env — app prints all validation errors at once +NODE_ENV=production node dist/index.js + +# Verify required production vars are set (no values printed) +env | grep -E '^(NODE_ENV|DATABASE_URL|ADMIN_API_TOKEN|CORS_ORIGINS|JWT_SEED|WALLET_ENCRYPTION_KEY)=' + +# Stellar network mismatch warning +# Ensure STELLAR_NETWORK=mainnet only when NODE_ENV=production and keys are mainnet +``` + +### Rate-Limit / Proxy Issues + +```bash +# The app trusts one reverse-proxy hop (trust proxy = 1). +# If client IPs look wrong behind CDN + LB, update src/index.ts before redeploying. +curl -H "X-Forwarded-For: 203.0.113.1" http://localhost:3001/health/live +``` + +--- + +## CI Validation + +Manifests are validated in CI with `kubeconform` (see `.github/workflows/k8s-validate.yml`). Run locally: + +```bash +kubeconform -summary deploy/k8s/*.yaml +``` + --- -## Related docs +## Related Documentation -- `docs/PRODUCTION_DEPLOYMENT.md` — Docker build, secret rotation +- `.env.example` — full environment variable reference +- `readme.md` — local development, auth flow, rate limiting +- `Dockerfile` — image build stages and default CMD +- `scripts/apply-migration.sh` — CI/CD migration gate with smoke test - `docs/RUNBOOK.md` — DLQ replay, cursor management -- `docs/OBSERVABILITY.md` — alerting and metrics -- `readme.md` — API request tracing headers +- `docs/OBSERVABILITY.md` — alerting and metrics \ No newline at end of file diff --git a/docs/DEPLOYMENT_GUIDE.md b/docs/DEPLOYMENT_GUIDE.md deleted file mode 100644 index 87cd3c8..0000000 --- a/docs/DEPLOYMENT_GUIDE.md +++ /dev/null @@ -1,372 +0,0 @@ -# Deployment Guide - Vault Events Persistence - -> For production secrets, migration roll-forward/rollback, and the full release checklist, see **[PRODUCTION_DEPLOYMENT.md](./PRODUCTION_DEPLOYMENT.md)**. - -## Pre-Deployment Checklist - -- [x] Code review completed -- [x] Tests passing -- [x] Documentation complete -- [x] Migration created -- [x] No breaking changes - -## Deployment Steps - -### 1. Apply Database Migration - -```bash -# Generate Prisma client -npm run prisma:generate - -# Apply migration -npx prisma migrate deploy - -# Verify migration -npx prisma db execute --stdin < prisma/migrations/20260326152030_add_event_tracking/migration.sql -``` - -### 2. Verify Database Changes - -```bash -# Check EventCursor table -psql $DATABASE_URL -c "SELECT * FROM event_cursors;" - -# Check ProcessedEvent table -psql $DATABASE_URL -c "SELECT * FROM processed_events;" - -# Verify indexes -psql $DATABASE_URL -c "\d event_cursors" -psql $DATABASE_URL -c "\d processed_events" -``` - -### 3. Build and Test - -```bash -# Install dependencies -npm install - -# Run linting -npm run lint - -# Run tests -npm test -- --run - -# Build -npm run build -``` - -### 4. Start Event Listener - -```bash -# In your application startup code -import { startEventListener } from './src/stellar/events'; - -// Start listening for events -await startEventListener(); -``` - -### 5. Monitor Event Processing - -```bash -# Check last processed ledger -psql $DATABASE_URL -c "SELECT * FROM event_cursors WHERE contractId = '$VAULT_CONTRACT_ID';" - -# Check processed events count -psql $DATABASE_URL -c "SELECT COUNT(*) FROM processed_events;" - -# Check recent transactions -psql $DATABASE_URL -c "SELECT * FROM transactions ORDER BY createdAt DESC LIMIT 10;" - -# Check recent positions -psql $DATABASE_URL -c "SELECT * FROM positions ORDER BY updatedAt DESC LIMIT 10;" -``` - -## Rollback Procedure - -### If Issues Occur - -```bash -# Stop event listener -import { stopEventListener } from './src/stellar/events'; -stopEventListener(); - -# Rollback migration -npx prisma migrate resolve --rolled-back 20260326152030_add_event_tracking - -# Verify rollback -psql $DATABASE_URL -c "\dt" # Should not show event_cursors or processed_events -``` - -## Post-Deployment Verification - -### 1. Check Event Processing - -```bash -# Wait 30 seconds for events to be processed -sleep 30 - -# Verify cursor was updated -psql $DATABASE_URL -c "SELECT * FROM event_cursors;" - -# Verify events were processed -psql $DATABASE_URL -c "SELECT COUNT(*) FROM processed_events;" - -# Verify transactions were created -psql $DATABASE_URL -c "SELECT COUNT(*) FROM transactions WHERE status = 'CONFIRMED';" -``` - -### 2. Monitor Logs - -```bash -# Check for errors -grep -i "error" logs/*.log - -# Check for event processing -grep -i "event" logs/*.log - -# Check for warnings -grep -i "warn" logs/*.log -``` - -### 3. Verify Data Integrity - -```bash -# Check for duplicate transactions -psql $DATABASE_URL -c "SELECT txHash, COUNT(*) FROM transactions GROUP BY txHash HAVING COUNT(*) > 1;" - -# Check for orphaned transactions -psql $DATABASE_URL -c "SELECT * FROM transactions WHERE positionId IS NULL AND type IN ('DEPOSIT', 'WITHDRAWAL');" - -# Check position balances -psql $DATABASE_URL -c "SELECT userId, protocolName, depositedAmount, currentValue FROM positions WHERE status = 'ACTIVE';" -``` - -## Performance Monitoring - -### Key Metrics to Track - -```bash -# Event processing rate -psql $DATABASE_URL -c "SELECT COUNT(*) FROM processed_events WHERE processedAt > NOW() - INTERVAL '1 hour';" - -# Average processing time -psql $DATABASE_URL -c "SELECT AVG(EXTRACT(EPOCH FROM (processedAt - createdAt))) FROM processed_events WHERE processedAt > NOW() - INTERVAL '1 hour';" - -# Ledger lag -psql $DATABASE_URL -c "SELECT lastProcessedLedger FROM event_cursors WHERE contractId = '$VAULT_CONTRACT_ID';" - -# Database size -psql $DATABASE_URL -c "SELECT pg_size_pretty(pg_total_relation_size('processed_events'));" -``` - -## Troubleshooting - -### Events Not Processing - -1. **Check listener is running** - ```bash - # In application logs - grep "Event Listener" logs/*.log - ``` - -2. **Check VAULT_CONTRACT_ID** - ```bash - echo $VAULT_CONTRACT_ID - ``` - -3. **Check database connection** - ```bash - psql $DATABASE_URL -c "SELECT 1;" - ``` - -4. **Check RPC connection** - ```bash - # In application logs - grep "RPC" logs/*.log - ``` - -### Duplicate Events - -1. **Check ProcessedEvent table** - ```bash - psql $DATABASE_URL -c "SELECT * FROM processed_events WHERE txHash = 'YOUR_TX_HASH';" - ``` - -2. **Verify unique constraint** - ```bash - psql $DATABASE_URL -c "\d processed_events" - ``` - -3. **Check for concurrent listeners** - ```bash - # Should only have one listener running - ps aux | grep "node" - ``` - -### Listener Not Resuming - -1. **Check EventCursor table** - ```bash - psql $DATABASE_URL -c "SELECT * FROM event_cursors;" - ``` - -2. **Check migration was applied** - ```bash - npx prisma migrate status - ``` - -3. **Check database logs** - ```bash - # PostgreSQL logs - tail -f /var/log/postgresql/postgresql.log - ``` - -## Scaling Considerations - -### For High Event Volume - -1. **Increase poll frequency** (if needed) - - Modify POLL_INTERVAL_MS in events.ts - - Default: 5000ms (5 seconds) - -2. **Batch processing** - - Process multiple events in single transaction - - Reduces database round trips - -3. **Connection pooling** - - Use PgBouncer or similar - - Reduce connection overhead - -4. **Database optimization** - - Add more indexes if needed - - Archive old processed events - - Partition large tables - -## Maintenance Tasks - -### Daily - -```bash -# Check event processing status -psql $DATABASE_URL -c "SELECT * FROM event_cursors;" - -# Check for errors in logs -grep -i "error" logs/*.log | tail -20 -``` - -### Weekly - -```bash -# Check database size -psql $DATABASE_URL -c "SELECT pg_size_pretty(pg_total_relation_size('processed_events'));" - -# Archive old processed events (optional) -psql $DATABASE_URL -c "DELETE FROM processed_events WHERE processedAt < NOW() - INTERVAL '30 days';" - -# Analyze query performance -psql $DATABASE_URL -c "ANALYZE processed_events;" -``` - -### Monthly - -```bash -# Vacuum database -psql $DATABASE_URL -c "VACUUM ANALYZE;" - -# Check index usage -psql $DATABASE_URL -c "SELECT * FROM pg_stat_user_indexes WHERE schemaname = 'public';" - -# Review slow queries -# Check PostgreSQL slow query log -``` - -## Disaster Recovery - -### If Database Corrupted - -1. **Stop event listener** - ```bash - stopEventListener(); - ``` - -2. **Restore from backup** - ```bash - # Restore database from backup - pg_restore -d $DATABASE_URL backup.dump - ``` - -3. **Verify data integrity** - ```bash - psql $DATABASE_URL -c "SELECT COUNT(*) FROM event_cursors;" - psql $DATABASE_URL -c "SELECT COUNT(*) FROM processed_events;" - ``` - -4. **Restart event listener** - ```bash - startEventListener(); - ``` - -### If Events Lost - -1. **Reset cursor to earlier ledger** - ```bash - psql $DATABASE_URL -c "UPDATE event_cursors SET lastProcessedLedger = 100 WHERE contractId = '$VAULT_CONTRACT_ID';" - ``` - -2. **Clear processed events (optional)** - ```bash - psql $DATABASE_URL -c "DELETE FROM processed_events WHERE contractId = '$VAULT_CONTRACT_ID';" - ``` - -3. **Restart listener** - ```bash - stopEventListener(); - startEventListener(); - ``` - -## Success Criteria - -✅ Migration applied successfully -✅ EventCursor table created -✅ ProcessedEvent table created -✅ Event listener starts without errors -✅ Events are processed and persisted -✅ No duplicate events -✅ Listener resumes on restart -✅ All tests passing -✅ No errors in logs -✅ Database performance acceptable - -## Support - -For issues or questions: -1. Check logs for error messages -2. Review database state -3. Refer to QUICK_REFERENCE.md -4. Refer to IMPLEMENTATION_DETAILS.md -5. Contact development team - -## Rollback Timeline - -- **Immediate**: Stop listener, rollback migration -- **5 minutes**: Verify rollback, check data -- **15 minutes**: Restart with previous version -- **30 minutes**: Full system verification - -## Communication - -### Before Deployment -- Notify team of deployment -- Schedule maintenance window if needed -- Prepare rollback plan - -### During Deployment -- Monitor logs in real-time -- Check database performance -- Verify event processing - -### After Deployment -- Confirm all systems operational -- Monitor for 24 hours -- Document any issues -- Update team on status diff --git a/docs/DEPLOYMENT_PRODUCTION.md b/docs/DEPLOYMENT_PRODUCTION.md deleted file mode 100644 index c0f7178..0000000 --- a/docs/DEPLOYMENT_PRODUCTION.md +++ /dev/null @@ -1,298 +0,0 @@ -# Production deployment guide - -Concise runbook for building, deploying, and operating the NeuroWealth backend in production. - -## Prerequisites - -- Node.js 20 (matches `Dockerfile` and CI) -- PostgreSQL 14+ -- A secrets store (AWS Secrets Manager, HashiCorp Vault, GitHub Actions secrets, etc.) -- TLS termination at your load balancer or ingress - ---- - -## Build - -### Docker image (recommended) - -Multi-stage build compiles TypeScript, generates the Prisma client, and produces a slim runtime image running as non-root user `app`. - -```bash -# From repository root -docker build -t neurowealth-backend:latest . - -# Tag and push to your registry -docker tag neurowealth-backend:latest /neurowealth-backend: -docker push /neurowealth-backend: -``` - -The image CMD runs `prisma migrate deploy && node dist/index.js`. For Kubernetes, prefer running migrations in an **initContainer** so rollouts stay atomic and failed migrations do not leave a half-started pod serving traffic. - -### Bare-metal / VM build - -```bash -npm ci -npx prisma generate -npm run build -``` - -Start with: - -```bash -npx prisma migrate deploy -node dist/index.js -``` - -Or use the safe migration script (applies migrations then runs smoke test): - -```bash -DATABASE_URL=postgresql://... bash scripts/apply-migration.sh -``` - ---- - -## Deploy - -### Docker Compose (Postgres only) - -`docker-compose.yml` runs Postgres for local development. In production, use a managed database (RDS, Cloud SQL, etc.) and run the app container separately: - -```bash -docker run -d \ - --name neurowealth-backend \ - -p 3001:3001 \ - --env-file /path/to/production.env \ - /neurowealth-backend: -``` - -### Kubernetes (outline) - -1. **Secrets** — mount required env vars via `Secret` + `envFrom` or an external secrets operator. -2. **initContainer** — run `npx prisma migrate deploy` against `DATABASE_URL` before the main container starts. -3. **Probes** — configure liveness and readiness (see [Health probes](#health-probes)). -4. **Ingress** — terminate TLS at ingress; the app sets `trust proxy` to `1` (one reverse-proxy hop) in `src/index.ts`. -5. **Graceful shutdown** — allow at least 30 s `terminationGracePeriodSeconds`; the app drains in-flight requests on `SIGTERM`. - -Example initContainer: - -```yaml -initContainers: - - name: migrate - image: /neurowealth-backend: - command: ["npx", "prisma", "migrate", "deploy"] - envFrom: - - secretRef: - name: neurowealth-backend-secrets -``` - -Override the main container command to skip inline migration when using an initContainer: - -```yaml -command: ["node", "dist/index.js"] -``` - -### Startup behaviour - -The HTTP server **does not accept traffic** until all critical services initialise: - -1. Database connection -2. Stellar event listener -3. Agent loop - -If any step fails, the process exits with a non-zero code so your orchestrator restarts the pod. - ---- - -## Environment variables - -Copy `.env.example` as a checklist. Set every value via your secrets manager — never commit production secrets. - -### Required (app will not start without these) - -| Variable | Notes | -|----------|-------| -| `NODE_ENV` | Must be `production` | -| `PORT` | Default `3001` | -| `DATABASE_URL` | PostgreSQL connection string | -| `STELLAR_NETWORK` | `mainnet`, `testnet`, or `futurenet` | -| `STELLAR_RPC_URL` | Soroban RPC endpoint for the chosen network | -| `STELLAR_AGENT_SECRET_KEY` | 56-char Stellar secret (`S…`) | -| `VAULT_CONTRACT_ID` | Deployed vault contract ID | -| `USDC_TOKEN_ADDRESS` | USDC token contract on Stellar | -| `ANTHROPIC_API_KEY` | Claude API key for the agent | -| `JWT_SEED` | 64-hex secret for signing sessions — rotate every 90 days | -| `WALLET_ENCRYPTION_KEY` | 64-hex (32 bytes) — `openssl rand -hex 32` | -| `TWILIO_AUTH_TOKEN` | Required for WhatsApp webhook signature validation | - -### Required in production only - -| Variable | Notes | -|----------|-------| -| `ADMIN_API_TOKEN` | Strong token (≥ 8 chars) for `/api/admin/*` — inject via secrets manager | -| `CORS_ORIGINS` or `ALLOWED_ORIGINS` | Comma-separated frontend origins (e.g. `https://app.example.com`) — **do not use `*`** | - -### Recommended - -| Variable | Default | Notes | -|----------|---------|-------| -| `LOG_LEVEL` | `info` | Winston log level in production | -| `RATE_LIMIT_*` / `AUTH_RATE_LIMIT_*` | see `.env.example` | Tune per environment | -| `INTERNAL_SERVICE_TOKEN` | — | Service-to-service bypass for rate limits | -| `TRUSTED_IPS` | — | Comma-separated IPs that skip rate limits (probes, internal scrapers) | -| `DLQ_ALERT_THRESHOLD` | `50` | Alert when dead-letter queue exceeds this count | - -**Reverse proxy:** Express `trust proxy` is set to `1` in `src/index.ts` so `req.ip` reflects the client behind a single load balancer. If you run behind CDN + LB (two hops), adjust that setting before deploy. - -Full list and defaults: `.env.example`. - ---- - -## Health probes - -| Probe | Path | Success | Failure | Use | -|-------|------|---------|---------|-----| -| **Liveness** | `GET /health/live` | `200` | n/a | Process is running; restart if unreachable | -| **Readiness** | `GET /health/ready` | `200` when DB, event listener, and agent loop are ready | `503` during startup or shutdown | Route traffic only to healthy instances | - -Additional endpoints: - -- `GET /health` — basic JSON status (also available via `healthRouter`) -- `GET /health/ready` (router) — subsystem readiness via `getReadiness()` -- `GET /metrics` — Prometheus scrape target - -### Kubernetes example - -```yaml -livenessProbe: - httpGet: - path: /health/live - port: 3001 - initialDelaySeconds: 10 - periodSeconds: 15 - timeoutSeconds: 5 - -readinessProbe: - httpGet: - path: /health/ready - port: 3001 - initialDelaySeconds: 15 - periodSeconds: 10 - timeoutSeconds: 5 - failureThreshold: 3 -``` - -### AWS ALB example - -- **Health check path:** `/health/ready` -- **Matcher:** `200` -- **Interval:** 30 s -- **Unhealthy threshold:** 3 - -During graceful shutdown (`SIGTERM`/`SIGINT`), readiness returns `503` with `status: shutting_down` so load balancers stop sending new requests before the process exits. - ---- - -## Secrets guidance - -| Secret | Rotation | Notes | -|--------|----------|-------| -| `JWT_SEED` | Every 90 days | Invalidates active sessions; schedule maintenance window | -| `WALLET_ENCRYPTION_KEY` | Coordinated migration | Re-encrypt `custodial_wallets` rows before swapping the key; loss of key = unrecoverable wallets | -| `STELLAR_AGENT_SECRET_KEY` | Rare | Fund a new agent key and update contract permissions before swap | -| `ADMIN_API_TOKEN` | On compromise | Rotate immediately; update secrets store and redeploy | -| `TWILIO_AUTH_TOKEN` | On compromise | Update Twilio console and redeploy | -| `DATABASE_URL` | Per provider policy | Use least-privilege DB user; enable SSL | - -**Do:** - -- Inject secrets at runtime from a secrets manager -- Use separate credentials per environment (staging vs production) -- Take a DB snapshot before running `prisma migrate deploy` - -**Do not:** - -- Commit `.env` files or bake secrets into Docker image layers -- Log secret values (Winston redacts in production but avoid passing secrets in error messages) - ---- - -## Runbook — troubleshooting commands - -### Health and readiness - -```bash -# Liveness — should always return 200 once the process is up -curl -sS http://localhost:3001/health/live | jq . - -# Readiness — 200 when all subsystems ready, 503 during startup/shutdown -curl -sS -o /dev/null -w "%{http_code}\n" http://localhost:3001/health/ready - -# Detailed subsystem status -curl -sS http://localhost:3001/health/ready | jq . -``` - -### Database and migrations - -```bash -# Check migration status -DATABASE_URL="postgresql://..." npx prisma migrate status - -# Apply pending migrations (production-safe, non-destructive) -DATABASE_URL="postgresql://..." npx prisma migrate deploy - -# Safe migration + smoke test -DATABASE_URL="postgresql://..." bash scripts/apply-migration.sh - -# Verify DB connectivity from the app host -psql "$DATABASE_URL" -c "SELECT 1" -``` - -### Logs and metrics - -```bash -# Docker logs -docker logs neurowealth-backend --tail 200 -f - -# Kubernetes logs -kubectl logs -l app=neurowealth-backend --tail=200 -f - -# Prometheus metrics -curl -sS http://localhost:3001/metrics | head -50 -``` - -### Common startup failures - -```bash -# Missing or invalid env — app prints all validation errors at once -NODE_ENV=production node dist/index.js - -# Verify required production vars are set (no values printed) -env | grep -E '^(NODE_ENV|DATABASE_URL|ADMIN_API_TOKEN|CORS_ORIGINS|JWT_SEED|WALLET_ENCRYPTION_KEY)=' - -# Stellar network mismatch warning -# Ensure STELLAR_NETWORK=mainnet only when NODE_ENV=production and keys are mainnet -``` - -### Rollback - -1. Scale down or stop new pods/tasks. -2. Restore database from pre-migration snapshot if the migration was destructive. -3. Redeploy the previous image tag. -4. Confirm `/health/ready` returns `200` and monitor logs for 15 minutes. - -### Rate-limit / proxy issues - -```bash -# The app trusts one reverse-proxy hop (trust proxy = 1). -# If client IPs look wrong behind CDN + LB, update src/index.ts before redeploying. -curl -H "X-Forwarded-For: 203.0.113.1" http://localhost:3001/health/live -``` - ---- - -## Related docs - -- `.env.example` — full environment variable reference -- `readme.md` — local development, auth flow, rate limiting -- `Dockerfile` — image build stages and default CMD -- `scripts/apply-migration.sh` — CI/CD migration gate with smoke test diff --git a/docs/DLQ_ALERTING_RUNBOOK.md b/docs/DLQ_ALERTING_RUNBOOK.md index 8a60589..6bac655 100644 --- a/docs/DLQ_ALERTING_RUNBOOK.md +++ b/docs/DLQ_ALERTING_RUNBOOK.md @@ -205,4 +205,4 @@ If queue continues to grow after recovery attempts: - [OBSERVABILITY.md](./OBSERVABILITY.md) - Complete metrics and alert thresholds - [API_REFERENCE.md](./API_REFERENCE.md) - Admin DLQ endpoints -- [DEPLOYMENT_GUIDE.md](./DEPLOYMENT_GUIDE.md) - Configuration and secrets management +- [DEPLOYMENT.md](./DEPLOYMENT.md) - Configuration and secrets management diff --git a/docs/DOCUMENTATION_INDEX.md b/docs/DOCUMENTATION_INDEX.md index 744718d..a1962db 100644 --- a/docs/DOCUMENTATION_INDEX.md +++ b/docs/DOCUMENTATION_INDEX.md @@ -11,14 +11,9 @@ ### For DevOps/Deployment -- **[DEPLOYMENT_GUIDE.md](DEPLOYMENT_GUIDE.md)** - Step-by-step deployment instructions +- **[DEPLOYMENT.md](DEPLOYMENT.md)** - Complete deployment guide for all environments - **[IMPLEMENTATION_CHECKLIST.md](IMPLEMENTATION_CHECKLIST.md)** - Verification checklist -### For Project Managers - -- **[FINAL_SUMMARY.md](FINAL_SUMMARY.md)** - Executive summary and status -- **[PR_DESCRIPTION.md](PR_DESCRIPTION.md)** - PR summary for code review - ### For Reference - **[IMPLEMENTATION_SUMMARY.md](IMPLEMENTATION_SUMMARY.md)** - High-level overview @@ -87,24 +82,23 @@ --- -### DEPLOYMENT_GUIDE.md +### DEPLOYMENT.md -**Purpose**: Step-by-step deployment instructions +**Purpose**: Complete deployment guide for all environments **Contents**: -- Pre-deployment checklist -- Deployment steps (migration, testing, verification) +- Local development (Docker Compose) +- Staging (Docker) +- Production (Kubernetes) +- Environment variables reference +- Health probes +- Secrets management +- Database migrations - Rollback procedure -- Post-deployment verification -- Performance monitoring -- Troubleshooting guide -- Scaling considerations -- Maintenance tasks -- Disaster recovery -- Success criteria -- Communication plan +- Monitoring and observability +- Troubleshooting -**Read this if**: You are deploying to production or need rollback procedures +**Read this if**: You are deploying to any environment or need rollback procedures --- @@ -130,48 +124,6 @@ --- -### FINAL_SUMMARY.md - -**Purpose**: Executive summary and project status -**Contents**: - -- Executive summary -- What was delivered -- Key features -- Testing summary -- Documentation overview -- Technical details -- Acceptance criteria verification -- Files created/modified -- Code quality metrics -- Performance characteristics -- Security features -- Deployment readiness -- Testing summary -- Next steps -- Key metrics -- Success criteria -- Known limitations -- Conclusion - -**Read this if**: You need high-level overview or project status - ---- - -### PR_DESCRIPTION.md - -**Purpose**: PR summary for code review -**Contents**: - -- Summary of changes -- Changes made -- Acceptance criteria -- Files changed - -**Read this if**: You are reviewing the PR or need a concise summary - ---- - ### IMPLEMENTATION_SUMMARY.md **Purpose**: High-level overview of implementation @@ -307,20 +259,18 @@ grep "RPC" logs/*.log ### For DevOps -1. DEPLOYMENT_GUIDE.md - Deployment steps +1. DEPLOYMENT.md - Deployment steps 2. IMPLEMENTATION_CHECKLIST.md - Verification 3. QUICK_REFERENCE.md - Troubleshooting ### For Project Managers -1. FINAL_SUMMARY.md - Status and metrics -2. PR_DESCRIPTION.md - Changes summary -3. IMPLEMENTATION_CHECKLIST.md - Verification +1. IMPLEMENTATION_SUMMARY.md - Overview +2. IMPLEMENTATION_CHECKLIST.md - Verification ### For Code Review -1. PR_DESCRIPTION.md - Summary -2. CODE_STRUCTURE.md - Architecture +1. CODE_STRUCTURE.md - Architecture 3. IMPLEMENTATION_DETAILS.md - Technical details 4. Review src/stellar/events.ts - Code review 5. Review tests - Test coverage @@ -342,7 +292,7 @@ For questions or issues: 1. Check QUICK_REFERENCE.md for common questions 2. Review IMPLEMENTATION_DETAILS.md for technical details -3. Check DEPLOYMENT_GUIDE.md for deployment issues +3. Check DEPLOYMENT.md for deployment issues 4. Review logs for error messages 5. Contact development team @@ -366,12 +316,10 @@ For questions or issues: | QUICK_REFERENCE.md | 150 | Quick lookup | | CODE_STRUCTURE.md | 350 | Architecture | | IMPLEMENTATION_DETAILS.md | 400 | Technical details | -| DEPLOYMENT_GUIDE.md | 350 | Deployment | +| DEPLOYMENT.md | 450 | Deployment | | IMPLEMENTATION_CHECKLIST.md | 200 | Verification | -| FINAL_SUMMARY.md | 300 | Executive summary | -| PR_DESCRIPTION.md | 30 | PR summary | | IMPLEMENTATION_SUMMARY.md | 100 | Overview | -| **Total Documentation** | **1880** | **Complete** | +| **Total Documentation** | **1650** | **Complete** | --- @@ -380,7 +328,7 @@ For questions or issues: 1. **Review**: Review all documentation 2. **Code Review**: Review implementation and tests 3. **Merge**: Merge to main branch -4. **Deploy**: Follow DEPLOYMENT_GUIDE.md +4. **Deploy**: Follow DEPLOYMENT.md 5. **Monitor**: Monitor event processing 6. **Verify**: Confirm all systems operational diff --git a/docs/PRODUCTION_DEPLOYMENT.md b/docs/PRODUCTION_DEPLOYMENT.md deleted file mode 100644 index a3380d7..0000000 --- a/docs/PRODUCTION_DEPLOYMENT.md +++ /dev/null @@ -1,279 +0,0 @@ -# Production deployment, secrets, and migrations - -This guide covers container image build, secret management, CI/CD injection, database migrations, health/readiness checks, and rollback for the NeuroWealth backend. - -## Container image (Dockerfile) - -The repo ships a multi-stage `Dockerfile`: - -| Stage | Base | Purpose | -|-------|------|---------| -| `builder` | `node:20-alpine` | `npm ci` → `prisma generate` → `tsc` | -| `runtime` | `node:20-alpine` | Slim image; only `dist/`, production `node_modules`, and Prisma artefacts | - -### Build - -```bash -docker build -t neurowealth-backend:latest . -``` - -Push to your registry: - -```bash -docker tag neurowealth-backend:latest registry.example.com/neurowealth-backend:$(git rev-parse --short HEAD) -docker push registry.example.com/neurowealth-backend:$(git rev-parse --short HEAD) -``` - -### Environment variables (production minimum) - -```bash -NODE_ENV=production -DATABASE_URL=postgresql://user:pass@db-host:5432/neurowealth -JWT_SEED=<64 hex chars> -WALLET_ENCRYPTION_KEY=<64 hex chars> -STELLAR_NETWORK=mainnet -STELLAR_RPC_URL=https://soroban-mainnet.stellar.org -STELLAR_AGENT_SECRET_KEY=S... -VAULT_CONTRACT_ID=C... -USDC_TOKEN_ADDRESS=C... -ANTHROPIC_API_KEY=sk-ant-... -TWILIO_AUTH_TOKEN= -TWILIO_ACCOUNT_SID=AC... -WHATSAPP_FROM=whatsapp:+1234567890 -ADMIN_API_TOKEN= -CORS_ORIGINS=https://app.neurowealth.io -``` - -Generate secrets locally (never commit raw values): - -```bash -openssl rand -hex 64 # JWT_SEED -openssl rand -hex 32 # WALLET_ENCRYPTION_KEY -openssl rand -hex 32 # ADMIN_API_TOKEN -``` - -### Database migrations (pre-start / init container) - -Run `prisma migrate deploy` **before** starting the app. The container `CMD` -does this automatically for simple single-instance deploys: - -``` -CMD ["sh", "-c", "npx prisma migrate deploy && node dist/index.js"] -``` - -For Kubernetes, use a dedicated `initContainer` so the migration completes -before any app replica starts: - -```yaml -initContainers: - - name: migrate - image: registry.example.com/neurowealth-backend:$(TAG) - command: ["npx", "prisma", "migrate", "deploy"] - env: - - name: DATABASE_URL - valueFrom: - secretKeyRef: - name: neurowealth-secrets - key: DATABASE_URL -``` - -### Health and readiness probes for load balancers / Kubernetes - -| Endpoint | HTTP method | Expected status | Use | -|----------|-------------|-----------------|-----| -| `GET /health/live` | GET | 200 always | Liveness — is the process running? | -| `GET /health/ready` | GET | 200 ready / 503 not ready | Readiness — are DB, event listener, and agent loop healthy? | -| `GET /health` | GET | 200 | Legacy; returns subsystem map from `readiness.ts` | - -Kubernetes example: - -```yaml -livenessProbe: - httpGet: - path: /health/live - port: 3001 - initialDelaySeconds: 10 - periodSeconds: 15 - -readinessProbe: - httpGet: - path: /health/ready - port: 3001 - initialDelaySeconds: 15 - periodSeconds: 10 - failureThreshold: 3 -``` - -AWS ALB / Nginx: configure the target group health check to `GET /health/ready` -and mark targets unhealthy at HTTP 5xx. During rolling deploys new instances -will return 503 until all three subsystems (`database`, `eventListener`, -`agentLoop`) are ready. - -### Key rotation / backup expectations - -- **WALLET_ENCRYPTION_KEY** — custodial wallet secrets are stored AES-256-GCM - encrypted in the `custodial_wallets` table. Rotate by running a migration - job that decrypts with the old key and re-encrypts with the new one before - swapping the env var. Back up the database; losing the encryption key makes - wallets unrecoverable. -- **JWT_SEED** — rotate every 90 days. All active sessions are invalidated; - users re-authenticate. Use a maintenance window. -- **Auth nonces** — stored in `auth_nonces` table with a 5-minute TTL. No - special rotation needed; expired rows are pruned lazily on each challenge - request. - -## Secret managers (recommended) - -| Provider | Best for | Notes | -|----------|----------|-------| -| **AWS Secrets Manager** | AWS-hosted production | Automatic rotation hooks; inject via ECS task secrets or Lambda env | -| **HashiCorp Vault** | Multi-cloud / on-prem | Dynamic secrets, audit trail; use AppRole or K8s auth | -| **GitHub Actions secrets** | CI/CD and staging | Store `DATABASE_URL`, `JWT_SEED`, etc. per environment; never log values | - -Never commit raw secrets. Use `.env.example` as a template only. - -## Required production secrets - -| Variable | Purpose | Rotation | -|----------|---------|----------| -| `JWT_SEED` | Signs session JWTs (64-hex) | Every 90 days; invalidate all sessions on rotate | -| `WALLET_ENCRYPTION_KEY` | Encrypts stored wallet material (32-byte hex) | Coordinated re-encryption migration required | -| `STELLAR_AGENT_SECRET_KEY` | On-chain agent signing (56-char `S…` key) | Generate new keypair, fund, update env, drain old key | -| `DATABASE_URL` | PostgreSQL connection | Rotate DB password in provider; update URL; restart app | -| `ANTHROPIC_API_KEY` | AI agent | Rotate in Anthropic console; update secret store | - -Generate locally (development only): - -```bash -openssl rand -hex 64 # JWT_SEED -openssl rand -hex 32 # WALLET_ENCRYPTION_KEY -``` - -### JWT_SEED rotation - -1. Generate a new 64-hex value and store it in your secret manager. -2. Deploy with the new `JWT_SEED` during a maintenance window. -3. All existing sessions become invalid; users re-authenticate via Stellar challenge. -4. Monitor auth error rates and `/api/auth` traffic. - -### WALLET_ENCRYPTION_KEY rotation - -1. Provision `WALLET_ENCRYPTION_KEY_NEW` alongside the current key. -2. Run a one-off migration job that decrypts with the old key and re-encrypts with the new key. -3. Swap env to the new key only after verification. -4. Remove the old key from the secret store. - -### STELLAR_AGENT_SECRET_KEY rotation - -1. Create and fund a new Stellar keypair on the target network. -2. Update contract/agent permissions if your vault requires an allowlist. -3. Set `STELLAR_AGENT_SECRET_KEY` in the secret manager and redeploy. -4. Verify agent loop and deposit/withdraw paths on testnet before mainnet. - -## CI/CD secret injection - -- Map GitHub Environment secrets (`staging`, `production`) to job `env` blocks. -- Use OIDC to AWS/GCP where possible instead of long-lived access keys. -- Restrict `workflow_dispatch` and production deploy jobs to protected branches. -- The `migration-smoke` CI job validates `npx prisma migrate deploy` + `npm run smoke` before release promotion. - -Example (GitHub Actions): - -```yaml -env: - DATABASE_URL: ${{ secrets.DATABASE_URL }} - JWT_SEED: ${{ secrets.JWT_SEED }} - WALLET_ENCRYPTION_KEY: ${{ secrets.WALLET_ENCRYPTION_KEY }} - STELLAR_AGENT_SECRET_KEY: ${{ secrets.STELLAR_AGENT_SECRET_KEY }} -``` - -## Deploy checklist (use at every release) - -### Pre-deploy - -- [ ] Review pending Prisma migrations (`npx prisma migrate status`) -- [ ] Confirm migration SQL is non-destructive or has a documented data backfill -- [ ] Take a database backup/snapshot (provider console or `pg_dump`) -- [ ] Schedule during low traffic; notify on-call -- [ ] Staging deploy passed CI (`migration-smoke` job green) - -### Deploy (roll-forward) - -1. **Backup** — snapshot or `pg_dump` of production DB. -2. **Apply migrations** — use the safe script (non-interactive in CI): - - ```bash - export DATABASE_URL="postgresql://..." - CI=1 bash scripts/apply-migration.sh - ``` - - Or manually: - - ```bash - npx prisma migrate deploy - npm run smoke - ``` - -3. **Smoke test** — `npm run smoke` must exit 0 (connectivity + core tables). -4. **Deploy application** — roll out new containers/instances with updated image. -5. **Promote traffic** — only after health checks pass (see below). - -### Post-deploy verification - -- [ ] `GET /health` returns 200 -- [ ] Readiness subsystems show `database`, `eventListener`, `agentLoop` ready -- [ ] No spike in DLQ size (`dead_letter_events` table) -- [ ] Monitor logs (Winston → CloudWatch/Datadog/etc.) - -## Health and readiness - -| Endpoint | Purpose | -|----------|---------| -| `GET /health` | Liveness; returns subsystem readiness from `src/config/readiness.ts` | -| Load balancer | Use readiness: return 503 until `database` (and optionally `eventListener`) are marked ready | - -If the event listener fails to start, the API may still serve read-only routes but on-chain ingestion will lag—treat sustained DLQ growth as a rollback trigger. - -## Migration rollback - -Prisma does not auto-reverse `migrate deploy`. Plan rollbacks explicitly: - -### When to rollback - -- `migrate deploy` or `npm run smoke` fails in CI or production -- Application errors correlated with a specific migration -- Data integrity issues in `processed_events`, `transactions`, or `positions` - -### Roll-forward vs rollback - -| Situation | Action | -|-----------|--------| -| Migration applied, app bug only | Roll back **application** image to previous tag; DB unchanged | -| Bad migration, no data loss yet | Restore DB from pre-deploy snapshot; redeploy previous app + migration set | -| Bad migration with partial writes | Restore snapshot; replay DLQ after fix; document manual reconciliation | - -### Rollback steps - -1. Stop traffic to new instances (drain load balancer). -2. Restore database from the pre-deploy backup/snapshot. -3. Deploy the **previous** application image (matching the restored schema). -4. Run `npm run smoke` against the restored DB. -5. Re-enable traffic; post-mortem and fix-forward migration in a new release. - -## Automated checks in CI - -The `migration-smoke` workflow job: - -1. Spins up an isolated Postgres service -2. Runs `npx prisma migrate status` and `npx prisma migrate deploy` -3. Fails if pending migrations remain after deploy -4. Runs `npm run smoke` - -A failing job blocks merge/deploy—treat it as the migration alert for staging. - -## Related files - -- `scripts/apply-migration.sh` — interactive checklist + migrate + smoke (for operators) -- `scripts/smoke-test.ts` — minimal schema connectivity test -- `.github/workflows/node-ci.yml` — `migration-smoke` job -- `docs/DEPLOYMENT_GUIDE.md` — vault event listener operational notes