Skip to content

Event-sourced state and internal API foundation#73

Merged
chrisbliss18 merged 11 commits into
v2from
stack-01-event-api-foundation
Apr 28, 2026
Merged

Event-sourced state and internal API foundation#73
chrisbliss18 merged 11 commits into
v2from
stack-01-event-api-foundation

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

Stacked PR 1 of 9 extracted from the broader v2-chris-events branch.

Base: v2
Head: stack-01-event-api-foundation

Summary:

  • Adds the event-sourced state model with jetmon_events and jetmon_event_transitions.
  • Adds the Phase 1/2 internal REST API foundation: auth, read endpoints, write endpoints, SLA stats, idempotency, and scope enforcement tests.
  • Captures the initial API and Phase 3 design documentation needed by later stack entries.

Review notes:
This is the foundation PR for the stack. Later PRs layer webhooks, alert contacts, rollout hardening, gateway tenant enforcement, and Docker/dev-env polish on top.

This stack supersedes the original all-in-one PR #72, but the original branch is left intact as a fallback while the split is reviewed.

Chris Jean added 11 commits April 25, 2026 00:47
The orchestrator's view of site state is now event-sourced across two new
tables: jetmon_events (one row per incident, mutable while open, frozen on
close) and jetmon_event_transitions (append-only history of every mutation
to an events row). Together they preserve full incident history including
intra-event severity bumps and state changes that the previous mutable-row
design silently overwrote.

Schema (migrations 9-11):
- jetmon_audit_log narrowed to operational-only (drop http_code, error_code,
  rtt_ms, old_status, new_status; add event_id, metadata JSON; relax blog_id
  to NULL; add idx_event_id, idx_event_type_created)
- jetmon_events with dedup_key generated column + UNIQUE KEY for one-open-
  event-per-tuple idempotency without partial indexes
- jetmon_event_transitions keyed on event_id with severity_before/after,
  state_before/after, reason, source, metadata, changed_at

New internal/eventstore package is the sole writer for both events tables.
Open/UpdateSeverity/UpdateState/Promote/LinkCause/Close all run their event
mutation and the matching transition row in a single transaction. A Tx
wrapper exposes the same surface for callers (the orchestrator) that need
to coordinate event writes with their own SQL — used to project v1
site_status onto jetpack_monitor_sites in the same transaction.

Orchestrator integration:
- handleFailure opens a Seems Down event on the first local failure and
  projects site_status=SITE_DOWN in the same tx
- confirmDown promotes Seems Down → Down with reason=verifier_confirmed
- false-positive branch closes with reason=false_alarm
- handleRecovery closes with reason=verifier_cleared (was Down) or
  probe_cleared (still in Seems Down)
- checkSSLAlerts opens a site-level tls_expiry event with severity laddered
  Warning (≤30/14 days) → Degraded (≤7 days), closes on cert renewal

Audit package refactored to operational-only. EventCheck, EventStatusTransition,
and EventVeriflierResult constants dropped (per-probe data lives in
jetmon_check_history; site-state changes flow through the events tables).
LogTransition removed. The Log signature is replaced with an Entry struct
carrying optional EventID and Metadata fields so audit rows can link to
incidents for operator drill-down.

Documentation: EVENTS.md, AGENTS.md, and TAXONOMY.md describe the two-table
split, the "open on first failure" lifecycle, the dedup_key idempotency
trick, transition reasons vocabulary, and the same-transaction invariant
that holds events, transitions, and the v1 site_status projection in sync.
Server-side hardening:
- Replace bare ListenAndServe with http.Server timeouts (ReadHeaderTimeout 5s,
  ReadTimeout 30s, WriteTimeout 35s, IdleTimeout 120s) so a slow client cannot
  pin a goroutine indefinitely
- Expose Shutdown(ctx) for graceful drain. veriflier2 binary's SIGINT/SIGTERM
  handler now drains in-flight checks for up to 30s before closing the
  listener instead of os.Exit(0)
- Optional StatsD metrics (verifier.checks.received.count,
  verifier.checks.duration.timer, verifier.auth.rejected.count) initialized
  from STATSD_ADDR env var; skipped cleanly when unset

Client-side performance:
- Tuned http.Transport: MaxIdleConns 100, MaxIdleConnsPerHost 20,
  IdleConnTimeout 90s, ForceAttemptHTTP2 true, explicit DialContext timeouts.
  The default MaxIdleConnsPerHost of 2 was forcing reconnects under any
  concurrency and was a latent bottleneck during outage waves
- Drop the hardcoded 30s http.Client.Timeout — caller-supplied ctx deadline
  is now the single source of truth. Orchestrator wraps each escalation with
  context.WithTimeout(NET_COMMS_TIMEOUT + 5s headroom) so a wedged verifier
  no longer hangs for the orchestrator's lifetime

Request correlation:
- Add RequestID field to CheckRequest and CheckResult (16-byte hex,
  crypto/rand backed). Client auto-generates if caller leaves it empty;
  server logs and echoes back. Orchestrator stamps the same id on the
  "escalating to N verifliers" audit row and on each verifier's reply row,
  so the full lifecycle of an escalation can be reconstructed via
  jetmon_audit_log.metadata.request_id without timestamp matching

Tests cover RequestID generation/echo, server graceful drain, and the
existing handler paths.
Two targeted fixes that surfaced during docker integration testing:

Verifier config validation: an empty grpc_port (typically a typo of "port"
instead of "grpc_port" in config.json) silently parsed to "" and the
orchestrator then dialed "host:" which resolves to port 80, producing a
generic connection-refused error. validate() now rejects any VERIFIERS[]
entry with empty host or grpc_port at startup with a precise message
naming the offending entry.

PID file location: run-jetmon.sh exported JETMON_PID_FILE=/jetmon/jetmon2.pid,
but /jetmon is owned by the jetmon user from the Dockerfile while the
container runs as ${JETMON_UID:-1000} via docker-compose, so the write
failed with permission denied and reload/drain commands could not find
the file. Move the PID file to /jetmon/stats/jetmon2.pid (the stats/
directory is chmod 0777 by the Dockerfile) and surface the env var via
docker-compose so docker compose exec ./jetmon2 reload picks it up too.
API.md is a design proposal for the internal Jetmon REST API. No code yet —
this drives review and alignment before Phase 1 implementation.

Scope and audience: Jetmon does not expose this API to end customers
directly. A separate gateway service handles tenant isolation, public-facing
errors, customer rate limiting, and plan-based feature gating, and calls
Jetmon over this internal interface. Only known internal systems (gateway,
operator dashboard, alerting workers, batch jobs) are direct callers.

Design principles documented: read API as source-of-truth not snapshot;
severity and state both first-class fields (not collapsed); cursor pagination
only; honest 401/403/404 distinction (no info-leak hiding); per-consumer
audit logging via the existing jetmon_audit_log; verbose error messages
for incident response.

Authentication: per-consumer Bearer tokens with three coarse scopes
(read/write/admin), sha256-hashed at rest in jetmon_api_keys. No live/test
key split, no OAuth, no self-service key management — keys are created and
revoked via an ops-only ./jetmon2 keys CLI.

Endpoints described across six families: sites + current state (Family 1),
events and history (Family 2), SLA and statistics (Family 3), webhooks
(Family 4), alert contacts (Family 5), identity and utility (Family 6).
Build order recommended in four phases.

Resolved design questions section captures the rationale behind: raw numeric
IDs everywhere (no type prefix or ULID); 200/page list cap with no
include_inactive flag; Stripe-style versioned HMAC for webhook signing;
synchronous trigger-now with 30s timeout; single metadata field per event
(gateway sanitizes before forwarding to customers).
Implements the read-only foundation described in API.md. The API server runs
on a dedicated port (config: API_PORT, 0 disables) inside the jetmon2
binary. Internal-only — a separate gateway service handles all customer-
facing concerns and calls this surface.

Schema (migration 12):
- jetmon_api_keys with sha256-hashed key_hash, consumer_name, scope enum
  (read|write|admin), rate_limit_per_minute, expires_at, revoked_at,
  last_used_at, created_at, created_by

internal/apikeys package:
- GenerateToken returns a 32-byte crypto/rand token, base32-encoded with
  jm_ prefix
- Lookup resolves a raw token to the Key record, distinguishing
  ErrInvalidToken / ErrKeyRevoked / ErrKeyExpired and touching last_used_at
- Create / List / Revoke / Rotate cover the full management lifecycle
- Scope.Includes enforces the read < write < admin hierarchy

CLI: ./jetmon2 keys create|list|revoke|rotate. The token is shown only
once at creation; the API has no /keys endpoints (key management is ops-
only by design).

internal/api package:
- Server.Listen / Shutdown with the verifier-style timeouts
- requireScope middleware: parses Bearer token, resolves via apikeys.Lookup,
  enforces scope, applies per-key token-bucket rate limiting (in-memory,
  with periodic GC), audits to jetmon_audit_log under event_type=api_access
  with consumer_name as source
- Standard X-RateLimit-{Limit,Remaining,Reset} headers; 429 with
  Retry-After when exceeded
- Honest 401 vs 403 vs 404 (no info-leak hiding) and verbose error
  messages for incident response — gateway sanitizes for customers

Endpoints (all GET, scope=read):
- /api/v1/health (unauthenticated)
- /api/v1/me — token introspection
- /api/v1/sites — cursor pagination, filters: state, severity__gte,
  monitor_active, q (URL substring)
- /api/v1/sites/{id} — single site with active_events array
- /api/v1/sites/{id}/events — incident history with duration_ms and
  transition_count, filters: state, check_type, started_at__gte/lt, active
- /api/v1/sites/{id}/events/{event_id} — event detail with embedded
  transitions (cross-site protection: event must belong to named site)
- /api/v1/sites/{id}/events/{event_id}/transitions — paginated transition list
- /api/v1/events/{event_id} — direct event lookup
- /api/v1/sites/{id}/uptime — uptime % from event durations, with
  per-state seconds, MTTR, MTBF over 1h/24h/7d/30d/90d window or from/to
- /api/v1/sites/{id}/response-time — p50/p95/p99/max/mean from
  jetmon_check_history.rtt_ms, sample cap 100k
- /api/v1/sites/{id}/timing-breakdown — same percentile shape per
  DNS/TCP/TLS/TTFB component

Tests use go-sqlmock with QueryMatcherEqual for precise SQL contract
assertions: 63 tests covering rate limiter behavior, auth middleware
(missing/invalid/revoked/expired/insufficient scope/rate-limited paths),
all read endpoint happy paths, 404s, cross-site protection, filter parsing,
cursor pagination, window math, and percentile correctness.

Audit package gains EventAPIAccess constant; main.go wires the API server
into runServe with graceful Shutdown(ctx) on SIGINT/SIGTERM. Keys CLI
shares the same db handle as the rest of the binary.
Implements the write-side endpoints described in API.md "Family 1 + 2"
for the build order's Phase 2. All endpoints require write scope and route
through an Idempotency-Key middleware so retries with the same key are
safe; PATCH/DELETE skip the middleware because they're inherently
idempotent on this schema.

Idempotency middleware (`internal/api/idempotency.go`):
- In-memory store keyed on (api_key_id, idempotency_key) — scoped by API
  key so two consumers can't collide on the same opaque value
- 24h TTL with hourly GC of expired entries
- On replay with same body: returns cached status/headers/body verbatim
  plus an `Idempotency-Replayed: true` header for debugging
- On same key + different body: 409 idempotency_conflict
- Only caches 2xx and 4xx responses — 5xx are re-attempted by retries
- State is bound to this jetmon2 instance; multi-instance would need
  Redis or a backing table. Adequate for the current single-instance
  internal API; documented as a future migration if scaling demands

Site write endpoints (`handlers_sites_write.go`):
- POST /api/v1/sites — caller-supplied blog_id (canonical from WPCOM),
  validates URL parses as http/https with non-empty host, validates
  redirect_policy ∈ {follow,alert,fail}, rejects duplicates with 409
  site_exists. Returns 201 with the full site record
- PATCH /api/v1/sites/{id} — partial update via dynamic SET clause from
  non-nil body fields. Empty body returns the current state (idempotent
  no-op). Validates inputs before existence check so bad shapes get 400
  even on nonexistent sites
- DELETE /api/v1/sites/{id} — soft delete: sets monitor_active=0 and
  closes any open events with reason=manual_override + metadata noting
  the deletion. Preserves audit trail and historical rows. Returns 204
- POST /api/v1/sites/{id}/pause — closes active events, sets
  monitor_active=0
- POST /api/v1/sites/{id}/resume — sets monitor_active=1. Does not
  reopen previously-closed events; the orchestrator's regular flow
  detects any current failure on the next round

Manual close + trigger-now (`handlers_events_write.go`):
- POST /api/v1/sites/{id}/events/{event_id}/close — closes an event with
  caller-supplied reason (defaults to manual_override) and stamps the
  optional note in metadata. Validates the event belongs to the named
  site (cross-site protection). Already-closed events return 200 with
  the existing record (idempotent close). If no other active events
  remain, projects site_status back to running
- POST /api/v1/sites/{id}/trigger-now — runs a checker.Check directly
  with a 30s context timeout, returns the raw timing result inline. On
  success, closes any open events with reason=probe_cleared (matches
  the orchestrator's no-verifier-on-recovery semantics from EVENTS.md).
  Does NOT open new events on failure — the orchestrator owns
  failure-detection state machine on its regular round

Tests (34 new, 97 total in the api package):
- Idempotency: hash stability, store lookup/store/expiry, middleware
  passthrough/cache-and-replay/conflict/key-isolation
- Site create: happy path, missing/invalid blog_id, bad URL variants
  (empty, malformed, ftp scheme, missing host), bad redirect_policy
  (422), duplicate (409)
- Site update: happy path, empty body returns current, not-found,
  validates URL/redirect_policy before existence check
- Site delete: soft-deletes with monitor_active=0 and closes events
- Pause: closes active events with manual_override + projects state;
  the underlying close transaction is asserted with sqlmock
- Resume: sets monitor_active=1
- Manual close: happy path with read-back, default reason fallback,
  not-found, cross-site rejection (404), already-closed idempotency,
  invalid id parsing
- Trigger-now: site-not-found 404, success path with no active events,
  success path that closes one active event via probe_cleared, invalid
  id 400
- Helpers: validateMonitorURL, encodeCustomHeaders, boolToTinyint,
  buildUpdateSetClause empty/full

Verified in docker against the running stack: create / patch /
pause / resume / trigger-now / 422 validation / 204 delete all returned
expected responses. Idempotency-Replayed header confirmed on a same-key
replay.
Closes items 7 and 10 from the verifier review punch-list.

Body size cap: handleCheck wraps r.Body in http.MaxBytesReader (10MB) before
the JSON decoder runs. An overlong payload now returns 413 Request Entity
Too Large rather than streaming through the decoder until something else
times out. 10MB is generous headroom — a typical 200-site batch is ~50KB.

Empty auth-token guard: veriflier2/cmd/main.go now refuses to start if the
resolved auth token is empty. Previously an empty token created a subtle
auth-bypass edge case where any request with the literal "Bearer " header
(no token after the space) would pass the equality check. Mirrors the same
pattern as the existing empty-port guard.

Item 11 (log.Fatalf on Listen failure) was reviewed and left as-is. Listen
only returns on startup port-binding failures (no in-flight work to drain)
or extremely rare mid-serve listener errors; clean shutdown via SIGINT goes
through srv.Shutdown which makes Listen return ErrServerClosed cleanly. The
current code is correct.
Three test races flagged by `go test -race ./...` predate this branch but
are easy enough to clean up while we're here. CI is now race-clean across
the entire module.

orchestrator: TestEscalateToVerifliersRecordsFalsePositiveWhenQuorumMissed
shared a `call` counter across the verifier-RPC goroutines escalateToVerifliers
spawns. Replace with sync/atomic.Int64.Add — same "first verifier returns
Success=false, subsequent ones return true" semantics, no race.

checker: TestQueueDepth, TestActiveCount, and TestScaleUpWhenQueueDeep all
used a two-Cleanup pattern that, due to LIFO ordering, restored the package-
level poolCheckFunc stub before the worker goroutines had finished reading
it. Consolidate into a single Cleanup that unblocks workers, drains the
pool to completion, and only then restores the stub. Functionally equivalent;
race-free.
The existing handler-level tests invoke handlers directly, bypassing the
requireScope middleware. These tests close that gap by going through
s.routes() so the middleware actually fires:

- TestPhase2WriteEndpointsRejectReadToken: a read-scope key on every
  Phase 2 write endpoint (POST /sites, PATCH /sites/{id}, DELETE,
  pause/resume, trigger-now, manual close) returns 403 insufficient_scope
- TestPhase2WriteEndpointsAcceptWriteToken: a write-scope key reaches
  the handler (asserts NOT 401/403; the handler may then 400/404 due to
  test-scoped DB state, but that's downstream of scope enforcement)
- TestPhase2ReadEndpointsAcceptReadToken: read scope passes on read
  endpoints
- TestPhase2WriteEndpointsRejectMissingToken: no Authorization header →
  401 missing_token across all write endpoints
- TestAdminTokenCanReachAllScopes: admin includes write includes read

Each subtest sets up the auth lookup expectations (key SELECT + last_used_at
UPDATE) via a small expectAuthLookup helper. That way the boilerplate
stays out of the test bodies and the scope assertion is the focus.
Resolved Phase 3 questions land in API.md's webhooks section:
- Detection: pull-based, 1s poll interval on jetmon_event_transitions.
  Long-term answer for the architecture; not a stepping stone toward push.
  Multi-instance via row-claim, no pub/sub layer needed.
- Retry/dead-letter: 6 attempts on the 1m/5m/30m/1h/6h/24h schedule;
  abandoned status in the same jetmon_webhook_deliveries table; manual
  retry endpoint for re-firing after a consumer fixes their endpoint.
- Filter semantics: empty = match all, AND across dimensions, whitelist
  only. Stripe/GitHub/Slack convention.
- Signing/rotation: HMAC-SHA256 over {timestamp}.{body}; immediate
  revocation only in v1.
- event.* webhook types fire 1:1 with jetmon_event_transitions rows.
  site.state_changed deferred.

Deferred items captured in ROADMAP.md:
- site.state_changed webhooks (rollup from events to site-row projection)
- Grace-period secret rotation (server signs both old + new for a window)
- Multi-repo / multi-binary split: orchestrator, API, deliverer, dashboard,
  and a renamed verifier as separately deployable services. Schema is
  already the implicit bus; split would extract each concern into its
  own cmd/ entry and move shared types out of internal/. "veriflier" is
  a long-standing typo and a split is the natural moment to rename
  (candidates: verifier, witness, probe-worker, vantage).

Each deferred entry includes the trigger condition that would prompt
revisiting and the upgrade path that keeps it non-breaking.
API.md webhooks section gains:
- Backpressure (Q6): shared 50-goroutine pool with per-webhook in-flight
  cap of 3, enforced via map[webhook_id]int counter under a mutex.
  Prevents a slow URL from monopolizing the pool and starving other
  webhooks' deliveries.
- Schema (Q7): jetmon_webhooks and jetmon_webhook_deliveries column
  layouts, the (status, next_attempt_at) and (webhook_id, created_at)
  indexes the worker and list-deliveries endpoints need, and the
  frozen-at-fire-time payload contract.
- Signing rationale (Q8): brief catalogue of the alternatives considered
  (GitHub-style, Slack-style, JWT, RFC 9421 HTTP Message Signatures,
  asymmetric Ed25519) with the conditions under which each would
  become attractive. Stripe-style HMAC-SHA256 over {timestamp}.{body}
  is the right call for our internal-API shape; asymmetric is the most
  compelling future migration if/when a public API without a gateway
  becomes a requirement.
- Webhook ownership (Q9): write-scope manages all webhooks today;
  created_by is audit-only. Section explicitly enumerates the
  ramifications if Jetmon ever becomes a public API: per-tenant
  ownership column, filtered queries, possibly a webhooks scope,
  backfill migration. created_by is forward-compatible.

ROADMAP.md gains a "Path to a public API" section under Architectural
roadmap that consolidates every internal-API design decision that would
need to change for direct customer access (auth scopes, error semantics,
error verbosity, webhook ownership and signing, rate limiting model,
idempotency key scoping, site id semantics). The migrations are
individually clean but touch most of the surface — public-API exposure
would be a project, not a flag flip.
@chrisbliss18 chrisbliss18 merged commit f0be708 into v2 Apr 28, 2026
@chrisbliss18 chrisbliss18 deleted the stack-01-event-api-foundation branch April 28, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant