Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
b1e6bfc
Add event-sourced state model with transitions table
Apr 25, 2026
8592e2e
Improve verifier performance, robustness, and observability
Apr 25, 2026
5fd8c0b
Fix verifier config validation and PID file location
Apr 25, 2026
110c0b2
Add internal API design document
Apr 25, 2026
0b48202
Add Phase 1 internal API: auth + read endpoints + SLA stats
Apr 25, 2026
e104637
Add Phase 2 API: write surface + idempotency keys
Apr 25, 2026
28c3c74
Verifier hardening: body size cap and empty-token guard
Apr 25, 2026
1731e7e
Fix pre-existing race conditions in test code
Apr 25, 2026
75d901b
Add Phase 2 scope-enforcement tests through the full mux
Apr 25, 2026
6358192
Capture Phase 3 design decisions and deferrals
Apr 25, 2026
4fe474c
Capture remaining Phase 3 design decisions and public-API ramifications
Apr 25, 2026
dc3f821
Add Phase 3 webhooks: CRUD, delivery worker, manual retry
Apr 25, 2026
c3d15b3
Reframe jetmon-deliverer as all-outbound-dispatch binary
Apr 25, 2026
7e68bf2
Document Phase 3.x alert contacts design
Apr 25, 2026
40e1ad8
Fix severity values in alert contacts doc to match eventstore constants
Apr 25, 2026
809748a
Add migrations 16-18 for alert contacts schema
Apr 25, 2026
fc729a0
Add internal/alerting package skeleton
Apr 25, 2026
55949cd
Add email transport with wpcom/smtp/stub senders
Apr 25, 2026
adbc8e4
Add PagerDuty, Slack, and Teams transports
Apr 25, 2026
32972a5
Add alert contact CRUD and send-test endpoints
Apr 25, 2026
2536c98
Add alert delivery worker with retry and rate cap
Apr 25, 2026
26aa91f
Add alert delivery list/retry endpoints and wire worker into main
Apr 25, 2026
524356a
Soft-lock claimed rows so the deliver loop doesn't re-claim them
Apr 25, 2026
56d6e8e
Update CHANGELOG and AGENTS.md for v2 branch work
Apr 25, 2026
0de25b9
Tighten alerting input validation, MIME header safety, and test idemp…
Apr 25, 2026
29b380c
Document send-test idempotency and v2 polish fixes
Apr 25, 2026
a06ee1d
Fix dns_ms / tcp_ms / tls_ms overflow on partially-failed checks
Apr 25, 2026
63bb091
Refresh AGENTS.md architecture for the v2 health-platform shape
Apr 25, 2026
f9ad4d7
Backfill architecture decision records for the v2 branch
Apr 25, 2026
bfca180
Honor future revoked_at timestamps as API key grace windows
Apr 27, 2026
2e73f25
Replace the unsafe DB update flag with legacy status projection config
Apr 27, 2026
3bb1c3a
Derive API site state from open v2 events
Apr 27, 2026
e873adc
Document the shadow-state rollout and related migration constraints
Apr 27, 2026
9905d84
Make active-event rollup MySQL 5.7-compatible and align API key cutoffs
Apr 27, 2026
531e1a5
Clean up unused symbols and gopls-flagged inefficiencies
Apr 27, 2026
ca10fd8
Apply gofmt to pre-existing nonconforming files
Apr 27, 2026
712aaa0
Drop the dead nil-siteID branch from listEvents
Apr 27, 2026
5519c74
Bring architecture docs back in line with the current code
Apr 27, 2026
74628df
Align the API reference with implemented routes
Apr 27, 2026
8e6c780
Validate email transport configuration
Apr 27, 2026
a9afebf
Return only delivery rows that win the soft lock
Apr 27, 2026
df6ebcf
Warn at startup when email transport will not deliver
Apr 27, 2026
2b05f9f
Document the in-place filter idiom in delivery claim loops
Apr 27, 2026
cba7a54
Cover email transport startup helpers
Apr 27, 2026
4a4928c
Document email transport warning behavior
Apr 27, 2026
ed22772
Align dashboard docs with the current health surface
Apr 27, 2026
4430811
Handle health checks without a database handle
Apr 27, 2026
b85e302
Capture post-v2 probe-agent architecture options
Apr 27, 2026
c660e6f
Add a top-level docs index
Apr 27, 2026
99e3281
Clarify public API roadmap scope
Apr 27, 2026
9043cb8
Align Veriflier transport wording
Apr 27, 2026
714aa72
Keep make all independent of code generation
Apr 27, 2026
cc65332
Document the Makefile build path
Apr 27, 2026
1d18619
Record recent docs and tooling polish
Apr 27, 2026
7f485e2
Make Go resolution explicit in build targets
Apr 27, 2026
8d8977a
Preserve request ids in rejected API audit rows
Apr 27, 2026
d77f3c0
Use a writable Go build cache for Makefile targets
Apr 27, 2026
947ec05
Pass request id explicitly to the audit helper
Apr 27, 2026
4e69766
Unify health-check error wording across nil-DB and ping failure
Apr 27, 2026
7469c62
Document the SMTP-style reply code in the stub email test
Apr 27, 2026
1557d29
Cover audit rows on the success and rate-limit paths
Apr 27, 2026
3cc50ef
Thread context.Context through audit.Log
Apr 27, 2026
b17fdd2
Refresh README to lead with the v2 health-platform story
Apr 27, 2026
68932dc
Keep filtered site-list pagination advancing
Apr 27, 2026
9790ffb
Expand coverage across core packages
Apr 27, 2026
0cd7c6e
Cover list handlers and lifecycle helpers
Apr 27, 2026
653a1c8
Prioritize remaining roadmap work
Apr 27, 2026
4a4d036
Guard delivery workers behind an owner host
Apr 28, 2026
93b836e
Measure the v2 downtime decision flow
Apr 28, 2026
b518ab7
Detect legacy projection drift during rounds
Apr 28, 2026
edb3abf
Introduce a standalone deliverer entrypoint
Apr 28, 2026
7b44f42
Claim deliveries with row locks
Apr 28, 2026
18d753d
Bless JSON-over-HTTP for Veriflier transport
Apr 28, 2026
b79d3e9
Publish route-driven OpenAPI contract
Apr 28, 2026
9a31f88
Plan outbound credential encryption
Apr 28, 2026
839da6c
Document deliverer rollout policy
Apr 28, 2026
8f2ea05
Publish OpenAPI component schemas
Apr 28, 2026
5cec3ea
Validate OpenAPI client generation smoke
Apr 28, 2026
0a4752b
Define public API tenant boundary
Apr 28, 2026
ade7a2f
Add tenant ownership hooks for outbound resources
Apr 28, 2026
f8f8ca1
Enforce gateway tenant context for outbound API resources
Apr 28, 2026
3d600cf
Enforce gateway tenant ownership for site APIs
Apr 28, 2026
9ccc903
Clean up gateway docs and lint compatibility
Apr 28, 2026
768723d
Import gateway site tenant mappings
Apr 28, 2026
d5faf10
Add pinned bucket rollout mode
Apr 28, 2026
481c8a2
Add pinned rollout preflight command
Apr 28, 2026
95e7e4b
Add dynamic rollout ownership check
Apr 28, 2026
edf51fe
Split detection metrics by outcome source
Apr 28, 2026
68ceaf4
List legacy projection drift details
Apr 28, 2026
25f5ced
Track WPCOM notification parity metrics
Apr 28, 2026
58ad6f9
Surface rollout checks in validate-config
Apr 28, 2026
c108b2d
Surface rollout health in dashboard
Apr 28, 2026
6531b54
Simplify Docker env port overrides
Apr 28, 2026
37a7dd3
Document Docker env sample groups
Apr 28, 2026
d6e905f
Clarify Docker API port binding
Apr 28, 2026
7f6857d
Handle Docker runtime permissions
Apr 28, 2026
9439dd4
Refine Docker development setup
Apr 28, 2026
81cb306
Wait for Docker MySQL TCP readiness
Apr 28, 2026
3de5c3b
Expose Docker API bind separately
Apr 28, 2026
a758770
Add Docker Mailpit and healthchecks
Apr 28, 2026
b02d6f1
Expose site scheduling fields in API responses
Apr 28, 2026
67cf7f5
Document site soft-delete contract
Apr 28, 2026
5bd3f45
Package standalone deliverer service
Apr 28, 2026
c0866e8
Cover gateway tenant event access paths
Apr 28, 2026
faaef24
Document completed roadmap work
Apr 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ config/db-config.conf
certs/

# Runtime output dirs
docker/volumes/
logs/
stats/

Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Compiled binaries
bin/
jetmon2
/jetmon2

# Editor and OS files
.DS_Store
Expand All @@ -24,6 +24,7 @@ veriflier2/config/veriflier.json
*.pb.go

# Runtime output dirs
docker/volumes/
logs/*.log
stats/*
!logs/.gitkeep
Expand Down
125 changes: 93 additions & 32 deletions AGENTS.md

Large diffs are not rendered by default.

1,077 changes: 1,077 additions & 0 deletions API.md

Large diffs are not rendered by default.

77 changes: 64 additions & 13 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,15 @@ call flow used to determine and report site status.
System Overview
---------------

Jetmon 2 is a single Go binary. Multiple instances can run on different hosts,
each owning a non-overlapping range of site buckets claimed from MySQL.
Jetmon 2 runs as a Go monitor binary (`jetmon2`). Multiple monitor instances can
run on different hosts, each owning a non-overlapping range of site buckets
claimed from MySQL. Outbound webhooks and alert contacts can still run embedded
inside one API-enabled `jetmon2` process, or through the standalone
`jetmon-deliverer` binary as the first step toward the post-v2 process split.

```
┌─────────────────────────────────────────┐
│ jetmon2 (single binary)
jetmon2
│ │
┌──────────┐ sites │ ┌─────────────┐ ┌─────────────────┐ │
│ MySQL │──────────► │ │ Orchestrator│───►│ Checker Pool │ │
Expand All @@ -40,13 +43,26 @@ Multiple jetmon2 instances coordinate through MySQL bucket leases:
Host C ────── (takes over Host B's range if B goes offline)
```

Shadow-v2-state migration model:

- `jetmon_events` and `jetmon_event_transitions` are the authoritative incident
state for Jetmon v2.
- `jetpack_monitor_sites` remains the legacy site/config table during migration.
- While `LEGACY_STATUS_PROJECTION_ENABLE` is true, every v2 incident mutation
also projects the v1-compatible `site_status` / `last_status_change` fields
back to `jetpack_monitor_sites` in the same transaction.
- Once legacy readers have moved to the v2 API/event tables, disable
`LEGACY_STATUS_PROJECTION_ENABLE`; v2 incident state continues to be written
to the event tables.


Package Map
-----------

```
jetmon/
├── cmd/jetmon2/ Entry point, CLI subcommands, signal handling
├── cmd/jetmon-deliverer/ Standalone outbound delivery worker
├── internal/
│ ├── orchestrator/ Round loop, bucket coordination, retry queue,
│ │ failure escalation, status notifications
Expand All @@ -58,6 +74,11 @@ jetmon/
│ ├── veriflier/ Veriflier client (JSON-over-HTTP) and server
│ ├── wpcom/ WPCOM notification client with circuit breaker
│ ├── audit/ Structured audit log (read + write)
│ ├── eventstore/ Authoritative incident event + transition writer
│ ├── api/ Internal REST API, auth, rate limits, idempotency
│ ├── deliverer/ Shared webhook + alert-contact worker wiring
│ ├── webhooks/ Webhook registry + HMAC-signed delivery worker
│ ├── alerting/ Managed alert-contact registry + delivery worker
│ ├── metrics/ StatsD UDP client, stats file writer
│ └── dashboard/ HTTP + SSE operator dashboard
└── veriflier2/cmd/ Standalone veriflier binary
Expand Down Expand Up @@ -129,8 +150,8 @@ This is the end-to-end path from database query to WPCOM notification.
└─────────────┘ │ │
│ Stage 3 — Confirm down │
│ confirmDown(site, entry, vResults) │
│ if DB_UPDATES_ENABLE:
dbUpdateSiteStatus(→ confirmed_down)
│ if LEGACY_STATUS_PROJECTION_ENABLE:
project site_status(→ confirmed_down) │
│ if inMaintenance(): suppress + audit │
│ else if !isAlertSuppressed(): Notify() │
│ retries.clear(blogID) │
Expand Down Expand Up @@ -315,7 +336,9 @@ Veriflier Transport
◄── {"status":"OK","version":"1.2.3"}
```

The transport is JSON-over-HTTP (a placeholder for gRPC; swap after `make generate`).
The transport is JSON-over-HTTP for v2 production. `proto/veriflier.proto`
remains as a schema reference for a possible future transport, but generated
gRPC stubs are not required to build or deploy v2.


Bucket Distribution — Multi-Host Scaling
Expand Down Expand Up @@ -366,11 +389,12 @@ Database Tables
----------------

```
jetpack_monitor_sites Core site list (pre-existing, extended by Jetmon 2)
jetpack_monitor_sites Legacy site/config table plus compatibility projection
blog_id WordPress site identifier
bucket_no Determines which monitor instance owns this site
monitor_url URL to check
site_status 1=running, 2=confirmed_down
site_status Legacy v1 projection; derived from v2 events
last_status_change Legacy v1 projection; derived from v2 transitions
last_checked_at Used to order fetch by least-recently-checked
ssl_expiry_date Updated after each TLS handshake
check_keyword Optional body text to require
Expand All @@ -387,20 +411,47 @@ Database Tables
last_heartbeat Updated every round; expiry triggers rebalance
status active / draining

jetmon_audit_log Immutable event record for compliance/debugging
event_type check | status_transition | wpcom_sent |
wpcom_retry | retry_dispatched | veriflier_sent |
jetmon_events Authoritative v2 incident current state
id Incident identifier
blog_id Site identifier
check_type Probe family (http, tls_expiry, ...)
severity/state Current incident projection
started_at/ended_at Incident window
resolution_reason Required close reason

jetmon_event_transitions Append-only mutation history for jetmon_events
event_id Incident row being mutated
severity/state before/after
reason/source Why and who caused the mutation
changed_at Transition time

jetmon_audit_log Operational trail for compliance/debugging
event_type check | wpcom_sent | wpcom_retry |
retry_dispatched | veriflier_sent |
veriflier_result | maintenance_active |
alert_suppressed
alert_suppressed | api_access | config_reload
blog_id, source, http_code, error_code, rtt_ms
old_status, new_status (for transition events)

jetmon_check_history Per-check timing samples
rtt_ms, dns_ms, tcp_ms, tls_ms, ttfb_ms

jetmon_false_positives Checks local failed but verifliers passed
blog_id, http_code, error_code, rtt_ms

jetmon_api_keys Internal API Bearer-token registry
key_hash, consumer_name, scope, rate_limit_per_minute

jetmon_webhooks Registered webhook receivers and filters
jetmon_webhook_deliveries
Per-transition webhook delivery attempts
jetmon_webhook_dispatch_progress
Webhook worker transition high-water marks

jetmon_alert_contacts Managed notification destinations
jetmon_alert_deliveries Per-transition alert delivery attempts
jetmon_alert_dispatch_progress
Alert worker transition high-water marks

jetmon_schema_migrations Idempotent migration tracking
```

Expand Down
134 changes: 133 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,136 @@ Breaking changes are marked **BREAKING**.

## Unreleased

### v2 branch — site health platform

The v2 branch builds on the Go rewrite to turn Jetmon from a status-flipper
into a full event-sourced health platform with an internal REST API,
HMAC-signed webhooks, and managed alert contacts. Kept on a parallel branch
because it is intentionally **not** drop-in with the Jetmon 1 wire format
(see PR #61 — DO NOT MERGE).

**New — event sourcing:**
- `jetmon_events` (current authoritative state per incident) and
`jetmon_event_transitions` (every status/severity change, append-only)
tables; `internal/eventstore` writes both in a single transaction
- Shadow-v2-state migration: while `LEGACY_STATUS_PROJECTION_ENABLE` is
true, event mutations also maintain the v1 `site_status` /
`last_status_change` projection for legacy consumers
- Five-layer severity ladder: `Up < Warning < Degraded < SeemsDown < Down`
matching `internal/eventstore.Severity*` constants

**New — internal REST API (`/api/v1/`, internal-only behind a gateway):**
- Per-consumer Bearer token auth with three scopes (`read` / `write` /
`admin`); `./jetmon2 keys create/list/revoke/rotate` CLI
- Per-key token-bucket rate limiter with `X-RateLimit-*` headers
- Stripe-style idempotency keys on POST endpoints
- Sites CRUD + pause/resume/trigger-now
- Events list + single + transitions list + manual close
- SLA endpoints: uptime, response-time, timing-breakdown
- Audit logging via `jetmon_audit_log` with `event_type=api_access`
- See API.md for full surface and design rationale

**New — webhooks (Phase 3):**
- `jetmon_webhooks` registry + `jetmon_webhook_deliveries` per-fire records
- Stripe-style HMAC-SHA256 signatures (`t=<unix>,v1=<hex>` over
`{ts}.{body}`); plaintext secret storage with documented threat model
- Filter dimensions: `events` + `site_filter` + `state_filter` (AND across,
whitelist within, empty=match all)
- Delivery worker with per-webhook in-flight cap (default 3) and shared
pool (default 50), retry ladder 1m / 5m / 30m / 1h / 6h then abandon
- Frozen-at-fire-time payload contract — consumer sees the event as it was
when the webhook fired, not as it is now
- POST `/webhooks/{id}/rotate-secret` (immediate revocation; grace-period
rotation deferred — see ROADMAP.md)
- POST `/webhooks/{id}/deliveries/{delivery_id}/retry` for operator manual
retry of abandoned rows

**New — alert contacts (Phase 3.x):**
- Managed channels for human destinations: `email`, `pagerduty`, `slack`,
`teams`. Boundary with webhooks: alert contacts deliver Jetmon-rendered
notifications through Jetmon-owned transports; webhooks deliver the raw
signed event stream for custom rendering
- Filter shape: `site_filter` + `min_severity` (default `Down`); per-contact
`max_per_hour` rate cap (default 60) as pager-storm insurance
- POST `/alert-contacts/{id}/test` for synthetic send-tests through the
same dispatch path
- Email transport pluggable via `EMAIL_TRANSPORT` config: `wpcom`
(production), `smtp` (dev / staging with MailHog), `stub` (default
log-only / tests, with startup and validate-config warnings)
- PagerDuty Events API v2 with severity mapping and event_action
trigger/resolve based on the recovery flag
- Slack Block Kit + Microsoft Teams Adaptive Card rendering
- Plaintext credential storage in `destination` JSON; same outbound-dispatch
rationale as webhook secrets, threat model documented inline
- Legacy WPCOM notification flow continues alongside; migration tracked
in ROADMAP.md

**Verifier hardening:**
- Body size cap and empty-token guard on the JSON-over-HTTP transport
- Verifier config validation: required `host` and `grpc_port` per entry,
PID file location now respects `JETMON_PID_FILE` env var

**Worker fixes:**
- Soft-lock fix for both webhooks and alerting deliver loops: `ClaimReady`
pushes `next_attempt_at` out by 60s so the 1s tick doesn't re-claim a
still-in-flight row. Without this, the per-contact in-flight cap (3)
was producing concurrent dispatches that inflated the attempt counter
and effectively skipped retry-schedule steps; the documented 7h36m
retry window was being collapsed to ~1h.
- `ClaimReady` now repeats the readiness predicate during the soft-lock
update and returns only rows whose update affected a row, so overlapping
claim attempts skip stale SELECT results instead of doing duplicate
dispatch work. Multi-instance row-claim caveat (SELECT ... FOR UPDATE
SKIP LOCKED) still tracked alongside the deliverer-binary extraction in
ROADMAP.md.

**Docs / tooling:**
- `make all` now builds the currently implemented `jetmon2` and
`veriflier2` binaries without requiring `protoc`; generated Veriflier
gRPC stubs remain an explicit `make generate` step for the future
transport swap.
- Makefile targets now share a configurable `GO` command and fall back to
`/usr/local/go/bin/go` when `go` is not on `PATH`; they also use an
overrideable `/tmp` Go build cache so checks do not depend on a
writable home-directory cache.
- Developer docs now point at the Makefile build path and document why
code generation is separate from the default build.
- Added a top-level docs index and a post-v2 probe-agent architecture
options document for revisiting the v3 direction after v2 is stable in
production.
- Clarified that the current Veriflier transport is JSON-over-HTTP and
that the public API roadmap is about a future customer-facing contract,
not the already-implemented internal `/api/v1`.

**Polish:**
- `alerting.Update` now validates `label` (must be non-empty) and
`max_per_hour` (must be ≥ 0) at input time, surfacing 422
`invalid_alert_contact` instead of letting an empty label silently
persist or a negative `max_per_hour` surface as a generic 500 from
MySQL's `INT UNSIGNED` constraint. Validations that don't depend on
the existing row run before the DB lookup so obviously bad PATCH
bodies don't pay for a round-trip.
- Email transport strips CR and LF from MIME header values
(`From` / `To` / `Subject`) as defense-in-depth against header
injection via untrusted strings (`monitor_url` is operator-controlled
but the column doesn't enforce CRLF-free). Body content with newlines
is unaffected.
- `POST /api/v1/alert-contacts/{id}/test` now honors `Idempotency-Key`
like the other write POSTs, so a retried "click to test" during a
network blip doesn't double-page the destination.
- API list-site rollup of the worst open event no longer relies on
`ROW_NUMBER()` window functions, so the query is compatible with
MySQL 5.7. Pagination caps the IN list and a site rarely has more
than one open event, so reducing in Go is cheap.
- API key cutoffs (`revoked_at` and `expires_at`) now share half-open
semantics: a key is valid for times strictly before the cutoff and
rejected at or after it. Future `revoked_at` continues to act as a
rotation grace window. See API.md.
- `LEGACY_STATUS_PROJECTION_ENABLE` is announced at startup
(`config: legacy_status_projection=enabled|disabled`) and surfaced by
`./jetmon2 validate-config`, so operators can confirm projection
state without reading the running config file.

### Jetmon 2 — initial Go rewrite

Complete rewrite of the Node.js + C++ uptime monitor as a single static Go binary.
Expand All @@ -22,7 +152,9 @@ Drop-in replacement for Jetmon 1; all existing MySQL schema columns are preserve
- `jetmon2 audit` — query per-site audit log from CLI
- Operator dashboard on configurable port with SSE state stream
- pprof debug server on localhost-only `DEBUG_PORT` (default 6060)
- `DB_UPDATES_ENABLE` double-gate: requires both config flag and `JETMON_UNSAFE_DB_UPDATES=1` env var
- `LEGACY_STATUS_PROJECTION_ENABLE` controls v1 `site_status` /
`last_status_change` compatibility writes; `DB_UPDATES_ENABLE` remains
as a deprecated alias
- Graceful shutdown with 30-second hard-exit backstop
- Non-root Docker images (`jetmon` / `veriflier` system users)
- Healthcheck-gated MySQL dependency in docker-compose
Expand Down
Loading