Add the streaming monitor engine for high-scale checks by chrisbliss18 · Pull Request #104 · Automattic/jetmon

chrisbliss18 · 2026-05-11T14:37:24Z

Summary

This draft PR adds the streaming monitor engine intended to replace the legacy round-oriented scheduler for large Jetmon v2 fleets. The new engine keeps a long-lived in-memory target set, schedules checks by per-site due time, auto-scales checker workers from required throughput and observed latency, moves legacy freshness projection off the hot path, batches side effects, and adds pressure controls so local overload does not become customer-site downtime noise.

It also includes several capacity-loop refinements from the internal-only uptime-bench runs: DNS cache reuse for repeated checks, explicit resolver control for capacity isolation, cleartext HTTP pooling by resolved IP, result-drain and dispatch pacing changes, dynamic target reload deferral, sticky failure-pressure handling, and result-backlog safeguards that keep the worker pool fed while completed results drain.

Why

The previous scheduler architecture was still shaped around full-fleet rounds and config caps. That made it hard to scale toward very large fleets without either falling behind check windows or relying on hand-tuned values. The streaming engine is designed around the actual work rate: active sites divided by each site's configured interval, with scheduling headroom and backpressure tied to live queue/result/failure signals rather than fixed batch limits.

Highlights

Adds internal/orchestrator/streaming.go and focused streaming scheduler tests.
Adds config support for SCHEDULER_MODE=streaming, streaming reload/projection intervals, worker auto defaults, resolver overrides, and body-read guardrails.
Keeps target checks on GET semantics and preserves the existing event/checker logic for site state transitions.
Caches checker request construction and DNS lookups across repeated scans.
Uses a shared checker pool with explicit queue/result capacity for streaming mode.
Records timing and side-effect metrics so capacity tests can separate check execution, result backlog, side effects, projection writes, and worker pressure.
Documents the design in docs/adr/0009-streaming-monitor-engine.md and updates operations/config docs.

Capacity Evidence So Far

The most recent deployed/tested build is 6512fdd, not the newest branch head. On the internal-only 500k-site, 5-minute cadence run, Jetmon v2 observed all 500000 target sites with 0 never-seen and 0 stale target-observer sites. Target observer request ratio was 1.0032, p95 target last-seen age was 294.88s, and max target last-seen age was 305.45s.

That run still stopped before 1M because host_cpu_used exceeded the current 85% stop threshold: average 77.03%, p95 92.26%, max 92.91%. DB freshness/projection also lagged badly in the benchmark health query, even though target-observer traffic stayed fresh. The latest commit, c976551, is intended to address the late-run result-drain/dispatch starvation seen in the 6512fdd logs, but it has not yet been deployed through a full capacity loop.

Known Follow-Ups Before Marking Ready

Deploy and run another internal-only capacity suite against c976551 or newer.
Decide how to treat DB freshness/projection health in streaming mode, since target-observer evidence and legacy projection freshness can diverge by design.
Continue CPU profiling if host CPU p95 remains above threshold after the result-drain fix.
Consider a follow-up PR for projection/reporting semantics if we want DB freshness to represent monitor liveness rather than rollback projection freshness.

Local Validation

go test ./internal/checker ./internal/orchestrator
go test ./...
make lint
make test-race
make build

This PR is intentionally draft until the latest branch head has one more capacity pass and the DB freshness/reporting decision is settled.

Add a SCHEDULER_ENGINE=streaming path that replaces round/page dispatch with an in-memory phase scheduler. Active sites are spread across their configured check interval, results are processed as they arrive, and the checker pool target is derived from required rate and observed latency instead of using NUM_WORKERS as a hard throughput ceiling. Keep existing incident handling intact by routing failures, recoveries, verifier escalation, TLS/SSL checks, audit entries, and WPCOM notifications through the current orchestrator paths. Streaming mode skips healthy check-history writes and batches legacy freshness projection so rollback has the accepted bounded freshness loss instead of per-probe last_checked_at accuracy. Add the additive jetmon_check_targets table and documentation for the streaming-engine migration path. The first prototype still reloads active config from jetpack_monitor_sites directly so uptime-bench can validate correctness before moving derived scheduling state fully onto the v2-native target table. Validated with go test ./....

Treat NUM_WORKERS=0 as the default worker floor and BUCKET_TARGET=0 as BUCKET_TOTAL. The capacity test service already uses those values to avoid hard-coded tuning caps, and the streaming scheduler should preserve that posture instead of requiring manual config edits before a run. Document the zero-value behavior in the config reference and operations guide, and update validation tests so the compatibility contract is explicit. Validated with go test ./....

Let DATASET_SIZE, BODY_READ_MAX_BYTES, BODY_READ_MAX_MS, and KEYWORD_READ_MAX_BYTES use their documented defaults when capacity-test configs leave them at zero or null. This keeps the streaming branch compatible with the existing Jetmon v2 test service config instead of requiring manual tuning edits before every run. Keep negative values invalid so obvious typos are still caught, and document the zero-value default behavior for operators. Validated with go test ./....

Treat zero or omitted local sizing fields as defaults for the capacity-test configs that intentionally avoid hand-tuned caps. NUM_WORKERS remains the streaming worker floor, BUCKET_TARGET=0 expands to the full bucket range, and DATASET_SIZE/body-read budgets fall back to their documented defaults while negative values still fail validation. Also update validate-config/startup output so SCHEDULER_ENGINE=streaming is reported explicitly with reload, rollback projection, worker-floor, and fetch-page details instead of looking like the legacy variable-interval scheduler. Validated with go test ./....

Add a lightweight active-site count check to the streaming scheduler so newly activated benchmark rows are picked up quickly instead of waiting for the full target reload interval. Empty schedulers poll every five seconds; populated schedulers check every thirty seconds and trigger a full reload only when the database count differs from the in-memory target count. This keeps the prototype safe for uptime-bench ladders that activate sites after the service is already running, without turning full config reloads into the steady-state hot path. Validated with go test ./....

Remove the non-blocking fallback that could drop completed check results when the result channel was full. Streaming mode depends on receiving every result so it can clear in-flight targets and reschedule them; dropping a result could permanently strand a site until process restart. Cancel the pool context before waiting during Drain so a worker blocked on result delivery can exit during shutdown, while still preserving the existing behavior of waiting for in-flight checks to finish before Drain returns. Add a regression test that fills the result channel, completes a second check, and verifies that the second result is delivered after the consumer catches up. Validated with go test ./....

Add a checker pool EnsureSize helper so the streaming scheduler can start workers proactively when active targets appear or the computed worker target increases. This avoids waiting for the queue-depth autoscaler to discover pressure after a large uptime-bench activation or production bucket takeover. Keep the worker target bounded by the pool max and only prewarm when streaming has active targets, so an idle process can still shrink down while a loaded process ramps immediately to the rate-derived target. Validated with go test ./....

Add checker pool size bounds so streaming mode can set a real worker floor while active targets are loaded. The deployed prototype was able to compute a worker target, but the generic pool autoscaler kept retiring workers during quiet queue moments, leaving the pool below target until pressure rebuilt. Streaming mode now sets min=max to the computed target while targets are active, then relaxes back to a one-worker floor when no targets are loaded. This should reduce warmup lag and avoid periodic shrink/rewarm churn during continuous capacity runs. Validated with go test ./....

Keep STREAMING_LEGACY_PROJECTION_INTERVAL_MIN inside the accepted 5-15 minute rollback window, and cap the effective projection cadence to the site interval when that interval is at least five minutes. The first streaming test proved HTTP coverage was correct, but the 10-minute projection cadence made 5-minute sites look stale to the persisted freshness verifier. This preserves reduced write pressure for faster one-minute checks by keeping the minimum projection window at five minutes, while allowing 5-minute production-style checks to maintain a 5-minute legacy freshness view. Validated with go test ./....

The 42beef8 capacity run showed healthy target-side coverage but a persistent legacy freshness failure: roughly 35% of rows were still older than the 5-minute DB freshness gate even though every target was checked. The tail age shape pointed at the projection cadence skipping checks that completed slightly less than 300 seconds after the previous projection. Treat a check as projection-eligible whenever the site's own check interval is at least the configured projection interval, and add a small slack window for faster sites that project less often. This keeps the batched legacy projection writer while avoiding a missed full interval caused by normal sub-second timing variance. Add regression coverage for the 5-minute interval case so checks that are 299 seconds apart still update the legacy freshness projection.

The 9f04c34 capacity run dropped stale rows from 1,773 to 131, but the remaining rows were clustered just past the 5-minute DB freshness cutoff while target-observer freshness stayed below 300 seconds. That points at projection accounting lag rather than missing checks. Streaming mode now uses the result timestamp only to decide whether a legacy projection is due. The actual last_checked_at value is the projection/commit time, bounded so it can never move before the result timestamp. This better represents when Jetmon has completed and persisted the check while keeping scheduling cadence based on the probe timestamp. Update projection tests so the expected timestamp is the projection commit time, including the 5-minute interval regression case.

The 14addde capacity run still showed a narrow 5-minute freshness boundary miss even though target coverage and replay detection were clean. At exact interval cadence, normal scheduling jitter and persistence timing leave no budget before a row can cross the strict freshness cutoff. Have streaming mode schedule normal checks slightly ahead of the configured site interval using bounded automatic headroom. This preserves per-site interval intent while giving the continuous engine room for latency, result processing, and projection flushing without operator tuning. The headroom is derived from the interval rather than a new config knob: five-minute sites run on a 285-second cadence, one-minute sites on a 57-second cadence, and long intervals are capped at 15 seconds of headroom. Failure retry cadence remains unchanged.

The 100k capacity run showed the first real streaming-engine break: request coverage collapsed while CPU and memory stayed below thresholds. Logs showed worker targets climbing into the ten-thousand range because timeout/failure latency was fed back into the sizing formula, which amplified external slowness into a connection and queue storm. Scale the streaming worker pool from successful probe latency instead of all result latency, and clamp the latency used for sizing between the default healthy estimate and a one-second ceiling. This keeps the pool large enough for normal variance without letting failure timeouts drive unbounded concurrency. Expose scale_latency alongside avg_latency in streaming metrics and logs so future capacity reports can distinguish normal site latency from the latency value used for concurrency control. Add regression coverage for the scaling cap.

Move streaming event, history, and SSL side effects onto sharded per-site processors so the scheduler can keep draining completed checker results and rescheduling probes even when persistence work slows down. This addresses the 100k/5m failure mode where workers finished checks but blocked behind the result channel while DB/event writes ran inline. Scale streaming queue capacity with active targets on activation, expose result and side-effect queue depth metrics, and auto-size the MySQL connection pool from runtime capacity instead of retaining the old fixed 10-connection ceiling. Increase batched DB write chunks to reduce round trips during large projection flushes. Verified with go test ./...

Avoid sending ordinary successful HTTP checks through the streaming side-effect pipeline when the site has no pending side effects, retry state, cached non-running status, or TLS observations. This keeps the healthy continuous-check path focused on scheduling and coarse freshness projection instead of filling event/history queues with no-op recovery work. Preserve correctness by forcing successes through side effects whenever prior failure work is pending for the same site, a retry entry exists, cached status is non-running, or TLS metadata needs processing. Also auto-size side-effect shards from GOMAXPROCS within safety bounds so failure/event handling can scale with the host without adding another operator tuning knob. Verified with go test ./...

Add WPCOM_NOTIFY_ENABLE as a real runtime config switch for the legacy WPCOM status-change path. The default remains enabled for production compatibility, but isolated test fleets can set it false so capacity and outage simulations do not attempt outbound WPCOM calls. Guard sendNotification before payload construction or last-alert updates when notifications are disabled, emit a disabled counter, surface the setting in startup/validate-config output, and document the flag near the WPCOM auth settings. Verified with go test ./...

When the streaming scheduler reloads from an active target set to zero active targets, rebuild the checker pool and side-effect processor and clear pending scheduler/side-effect state. This prevents queued checks and stale failure side effects from continuing to mutate events for sites that have just been deactivated by a capacity run or operator action. The previous 0608146 capacity run showed this failure mode directly: side-effect backlog kept opening/recovering events after the test targets were deactivated, leaving open test events that blocked the next preflight. Verified with go test ./...

Capacity testing showed that WPCOM_NOTIFY_ENABLE=false still wrote one log line for every skipped legacy status-change notification. At high site counts that turns the intentionally disabled path into avoidable journal I/O and obscures the scheduler signals we need to read while tuning the streaming engine. Keep the per-notification counter so skipped delivery volume remains measurable, but emit the operator-facing log once per orchestrator process. This preserves the configuration signal without making disabled WPCOM delivery a hot-path logging bottleneck. Verified with go test ./internal/orchestrator and go test ./....

The streaming side-effect processor was still recording failure check-history samples one row at a time. Capacity runs with many transient failures showed side_effect_depth climbing while the main result queue stayed drained, which points at side-effect database work as the next bottleneck. Buffer failure history rows per side-effect shard and flush them through the existing batch insert path on size or short interval. Event, retry, recovery, and SSL side effects still run in per-site shard order; only the trending history writes move to an independent batched path where strict per-site ordering is not required. Verified with go test ./internal/orchestrator and go test ./....

The 100k streaming run showed side_effect_depth pinned near queue capacity while the main result queue eventually backed up. The side-effect path performs event, retry, history, SSL, and notification bookkeeping, most of which is database or network I/O rather than CPU-bound work. Size the side-effect shard count from both GOMAXPROCS and active target volume, capped at a higher ceiling, so large fleets get enough independent side-effect workers to keep result draining from stalling. This keeps idle fleets modest while letting capacity runs use available database concurrency. Verified with go test ./internal/orchestrator and go test ./....

The streaming engine now scales side-effect concurrency with active target count, but the database pool still used the legacy CPU-only sizing. On the 100k run the service was side-effect-bound and the database showed active concurrent work, so the streaming path needs more I/O room without requiring operators to hand-tune a connection knob. Keep the legacy scheduler pool sizing unchanged. When SCHEDULER_ENGINE=streaming is active, raise the floor and per-GOMAXPROCS multiplier while preserving the existing hard cap so one process cannot grow connections without bound. Verified with go test ./internal/db ./internal/orchestrator and go test ./....

The 5298863 capacity run showed the side-effect queue staying healthy while the check queue began to lag. Live summaries reported successful-check latency above two seconds, but the worker target stayed at 1,053 because the scaler capped latency at a hard-coded one second. Replace that arbitrary ceiling with the configured request timeout. The pool still cannot exceed the active target count, but it can now add the concurrency needed when real network latency rises, instead of leaving capacity idle behind a stale fixed cap. Verified with go test ./internal/orchestrator and go test ./....

The first 8757266 run showed that removing the one-second latency cap let the check pool expand into the several-thousand-worker range. That cleared the old queue cap, but it also amplified target latency and failures, causing a new side-effect backlog. Keep latency-aware scaling, but reduce the Little's Law headroom multiplier from 3x to 2x. This still allows the worker pool to grow well past the old 1,053-worker ceiling when real latency rises, while lowering the risk that transient latency creates a self-reinforcing surge of checks and failure side effects. Verified with go test ./internal/orchestrator and go test ./....

The 100k-site streaming run against 770c0ab showed that worker growth was still too reactive and that a failure-heavy interval could build large result and side-effect queues. Workers climbed quickly from partial-minute latency samples, while confirmed-down or false-alarm side-effect outcomes did not move targets back to their normal phase-spread cadence. Dampen streaming worker target changes, apply scaling at a slower interval, and pause new dispatch while result or side-effect queues are already behind. Also update streaming target status from side-effect reports so confirmed-down and cleared failures stop staying on the fast local retry cadence. This keeps the fast retry path for Seems Down incidents, but prevents completed downstream outcomes from turning into a permanent 10-second check stream during large synthetic failure bursts.

The f4d1b27 run showed that queue backpressure prevented immediate overload, but once the queues cleared the scheduler flushed accumulated pending work in a single burst. One report dispatched almost ninety thousand checks, refilling the worker queue and pushing freshness lag above three minutes. Add an automatically sized per-tick dispatch budget based on the required steady-state rate, pending backlog, and current worker count. This lets Jetmon catch up after backpressure without dumping the whole backlog into the checker pool at once. The budget scales with fleet size, so larger batches can still use more of the host while keeping dispatch pressure paced and observable through a new dispatch-budget-limited metric.

The paced-dispatch run showed that the streaming event loop can spend long stretches processing completed results and side effects before it observes the next tick. The old budget treated that delayed tick like a normal one-second tick, which caused under-dispatch and allowed pending lag to keep growing even when the host had queue and CPU headroom. Use wall-clock elapsed time for dispatch budgeting and report summaries. This lets catch-up dispatch account for missed or delayed ticks without reverting to unbounded backlog dumps, and makes the logged SPS and last-round duration reflect the actual reporting interval under load.

The 6cb6261 run showed the elapsed-time dispatch budget keeping bursts under control, but the worker target was still sized almost entirely for steady-state throughput. Queue depth climbed while CPU remained well below saturation, leaving capacity unused and allowing freshness lag to grow. Add backlog-aware worker target growth when side effects are not already backpressured. Pending work and checker queue depth can now raise the desired worker target automatically, while the existing damping still prevents sudden worker explosions.

The backlog-aware run proved that elapsed dispatch budgeting can still become too permissive after a long backpressure pause. Once result and side-effect queues cleared, the scheduler was allowed to enqueue nearly a full pending window in one tick, recreating the same burst shape through a different path. Keep elapsed-time accounting, but cap each dispatch tick relative to the current worker pool. This preserves catch-up behavior while preventing minutes of accumulated due work from being dumped into the checker queue at once.

The latest 100k streaming run showed that most synthetic failures were DNS timeouts or SERVFAIL responses from the host-local systemd-resolved stub at 127.0.0.53. Those resolver failures then cascaded into Seems Down events, retry audit rows, false-positive verifier work, and side-effect backpressure. Teach the checker transport to use a Go resolver pointed at systemd-resolved's direct upstream nameserver list when available, skipping loopback stub resolvers. This keeps normal host resolver configuration as the source of truth while avoiding the local stub as a single high-volume bottleneck for monitor checks. The fallback remains the standard resolver behavior when no non-loopback nameserver is discoverable.

The 100k-site streaming run still routed checker DNS through the local systemd-resolved stub and then amplified resolver/timeouts into a retry and side-effect storm. Force the checker dial path to resolve hostnames through the direct resolver before dialing, preserving DNS trace timings and preferring IPv4 addresses when both families are available. This makes the bypass explicit instead of relying on net.Dialer.Resolver behavior. Teach the streaming controller to recognize high global failure pressure once enough checks have completed. While pressure is active, local failures return to the normal phase-spread cadence instead of the 10-second retry cadence, backlog-based worker growth is disabled, and worker shrinkage happens faster. The intent is to avoid treating monitor-side systemic pressure as customer-site downtime work that should be retried immediately. Verified with go test ./internal/checker ./internal/orchestrator and go test ./....

Move periodic streaming target reload scans into a background loader so the main scheduling loop can continue draining completed checks and dispatching due work during large active batches. The previous synchronous reload blocked result handling while scanning hundreds of thousands of rows, which showed up in the 500k internal capacity run as result backlog growth, stale checks, and low CPU utilization. Also avoid invalidating cached checker requests when a reload only refreshes non-check fields such as site status. This keeps large unchanged reloads from creating avoidable request rebuild work while still marking requests dirty when URL, interval, headers, keyword, timeout, or redirect settings change.

Avoid starting routine full target reload scans when the streaming scheduler already has meaningful pending work, result backlog, side-effect backlog, or scheduler lag. Activation and bucket-change reloads still run immediately so newly enabled sites and ownership changes are picked up without waiting. The 500k internal capacity run on 76917f0 showed that asynchronous reloads removed the hard scheduler stall, but the periodic 500k-row scan still competed with the hot path while the batch was behind. This change lets Jetmon prioritize check freshness under pressure and retry the periodic reload shortly instead of waiting a full reload interval.

Extend streaming pressure handling so a hot scheduler path can suppress new timeout/connect side effects and immediate retries before those transient local failures turn into incident churn. HTTP failures still flow through the normal event path, preserving real customer-site outage detection. This follows the internal 500k capacity evidence where Jetmon was behind, timing out locally, then opening and recovering many incidents. Those logs and side effects add pressure without improving correctness because the failures are monitor-side overload signals, not proof that the checked sites are down.

Turn off HTTP keep-alives in the shared checker transport. Jetmon's monitor workload checks mostly unique hostnames on minute-scale cadences, so idle connections rarely get reused before expiry, but Go's shared idle-connection pool becomes a global lock and goroutine-pressure point at high concurrency. The 500k internal run showed many checker goroutines contending in net/http Transport idle-connection bookkeeping. Closing each connection after its check better matches the workload and should reduce mutex contention, idle goroutine buildup, and memory pressure without changing site-check semantics.

The second internal-only 500k capacity run reached every site, but it still ended with 197,552 stale sites and only about 645 requests per second. Jetmon was repeatedly pausing dispatch while result depth was around 25k-80k, which left active checks at zero even though pending work was huge. Raise the result backlog threshold used to pause dispatch so the checker pipeline can stay fed until the result channel is near a real safety margin. Keep periodic reload deferral sensitive to backlog, increase the result drain batch, and cache hot-path pressure at tick cadence instead of recomputing shard queue depth for every result. Verified with go test ./internal/orchestrator ./internal/checker and go test ./... .

The f8de496 500k run removed shallow result-pause stalls, but live logs still showed large result backlogs. Completed checks can remain marked in-flight until the scheduler consumes their results, which delays their next due time and creates stale-site drift even when the HTTP work itself has finished. Drain a larger batch of completed results at the start of each scheduler tick before selecting and dispatching more targets. Also stream no-keyword response bodies to io.Discard instead of allocating a byte slice just to validate body integrity; keyword checks still retain the bounded body needed for matching. Verified with go test ./internal/orchestrator ./internal/checker, go test ./..., and a focused checker benchmark for no-keyword and keyword large-body checks.

The d5a548b live run kept result backlog controlled, but it still fell behind because failure-pressure mode capped the pool around 3.5k workers and the normal growth damping took several minutes to approach the concurrency demanded by a 500k-site five-minute schedule. Raise the conservative pressure latency from one second to three seconds so timeout pressure still limits runaway concurrency without collapsing below the observed capacity need. Increase the upward damping step from 25 percent to 50 percent so large activations and catch-up windows can use available resources sooner. Verified with go test ./internal/orchestrator and go test ./... .

The 500k internal capacity run showed the scheduler falling behind while completed-result handling dominated the event loop. Workers could become idle with hundreds of thousands of due checks pending because tick handling was not guaranteed to run while the result channel stayed hot. Factor the tick path into a reusable closure and service a ready tick immediately after each result-drain burst. This preserves the result-drain behavior that reduced stale coverage on the prior run while preventing result traffic from starving new dispatch decisions. Verified with go test ./internal/orchestrator and go test ./....

The d663ad9 500k internal run showed the scheduler blocking behind large legacy last_checked_at/next_check_at projection batches. During the failure window those batches hit DB timeouts and deadlocks while dispatch lag and stale-site counts continued to climb. Flush the coarse legacy freshness projection asynchronously, keep at most one projection write in flight, and requeue failed rows back into the coalesced pending map. This keeps passive rollback-compatibility writes from stopping result handling and check dispatch while preserving best-effort freshness projection for rollback. Verified with go test ./internal/orchestrator and go test ./....

The 8cda973 500k internal run improved the scheduler hot path but still showed the first wave falling behind while worker targets rose slowly and large result-drain bursts delayed regular dispatch decisions. Scale worker targets every five seconds, bootstrap the streaming pool with a one-second latency assumption for newly loaded fleets, and shorten each result-drain burst. This favors steadier dispatch cadence and faster startup throughput without changing check semantics or external interfaces. Verified with go test ./internal/orchestrator and go test ./....

The a81ba69 active capacity profile showed most CPU in socket creation, DNS/dial work, net/http connection setup, and GC. Internal HTTP capacity targets all resolve synthetic hostnames to the same target IP, but the default transport keyed idle connections by hostname and disabled keep-alive reuse, forcing a new socket for every check. Add an HTTP-only transport path that resolves the original hostname through the existing cached resolver, rewrites only the cleartext dial URL to the resolved IP, and preserves the original Host header. HTTPS remains on the existing transport so SNI and certificate validation semantics are unchanged. This should cut dial/syscall/GC pressure for cleartext fleets and internal capacity runs without changing external check results. Verified with go test ./internal/checker and go test ./....

The 500k internal-only capacity run for 7801ecd reached full coverage but failed freshness after result backlog waves grew above 100k completed checks. Those waves came from delayed scheduler ticks catching up too aggressively and from worker scaling that treated hot-path pressure differently than failure pressure. Use combined hot-path pressure when choosing the streaming worker target, lower the result backlog threshold that pauses new dispatch, and cap delayed catch-up budgets so a busy orchestrator does not dump an oversized burst into the checker pool. The goal is steadier throughput with fewer timeout/backlog cascades at fleet scale. Verified with go test ./internal/orchestrator and go test ./....

The first run after dispatch smoothing stopped the large result backlog waves, but the scheduler became underpaced because ticks were still delayed behind large result-drain bursts and the delayed-tick budget was capped too tightly. Reduce the per-pass result drain batch so the scheduler services ticks more frequently while results are flowing, and allow a delayed tick to account for up to six seconds of missed pacing. This keeps the previous burst guardrails while giving the scheduler enough room to keep a 500k five-minute fleet near the required check rate. Verified with go test ./internal/orchestrator and go test ./....

Capacity logs from the 4fc05d8 500k run showed the scheduler recovering from backlog by raising the worker target above 10k. That amplified target latency and lowered completed-check throughput even though CPU still had headroom. Reduce the pressure-mode latency assumption from three seconds to one second. Pressure mode should recover steadily without chasing overload-induced latency spikes; normal non-pressure scaling can still raise the pool for genuinely high-latency healthy checks. Verified with go test ./internal/orchestrator and go test ./....

The cleartext IP-pooling transport reused connections by rewriting HTTP requests to a resolved address while preserving the original Host header. That fast path only tried the first address returned by DNS, which could turn a single bad A record into a false outage even when another resolved address was healthy. Attempt each resolved address until one returns a response or the request context expires. This keeps the IP-pooling optimization while preserving the legacy checker behavior of falling through usable resolved addresses before reporting a connection failure. Added a regression test with a cached multi-address hostname whose first loopback address has no listener and whose second address succeeds. Verified with make test, make lint, and make test-race.

The streaming planner was scanning every scheduled due bucket once per tick, which became visible in pprof at 500k-site scale even when only a small set of buckets was ready to dispatch. Keep the per-second due buckets in a min-heap so popDue only visits buckets that are actually due. This preserves stale-entry protection for targets that were rescheduled before their old bucket fired, and adds tests for future bucket retention plus stale bucket skipping.

Hot-path pressure from a growing pending queue was treated the same as actual failure pressure. That kept backlog-based worker growth disabled exactly when the scheduler was behind but the target fleet was healthy. Separate the failure-pressure cap from backlog pressure. Failure floods still stay conservative, while a pure scheduler backlog can temporarily raise the worker target and catch up before freshness margin erodes. Add tests for both paths.

The 500k internal run showed a feedback loop where a short RTT spike caused the scheduler to raise the worker target aggressively. That increased result-channel pressure and freshness lag instead of improving throughput. Keep latency-driven growth conservative while freshness is still on time, only use backlog growth once max lag shows real schedule risk, and skip backlog growth when result handling is already pressured. This preserves failure-pressure throttling separately from healthy backlog catch-up.

The latest 500k run showed the scheduler recovering too slowly after a backlog formed. Once the service fell behind, normal dispatch pacing kept throughput near the steady-state rate, so freshness lag persisted for too long. Add a guarded fast catch-up mode for dispatch budgeting. It only activates when max lag has crossed the report interval and result handling is not already under pressure, so the scheduler can drain a pending backlog without feeding an existing result-channel bottleneck.

The 56fbd02 live run showed that once max lag crossed one minute, backlog catch-up could still raise worker count into the thousands and risk reintroducing the result-backlog feedback loop. Cap non-failure latency scaling at the steady-state default and slow backlog worker growth to a smaller 2x base ceiling. Dispatch catch-up can still drain queued work, but worker growth no longer reacts aggressively to latency generated by the load test itself.

The 56fbd02 capacity run showed that once overload produced real timeouts, the failure-pressure path still allowed the worker target to remain at the one-second latency cap. That kept too much concurrency active while the service needed to cool down. Use the steady-state default latency for failure-pressure sizing so timeout floods pull concurrency back toward the normal target instead of preserving a high overload cap.

The 8cd4ada run showed that a full periodic reload of 500k targets can contend with the hot path and create a temporary scheduling dip. Active-count polling still catches additions/removals quickly, but unchanged large fleets do not need to rescan every five minutes. Keep the configured interval for smaller fleets, then raise the periodic reload interval with active target count up to a 30 minute ceiling. This reduces full-fleet reload pressure without removing the explicit active-count-change reload path.

The 500k internal capacity run on b8ad83b stayed fresh at the target observer but still crossed the host CPU threshold after a periodic full target reload and large rollback projection writes competed with the check loop. This change stretches periodic full reload cadence for large fleets, drains completed results more aggressively when the result channel backs up, and keeps rollback projection writes on the configured 5-15 minute compatibility window instead of shrinking them to 5-minute site cadence. Projection flushes are also capped into rate-sized batches, and deadlocked legacy freshness chunks retry locally so one lock conflict does not requeue a very large batch. This keeps active/deactivated site count polling intact while reducing the broad legacy-table scans and write bursts that showed up during the capacity loop. Verified with go test ./..., make lint, make test-race, and make build.

The df088e1 capacity run reduced MySQL and reload pressure but still let a backlog wave build late in the 500k window. The scheduler was dispatching and draining in one-second bursts, which left workers underfed during short latency spikes and allowed a single target to cross the target-observer stale threshold. Run the streaming scheduler tick every 250ms instead. Dispatch budgets still scale from elapsed time, so the total target rate is unchanged, but work is fed to the checker pool and completed results are drained more smoothly. Verified with go test ./..., make lint, make test-race, and make build.

The 250ms ticker experiment improved average pacing slightly, but the latest 500k internal-only run still failed the target-observer stale check and regressed the stale tail from one site to 277. That points to dispatch being too dependent on ticker delivery when result handling is busy, not just the base tick frequency. Move the dispatch/backpressure decision into a shared helper, restore the steadier one-second scheduler tick, and let the result-drain path opportunistically dispatch pending work at a bounded 100ms wake interval. This keeps workers fed when ticks are coalesced or delayed, while still respecting result-channel and side-effect backpressure. Validated with go test ./internal/orchestrator, go test ./..., make lint, make test-race, and make build.

The e3c8f2e capacity run fixed target-observer staleness at 500k, but the live logs still showed late-window pending backlog growth while the scheduler spent long stretches draining result bursts. CPU also failed the host threshold, which points to bursty recovery work rather than a steady pacing problem. Dispatch pending work during large result drains so worker queues do not run dry while the scheduler is handling completed checks. Reuse the already-computed result timestamp for scheduling/projection paths to remove duplicate time lookups in the hot path. Also let hot-path backlog pressure raise worker targets after ten seconds of lag instead of waiting for a full minute, while still refusing to grow workers when result-channel pressure is already high. This should smooth backlog recovery without amplifying result-processing contention. Validated with go test ./internal/orchestrator, go test ./..., make lint, make test-race, and make build.

The 5c00a67 internal 500k run kept target-observer freshness clean, but it still stopped on host CPU. Live scheduler logs showed result_depth crossing the previous active/10 pause threshold, result_pauses increasing, active_checks draining to zero, and pending work growing even while the host still had useful CPU headroom. Raise the dispatch-pause depth for large fleets to active/5 and reserve the final quarter of the result channel as the hard buffer. This keeps the result backlog from prematurely starving new dispatch at 500k+ sites while retaining a safety margin for completed checks to drain.

The 72968de internal 500k run passed target-observer freshness with a request ratio above 1.0, but it still stopped on host CPU. The run also showed late result bursts delaying scheduler ticks and briefly pushing result depth past the dispatch-pause threshold. Avoid cloning cached DNS address slices on every repeated check, keep the single-address resolver ordering path allocation-free, cap per-drain result bursts at 16k so the scheduler loop returns to ticks sooner, and widen the large-fleet dispatch-pause threshold to active/3. Together these changes target CPU p95 and burstiness without changing check semantics. Validated with go test ./internal/checker ./internal/orchestrator, go test ./..., make lint, make test-race, and make build.

The 6512fdd internal 500k run passed target-observer coverage but showed that the 16k result-drain cap was too low. Late in the window the scheduler processed roughly four capped result chunks per minute, result_depth climbed to the dispatch-pause threshold, and active_checks drained to zero while pending work remained. Restore the larger result-drain ceiling so completed checks can be processed fast enough, and make result-depth backpressure pause dispatch only while there is still active or queued check work. If the pool is already idle, dispatch continues so the scheduler does not self-starve behind a result backlog. Validated with go test ./internal/orchestrator, go test ./..., make lint, make test-race, and make build.

The latest Jetmon v2 capacity reports showed RSS staying around 0.45 GiB while CPU, pending work, result backlog, and host disk activity rose sharply at large active-site counts. This commit makes the streaming engine deliberately spend more memory before throttling the hot path. Replace the heap-backed due schedule with a bucketed due-time wheel, widen the in-memory work/result buffer ceiling, drain large result backlogs more aggressively, and default the coarse legacy freshness projection interval to the accepted 15-minute rollback window. The roadmap now tracks the remaining larger follow-ups: sharded result ingestion, deeper request/runtime caches, memory-backed success rollups, larger DNS/HTTP cache evaluation, and uptime-bench I/O attribution. Verification: go test ./...

Chris Jean added 30 commits May 9, 2026 22:55

Chris Jean added 30 commits May 10, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the streaming monitor engine for high-scale checks#104

Add the streaming monitor engine for high-scale checks#104
chrisbliss18 wants to merge 71 commits into
v2from
feature/streaming-monitor-engine

chrisbliss18 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrisbliss18 commented May 11, 2026

Summary

Why

Highlights

Capacity Evidence So Far

Known Follow-Ups Before Marking Ready

Local Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant