chore: Harden keycast autoscaling for a possible viral spike by dcadenas · Pull Request #196 · divinevideo/keycast

dcadenas · 2026-04-30T20:07:57Z

Plain summary

On a scale-down event, keycast could drop in-flight NIP-46 sign requests before their responses got back to the client. That would cause a reconnect storm right when we are trying to reduce pod count. This PR reorders shutdown into four explicit bounded phases so K8s stops routing traffic to the pod before we stop accepting, HTTP drains first, and only then do we tear down the relay client that the signer workers need. Every phase budget is env-var-tunable and validated at startup against a grace-period ceiling.

Part of the umbrella autoscaling hardening in issue #692. The Kubernetes-side pieces (HPA tune, PDB, topology-spread, terminationGracePeriodSeconds) land in a sibling PR against divinevideo/divine-iac-coreconfig — this PR is the keycast (Rust) slice.

No UI change. No public brand/partner references in the diff.

Motivation

Keycast holds two pieces of per-pod state that matter during shutdown: long-lived NIP-46 WebSocket relay connections, and a bounded queue of in-flight sign requests being processed by relay workers. The old shutdown block did two things in the wrong order:

It flipped /healthz/ready to 503 and immediately told axum to stop accepting. K8s EndpointSlice propagation is asynchronous, so the load balancer kept routing new HTTP and NIP-46 subscriptions to us for several seconds while we rejected them.
It called client.shutdown() before awaiting HTTP drain. In-flight NIP-46 requests in the relay worker queue lost their WebSocket back to the requesting client and could not deliver responses.

On a scale-down that amplifies into a reconnect/retry storm.

Related Issue

Relates to #692. Does not close it — the iac-side sibling PR covers HPA/PDB/topology/grace-period.

What Changed

The shutdown sequence in async_main is now four explicit phases, with five defensive additions accumulated across three review cycles.

Pre-drain: set shutting_down = true (readiness returns 503) and sleep SHUTDOWN_PRE_DRAIN_SECS (default 10s). This is the window during which K8s observes the failing probe and removes us from the Service EndpointSlice.
HTTP drain: notify axum to stop accepting and wait up to SHUTDOWN_HTTP_DRAIN_SECS (default 45s). On timeout the axum task is aborted via drain_http_or_abort and awaited to cancellation, so phases 3 and 4 do not race a still-live accept loop — previously the JoinHandle was dropped, which does NOT cancel the task under tokio+std semantics.
Signer drain: tear down the nostr_sdk Client (which stops signer.run() and the relay worker loop so in-flight sign responses can still be published on the way out) AND wait for tracked background tasks, both under a single shared SHUTDOWN_SIGNER_DRAIN_SECS (default 10s) budget via drain_signer_or_abort. Previously client.shutdown() was unbounded and only the trailing task_tracker.wait() was timeout-wrapped. On timeout the signer JoinHandle is aborted.
Cleanup: close the DB pool, bounded by SHUTDOWN_TEARDOWN_MARGIN (10s) via close_within_margin. sqlx's Pool::close waits for every checked-out connection to be returned, so a stuck query could previously block past the kubelet SIGKILL and swallow the final log line. Now the margin is a real bound; on timeout we still emit a warn-level 'Graceful shutdown complete with warning' so operators have a positive signal.

At startup validate_shutdown_timings checks that pre + http + signer + 10s teardown margin <= SHUTDOWN_GRACE_PERIOD_CEILING_SECS (default 120s). If the configured budget does not fit, a warn-level log names the four phase values and the ceiling. We warn rather than refuse to boot because the ceiling is cross-repo-coupled to the iac repo. Marked with TODO(#692) so the warn-only fallback can tighten once the iac PR pins terminationGracePeriodSeconds at >=75s everywhere.

Every SHUTDOWN_*_SECS env var is clamped through a per-field sanity ceiling of 1 day (SHUTDOWN_PER_FIELD_CEILING_SECS = 24*3600). Values above that fall back to the per-field default with a warn-level log. This prevents a typo like SHUTDOWN_PRE_DRAIN_SECS=18446744073709551600 from overflowing ShutdownTimings::total_budget()'s sum and crashlooping before the validate-and-warn path can surface an operator-actionable message.

New env vars (all optional, all have safe defaults):

SHUTDOWN_PRE_DRAIN_SECS — default 10.
SHUTDOWN_HTTP_DRAIN_SECS — default 45.
SHUTDOWN_SIGNER_DRAIN_SECS — default 10.
SHUTDOWN_GRACE_PERIOD_CEILING_SECS — default 120.

Design choice: one shared phase-3 bound vs two independent ones

The cycle-3 review suggested 'wrap client_for_shutdown.shutdown() in tokio::time::timeout(timings.signer_drain, ...) so it shares the same bound as task_tracker.wait()'. Taken literally that means two independent timeouts each sized at signer_drain, which in the worst case makes phase 3 run for 2 * signer_drain and quietly busts the design contract that validate_shutdown_timings enforces. Chose the stricter option: one outer timeout(signer_drain, ...) over the whole phase via drain_signer_or_abort. Same happy-path behavior, strictly tighter worst-case bound, no env-var surface change, no iac contract change.

Docs consulted

tokio graceful shutdown topic — three-part model (signal → cancel → wait). Existing code already followed this via Notify + TaskTracker; this change tightens phase ordering and makes aborts actually happen on timeout.
K8s HPA v2 configurable scaling behavior. Used to size the phase budgets against the iac PR's numbers.
K8s pods-and-endpoint-termination-flow tutorial — endpoint-slice removal is asynchronous vs SIGTERM, hence the in-process sleep before closing the accept loop.

Cross-repo coupling note (one-way door)

The commit b74aaba already on main (#174) widened registered_clients.tenant_id from INTEGER to BIGINT via database/migrations/20260429000000_widen_registered_clients_tenant_id.sql AND removed the ::BIGINT casts from the repository queries in core/src/repositories/registered_client.rs in the same commit. cloudbuild.yaml runs migrations before deploy so the happy path is safe, but if a deploy boots against a DB that has not applied that migration (dev DB restored from a pre-migration snapshot, revision rollback reverting the column but not the code, run-migrations step failing before deploy proceeds) every query in that repository fails at decode time with mismatched types: Rust type i64 (as SQL type int8) is not compatible with SQL type int4 and OAuth client management breaks entirely. Raised as a one-way-door risk by review. No code in this PR touches that file; this callout is for awareness.

Testing

I ran the keycast-bin unit tests, workspace lib+bin tests, clippy, and fmt. The new code has 27 shutdown-specific unit tests: 7 for env parsing (defaults, overrides, zero, invalid, per-field ceiling rejection/acceptance, u64::MAX regression), 2 for the pre-drain flag-flip ordering, 7 for the grace-period ceiling validation, 2 for the HTTP-drain abort-on-timeout helper, 3 for the DB-pool-close margin helper, 1 for total_budget not overflowing at the per-field ceiling, and 5 for the cycle-3 drain_signer_or_abort helper.

cargo test -p keycast --bin keycast — 45 passed / 0 failed.
cargo test --workspace --lib --bins — all passed.
cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated — clean.
cargo fmt --all -- --check — clean.
cargo test --workspace --verbose — 11 DB-backed integration tests (api/tests/atproto_http_test.rs, core/tests/auth_event_repository_test.rs) fail locally with Failed to run migrations: VersionMissing(20260429030500). That migration was removed/renumbered in source but my local keycast_test Postgres still has a record of having run it; reproduces on unmodified main and is unrelated to this PR's diff. CI with a fresh DB should be green.

Not tested in this PR: full kill-pod-and-observe-reconnect-spike verification. That is integrated acceptance (keycast + iac) and needs the sibling iac PR to land first so terminationGracePeriodSeconds is wide enough for the new phase budget.

Risks

The main risk is a grace-period mismatch with the sibling iac PR. The defaults target the iac PR's 75s goal, and the startup ceiling validation plus per-field sanity ceiling surface misconfig in boot logs rather than letting it silently kill shutdown mid-flight. All five defensive changes (abort-on-timeout for HTTP, pool-close margin, startup ceiling validation, per-field sanity clamp, shared phase-3 bound) are strictly safer than the original phased-shutdown diff.

SIGKILL mid-drain if the iac PR sets terminationGracePeriodSeconds below 75s. The validation logs a warn-level event at boot naming the exact mismatch. Env vars let ops tune down without a rebuild.
Long-running handlers stretching HTTP drain: axum is now aborted instead of left running.
Hung relay-close handshake or task_tracker.wait(): phase 3 is bounded by signer_drain; on timeout the signer task is aborted.
DB pool close stuck on a long-running query: bounded by SHUTDOWN_TEARDOWN_MARGIN. On timeout we still emit the final log (with warn level) so operators have a positive signal.
Operator typo on an env var: values above the 1-day per-field ceiling fall back to default with a warn log. u64::MAX-style input no longer panics total_budget.
Local dev restart latency: SHUTDOWN_PRE_DRAIN_SECS=0 short-circuits the pre-drain sleep.
Cross-repo one-way door: see the registered_clients.tenant_id callout above. Not touched by this PR; flagged for awareness.

Follow-ups

Tracked in issue #692:

KEDA migration to request-rate / connection-count scaling.
Balloon pods (PriorityClass: -1) to absorb cluster-autoscaler node-provisioning latency.
Tighten validate_shutdown_timings from warn to hard-fail once the iac-side terminationGracePeriodSeconds is pinned at >=75s everywhere.

Visuals

UI/web change with screenshots/video attached
No visual change
Visuals and text avoid sensitive external brand or partner names unless explicitly approved

Rework the SIGTERM handler in keycast/src/main.rs so that scale-down events on GKE don't drop in-flight NIP-46 RPCs mid-request. Phase ordering is now: 1. Flip /healthz/ready to 503 and sleep SHUTDOWN_PRE_DRAIN_SECS (default 10s). During this window the kubelet sees the failing probe and the Service controller removes the pod from the EndpointSlice, so the GCLB stops routing new HTTP and new NIP-46 subscriptions to us before we stop accepting. 2. Tell axum to stop accepting and wait up to SHUTDOWN_HTTP_DRAIN_SECS (default 45s) for in-flight HTTP requests to finish. 3. Tear down the nostr_sdk Client (stops signer.run() and the relay worker loop, flushes in-flight sign responses back to clients) and wait up to SHUTDOWN_SIGNER_DRAIN_SECS (default 10s) for tracked background tasks. This is the key fix: previously client.shutdown() ran BEFORE the drain waits, so any in-flight NIP-46 request in the relay queue lost its WebSocket back to the requesting client. 4. Close the DB pool. The three windows are env-var-tunable so the iac repo can adjust them without a rebuild. Sum of defaults (65s) plus DB/tracing teardown headroom fits inside a terminationGracePeriodSeconds of 75s. The iac-side PR bumps grace accordingly. Covered by 6 new unit tests: - parse_shutdown_timings defaults / overrides / zero-is-valid / invalid-falls-back-to-default - pre_drain_pause flips the readiness flag BEFORE sleeping (tokio::time pause + advance, deterministic) - pre_drain_pause with zero duration returns immediately Doc references for the design choices: - K8s HPA v2 behavior defaults (Percent 100/15s + Pods 4/15s, scaleDown stabilizationWindowSeconds: 300). HPA tune itself lands in the iac PR, but the phase budgets here are sized against the iac PR's terminationGracePeriodSeconds bump to 75s. - Tokio graceful-shutdown topic (https://tokio.rs/tokio/topics/shutdown) — three-part model (signal → cancel → wait). Already followed by existing code via Notify + TaskTracker; this change tightens phase ordering. - K8s pods-and-endpoint-termination-flow — endpoint-slice removal is asynchronous vs SIGTERM, hence the in-process sleep before closing the accept loop. Out of scope for this repo: HPA tune, PodDisruptionBudget, topologySpreadConstraint, terminationGracePeriodSeconds — those are in divine-iac-coreconfig. Adding tokio test-util to dev-dependencies so we can test the pre-drain pause deterministically with tokio::time::advance.

…startup Adds a startup-time sanity check so a misconfigured phase duration is reported via tracing::warn rather than surfacing as a kubelet SIGKILL mid-drain on scale-down. Addresses two concrete failure modes flagged in review: 1. Operator sets SHUTDOWN_HTTP_DRAIN_SECS=120 against a 75s grace period. Previously silent; now a loud warn-level log at boot includes the three phase values, the computed total + margin, and the configured ceiling. 2. Sibling divinevideo/divine-iac-coreconfig PR (tracked in #692) lands with terminationGracePeriodSeconds below the current keycast defaults. Same warning surface. New surface area: - SHUTDOWN_TEARDOWN_MARGIN (const, 10s): reserved headroom between the end of the phased drain and SIGKILL for DB pool close + tracing flush. Subtracted from the ceiling during validation. - DEFAULT_SHUTDOWN_GRACE_CEILING_SECS (const, 120s): conservative upper bound covering the issue spec's 60-90s range with headroom. Override via SHUTDOWN_GRACE_PERIOD_CEILING_SECS. - ShutdownTimings::total_budget(): pre_drain + http_drain + signer_drain (no margin). - validate_shutdown_timings(&ShutdownTimings, Duration) -> Result<(), String>: returns Err with an operator-actionable message if total + margin > ceiling. Exact equality is accepted (margin fully fits). The call site in async_main warns rather than refusing to boot: the ceiling is cross-repo-coupled to the iac repo, and an operator may legitimately need to start before the iac repo catches up. Marked with TODO(#692) so the warn-only fallback can be tightened once the iac PR lands and terminationGracePeriodSeconds is pinned at >=75s everywhere. Default phase values (65s sum + 10s margin = 75s) are unchanged, matching the iac PR target. Decision: keep defaults rather than shrink to fit a 60s grace, because the iac PR targets 75s and shrinking would reduce the HTTP-drain budget available for legitimately long in-flight sign requests. The validation path is the defense-in-depth for the smaller-grace case, with SHUTDOWN_* env vars as the tuning knobs. Covered by 7 new unit tests (31 total in keycast bin, all passing): - parse_shutdown_grace_ceiling default / override - ShutdownTimings::total_budget arithmetic - validate_shutdown_timings ok when under ceiling (75s, 90s) - validate_shutdown_timings err when over ceiling (60s grace vs 65s + 10s margin, from review) - validate_shutdown_timings err when single field overshoots (SHUTDOWN_HTTP_DRAIN_SECS=120 case from review) - validate_shutdown_timings boundary: equal-to-ceiling accepted, one-below rejected, one-above accepted cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated and cargo fmt --all -- --check are clean.

Previously the HTTP-drain timeout arm in async_main just logged 'API server shutdown timed out' and fell through. Dropping the axum task's JoinHandle at end of async_main does NOT cancel the underlying task (tokio+std semantics), so axum would keep running concurrently with phase 3 (relay client teardown) and phase 4 (DB pool close). Pool close racing a still-live accept loop surfaces as DB errors on any late-arriving requests, and the kubelet SIGKILL at terminationGracePeriodSeconds would be the actual stop — defeating the point of a bounded HTTP drain budget. Extracted the drain into a helper (drain_http_or_abort) that takes the JoinHandle and budget, awaits the task up to the budget via tokio::time::timeout with &mut handle (JoinHandle: Unpin + Future, so timeout borrows rather than consumes it), and on timeout calls handle.abort() then awaits the cancellation before returning. Returns a two-variant HttpDrainOutcome so the caller logs an honest message in both arms instead of the old 'timed out' line that implied axum had stopped. The caller in async_main now logs: 'API server shutdown timed out; axum task aborted to prevent DB pool close from racing late requests' on the timeout path, which is what actually happens. Covered by 2 new tokio::test(start_paused = true) unit tests: - drain_http_or_abort returns Completed when the task finishes inside the budget (fast path). - drain_http_or_abort aborts the task when the budget is exceeded, using a Drop-guarded cancellation flag in the spawned task to prove the task is actually stopped (not just left pending with the join result thrown away). This directly reproduces the review failure mode. 33 unit tests in the keycast bin now all pass. cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated and cargo fmt --all -- --check are clean.

Phase 4 (DB pool close) previously awaited sqlx's Pool::close() with no upper bound, contradicting the design contract that documented SHUTDOWN_TEARDOWN_MARGIN (10s) as the headroom reserved inside terminationGracePeriodSeconds for post-drain cleanup. sqlx's Pool::close() waits for every checked-out connection to be returned, so a connection stuck in a long-running query — e.g. a tenant-cache preload task mid-query when task_tracker.wait() timed out and signer_handle.abort() was called but not awaited — could block past the kubelet SIGKILL at terminationGracePeriodSeconds. That would swallow the final 'Graceful shutdown complete' log, leaving operators with no positive signal of clean shutdown. Extracted close_within_margin(close_fut, margin) -> PoolCloseOutcome which wraps the future in tokio::time::timeout. Generic over Future<Output = ()> so the behaviour can be tested deterministically with tokio::time::pause without standing up a real Postgres pool. sqlx's Pool::close is cancellation-safe, so dropping the future on timeout is safe. The call site in async_main now logs two distinct lines: - 'Graceful shutdown complete' on Completed (the happy path). - 'Graceful shutdown complete with warning: DB pool close exceeded teardown margin; stuck checked-out connections were dropped' on TimedOut (still emitted so operators have a positive signal). Covered by 3 new tokio::test(start_paused = true) unit tests: - returns Completed when the future finishes inside the margin. - returns TimedOut when the future blocks past the margin. - enforces the real bound: measures virtual-time elapsed and asserts the helper returns within the margin + 100ms even when the inner future would take 1h. This is the safety property — phase 4 cannot eat into the kubelet's SIGKILL window. 36 unit tests in the keycast bin now all pass. cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated and cargo fmt --all -- --check are clean.

Regression defense. `parse_duration_secs_env` previously accepted any u64, so a misconfig like SHUTDOWN_PRE_DRAIN_SECS=18446744073709551600 parsed cleanly. The value then flowed into ShutdownTimings, where total_budget()'s `+` sum panicked on Duration overflow — aborting the process at startup with a raw 'overflow when adding durations' panic BEFORE validate_shutdown_timings could surface the operator- actionable 'does not fit inside ceiling' warn. Net result: crashloop instead of loud warning. Added SHUTDOWN_PER_FIELD_CEILING_SECS const (24*3600 = 86400s = 1 day). parse_duration_secs_env now rejects any value strictly greater than the ceiling and falls back to the per-field default with a warn-level log naming the env var, the bad value, and the ceiling. Three fields at the ceiling sum to 3 * 86400 = 259200s = 3 days, comfortably inside u64, so total_budget() can stay infallible. One day is far above any plausible operational value — real drain budgets are seconds, not hours — and the ceiling is documented on the constant so an operator who genuinely wants a larger value knows where to look. The warn log still emits at boot so the misconfig doesn't silently revert. Covered by 4 new unit tests: - u64::MAX-ish input falls back to default AND downstream total_budget() does not panic (the actual safety property). - value strictly above the ceiling falls back to default. - value exactly at the ceiling is accepted (legitimate opt-in). - total_budget() with every field at the ceiling computes 3 * ceiling without overflow and equals Duration::from_secs(3 * ceiling). 40 unit tests in the keycast bin now all pass. cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated and cargo fmt --all -- --check are clean.

Phase 3's `client_for_shutdown.shutdown().await` was previously awaited unbounded in `async_main` — only the trailing `task_tracker.wait()` was wrapped in `tokio::time::timeout(signer_drain, ...)`. A hung WebSocket close handshake, a bug in `nostr_sdk::Client::shutdown`, or a dead relay socket could therefore block phase 3 past `terminationGracePeriodSeconds` and swallow the kubelet SIGKILL entry path: phase 4 (DB pool close, already bounded by `SHUTDOWN_TEARDOWN_MARGIN`) would never run, and the final `Graceful shutdown complete` log would be lost. This silently busted the design contract enforced by `validate_shutdown_timings`: pre_drain + http_drain + signer_drain + SHUTDOWN_TEARDOWN_MARGIN <= terminationGracePeriodSeconds Relay teardown was not in that sum. Extracted the phase into `drain_signer_or_abort(client_shutdown, tracker_wait, abort_signer, budget)` which wraps the whole phase in ONE outer `tokio::time::timeout(budget, ...)` rather than per-await timeouts. One shared outer bound is strictly more correct than two independent inner timeouts: two independent `signer_drain`-sized timeouts would make phase 3 worst-case `2 * signer_drain` and quietly bust the same contract we were trying to restore. The helper awaits `client_shutdown` then `tracker_wait` sequentially inside the timeout; on timeout it invokes the caller-supplied `abort_signer` callback (which in production calls `signer_handle.abort()`) and returns `AbortedAfterTimeout`. The helper is generic over both inner futures and the abort callback so the timeout behaviour can be exercised deterministically with `tokio::time::pause` in tests without standing up a real `nostr_sdk::Client` or a real signer `JoinHandle`. Call site in `async_main`: match drain_signer_or_abort( client_for_shutdown.shutdown(), task_tracker.wait(), || signer_handle_for_abort.abort(), timings.signer_drain, ).await { ... } replaces the separate `client_for_shutdown.shutdown().await` + `tokio::time::timeout(signer_drain, task_tracker.wait())` block. The log line on timeout is updated to reflect the new truth — either `client.shutdown()` or `task_tracker.wait()` (or both) overran the budget. Covered by 5 new `tokio::test(start_paused = true)` unit tests: - happy path: both futures finish inside budget → Completed, abort NOT called - client.shutdown() hangs → AbortedAfterTimeout, abort called (the reviewer-flagged regression target) - task_tracker.wait() hangs → AbortedAfterTimeout, abort called (pre-patch behavior preserved under new outer-timeout structure) - safety bound: helper returns within `budget` even when both inner futures would sleep 1h - shared-budget semantic: 8s + 8s sequential awaits against a 10s budget must time out at 10s, not extend to 16s (pins the single-outer-timeout vs two-independent-timeouts choice) 45 unit tests in the keycast bin now all pass (40 previously + 5 new). cargo test --workspace --lib --bins is clean. cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated and cargo fmt --all -- --check are clean. Relates to #692.

dcadenas added 6 commits April 29, 2026 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Harden keycast autoscaling for a possible viral spike#196

chore: Harden keycast autoscaling for a possible viral spike#196
dcadenas wants to merge 6 commits into
mainfrom
feat/692-harden-keycast-autoscaling-for-a-possible-viral-spi

dcadenas commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dcadenas commented Apr 30, 2026

Plain summary

Motivation

Related Issue

What Changed

Design choice: one shared phase-3 bound vs two independent ones

Docs consulted

Cross-repo coupling note (one-way door)

Testing

Risks

Follow-ups

Visuals

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant