Skip to content

chore: Harden keycast autoscaling for a possible viral spike#196

Draft
dcadenas wants to merge 6 commits into
mainfrom
feat/692-harden-keycast-autoscaling-for-a-possible-viral-spi
Draft

chore: Harden keycast autoscaling for a possible viral spike#196
dcadenas wants to merge 6 commits into
mainfrom
feat/692-harden-keycast-autoscaling-for-a-possible-viral-spi

Conversation

@dcadenas
Copy link
Copy Markdown
Contributor

Plain summary

On a scale-down event, keycast could drop in-flight NIP-46 sign requests before their responses got back to the client. That would cause a reconnect storm right when we are trying to reduce pod count. This PR reorders shutdown into four explicit bounded phases so K8s stops routing traffic to the pod before we stop accepting, HTTP drains first, and only then do we tear down the relay client that the signer workers need. Every phase budget is env-var-tunable and validated at startup against a grace-period ceiling.

Part of the umbrella autoscaling hardening in issue #692. The Kubernetes-side pieces (HPA tune, PDB, topology-spread, terminationGracePeriodSeconds) land in a sibling PR against divinevideo/divine-iac-coreconfig — this PR is the keycast (Rust) slice.

No UI change. No public brand/partner references in the diff.

Motivation

Keycast holds two pieces of per-pod state that matter during shutdown: long-lived NIP-46 WebSocket relay connections, and a bounded queue of in-flight sign requests being processed by relay workers. The old shutdown block did two things in the wrong order:

  1. It flipped /healthz/ready to 503 and immediately told axum to stop accepting. K8s EndpointSlice propagation is asynchronous, so the load balancer kept routing new HTTP and NIP-46 subscriptions to us for several seconds while we rejected them.
  2. It called client.shutdown() before awaiting HTTP drain. In-flight NIP-46 requests in the relay worker queue lost their WebSocket back to the requesting client and could not deliver responses.

On a scale-down that amplifies into a reconnect/retry storm.

Related Issue

  • Relates to #692. Does not close it — the iac-side sibling PR covers HPA/PDB/topology/grace-period.

What Changed

The shutdown sequence in async_main is now four explicit phases, with five defensive additions accumulated across three review cycles.

  1. Pre-drain: set shutting_down = true (readiness returns 503) and sleep SHUTDOWN_PRE_DRAIN_SECS (default 10s). This is the window during which K8s observes the failing probe and removes us from the Service EndpointSlice.
  2. HTTP drain: notify axum to stop accepting and wait up to SHUTDOWN_HTTP_DRAIN_SECS (default 45s). On timeout the axum task is aborted via drain_http_or_abort and awaited to cancellation, so phases 3 and 4 do not race a still-live accept loop — previously the JoinHandle was dropped, which does NOT cancel the task under tokio+std semantics.
  3. Signer drain: tear down the nostr_sdk Client (which stops signer.run() and the relay worker loop so in-flight sign responses can still be published on the way out) AND wait for tracked background tasks, both under a single shared SHUTDOWN_SIGNER_DRAIN_SECS (default 10s) budget via drain_signer_or_abort. Previously client.shutdown() was unbounded and only the trailing task_tracker.wait() was timeout-wrapped. On timeout the signer JoinHandle is aborted.
  4. Cleanup: close the DB pool, bounded by SHUTDOWN_TEARDOWN_MARGIN (10s) via close_within_margin. sqlx's Pool::close waits for every checked-out connection to be returned, so a stuck query could previously block past the kubelet SIGKILL and swallow the final log line. Now the margin is a real bound; on timeout we still emit a warn-level 'Graceful shutdown complete with warning' so operators have a positive signal.

At startup validate_shutdown_timings checks that pre + http + signer + 10s teardown margin <= SHUTDOWN_GRACE_PERIOD_CEILING_SECS (default 120s). If the configured budget does not fit, a warn-level log names the four phase values and the ceiling. We warn rather than refuse to boot because the ceiling is cross-repo-coupled to the iac repo. Marked with TODO(#692) so the warn-only fallback can tighten once the iac PR pins terminationGracePeriodSeconds at >=75s everywhere.

Every SHUTDOWN_*_SECS env var is clamped through a per-field sanity ceiling of 1 day (SHUTDOWN_PER_FIELD_CEILING_SECS = 24*3600). Values above that fall back to the per-field default with a warn-level log. This prevents a typo like SHUTDOWN_PRE_DRAIN_SECS=18446744073709551600 from overflowing ShutdownTimings::total_budget()'s sum and crashlooping before the validate-and-warn path can surface an operator-actionable message.

New env vars (all optional, all have safe defaults):

  • SHUTDOWN_PRE_DRAIN_SECS — default 10.
  • SHUTDOWN_HTTP_DRAIN_SECS — default 45.
  • SHUTDOWN_SIGNER_DRAIN_SECS — default 10.
  • SHUTDOWN_GRACE_PERIOD_CEILING_SECS — default 120.

Design choice: one shared phase-3 bound vs two independent ones

The cycle-3 review suggested 'wrap client_for_shutdown.shutdown() in tokio::time::timeout(timings.signer_drain, ...) so it shares the same bound as task_tracker.wait()'. Taken literally that means two independent timeouts each sized at signer_drain, which in the worst case makes phase 3 run for 2 * signer_drain and quietly busts the design contract that validate_shutdown_timings enforces. Chose the stricter option: one outer timeout(signer_drain, ...) over the whole phase via drain_signer_or_abort. Same happy-path behavior, strictly tighter worst-case bound, no env-var surface change, no iac contract change.

Docs consulted

  • tokio graceful shutdown topic — three-part model (signal → cancel → wait). Existing code already followed this via Notify + TaskTracker; this change tightens phase ordering and makes aborts actually happen on timeout.
  • K8s HPA v2 configurable scaling behavior. Used to size the phase budgets against the iac PR's numbers.
  • K8s pods-and-endpoint-termination-flow tutorial — endpoint-slice removal is asynchronous vs SIGTERM, hence the in-process sleep before closing the accept loop.

Cross-repo coupling note (one-way door)

The commit b74aaba already on main (#174) widened registered_clients.tenant_id from INTEGER to BIGINT via database/migrations/20260429000000_widen_registered_clients_tenant_id.sql AND removed the ::BIGINT casts from the repository queries in core/src/repositories/registered_client.rs in the same commit. cloudbuild.yaml runs migrations before deploy so the happy path is safe, but if a deploy boots against a DB that has not applied that migration (dev DB restored from a pre-migration snapshot, revision rollback reverting the column but not the code, run-migrations step failing before deploy proceeds) every query in that repository fails at decode time with mismatched types: Rust type i64 (as SQL type int8) is not compatible with SQL type int4 and OAuth client management breaks entirely. Raised as a one-way-door risk by review. No code in this PR touches that file; this callout is for awareness.

Testing

I ran the keycast-bin unit tests, workspace lib+bin tests, clippy, and fmt. The new code has 27 shutdown-specific unit tests: 7 for env parsing (defaults, overrides, zero, invalid, per-field ceiling rejection/acceptance, u64::MAX regression), 2 for the pre-drain flag-flip ordering, 7 for the grace-period ceiling validation, 2 for the HTTP-drain abort-on-timeout helper, 3 for the DB-pool-close margin helper, 1 for total_budget not overflowing at the per-field ceiling, and 5 for the cycle-3 drain_signer_or_abort helper.

  • cargo test -p keycast --bin keycast — 45 passed / 0 failed.
  • cargo test --workspace --lib --bins — all passed.
  • cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated — clean.
  • cargo fmt --all -- --check — clean.
  • cargo test --workspace --verbose — 11 DB-backed integration tests (api/tests/atproto_http_test.rs, core/tests/auth_event_repository_test.rs) fail locally with Failed to run migrations: VersionMissing(20260429030500). That migration was removed/renumbered in source but my local keycast_test Postgres still has a record of having run it; reproduces on unmodified main and is unrelated to this PR's diff. CI with a fresh DB should be green.

Not tested in this PR: full kill-pod-and-observe-reconnect-spike verification. That is integrated acceptance (keycast + iac) and needs the sibling iac PR to land first so terminationGracePeriodSeconds is wide enough for the new phase budget.

Risks

The main risk is a grace-period mismatch with the sibling iac PR. The defaults target the iac PR's 75s goal, and the startup ceiling validation plus per-field sanity ceiling surface misconfig in boot logs rather than letting it silently kill shutdown mid-flight. All five defensive changes (abort-on-timeout for HTTP, pool-close margin, startup ceiling validation, per-field sanity clamp, shared phase-3 bound) are strictly safer than the original phased-shutdown diff.

  • SIGKILL mid-drain if the iac PR sets terminationGracePeriodSeconds below 75s. The validation logs a warn-level event at boot naming the exact mismatch. Env vars let ops tune down without a rebuild.
  • Long-running handlers stretching HTTP drain: axum is now aborted instead of left running.
  • Hung relay-close handshake or task_tracker.wait(): phase 3 is bounded by signer_drain; on timeout the signer task is aborted.
  • DB pool close stuck on a long-running query: bounded by SHUTDOWN_TEARDOWN_MARGIN. On timeout we still emit the final log (with warn level) so operators have a positive signal.
  • Operator typo on an env var: values above the 1-day per-field ceiling fall back to default with a warn log. u64::MAX-style input no longer panics total_budget.
  • Local dev restart latency: SHUTDOWN_PRE_DRAIN_SECS=0 short-circuits the pre-drain sleep.
  • Cross-repo one-way door: see the registered_clients.tenant_id callout above. Not touched by this PR; flagged for awareness.

Follow-ups

Tracked in issue #692:

  • KEDA migration to request-rate / connection-count scaling.
  • Balloon pods (PriorityClass: -1) to absorb cluster-autoscaler node-provisioning latency.
  • Tighten validate_shutdown_timings from warn to hard-fail once the iac-side terminationGracePeriodSeconds is pinned at >=75s everywhere.

Visuals

  • UI/web change with screenshots/video attached
  • No visual change
  • Visuals and text avoid sensitive external brand or partner names unless explicitly approved

Rework the SIGTERM handler in keycast/src/main.rs so that scale-down events
on GKE don't drop in-flight NIP-46 RPCs mid-request.

Phase ordering is now:

  1. Flip /healthz/ready to 503 and sleep SHUTDOWN_PRE_DRAIN_SECS (default
     10s). During this window the kubelet sees the failing probe and the
     Service controller removes the pod from the EndpointSlice, so the GCLB
     stops routing new HTTP and new NIP-46 subscriptions to us before we
     stop accepting.
  2. Tell axum to stop accepting and wait up to SHUTDOWN_HTTP_DRAIN_SECS
     (default 45s) for in-flight HTTP requests to finish.
  3. Tear down the nostr_sdk Client (stops signer.run() and the relay
     worker loop, flushes in-flight sign responses back to clients) and
     wait up to SHUTDOWN_SIGNER_DRAIN_SECS (default 10s) for tracked
     background tasks. This is the key fix: previously client.shutdown()
     ran BEFORE the drain waits, so any in-flight NIP-46 request in the
     relay queue lost its WebSocket back to the requesting client.
  4. Close the DB pool.

The three windows are env-var-tunable so the iac repo can adjust them
without a rebuild. Sum of defaults (65s) plus DB/tracing teardown headroom
fits inside a terminationGracePeriodSeconds of 75s. The iac-side PR bumps
grace accordingly.

Covered by 6 new unit tests:
  - parse_shutdown_timings defaults / overrides / zero-is-valid /
    invalid-falls-back-to-default
  - pre_drain_pause flips the readiness flag BEFORE sleeping
    (tokio::time pause + advance, deterministic)
  - pre_drain_pause with zero duration returns immediately

Doc references for the design choices:
  - K8s HPA v2 behavior defaults (Percent 100/15s + Pods 4/15s, scaleDown
    stabilizationWindowSeconds: 300). HPA tune itself lands in the iac
    PR, but the phase budgets here are sized against the iac PR's
    terminationGracePeriodSeconds bump to 75s.
  - Tokio graceful-shutdown topic
    (https://tokio.rs/tokio/topics/shutdown) — three-part model (signal →
    cancel → wait). Already followed by existing code via Notify +
    TaskTracker; this change tightens phase ordering.
  - K8s pods-and-endpoint-termination-flow — endpoint-slice removal is
    asynchronous vs SIGTERM, hence the in-process sleep before closing the
    accept loop.

Out of scope for this repo: HPA tune, PodDisruptionBudget,
topologySpreadConstraint, terminationGracePeriodSeconds — those are in
divine-iac-coreconfig. Adding tokio test-util to dev-dependencies so we
can test the pre-drain pause deterministically with tokio::time::advance.
…startup

Adds a startup-time sanity check so a misconfigured phase duration is
reported via tracing::warn rather than surfacing as a kubelet SIGKILL
mid-drain on scale-down. Addresses two concrete failure modes flagged
in review:

  1. Operator sets SHUTDOWN_HTTP_DRAIN_SECS=120 against a 75s grace
     period. Previously silent; now a loud warn-level log at boot
     includes the three phase values, the computed total + margin, and
     the configured ceiling.
  2. Sibling divinevideo/divine-iac-coreconfig PR (tracked in #692)
     lands with terminationGracePeriodSeconds below the current keycast
     defaults. Same warning surface.

New surface area:

  - SHUTDOWN_TEARDOWN_MARGIN (const, 10s): reserved headroom between
    the end of the phased drain and SIGKILL for DB pool close + tracing
    flush. Subtracted from the ceiling during validation.
  - DEFAULT_SHUTDOWN_GRACE_CEILING_SECS (const, 120s): conservative
    upper bound covering the issue spec's 60-90s range with headroom.
    Override via SHUTDOWN_GRACE_PERIOD_CEILING_SECS.
  - ShutdownTimings::total_budget(): pre_drain + http_drain +
    signer_drain (no margin).
  - validate_shutdown_timings(&ShutdownTimings, Duration) -> Result<(),
    String>: returns Err with an operator-actionable message if total +
    margin > ceiling. Exact equality is accepted (margin fully fits).

The call site in async_main warns rather than refusing to boot: the
ceiling is cross-repo-coupled to the iac repo, and an operator may
legitimately need to start before the iac repo catches up. Marked with
TODO(#692) so the warn-only fallback can be tightened once the iac PR
lands and terminationGracePeriodSeconds is pinned at >=75s everywhere.

Default phase values (65s sum + 10s margin = 75s) are unchanged,
matching the iac PR target. Decision: keep defaults rather than shrink
to fit a 60s grace, because the iac PR targets 75s and shrinking would
reduce the HTTP-drain budget available for legitimately long in-flight
sign requests. The validation path is the defense-in-depth for the
smaller-grace case, with SHUTDOWN_* env vars as the tuning knobs.

Covered by 7 new unit tests (31 total in keycast bin, all passing):
  - parse_shutdown_grace_ceiling default / override
  - ShutdownTimings::total_budget arithmetic
  - validate_shutdown_timings ok when under ceiling (75s, 90s)
  - validate_shutdown_timings err when over ceiling (60s grace vs 65s
    + 10s margin, from review)
  - validate_shutdown_timings err when single field overshoots
    (SHUTDOWN_HTTP_DRAIN_SECS=120 case from review)
  - validate_shutdown_timings boundary: equal-to-ceiling accepted,
    one-below rejected, one-above accepted

cargo clippy --workspace --all-targets --all-features -- -D warnings
-A deprecated and cargo fmt --all -- --check are clean.
Previously the HTTP-drain timeout arm in async_main just logged 'API
server shutdown timed out' and fell through. Dropping the axum task's
JoinHandle at end of async_main does NOT cancel the underlying task
(tokio+std semantics), so axum would keep running concurrently with
phase 3 (relay client teardown) and phase 4 (DB pool close). Pool
close racing a still-live accept loop surfaces as DB errors on any
late-arriving requests, and the kubelet SIGKILL at
terminationGracePeriodSeconds would be the actual stop — defeating
the point of a bounded HTTP drain budget.

Extracted the drain into a helper (drain_http_or_abort) that takes the
JoinHandle and budget, awaits the task up to the budget via
tokio::time::timeout with &mut handle (JoinHandle: Unpin + Future, so
timeout borrows rather than consumes it), and on timeout calls
handle.abort() then awaits the cancellation before returning. Returns
a two-variant HttpDrainOutcome so the caller logs an honest message
in both arms instead of the old 'timed out' line that implied axum
had stopped.

The caller in async_main now logs:
  'API server shutdown timed out; axum task aborted to prevent DB
   pool close from racing late requests'
on the timeout path, which is what actually happens.

Covered by 2 new tokio::test(start_paused = true) unit tests:
  - drain_http_or_abort returns Completed when the task finishes inside
    the budget (fast path).
  - drain_http_or_abort aborts the task when the budget is exceeded,
    using a Drop-guarded cancellation flag in the spawned task to
    prove the task is actually stopped (not just left pending with
    the join result thrown away). This directly reproduces the review
    failure mode.

33 unit tests in the keycast bin now all pass. cargo clippy --workspace
--all-targets --all-features -- -D warnings -A deprecated and cargo fmt
--all -- --check are clean.
Phase 4 (DB pool close) previously awaited sqlx's Pool::close() with
no upper bound, contradicting the design contract that documented
SHUTDOWN_TEARDOWN_MARGIN (10s) as the headroom reserved inside
terminationGracePeriodSeconds for post-drain cleanup. sqlx's
Pool::close() waits for every checked-out connection to be returned,
so a connection stuck in a long-running query — e.g. a tenant-cache
preload task mid-query when task_tracker.wait() timed out and
signer_handle.abort() was called but not awaited — could block past
the kubelet SIGKILL at terminationGracePeriodSeconds. That would
swallow the final 'Graceful shutdown complete' log, leaving operators
with no positive signal of clean shutdown.

Extracted close_within_margin(close_fut, margin) -> PoolCloseOutcome
which wraps the future in tokio::time::timeout. Generic over
Future<Output = ()> so the behaviour can be tested deterministically
with tokio::time::pause without standing up a real Postgres pool.
sqlx's Pool::close is cancellation-safe, so dropping the future on
timeout is safe.

The call site in async_main now logs two distinct lines:
  - 'Graceful shutdown complete' on Completed (the happy path).
  - 'Graceful shutdown complete with warning: DB pool close exceeded
     teardown margin; stuck checked-out connections were dropped' on
    TimedOut (still emitted so operators have a positive signal).

Covered by 3 new tokio::test(start_paused = true) unit tests:
  - returns Completed when the future finishes inside the margin.
  - returns TimedOut when the future blocks past the margin.
  - enforces the real bound: measures virtual-time elapsed and asserts
    the helper returns within the margin + 100ms even when the inner
    future would take 1h. This is the safety property — phase 4
    cannot eat into the kubelet's SIGKILL window.

36 unit tests in the keycast bin now all pass. cargo clippy
--workspace --all-targets --all-features -- -D warnings -A deprecated
and cargo fmt --all -- --check are clean.
Regression defense. `parse_duration_secs_env` previously accepted any
u64, so a misconfig like SHUTDOWN_PRE_DRAIN_SECS=18446744073709551600
parsed cleanly. The value then flowed into ShutdownTimings, where
total_budget()'s `+` sum panicked on Duration overflow — aborting
the process at startup with a raw 'overflow when adding durations'
panic BEFORE validate_shutdown_timings could surface the operator-
actionable 'does not fit inside ceiling' warn. Net result: crashloop
instead of loud warning.

Added SHUTDOWN_PER_FIELD_CEILING_SECS const (24*3600 = 86400s = 1 day).
parse_duration_secs_env now rejects any value strictly greater than
the ceiling and falls back to the per-field default with a warn-level
log naming the env var, the bad value, and the ceiling. Three fields
at the ceiling sum to 3 * 86400 = 259200s = 3 days, comfortably inside
u64, so total_budget() can stay infallible.

One day is far above any plausible operational value — real drain
budgets are seconds, not hours — and the ceiling is documented on
the constant so an operator who genuinely wants a larger value knows
where to look. The warn log still emits at boot so the misconfig
doesn't silently revert.

Covered by 4 new unit tests:
  - u64::MAX-ish input falls back to default AND downstream
    total_budget() does not panic (the actual safety property).
  - value strictly above the ceiling falls back to default.
  - value exactly at the ceiling is accepted (legitimate opt-in).
  - total_budget() with every field at the ceiling computes 3 *
    ceiling without overflow and equals Duration::from_secs(3 *
    ceiling).

40 unit tests in the keycast bin now all pass. cargo clippy
--workspace --all-targets --all-features -- -D warnings -A deprecated
and cargo fmt --all -- --check are clean.
Phase 3's `client_for_shutdown.shutdown().await` was previously awaited
unbounded in `async_main` — only the trailing `task_tracker.wait()`
was wrapped in `tokio::time::timeout(signer_drain, ...)`. A hung
WebSocket close handshake, a bug in `nostr_sdk::Client::shutdown`, or
a dead relay socket could therefore block phase 3 past
`terminationGracePeriodSeconds` and swallow the kubelet SIGKILL entry
path: phase 4 (DB pool close, already bounded by
`SHUTDOWN_TEARDOWN_MARGIN`) would never run, and the final
`Graceful shutdown complete` log would be lost.

This silently busted the design contract enforced by
`validate_shutdown_timings`:

    pre_drain + http_drain + signer_drain + SHUTDOWN_TEARDOWN_MARGIN
    <= terminationGracePeriodSeconds

Relay teardown was not in that sum.

Extracted the phase into `drain_signer_or_abort(client_shutdown,
tracker_wait, abort_signer, budget)` which wraps the whole phase in
ONE outer `tokio::time::timeout(budget, ...)` rather than per-await
timeouts. One shared outer bound is strictly more correct than two
independent inner timeouts: two independent `signer_drain`-sized
timeouts would make phase 3 worst-case `2 * signer_drain` and quietly
bust the same contract we were trying to restore. The helper awaits
`client_shutdown` then `tracker_wait` sequentially inside the
timeout; on timeout it invokes the caller-supplied `abort_signer`
callback (which in production calls `signer_handle.abort()`) and
returns `AbortedAfterTimeout`.

The helper is generic over both inner futures and the abort callback
so the timeout behaviour can be exercised deterministically with
`tokio::time::pause` in tests without standing up a real
`nostr_sdk::Client` or a real signer `JoinHandle`.

Call site in `async_main`:

  match drain_signer_or_abort(
      client_for_shutdown.shutdown(),
      task_tracker.wait(),
      || signer_handle_for_abort.abort(),
      timings.signer_drain,
  ).await { ... }

replaces the separate `client_for_shutdown.shutdown().await` +
`tokio::time::timeout(signer_drain, task_tracker.wait())` block. The
log line on timeout is updated to reflect the new truth — either
`client.shutdown()` or `task_tracker.wait()` (or both) overran the
budget.

Covered by 5 new `tokio::test(start_paused = true)` unit tests:
  - happy path: both futures finish inside budget → Completed, abort
    NOT called
  - client.shutdown() hangs → AbortedAfterTimeout, abort called
    (the reviewer-flagged regression target)
  - task_tracker.wait() hangs → AbortedAfterTimeout, abort called
    (pre-patch behavior preserved under new outer-timeout structure)
  - safety bound: helper returns within `budget` even when both inner
    futures would sleep 1h
  - shared-budget semantic: 8s + 8s sequential awaits against a 10s
    budget must time out at 10s, not extend to 16s (pins the
    single-outer-timeout vs two-independent-timeouts choice)

45 unit tests in the keycast bin now all pass (40 previously + 5
new). cargo test --workspace --lib --bins is clean.
cargo clippy --workspace --all-targets --all-features -- -D warnings
-A deprecated and cargo fmt --all -- --check are clean.

Relates to #692.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant